This notebook demonstrates how to prepare your data for a benchmark of the EileForecast forecasting models on the example of a kaggle dataset.
For a benchmark of on your own data, you need to:
* Adjust this notebook to your data and run it
* Run 2_run_your_benchmark.py 
* Evaluate the benchmark in 3_evaluate_your_benchmark.ipynb. 

Create a new folder "example" in /data/raw/.
Download public kaggle electricity consumption data "train.csv" from https://www.kaggle.com/datasets/utathya/electricity-consumption/data and put it into the new folder.

In [1]:
import pandas as pd
import yaml
path_to_project = "/home/dev/projects/paper/" # put your project path here.
path_to_raw = path_to_project + "data/raw/"
path_to_raw_data = path_to_raw + "example/"

Read in the data and take a look at its structure.

In [3]:
df = pd.read_csv(path_to_raw_data + "train.csv")
df

Unnamed: 0,ID,datetime,temperature,var1,pressure,windspeed,var2,electricity_consumption
0,0,2013-07-01 00:00:00,-11.4,-17.1,1003.0,571.910,A,216.0
1,1,2013-07-01 01:00:00,-12.1,-19.3,996.0,575.040,A,210.0
2,2,2013-07-01 02:00:00,-12.9,-20.0,1000.0,578.435,A,225.0
3,3,2013-07-01 03:00:00,-11.4,-17.1,995.0,582.580,A,216.0
4,4,2013-07-01 04:00:00,-11.4,-19.3,1005.0,586.600,A,222.0
...,...,...,...,...,...,...,...,...
26491,34891,2017-06-23 19:00:00,-0.7,-15.0,1009.0,51.685,A,225.0
26492,34892,2017-06-23 20:00:00,-2.9,-11.4,1005.0,56.105,A,213.0
26493,34893,2017-06-23 21:00:00,-1.4,-12.9,995.0,61.275,A,213.0
26494,34894,2017-06-23 22:00:00,-2.9,-11.4,996.0,67.210,A,210.0


The data contains dates, weather data, two variables not further explained, and electricity consumption.
The date column is named "datetime", the singal we want to forecast is in column "electricity_consumption". 
Weather data is in the columns "temperature", "pressure" and "windspeed".

Check the first and the last date of the dataset.

In [4]:
first_date = df.datetime.min()
last_date = df.datetime.max()
print(f"Data is recorded from {first_date} to {last_date}")

Data is recorded from 2013-07-01 00:00:00 to 2017-06-23 23:00:00


The EileForecast benchmark framework considers electricity (load and production) data in one dataframe and external features as e.g. weather in another dataframe.

The electricity data needs to be in the so called long format. This means, for each signal_id and date combination you provide one row. As we only have one load signal here, this is automatically given. But, if we had flexible load signals or pv production data, we would need to concatenate it to the consumption data: #df = pd.concat([df, flexible, production])

The electricity dataset needs the columns:
* "signal_id": the name of the signal consuming or producing power
* "date": the timestamp of electricity consumption or production, in UTC
* "power": the consumed or produced electricity



In [5]:
# we keep electricity data in df and create a new dataframe weather for weather data
df.rename(columns={"datetime":"date", "electricity_consumption": "power", "windspeed": "wind_speed"}, inplace=True)
weather = pd.DataFrame(df[["date", "temperature", "wind_speed", "pressure"]])

df.drop(columns={"ID", "var1", "pressure", "var2", "temperature", "wind_speed"}, inplace=True)

df["date"] = pd.to_datetime(df["date"], utc=True) # the dates need to be UTC timestamps
df["signal_id"] = "load" # we have only load data
df.set_index("signal_id", inplace=True)
df.dropna(inplace=True)
df


Unnamed: 0_level_0,date,power
signal_id,Unnamed: 1_level_1,Unnamed: 2_level_1
load,2013-07-01 00:00:00+00:00,216.0
load,2013-07-01 01:00:00+00:00,210.0
load,2013-07-01 02:00:00+00:00,225.0
load,2013-07-01 03:00:00+00:00,216.0
load,2013-07-01 04:00:00+00:00,222.0
...,...,...
load,2017-06-23 19:00:00+00:00,225.0
load,2017-06-23 20:00:00+00:00,213.0
load,2017-06-23 21:00:00+00:00,213.0
load,2017-06-23 22:00:00+00:00,210.0


In [6]:
weather

Unnamed: 0,date,temperature,wind_speed,pressure
0,2013-07-01 00:00:00,-11.4,571.910,1003.0
1,2013-07-01 01:00:00,-12.1,575.040,996.0
2,2013-07-01 02:00:00,-12.9,578.435,1000.0
3,2013-07-01 03:00:00,-11.4,582.580,995.0
4,2013-07-01 04:00:00,-11.4,586.600,1005.0
...,...,...,...,...
26491,2017-06-23 19:00:00,-0.7,51.685,1009.0
26492,2017-06-23 20:00:00,-2.9,56.105,1005.0
26493,2017-06-23 21:00:00,-1.4,61.275,995.0
26494,2017-06-23 22:00:00,-2.9,67.210,996.0


Save the electricity data as electricity_demand.parquet (even if production data is contained) and weather data as weather.parquet to the raw folder. 

In [8]:
df.to_parquet(path_to_raw_data + "electricity_demand.parquet")
weather.to_parquet(path_to_raw_data + "weather.parquet")


Each dataset forecasted with EileForecast needs a dataset config file. This contains data specific configuration.
The following code creates a dataset config file. Here we have commented out the information which we do not have. If you have it for your data, delete the # at the start of the corresponding line.

In [None]:
config_content = {
        '_target_': 'eile_forecast.forecast.split_data.GluontsDatasetEile',
        'name': 'example',
        'add_seasonalities': True, # if set to true, seasonal sine and cosine waves will be added automatically in the code.
        'add_holidays_and_weekends': False, #only possible if you also provide the country or country and state. If true, the code will add holiday and weekend binary variables automatically. 
        'add_weather': True, # if you have weather data
        'benchmark_end_date': str(df.date.max().strftime('%Y-%m-%d %H:%M')),
        'main_signal_name': 'load',
        'forecast_signal': 'load', #load or pv, if df contains pv production
        # 'pv_signal_name': 'production', for example, add if known and contained in electricity_demand.parquet 
        # 'flexible_load_signals': ['machine1', 'machine2'] for example, add if known and if these should be subtracted from the main signal. This is useful for load shifting applications.
        'benchmark_start_date': str(df.date.min().strftime('%Y-%m-%d %H:%M')),
        # 'country': 'Germany' for example, add if known. In combination with add_holidays_and_weekends this will add country holiday and weekend binaries as external features.
        # 'state': 'Bavaria' for example, add if known. In combination with add_holidays_and_weekends this will add country and state holiday and weekend binaries as external features.
    }
    
# Write to YAML file
yaml_file_path = f'{path_to_project}conf/datasets/example.yaml'
with open(yaml_file_path, 'w') as yaml_file:
        yaml.dump(config_content, yaml_file, default_flow_style=False)

For a benchmark, we need a training and a test set. The length of the training set is called the training length. 
We opt for a 80:20 split, i.e. train the models on 80% of the timestamps and test on the most recent 20%.


In [11]:
train_len = int(len(pd.date_range(start=first_date, end=last_date, freq="H"))*0.80)
train_len

27916

Now, set the training length in the file 2_run_your_benchmark.py and run the file.