## Data origin :: smard

The origin of this data is the onlien plattform "smard" provided and maintained by "Bundesnetzagentur". It's a central collection and publiaction of electricity generation, transportation and consumption data and information of the german electricty market. The url for the following data is: [smard; Data](https://www.smard.de/home/downloadcenter/download-marktdaten#!?downloadAttributes=%7B%22selectedCategory%22:false,%22selectedSubCategory%22:false,%22selectedRegion%22:false,%22from%22:1628373600000,%22to%22:1629323999999,%22selectedFileType%22:false%7D)


## Feature Description

Features are documented in the [smard manual](https://www.smard.de/home/benutzerhandbuch).

## Import

In [148]:
import datetime, time, os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import pandas as pd
import glob
#from pandas_profiling import ProfileReport
import json

#from cesium import datasets

from functools import reduce

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Make numpy printouts easier to read.
#np.set_printoptions(precision=3, suppress=True)

from datetime import datetime, timezone, timedelta
import pytz
resample_factor = 15

import warnings

### Time Zone Adaption

All time information on SMARD refers to the time valid at the respective time
Central European Standard Time (CET or UTC+1) or Central European Summer Time (CEST or UTC+2).
(CEST or UTC+2).
- The changeover from standard time to daylight saving time takes place every last Sunday
in March at 02:00 by skipping the hour from 02:00 to 03:00.
- The changeover from summer time to standard time takes place every last Sunday in
in October at 03:00 by repeating the hour from 02:00 to 03:00
o'clock

## Part 1: SMARD Data Mining

## 1.1 Import "Predicted Energy Consumption"

In [149]:
# Read csv file and assign to dataframe
df_prog_cons = pd.read_csv("../data/smard/df_prog_cons.csv")

In [150]:
print(df_prog_cons.shape)
df_prog_cons.head()

(158876, 6)


Unnamed: 0,dt_start_cet,50Hertz_power_mw,DE_power_mw,DK_power_mw,DK1_power_mw,TTG_power_mw
0,2016-12-31 23:00:00,4922.0,12966.0,2282.0,1857.0,6636.0
1,2016-12-31 23:15:00,4890.0,12916.0,2282.0,1857.0,6614.0
2,2016-12-31 23:30:00,4859.0,12873.0,2282.0,1857.0,6595.0
3,2016-12-31 23:45:00,4833.0,12830.0,2282.0,1857.0,6572.0
4,2017-01-01 00:00:00,4808.0,12315.0,2152.0,1774.0,6078.0


In [151]:
df_prog_cons.tail()

Unnamed: 0,dt_start_cet,50Hertz_power_mw,DE_power_mw,DK_power_mw,DK1_power_mw,TTG_power_mw
158871,2021-07-13 20:45:00,3637.0,8334.0,274.0,193.0,2620.0
158872,2021-07-13 21:00:00,3677.0,8426.0,297.0,208.0,2666.0
158873,2021-07-13 21:15:00,3580.0,8309.0,297.0,208.0,2678.0
158874,2021-07-13 21:30:00,3485.0,8192.0,297.0,208.0,2688.0
158875,2021-07-13 21:45:00,3371.0,8061.0,297.0,208.0,2699.0


In [152]:
# Convert time to datetime object and set it as index
df_prog_cons["dt_start_cet"] = pd.to_datetime(df_prog_cons["dt_start_cet"])
df_prog_cons.set_index('dt_start_cet', inplace=True)

# Convert CET time zone to UTC time zone and rename index.
df_prog_cons.index = df_prog_cons.index.tz_localize('Europe/Berlin', ambiguous="NaT" , nonexistent="NaT")
df_prog_cons.index = df_prog_cons.index.tz_convert(pytz.utc)
df_prog_cons.index.names = ['dt_start_utc']
df_prog_cons.index = pd.to_datetime(df_prog_cons.index)
df_prog_cons.index = df_prog_cons.index.tz_localize(None)

We will need drop all the columns that are constituted by infinite values and NaNs, as they are unusable.

In [153]:
count_inf = np.isinf(df_prog_cons).values.sum()
count_null = df_prog_cons.isnull().sum().sum()

print("The data frame contains " + str(count_inf) + " infinite values and " + str(count_null) + " missing values.")

The data frame contains 0 infinite values and 0 missing values.


Fortunately, we can keep every column of the dataframe.

In [154]:
# Print the type of each variable in the dataframe

df_prog_cons.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 158876 entries, 2016-12-31 22:00:00 to 2021-07-13 19:45:00
Data columns (total 5 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   50Hertz_power_mw  158876 non-null  float64
 1   DE_power_mw       158876 non-null  float64
 2   DK_power_mw       158876 non-null  float64
 3   DK1_power_mw      158876 non-null  float64
 4   TTG_power_mw      158876 non-null  float64
dtypes: float64(5)
memory usage: 7.3 MB


At this point, we like to implement a feature that sums up the total predicted energy consumption, which is the sum of all columns.
All features are numerical, so we don't need to take action in converting the value type.

In [155]:
df_prog_cons['total_pred_cons'] = df_prog_cons.sum(axis=1)

## 1.2 Import "Realisierte Erzeugung" 

This dataset contains the realized electrical energy. 

In [156]:
df = pd.read_csv("../data/smard/Stromerzeugung/Realisierte_Erzeugung_2021.csv", delimiter=";", decimal=',')

df.shape

(17372, 14)

In [157]:
df.head()

Unnamed: 0,Datum,Uhrzeit,Biomasse[MWh],Wasserkraft[MWh],Wind Offshore[MWh],Wind Onshore[MWh],Photovoltaik[MWh],Sonstige Erneuerbare[MWh],Kernenergie[MWh],Braunkohle[MWh],Steinkohle[MWh],Erdgas[MWh],Pumpspeicher[MWh],Sonstige Konventionelle[MWh]
0,01.01.2021,00:00,1.147,308,84,1.058,0,55,2.034,2.899,937,1.402,73,358
1,01.01.2021,00:15,1.145,301,88,1.025,0,55,2.036,2.905,872,1.416,95,358
2,01.01.2021,00:30,1.138,299,101,955.0,0,55,2.037,2.904,829,1.419,82,356
3,01.01.2021,00:45,1.139,298,108,931.0,0,55,2.038,2.901,805,1.412,98,354
4,01.01.2021,01:00,1.138,310,105,943.0,0,55,2.038,2.911,785,1.363,125,357


In [158]:
# Merge and convert "Datum" und "Uhrzeit" to datetime object and set it as index
df["datetime"]= df['Datum'] + ' ' + df['Uhrzeit']
df.drop(columns=['Datum', 'Uhrzeit'], inplace=True)
df["datetime"] = pd.to_datetime(df["datetime"], dayfirst=True)

df["datetime"] = pd.to_datetime(df["datetime"])
df.set_index('datetime', inplace=True)


#Convert CET time zone to UTC time zone and rename index.
df.index = df.index.tz_localize('Europe/Berlin', ambiguous="NaT" , nonexistent="NaT")
df.index = df.index.tz_convert(pytz.utc)
df.index = pd.to_datetime(df.index)
df.index = df.index.tz_localize(None)
df.index.names = ['dt_start_utc']

In [159]:
df.head()

Unnamed: 0_level_0,Biomasse[MWh],Wasserkraft[MWh],Wind Offshore[MWh],Wind Onshore[MWh],Photovoltaik[MWh],Sonstige Erneuerbare[MWh],Kernenergie[MWh],Braunkohle[MWh],Steinkohle[MWh],Erdgas[MWh],Pumpspeicher[MWh],Sonstige Konventionelle[MWh]
dt_start_utc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2020-12-31 23:00:00,1.147,308,84,1.058,0,55,2.034,2.899,937,1.402,73,358
2020-12-31 23:15:00,1.145,301,88,1.025,0,55,2.036,2.905,872,1.416,95,358
2020-12-31 23:30:00,1.138,299,101,955.0,0,55,2.037,2.904,829,1.419,82,356
2020-12-31 23:45:00,1.139,298,108,931.0,0,55,2.038,2.901,805,1.412,98,354
2021-01-01 00:00:00,1.138,310,105,943.0,0,55,2.038,2.911,785,1.363,125,357


In [160]:
# Print the type of each variable in the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 17372 entries, 2020-12-31 23:00:00 to 2021-06-30 21:45:00
Data columns (total 12 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   Biomasse[MWh]                 17372 non-null  object
 1   Wasserkraft[MWh]              17372 non-null  int64 
 2   Wind Offshore[MWh]            17372 non-null  object
 3   Wind Onshore[MWh]             17372 non-null  object
 4   Photovoltaik[MWh]             17372 non-null  object
 5   Sonstige Erneuerbare[MWh]     17372 non-null  int64 
 6   Kernenergie[MWh]              17372 non-null  object
 7   Braunkohle[MWh]               17372 non-null  object
 8   Steinkohle[MWh]               17372 non-null  object
 9   Erdgas[MWh]                   17372 non-null  object
 10  Pumpspeicher[MWh]             17372 non-null  object
 11  Sonstige Konventionelle[MWh]  17372 non-null  int64 
dtypes: int64(3), object(9)
memory usage: 1.

Not all columns are numerical type. For futher work we need to convert them to numerical type.

In [161]:
df[df['Erdgas[MWh]'].str.contains("-")].shape[0]

288

The column "Erdgas[MWh]" has strings with the value "-". We replace them with the numerical value "0".

In [162]:
df["Erdgas[MWh]"] = df["Erdgas[MWh]"].replace("-", int(0))

Now we can convert all values to a numerical tpye.

In [163]:
df = df.astype("float32", copy=False)
df.dtypes

Biomasse[MWh]                   float32
Wasserkraft[MWh]                float32
Wind Offshore[MWh]              float32
Wind Onshore[MWh]               float32
Photovoltaik[MWh]               float32
Sonstige Erneuerbare[MWh]       float32
Kernenergie[MWh]                float32
Braunkohle[MWh]                 float32
Steinkohle[MWh]                 float32
Erdgas[MWh]                     float32
Pumpspeicher[MWh]               float32
Sonstige Konventionelle[MWh]    float32
dtype: object

At this point, we like to implement a feature that sums up the total supplied energy, which is the sum of all columns.

In [164]:
df['rel_total'] = df.sum(axis=1)

The data is typically published with a delay of 2 hours. We need to shif them by 8 rows, because the time delta of two rows are 15 min.

In [165]:
df = df.shift(periods=8)

In [166]:
df_real_prov = df.copy()

In [167]:
df_real_prov.head()

Unnamed: 0_level_0,Biomasse[MWh],Wasserkraft[MWh],Wind Offshore[MWh],Wind Onshore[MWh],Photovoltaik[MWh],Sonstige Erneuerbare[MWh],Kernenergie[MWh],Braunkohle[MWh],Steinkohle[MWh],Erdgas[MWh],Pumpspeicher[MWh],Sonstige Konventionelle[MWh],rel_total
dt_start_utc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2020-12-31 23:00:00,,,,,,,,,,,,,
2020-12-31 23:15:00,,,,,,,,,,,,,
2020-12-31 23:30:00,,,,,,,,,,,,,
2020-12-31 23:45:00,,,,,,,,,,,,,
2021-01-01 00:00:00,,,,,,,,,,,,,


## Merge 

We merge booth data frames to a single dataframe

In [168]:
df_merge = df_prog_cons.merge(df_real_prov, left_index=True, right_index=True)

In [169]:
df_merge.head()

Unnamed: 0_level_0,50Hertz_power_mw,DE_power_mw,DK_power_mw,DK1_power_mw,TTG_power_mw,total_pred_cons,Biomasse[MWh],Wasserkraft[MWh],Wind Offshore[MWh],Wind Onshore[MWh],Photovoltaik[MWh],Sonstige Erneuerbare[MWh],Kernenergie[MWh],Braunkohle[MWh],Steinkohle[MWh],Erdgas[MWh],Pumpspeicher[MWh],Sonstige Konventionelle[MWh],rel_total
dt_start_utc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2020-12-31 23:00:00,1616.0,3821.0,227.0,84.0,1437.0,7185.0,,,,,,,,,,,,,
2020-12-31 23:15:00,1552.0,3708.0,227.0,84.0,1391.0,6962.0,,,,,,,,,,,,,
2020-12-31 23:30:00,1495.0,3598.0,227.0,84.0,1344.0,6748.0,,,,,,,,,,,,,
2020-12-31 23:45:00,1442.0,3488.0,227.0,84.0,1297.0,6538.0,,,,,,,,,,,,,
2021-01-01 00:00:00,1392.0,3378.0,184.0,70.0,1248.0,6272.0,,,,,,,,,,,,,


We create a new feature. The difference between forcasted consumption and supplied power.

In [170]:
df_merge.eval("diff_prog_real = total_pred_cons - rel_total", inplace=True)

## Save Data Set

For further work we save the data frame as a "pickle".

In [172]:
df_merge.to_pickle('../data/pickle/df_merged_1.pickle')