## Data origin :: smard

The origin for this data is the transparency platform "entsoe". It's a central collection and publiaction of electricity generation, transportation and consumption data and information for the pan-European market. The url for the following data is: https://transparency.entsoe.eu/balancing/r2/imbalance/show


### Original features

The instances represent article orders. The given features are as follows:

## Part 1: Data mining

We import the data set, explore it briefly, drop duplicates and unused features and cast the data types.

## Imports

### Modules, classes and functions

In [229]:
import datetime, time, os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import pandas as pd
import glob
from pandas_profiling import ProfileReport
import json

from cesium import datasets

from functools import reduce

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Make numpy printouts easier to read.
#np.set_printoptions(precision=3, suppress=True)

from datetime import datetime, timezone, timedelta
import pytz
resample_factor = 15

import warnings

## Import Forcasted Energy Consumption

Alle Zeitangaben auf SMARD beziehen sich auf die zum jeweiligen Zeitpunkt gültige
mitteleuropäische Standard-Zeit (CET bzw. UTC+1) bzw. mitteleuropäische Sommerzeit
(CEST bzw. UTC+2).
- Die Umstellung von der Standard-Zeit zur Sommer-Zeit erfolgt jeden letzten Sonntag
im März um 02:00 Uhr durch das Überspringen der Stunde von 02:00 bis 03:00 Uhr.
- Die Umstellung von der Sommer-Zeit zur Standard-Zeit erfolgt jeden letzten Sonntag
im Oktober um 03:00 Uhr durch eine Wiederholung der Stunde von 02:00 bis 03:00
Uhr

In [230]:
df_prog_cons = pd.read_csv("data/smard/df_prog_cons.csv")
df_prog_cons["dt_start_cet"] = pd.to_datetime(df_prog_cons["dt_start_cet"])
df_prog_cons.set_index('dt_start_cet', inplace=True)

df_prog_cons.index = df_prog_cons.index.tz_localize('Europe/Berlin', ambiguous="NaT" , nonexistent="NaT")
df_prog_cons.index = df_prog_cons.index.tz_convert(pytz.utc)
df_prog_cons.index.names = ['dt_start_utc']

df_prog_cons.index = pd.to_datetime(df_prog_cons.index)
df_prog_cons.index = df_prog_cons.index.tz_localize(None)

df_prog_cons.to_csv("temp_prg_cons.csv")

In [231]:
df_prog_cons.tail()

Unnamed: 0_level_0,50Hertz_power_mw,DE_power_mw,DK_power_mw,DK1_power_mw,TTG_power_mw
dt_start_utc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2021-07-13 18:45:00,3637.0,8334.0,274.0,193.0,2620.0
2021-07-13 19:00:00,3677.0,8426.0,297.0,208.0,2666.0
2021-07-13 19:15:00,3580.0,8309.0,297.0,208.0,2678.0
2021-07-13 19:30:00,3485.0,8192.0,297.0,208.0,2688.0
2021-07-13 19:45:00,3371.0,8061.0,297.0,208.0,2699.0


In [232]:
df_prog_cons['total_pred_cons'] = df_prog_cons.sum(axis=1)

In [233]:
count = np.isinf(df_prog_cons).values.sum()
print("The data frame contains " + str(count) + " infinite values and",df_prog_cons.isnull().sum().sum(),"missing values.")

The data frame contains 0 infinite values and 0 missing values.


## 1.3 Realisierte Erzeugung

In [234]:
df = pd.read_csv("data/smard/Stromerzeugung/Realisierte_Erzeugung_2021.csv", delimiter=";", decimal=',')

df["datetime"]= df['Datum'] + ' ' + df['Uhrzeit']
df.drop(columns=['Datum', 'Uhrzeit'], inplace=True)
df["datetime"] = pd.to_datetime(df["datetime"], dayfirst=True)

#df_prog_cons["dt_start_utc"] = df_prog_cons["dt_start_utc"] - timedelta(hours=1)


df["datetime"] = pd.to_datetime(df["datetime"])
df.set_index('datetime', inplace=True)

df.index = df.index.tz_localize('Europe/Berlin', ambiguous="NaT" , nonexistent="NaT")
df.index = df.index.tz_convert(pytz.utc)
df.index = pd.to_datetime(df.index)
df.index = df.index.tz_localize(None)






In [235]:
df.head()

Unnamed: 0_level_0,Biomasse[MWh],Wasserkraft[MWh],Wind Offshore[MWh],Wind Onshore[MWh],Photovoltaik[MWh],Sonstige Erneuerbare[MWh],Kernenergie[MWh],Braunkohle[MWh],Steinkohle[MWh],Erdgas[MWh],Pumpspeicher[MWh],Sonstige Konventionelle[MWh]
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2020-12-31 23:00:00,1.147,308,84,1.058,0,55,2.034,2.899,937,1.402,73,358
2020-12-31 23:15:00,1.145,301,88,1.025,0,55,2.036,2.905,872,1.416,95,358
2020-12-31 23:30:00,1.138,299,101,955.0,0,55,2.037,2.904,829,1.419,82,356
2020-12-31 23:45:00,1.139,298,108,931.0,0,55,2.038,2.901,805,1.412,98,354
2021-01-01 00:00:00,1.138,310,105,943.0,0,55,2.038,2.911,785,1.363,125,357


In [236]:
df.index.names = ['dt_start_utc']
df.head()

Unnamed: 0_level_0,Biomasse[MWh],Wasserkraft[MWh],Wind Offshore[MWh],Wind Onshore[MWh],Photovoltaik[MWh],Sonstige Erneuerbare[MWh],Kernenergie[MWh],Braunkohle[MWh],Steinkohle[MWh],Erdgas[MWh],Pumpspeicher[MWh],Sonstige Konventionelle[MWh]
dt_start_utc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2020-12-31 23:00:00,1.147,308,84,1.058,0,55,2.034,2.899,937,1.402,73,358
2020-12-31 23:15:00,1.145,301,88,1.025,0,55,2.036,2.905,872,1.416,95,358
2020-12-31 23:30:00,1.138,299,101,955.0,0,55,2.037,2.904,829,1.419,82,356
2020-12-31 23:45:00,1.139,298,108,931.0,0,55,2.038,2.901,805,1.412,98,354
2021-01-01 00:00:00,1.138,310,105,943.0,0,55,2.038,2.911,785,1.363,125,357


In [237]:
#cols = df.columns.tolist()
#cols =cols.remove("datetime")

In [238]:
#print(cols)

In [239]:
df["Erdgas[MWh]"] = df["Erdgas[MWh]"].replace("-", int(0))
#df["Erdgas[MWh]"] = df["Erdgas[MWh]"].replace(".", "")
#df["Erdgas[MWh]"] = df["Erdgas[MWh]"].astype('int64')

df = df.astype("float32", copy=False)


df


df['rel_total'] = df.sum(axis=1)
df.dtypes
df.head()

Unnamed: 0_level_0,Biomasse[MWh],Wasserkraft[MWh],Wind Offshore[MWh],Wind Onshore[MWh],Photovoltaik[MWh],Sonstige Erneuerbare[MWh],Kernenergie[MWh],Braunkohle[MWh],Steinkohle[MWh],Erdgas[MWh],Pumpspeicher[MWh],Sonstige Konventionelle[MWh],rel_total
dt_start_utc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2020-12-31 23:00:00,1.147,308.0,84.0,1.058,0.0,55.0,2.034,2.899,937.0,1.402,73.0,358.0,1823.539917
2020-12-31 23:15:00,1.145,301.0,88.0,1.025,0.0,55.0,2.036,2.905,872.0,1.416,95.0,358.0,1777.526978
2020-12-31 23:30:00,1.138,299.0,101.0,955.0,0.0,55.0,2.037,2.904,829.0,1.419,82.0,356.0,2684.498047
2020-12-31 23:45:00,1.139,298.0,108.0,931.0,0.0,55.0,2.038,2.901,805.0,1.412,98.0,354.0,2656.490234
2021-01-01 00:00:00,1.138,310.0,105.0,943.0,0.0,55.0,2.038,2.911,785.0,1.363,125.0,357.0,2687.449951


Publishing the data is about 2 hours delayed. We need to shif them 15 min * 8

In [240]:
df = df.shift(periods=8)

In [241]:
df_real_prov = df.copy()

In [242]:
df_real_prov.head()

Unnamed: 0_level_0,Biomasse[MWh],Wasserkraft[MWh],Wind Offshore[MWh],Wind Onshore[MWh],Photovoltaik[MWh],Sonstige Erneuerbare[MWh],Kernenergie[MWh],Braunkohle[MWh],Steinkohle[MWh],Erdgas[MWh],Pumpspeicher[MWh],Sonstige Konventionelle[MWh],rel_total
dt_start_utc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2020-12-31 23:00:00,,,,,,,,,,,,,
2020-12-31 23:15:00,,,,,,,,,,,,,
2020-12-31 23:30:00,,,,,,,,,,,,,
2020-12-31 23:45:00,,,,,,,,,,,,,
2021-01-01 00:00:00,,,,,,,,,,,,,


In [243]:
df.to_csv("test_erdgas.csv")

## Merge

In [244]:
df_merge = df_prog_cons.merge(df_real_prov, left_index=True, right_index=True)

In [245]:
df_merge.head()

Unnamed: 0_level_0,50Hertz_power_mw,DE_power_mw,DK_power_mw,DK1_power_mw,TTG_power_mw,total_pred_cons,Biomasse[MWh],Wasserkraft[MWh],Wind Offshore[MWh],Wind Onshore[MWh],Photovoltaik[MWh],Sonstige Erneuerbare[MWh],Kernenergie[MWh],Braunkohle[MWh],Steinkohle[MWh],Erdgas[MWh],Pumpspeicher[MWh],Sonstige Konventionelle[MWh],rel_total
dt_start_utc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2020-12-31 23:00:00,1616.0,3821.0,227.0,84.0,1437.0,7185.0,,,,,,,,,,,,,
2020-12-31 23:15:00,1552.0,3708.0,227.0,84.0,1391.0,6962.0,,,,,,,,,,,,,
2020-12-31 23:30:00,1495.0,3598.0,227.0,84.0,1344.0,6748.0,,,,,,,,,,,,,
2020-12-31 23:45:00,1442.0,3488.0,227.0,84.0,1297.0,6538.0,,,,,,,,,,,,,
2021-01-01 00:00:00,1392.0,3378.0,184.0,70.0,1248.0,6272.0,,,,,,,,,,,,,


In [246]:
df_merge.eval("diff_prog_real = total_pred_cons - rel_total", inplace=True)

In [247]:
df_merge.tail()

Unnamed: 0_level_0,50Hertz_power_mw,DE_power_mw,DK_power_mw,DK1_power_mw,TTG_power_mw,total_pred_cons,Biomasse[MWh],Wasserkraft[MWh],Wind Offshore[MWh],Wind Onshore[MWh],Photovoltaik[MWh],Sonstige Erneuerbare[MWh],Kernenergie[MWh],Braunkohle[MWh],Steinkohle[MWh],Erdgas[MWh],Pumpspeicher[MWh],Sonstige Konventionelle[MWh],rel_total,diff_prog_real
dt_start_utc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2021-06-30 20:45:00,4194.0,9994.0,756.0,625.0,3352.0,18921.0,1.121,562.0,420.0,1.919,77.0,36.0,1.656,3.135,1.692,2.051,169.0,273.0,1548.574097,17372.425903
2021-06-30 21:00:00,4246.0,10153.0,834.0,694.0,3427.0,19354.0,1.123,632.0,431.0,1.874,32.0,36.0,1.657,3.143,1.651,2.059,455.0,272.0,1869.506958,17484.493042
2021-06-30 21:15:00,4291.0,10299.0,834.0,694.0,3492.0,19610.0,1.121,595.0,450.0,1.904,8.0,36.0,1.657,3.139,1.622,2.061,418.0,272.0,1790.504028,17819.495972
2021-06-30 21:30:00,4319.0,10429.0,834.0,694.0,3555.0,19831.0,1.125,572.0,475.0,1.914,2.0,36.0,1.657,3.119,1.604,2.062,259.0,272.0,1627.480957,18203.519043
2021-06-30 21:45:00,4328.0,10546.0,834.0,694.0,3612.0,20014.0,1.128,567.0,450.0,1.908,1.0,36.0,1.657,3.028,1.571,2.052,153.0,272.0,1490.343994,18523.656006


## Save Data Set

In [248]:
df_merge.to_pickle('data/pickle/df_merged_1.pickle')

In [249]:
count = np.isinf(df_prog_cons).values.sum()
print("The data frame contains " + str(count) + " infinite values")
print("The Data Frame has",df_prog_cons.isnull().sum().sum(),"missing values.")

The data frame contains 0 infinite values
The Data Frame has 0 missing values.


In [250]:
#df.fillna(0, inplace=True)
#df.fillna(df.median(), inplace=True)