## Data origin :: ane.energy

The origin for this data is the transparency platform "entsoe". It's a central collection and publiaction of electricity generation, transportation and consumption data and information for the pan-European market. The url for the following data is: https://transparency.entsoe.eu/balancing/r2/imbalance/show


## Imports

### Modules, classes and functions

In [55]:
import datetime, time, os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import pandas as pd
import glob
from pandas_profiling import ProfileReport
import json

from cesium import datasets

from functools import reduce

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Make numpy printouts easier to read.
np.set_printoptions(precision=3, suppress=True)

from datetime import datetime, timezone, timedelta

resample_size = "15MIN"
resample_factor = 15

#windowfactor = 15

import warnings

### Original features

The instances represent article orders. The given features are as follows:

## Part 1: Data mining

We import the data set, explore it briefly, drop duplicates and unused features and cast the data types.

## Data

### 1.1 Import Data EXPEX

In [56]:
def import_csv(path):
    
    df = pd.read_csv(path, parse_dates=['dt_start_utc'], index_col='dt_start_utc')
    df.sort_index(inplace=True)
    #df['dt_start_utc'] = pd.to_datetime(df.dt_start_utc, format = '%d/%m/%Y %H.%M.%S')
    return df  
    
#parse_dates=['dt_start_utc'], index_col='dt_start_utc'

In [57]:
def obj_2_datetime(df_list):
    for df in df_list:
        df["dt_start_utc"] = pd.to_datetime(df["dt_start_utc"])
        
        
        
        df.sort_index(inplace=True)
    return df

In [58]:
df_epex_da = import_csv("data/ane_energy/epex_da_de.csv") 

In [59]:
df_epex_da.drop(['Unnamed: 0', 'epex_da_de_mwh'],
                 axis=1, inplace=True
               );

In [60]:
print("The Data Frame has",df_epex_da.isnull().sum().sum(),"missing values. Period:", df_epex_da.index.min(), "till",df_epex_da.index.max())

The Data Frame has 0 missing values. Period: 2004-12-31 23:00:00 till 2021-07-13 21:00:00


In [61]:
df_epex_da.tail

<bound method NDFrame.tail of                      sechs_h_regelung  epex_da_de_eur_mwh
dt_start_utc                                             
2004-12-31 23:00:00                 0               23.89
2005-01-01 00:00:00                 0               20.05
2005-01-01 01:00:00                 0               15.00
2005-01-01 02:00:00                 0               13.41
2005-01-01 03:00:00                 0               13.73
...                               ...                 ...
2021-07-13 17:00:00                 0              105.00
2021-07-13 18:00:00                 0              100.29
2021-07-13 19:00:00                 0               98.52
2021-07-13 20:00:00                 0               95.03
2021-07-13 21:00:00                 0               86.10

[144911 rows x 2 columns]>

print(df_epex_da.epex_da_de_mwh.isnull().sum())
print(df_epex_da.shape[0])
print((df_epex_da.sechs_h_regelung.sum()))

All missing values are in the column "epex_da_de_mwh". Looking at the shape of the data frame and compare to the the column "Unnamed:0", we can see that it just counting up the number of rows. It's reasonable to remove the whole column from this data frame. "Sechs_h_regelung" isn't just filled with "0", so we decide to keep it.

In [62]:
df_epex_da = df_epex_da.resample(resample_size).interpolate(method='polynomial', order=2)

In [63]:
df_epex_da.tail()

Unnamed: 0_level_0,sechs_h_regelung,epex_da_de_eur_mwh
dt_start_utc,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-07-13 20:00:00,0.0,95.03
2021-07-13 20:15:00,0.0,93.350088
2021-07-13 20:30:00,0.0,91.301784
2021-07-13 20:45:00,0.0,88.885088
2021-07-13 21:00:00,0.0,86.1


In [64]:
count_inf = np.isinf(df_epex_da).values.sum()
count_nan = df_epex_da.isnull().sum().sum()
print("The data frame contains " + str(count_inf) + " infinite values and " + str(count_nan) + " missing values.")

The data frame contains 0 infinite values and 0 missing values.


### Merge EPEX

In [65]:
#df_merged= df_merged.merge(df_epex_da, left_index=True, right_index=True)

## 1.2 Import ES Forcasts

In [66]:
def import_csv(path):
    
    df = pd.read_csv(path, parse_dates=['dt_start_utc'], index_col='dt_start_utc')
    df.sort_index(inplace=True)
    #df['dt_start_utc'] = pd.to_datetime(df.dt_start_utc, format = '%d/%m/%Y %H.%M.%S')
    return df  
    
#parse_dates=['dt_start_utc'], index_col='dt_start_utc'

In [67]:
def obj_2_datetime(df_list):
    for df in df_list:
        df["dt_start_utc"] = pd.to_datetime(df["dt_start_utc"])
        df.sort_index(inplace=True)
    return df

In [68]:
df_es_fc_solar_ts = import_csv("data/ane_energy/es_fc_solar_ts.csv") 
df_es_fc_total_load_ts = import_csv("data/ane_energy/es_fc_total_load_ts.csv")
df_es_fc_total_renewables_ts = import_csv("data/ane_energy/es_fc_total_renewables_ts.csv")
df_es_fc_wind_offshore_ts = import_csv("data/ane_energy/es_fc_wind_offshore_ts.csv")
df_es_fc_wind_onshore_ts = import_csv("data/ane_energy/es_fc_wind_onshore_ts.csv")

In [69]:
df_es_fc_solar_ts.name = "df_es_fc_solar_ts"
df_es_fc_total_load_ts.name = "df_es_fc_total_load_ts"
df_es_fc_total_renewables_ts.name = "df_es_fc_total_renewables_ts"
df_es_fc_wind_offshore_ts.name = "df_es_fc_wind_offshore_ts"
df_es_fc_wind_onshore_ts.name = "df_es_fc_wind_onshore_ts"



In [70]:
df_list_es = [df_es_fc_solar_ts, 
           df_es_fc_total_load_ts, 
           df_es_fc_total_renewables_ts,
           df_es_fc_wind_offshore_ts,
           df_es_fc_wind_onshore_ts
          ]

In [71]:
for df in df_list_es:
    print(df.head)
    print("The Data Frame",df.name,"has",df.isnull().sum().sum(),"missing values. Period:", df.index.min(), "till",df.index.max())

<bound method NDFrame.head of                      50Hertz_power_mw  DE_power_mw  DK_power_mw  DK1_power_mw  \
dt_start_utc                                                                    
2016-12-31 23:00:00               0.0          0.0          0.0           0.0   
2016-12-31 23:15:00               0.0          0.0          0.0           0.0   
2016-12-31 23:30:00               0.0          0.0          0.0           0.0   
2016-12-31 23:45:00               0.0          0.0          0.0           0.0   
2017-01-01 00:00:00               0.0          0.0          0.0           0.0   
...                               ...          ...          ...           ...   
2021-07-13 20:45:00               0.0          0.0          0.0           0.0   
2021-07-13 21:00:00               0.0          0.0          0.0           0.0   
2021-07-13 21:15:00               0.0          0.0          0.0           0.0   
2021-07-13 21:30:00               0.0          0.0          0.0           0.0  

In [72]:
df_merged_es = reduce(lambda  left,right: pd.merge(left, right, on=['dt_start_utc'],
                                            how='inner'), df_list_es)

In [73]:
df_merged_es.tail()

Unnamed: 0_level_0,50Hertz_power_mw_x,DE_power_mw_x,DK_power_mw_x,DK1_power_mw_x,TTG_power_mw_x,50Hertz_power_mw_y,DE_power_mw_y,DK_power_mw_y,DK1_power_mw_y,TTG_power_mw_y,...,50Hertz_power_mw_y,DE_power_mw_y,DK_power_mw_y,DK1_power_mw_y,TTG_power_mw_y,50Hertz_power_mw,DE_power_mw,DK_power_mw,DK1_power_mw,TTG_power_mw
dt_start_utc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2021-07-13 20:45:00,0.0,0.0,0.0,0.0,0.0,9651.0,51324.0,3505.0,2268.0,14931.0,...,323.0,743.0,226.0,127.0,420.0,3637.0,8334.0,274.0,193.0,2620.0
2021-07-13 21:00:00,0.0,0.0,0.0,0.0,0.0,9421.0,49935.0,3289.0,2144.0,14327.0,...,330.0,747.0,214.0,120.0,417.0,3677.0,8426.0,297.0,208.0,2666.0
2021-07-13 21:15:00,0.0,0.0,0.0,0.0,0.0,9244.0,49336.0,3289.0,2144.0,14301.0,...,337.0,746.0,214.0,120.0,409.0,3580.0,8309.0,297.0,208.0,2678.0
2021-07-13 21:30:00,0.0,0.0,0.0,0.0,0.0,9065.0,48488.0,3289.0,2144.0,14043.0,...,343.0,746.0,214.0,120.0,403.0,3485.0,8192.0,297.0,208.0,2688.0
2021-07-13 21:45:00,0.0,0.0,0.0,0.0,0.0,8914.0,47817.0,3289.0,2144.0,13865.0,...,350.0,744.0,214.0,120.0,394.0,3371.0,8061.0,297.0,208.0,2699.0


In [74]:
#df_merged_es.to_csv("df_merged_test.csv")

In [75]:
#df.fillna(df.median(), inplace=True)

In [76]:
#df.fillna(0, inplace=True)

## Merge

In [77]:
df_merged_3 = df_epex_da.merge(df_merged_es, left_index=True, right_index=True)

In [78]:
count_inf = np.isinf(df_merged_3).values.sum()
count_nan = df_merged_3.isnull().sum().sum()
print("The data frame contains " + str(count_inf) + " infinite values and " + str(count_nan) + " missing values.")

The data frame contains 0 infinite values and 0 missing values.


In [79]:
df_merged_3.head()

Unnamed: 0_level_0,sechs_h_regelung,epex_da_de_eur_mwh,50Hertz_power_mw_x,DE_power_mw_x,DK_power_mw_x,DK1_power_mw_x,TTG_power_mw_x,50Hertz_power_mw_y,DE_power_mw_y,DK_power_mw_y,...,50Hertz_power_mw_y,DE_power_mw_y,DK_power_mw_y,DK1_power_mw_y,TTG_power_mw_y,50Hertz_power_mw,DE_power_mw,DK_power_mw,DK1_power_mw,TTG_power_mw
dt_start_utc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2016-12-31 23:00:00,0.0,20.96,0.0,0.0,0.0,0.0,0.0,6291.0,40241.0,2942.0,...,331.0,3340.0,1004.0,652.0,3009.0,4922.0,12966.0,2282.0,1857.0,6636.0
2016-12-31 23:15:00,2.9437899999999997e-88,20.463647,0.0,0.0,0.0,0.0,0.0,6134.0,39805.0,2942.0,...,331.0,3338.0,1004.0,652.0,3007.0,4890.0,12916.0,2282.0,1857.0,6614.0
2016-12-31 23:30:00,2.667555e-88,20.622226,0.0,0.0,0.0,0.0,0.0,5976.0,39023.0,2942.0,...,331.0,3339.0,1004.0,652.0,3008.0,4859.0,12873.0,2282.0,1857.0,6595.0
2016-12-31 23:45:00,1.057543e-88,20.934692,0.0,0.0,0.0,0.0,0.0,5871.0,38682.0,2942.0,...,331.0,3339.0,1004.0,652.0,3008.0,4833.0,12830.0,2282.0,1857.0,6572.0
2017-01-01 00:00:00,0.0,20.9,0.0,0.0,0.0,0.0,0.0,5810.0,38162.0,2735.0,...,331.0,3370.0,1094.0,748.0,3039.0,4808.0,12315.0,2152.0,1774.0,6078.0


In [80]:
df_merged_3.to_pickle('data/pickle/df_merged_3.pickle')

In [81]:
count = np.isinf(df_merged_3).values.sum()
print("The data frame contains " + str(count) + " infinite values")
print("The Data Frame has",df_merged_3.isnull().sum().sum(),"missing values.")

The data frame contains 0 infinite values
The Data Frame has 0 missing values.
