# entso e Actual Generation per Type data

In this Jupyter Notebook we importing the entso e Actual Generation per Type data from OPSD data processing
and correcting the hourly data with reported values from eurostat

## Data sources

1. ENTSO-E Transparency Platform, Actual Generation per Type Available online: https://transparency.entsoe.eu/generation/r2/actualGenerationPerProductionType/show (accessed on Oct 02, 2020).
 - Proccesed with OPSD time series scrips
 
2. Energy Balances in the MS Excel file format (2020 edition) eurostat https://ec.europa.eu/eurostat/de/web/energy/data/energy-balances (accessed on Oct 02, 2020).




## Import python libraries

In [59]:
import numpy as np
import pandas as pd
import yaml


#Helpers
import os
#import pycountry
import glob
from datetime import datetime, date, timedelta, time


#Ploting
import matplotlib.pyplot as plt
#import seaborn as sns


%matplotlib inline
plt.style.use('seaborn')
plt.rcParams['figure.figsize'] = [15, 6]

## Set data directories

Create input, processed and output folders if they don't exist. If the paths are relative, the correspoding folders will be created inside the current working directory.

In [60]:
input_directory_path = os.path.join('input')
processed_directory_path = 'processed'
output_directory_path = os.path.join('output')

sources_yaml_path = os.path.join('input', 'sources.yml')

os.makedirs(input_directory_path, exist_ok=True)
os.makedirs(processed_directory_path, exist_ok=True)
os.makedirs(output_directory_path, exist_ok=True)

## Define functions¶

In [61]:
# Import function timeseries_opsd

def load_timeseries_opsd(years=None, fn=None, countries=None, source="ENTSOE_transparency"):
    """
    Read data from OPSD time-series package own modification.

    Parameters
    ----------
    years : None or slice()
        Years for which to read load data (defaults to
        slice("2018","2019"))
        
    fn : file name or url location (file format .csv)
    
    countries : Countries for which to read load data.
        
    source : "ENTSOE_transparency" or "ENTSOE_power_statistics"

    Returns
    -------
    load : pd.DataFrame
        Load time-series with UTC timestamps x ISO-2 countries
    """

     
    if source == 'ENTSOE_transparency':
        generation = (pd.read_csv(fn, index_col=[0], header=[0, 1, 2, 3, 4, 5], parse_dates=True)
                    .dropna(how="all", axis=0))
        
    else:
        raise NotImplementedError(f"Data for source `{source}` not available.")
    
    
    #generation = generation.rename(columns={'GB_UKM' : 'GB'}).filter(items=countries)
       
    
    return generation

In [62]:
def import_eurostat_energy_balance_sheets(path):
    """
    Load and standardize the raw eurostat energy balance sheet files.

    Parameters
    ----------
    path : Path to data directory


    """
    
    
    # combining path and .xlsb data
    
    filenames = sorted(glob.glob(path + "/*.xlsb"))
    
    # import xlsb files
    # using pd.concat function as import function to append data to dataframe
    # encoding: "utf-16" see entso-e documentation
    # colum selection is possible by using "usecols=['DateTime','ResolutionCode','AreaCode','AreaTypeCode','GenerationUnitEIC',...]" 
    
    entsoe_pp_timeseries = pd.concat((pd.read_csv(f, sep='\t', encoding='utf-16', index_col = 3) for f in filenames))
    
    entsoe_pp_timeseries.drop(columns=["Year","Month","Day"], inplace=True)
    
    entsoe_pp_timeseries.index = pd.to_datetime(entsoe_pp_timeseries.index)

    #set generation and consumtion as absolut value (assuming that the negative entries are incorrect)
    entsoe_pp_timeseries['ActualGenerationOutput'] = entsoe_pp_timeseries.ActualGenerationOutput.abs()
    
    entsoe_pp_timeseries['ActualConsumption'] = entsoe_pp_timeseries.ActualConsumption.abs()

    return entsoe_pp_timeseries

In [63]:
def change_ProductionTypeName (entsoe_timeseries):
    return entsoe_timeseries.ProductionTypeName.replace(
                                {'Fossil Hard coal': 'Hard Coal',
                                 'Fossil Brown coal/Lignite':'Lignite',
                                 'Fossil Gas': 'Gas',
                                 'Fossil Oil' : 'Other fossil',
                                 'Fossil Coal-derived gas': 'Other fossil',
                                 'Fossil Peat': 'Other fossil',
                                 'Fossil Oil Shale' : 'Other fossil',
                                 'Other' : 'Other fossil',
                                 '.*Hydro.*': 'Hydro',
                                 '.*Oil.*': 'Oil'
                                 }, regex = True, inplace = True)

## Set filter parameter

In [65]:
# Change the production type names
new_ProductionTypeName = False


renewables:
            Solar: solar
            Wind Onshore: wind_onshore
            Wind Offshore: wind_offshore
            Biomass: biomass
            Other renewable: other_renewable
        conventional:
            Fossil Hard coal: hard_coal 
            Fossil Brown coal/Lignite: lignite 
            Fossil Gas: gas 
            Fossil Oil: other_fossil
            Fossil Coal-derived gas: other_fossil
            Fossil Peat: other_fossil
            Fossil Oil Shale: other_fossil
            Other: other_fossil
            Hydro Pumped Storage: hydro
            Hydro Run-of-river and poundage: hydro
            Hydro Water Reservoir: hydro
            Fossil Oil: oil
            Fossil Oil shale: oil 


#old                        : new
#------------------------------------------------
#'Fossil Hard coal'         : 'Hard Coal',
#'Fossil Brown coal/Lignite': 'Lignite',
#'Fossil Gas'               : 'Gas',
#'Fossil Oil'               : 'Other fossil',
#'Fossil Coal-derived gas'  : 'Other fossil',
#'Fossil Peat'              : 'Other fossil',
#'Fossil Oil Shale'         : 'Other fossil',
#'Other'                    : 'Other fossil',
#'.*Hydro.*'                : 'Hydro',
#'.*Oil.*'                  : 'Oil'

# dataset period
start = '2019-01-01'
end = '2020-01-01'
closed='left' # end is not included 

# test dataet about gaps, timedate and duplicates
test_dataset = False

# countries to analyze
#countries = ['AT', 'BE', 'BG', 'CH', 'CZ', 'DE', 'DK', 'EE', 'ES', 'FI', 'FR', 'GB', 'GR', 'HR', 'HU', 'IE', 'IT', 'LT', 'LU', 'LV', 'ME', 'NL', 'NO', 'PL', 'PT', 'RO', 'RS', 'SE', 'SI', 'SK']

#'AL', 
#missing in the data 'BA', 'MK'

#Dic to convert between alpha 3 and alpha 2
countries_dic = {}
for country in pycountry.countries:
    countries_dic[country.alpha_3] = country.alpha_2

IndentationError: unindent does not match any outer indentation level (<tokenize>, line 11)

In [66]:
with open(sources_yaml_path, 'r', encoding='UTF-8') as f:
    sources = yaml.load(f.read())

  


In [67]:
sources['eurostat energy balances']['Energy Balances in the MS Excel file format']['variable_type']

{'Anthracite': 'hard_coal',
 'Coking coal': 'hard_coal',
 'Other bituminous coal': 'hard_coal',
 'Sub-bituminous coal': 'hard_coal',
 'Lignite': 'lignite',
 'Patent fuel': 'other_fossil',
 'Coke oven coke': 'hard_coal',
 'Gas coke': 'hard_coal',
 'Coal tar': 'hard_coal',
 'Brown coal briquettes': 'lignite',
 'Gas works gas': 'gas',
 'Coke oven gas': 'gas',
 'Blast furnace gas': 'gas',
 'Other recovered gases': 'gas',
 'Peat': 'other_fossil',
 'Peat products': 'other_fossil',
 'Oil shale and oil sands': 'oil',
 'Crude oil': 'oil',
 'Natural gas liquids': 'oil',
 'Refinery feedstocks': 'other_fossil',
 'Other hydrocarbons': 'other_fossil',
 'Refinery gas': 'other_fossil',
 'Ethane': 'other_fossil',
 'Liquefied petroleum gases': 'other_fossil',
 'Motor gasoline (excluding biofuel portion)': 'other_fossil',
 'Aviation gasoline': 'other_fossil',
 'Gasoline-type jet fuel': 'other_fossil',
 'Other kerosene': 'other_fossil',
 'Naphtha': 'other_fossil',
 'Fuel oil': 'oil',
 'White spirit and sp

In [97]:
years = ['2018','2017','2016','2015']

df= pd.DataFrame()

for year in years:
    
    df[year] = pd.read_excel(io=input_directory_path + '\DE-Energy-balance-sheets-June-2020-edition.xlsb', sheet_name=year, engine='pyxlsb', header=135, skipfooter=10, usecols=sources['eurostat energy balances']['Energy Balances in the MS Excel file format']['variable_type'], na_values='Z').iloc[1:3].sum()

# rename columns
df.rename(sources['eurostat energy balances']['Energy Balances in the MS Excel file format']['variable_type'], inplace=True)


#convert to MWh
df = df * 11630

In [98]:
df.index

df = df.groupby(df.index).sum()

In [100]:
(df/1000000)

Unnamed: 0,2018,2017,2016,2015
biomass,33.217001,33.668001,33.509996,32.891001
gas,56.370994,60.202997,56.925,39.626003
geothermal,0.177997,0.162994,0.174997,0.133001
hard_coal,79.335999,89.461007,108.832993,116.803009
hydro,24.057004,25.983002,25.957997,24.739999
lignite,142.164004,144.958995,146.187995,151.143003
marine,0.0,0.0,0.0,0.0
nuclear,76.005004,76.324003,84.633999,91.785995
oil,0.354006,0.476004,0.493007,0.892998
other_fossil,0.022004,0.007001,0.003001,0.017003


In [None]:
importlib.import_module('pyxlsb')

## Load and filter data¶

In [None]:
# load and standardize data

entsoe_gen_type = load_timeseries_opsd(years=None, fn=input_directory_path + '/time_series_60min_multiindex.csv', countries=None, source="ENTSOE_transparency")

In [None]:
DE = entsoe_gen_type['DE']
DE

In [None]:
entsoe_gen_type[('DE','gas')]


In [None]:
start = date(2018, 1, 1)
end = date(2018, 12, 31)

In [None]:
start

In [None]:
DE = DE.loc[start:end, :]

In [None]:
DE.isnull().sum()

In [None]:
cols = {'ResolutionCode': 'resolution',
            'areacode': 'areacode',
            'AreaTypeCode': 'AreaTypeCode',
            'AreaName': 'region',
            'MapCode': 'mapcode',
            'ProductionType': 'variable',
            'ActualGenerationOutput': 'generation_actual',
            'ActualConsumption': 'consumption_actual',
            'UpdateTime': 'updatetime'}

entsoe_gen_type.rename(columns=cols, inplace=True)

In [None]:
entsoe_gen_type.drop(columns=['areacode','AreaTypeCode','mapcode','consumption_actual','updatetime'], inplace=True)

In [None]:
entsoe_gen_type.dropna(subset=['generation_actual'], inplace=True)

In [None]:
dfs = {}
res = '15'
df = (entsoe_gen_type.loc[entsoe_gen_type['resolution'] == 'PT' + res + 'M', :]
         .copy().sort_index(axis='columns'))
df.drop(columns=['resolution'], inplace=True)

stacked = ['region',  'variable']

In [None]:
df.set_index(stacked, append=True, inplace=True)

In [None]:
df.index.duplicated(keep="last")

In [None]:
df = df[~df.index.duplicated(keep="last")]

In [None]:
df = df.unstack(stacked)


In [None]:
df = df.loc[:, (df > 0).any(axis=0)]

In [None]:
headers = ['region', 'variable']

In [None]:
df

In [None]:
df = df.reorder_levels(headers, axis=1)

In [None]:
dfs = {}
for res in ['15', '30', '60']:
    df = (entsoe_gen_type.loc[entsoe_gen_type['resolution'] == 'PT' + res + 'M', :]
         .copy().sort_index(axis='columns'))
    df.drop(columns=['resolution'], inplace=True)

    # juggle the index and columns
    df.set_index(stacked, append=True, inplace=True)
    # at this point, only the values we are intereseted in are are left as
    # columns
    df.columns.rename(unstacked, inplace=True)
    df = df.unstack(stacked)
    
    # keep only columns that have at least some nonzero values
    df = df.loc[:, (df > 0).any(axis=0)]
    
    # add source, url and unit to the column names.
    # Note: pd.concat inserts new MultiIndex values infront of the old ones
    df = pd.concat([df],
                   keys=[tuple(append_headers.values())],
                   names=append_headers.keys(),
                   axis='columns')
    
    # reorder and sort columns
    df = df.reorder_levels(headers, axis=1)
    
    dfs[res + 'min'] = df

In [None]:
entsoe_gen_type.rename(columns={Date})



In [None]:
 # keep only entries for selected geographic entities as specified in
    # areas.csv
    area_filter = areas['primary AreaName ENTSO-E'].dropna()
    df_raw = df_raw.loc[df_raw['region'].isin(area_filter)]
    
        #set generation and consumtion as absolut value (assuming that the negative entries are incorrect)
    #entsoe_pp_timeseries['ActualGenerationOutput'] = entsoe_pp_timeseries.ActualGenerationOutput.abs()
    
    #entsoe_pp_timeseries['ActualConsumption'] = entsoe_pp_timeseries.ActualConsumption.abs()

In [None]:
# check the availbe columns

entsoe_gen_type.columns

In [None]:
entsoe_gen_type[.AreaName.unique()

In [None]:
entsoe_gen_type[entsoe_gen_type.AreaName == 'NO2 BZ']

In [None]:
entsoe_gen_type.MapCode.unique()

In [None]:
# check the availbe 'ProductionTypeName'

entsoe_gen_type.ProductionType.unique()

In [None]:
# check the availbe countries

entsoe_gen_type.MapCode.unique()

In [None]:
# replace DE_* names with DE (DE is represend as four areas)

entsoe_gen_unit.MapCode.replace({'.*DE.*' : 'DE'}, regex = True, inplace = True)

In [None]:
# new names for production types

if new_ProductionTypeName:
    entsoe_gen_unit = change_ProductionTypeName(entsoe_gen_unit)

In [None]:
# Which resolutions do exist in the data?

entsoe_gen_unit.ResolutionCode.unique()

In [None]:
# How many generators in the data

len(entsoe_gen_unit.GenerationUnitEIC.unique().tolist())

In [None]:
if test_dataset:
    for i in entsoe_gen_unit.GenerationUnitEIC.unique():
        unit_gen = entsoe_gen_unit.query("GenerationUnitEIC == @i")

        # test if different resolution codes exist for one power plant
        if len(unit_gen.ResolutionCode.unique()) >= 2:
            print('The data for generator ' + unit_gen.GenerationUnitEIC.iloc[0] + ' contains different time resolutions')
            # for 2018 all data OK
            # for 2019 all data OK
        
        if unit_gen.index.has_duplicates:
            #print('The data for generator ' + unit_gen.GenerationUnitEIC.iloc[0] + ' contains duplicates in the index')
            #many duplicates in 2019!
            count = unit_gen.index.duplicated(keep='first').sum()
            if count > 3:
                print('The data for generator ' + unit_gen.GenerationUnitEIC.iloc[0] + ' contains more than 3 duplicates in the index')
                #many duplicates with more than 3 duplicates in 2019!


## Resampling the data

Resampling all generation data to hourly generation data per unit and store the data in a new dataframe 'gen_data'. Specific genertor unit data stored in 'unit_data'.

In [None]:
# set timeframe
t_index = pd.date_range(start=start, end=end, freq='60Min', closed=closed)

# dataframe for generation data
gen_data = pd.DataFrame(index=t_index)

# dataframe for powerplant information
unit_data = pd.DataFrame()


# slicing over all generator units
# takes some time
for i in entsoe_gen_unit.GenerationUnitEIC.unique():
    unit_gen = entsoe_gen_unit.query("GenerationUnitEIC == @i").copy()
    duplicate_count = 0
    unit_gen['duplicate_count'] = duplicate_count
    # test if different resolution codes exist for one power plant
    if len(unit_gen.ResolutionCode.unique()) >= 2:
        print('The data for generator ' + unit_gen.GenerationUnitEIC.iloc[0] + ' contains different time resolutions')
        # for 2018 all data OK
        # for 2019 all data OK
    
    # check if duplicates exist in index (datetime) for the power plant and drop them
    if unit_gen.index.has_duplicates:
        #many duplicates in 2019!
        
        duplicate_count = unit_gen.index.duplicated(keep='first').sum()
        
        #drop all duplicates and only keep the first entry 
        unit_gen = unit_gen[~unit_gen.index.duplicated(keep='first')]
        unit_gen['duplicate_count'] = duplicate_count
    
    #resampling the data to 1h and store it in "gen_data"
    gen_data[i] = resampling(pp_gen=unit_gen, start=start, end=end, resolution='60Min')['ActualGenerationOutput']
   
    #store power plant info in unit_data
    unit_data = unit_data.append((unit_gen.set_index('GenerationUnitEIC')[['AreaCode', 'AreaTypeCode', 'AreaName', 'MapCode', 'PowerSystemResourceName', 'ProductionTypeName','InstalledGenCapacity','duplicate_count']].iloc[0]))    

## Group the data

By using the "unit_data" dataframe in combination with the .groupby() function the data can be easily grouped and analyzed.

### Hourly data per county and technology

In [None]:
# will result in a multi index dataframe
data_country_tech_hourly = gen_data.groupby([unit_data.MapCode, unit_data.ProductionTypeName], axis=1).sum()

In [None]:
data_country_tech_hourly.head()

### Monthly data per county and technology

In [None]:
# generate month as a grouper
data_country_tech_hourly['Month'] = pd.DatetimeIndex(data_country_tech_hourly.index).month

#will result in a multi index dataframe
data_country_tech_monthly = data_country_tech_hourly.groupby(data_country_tech_hourly.Month, axis=0).sum()

#drop the grouper from resulting dataframe
data_country_tech_monthly.drop(['Month'],axis=1, inplace=True)


In [None]:
data_country_tech_monthly.head()

### Yearly data per county and technology

In [None]:
# sum the data from multiindex dataframe and convert multiindex into columns and rows
data_country_tech_yearly = data_country_tech_hourly.sum().unstack(-1)

In [None]:
data_country_tech_yearly.head()

### Germany as example

In [None]:
DE = data_country_tech_hourly['DE']

In [None]:
DE.head()

In [None]:
# production per technology in GWh
DE.sum()/1000

In [None]:
ax = sns.barplot(data=DE) 


## Export data

Save data as .csv files. All files are saved in the output directory of this notebook. Take some time (2 min)

In [None]:
# hourly data
data_country_tech_hourly.to_csv(output_directory_path + '/data_country_tech_hourly.csv')

# monthly data
data_country_tech_monthly.to_csv(output_directory_path + '/data_country_tech_monthly.csv')

# yearly data
data_country_tech_yearly.to_csv(output_directory_path + '/data_country_tech_yearly.csv')

# power plant information
unit_data.to_csv(output_directory_path + '/unit_data.csv')

# hourly unit generation data
gen_data.to_csv(output_directory_path + '/gen_data.csv')