# Processing
Part of the project [Open Power System Data](http://open-power-system-data.org/).

Go back to the main notebook ([GitHub](https://github.com/Open-Power-System-Data/datapackage_timeseries/blob/master/main.ipynb?) / [local copy](main.ipynb#))

This notebook processes the data combined by the read script ([GitHub](https://github.com/Open-Power-System-Data/datapackage_timeseries/blob/master/read.ipynb) / [local copy](read.ipynb#)).

# Table of Contents
* [1. Preparations](#1.-Preparations)
	* [1.1 Libraries](#1.1-Libraries)
	* [1.2 Load raw data](#1.2-Load-raw-data)
* [2. Own calculations](#2.-Own-calculations)
	* [2.1 Aggregate German data from individual TSOs](#2.1-Aggregate-German-data-from-individual-TSOs)
	* [2.2 Create hourly data from 15' data](#2.2-Create-hourly-data-from-15'-data)
* [3. Create metadata](#3.-Create-metadata)
	* [3.1 General metadata](#3.1-General-metadata)
	* [3.2 Columns-specific metadata](#3.2-Columns-specific-metadata)
* [4. Writing the data to disk](#4.-Writing-the-data-to-disk)
	* [4.1 Write to SQL-database](#4.1-Write-to-SQL-database)
	* [4.2 Write to Excel](#4.2-Write-to-Excel)
	* [4.3 Write to CSV](#4.3-Write-to-CSV)
* [5. Missing data handling](#5.-Missing-data-handling)


# 1. Preparations

## 1.1 Libraries

Loading some python libraries.

In [None]:
from datetime import timedelta
import pandas as pd
import logging
import pycountry
import json
import sqlite3
import copy
from itertools import chain

## 1.2 Load raw data

Load the dataset compiled by the read-script ([local copy](read.ipynb#) / [GitHub](https://github.com/Open-Power-System-Data/datapackage_timeseries/blob/master/read.ipynb))

In [None]:
data_sets = {}
for res_key in ['15min', '60min']:
    data_sets[res_key] = pd.read_csv(
        'raw_data_' + res_key + '.csv',
        header=[0,1,2,3,4],
        index_col=0,
        parse_dates=True
        )

# 2. Own calculations

## 2.1 Aggregate German data from individual TSOs

The wind and solar in-feed data for the 4 German balancing areas is summed up and stored in in new columns, which are then used to calculate profiles, that is, the share of wind/solar capacity producing at a given time. The column headers are created in the fashion introduced in the read script.

In [None]:
HEADERS = ['variable', 'country', 'attribute', 'source', 'web']

In [None]:
web = 'http://data.open-power-system-data.org/datapackage_timeseries'
for tech in ['wind', 'solar']:
    for attribute in ['generation', 'forecast']:
        sum_col = pd.Series()
        for tso in ['50hertz', 'amprion', 'tennet', 'transnetbw']:
            add_col = data_sets['15min'][tech, 'DE' + tso, attribute]
            if len(sum_col) == 0:
                sum_col = add_col
            else:
                sum_col = sum_col + add_col.values
                
        # Create a new MultiIndex
        tuples = [(tech, 'DE', attribute, 'own calculation', web)]
        columns = pd.MultiIndex.from_tuples(tuples, names=HEADERS)
        sum_col.columns = columns
        data_sets['15min'] = data_sets['15min'].combine_first(sum_col)
        
        # Calculate the profile column
        if attribute == 'generation':
            profile_col = sum_col.values / data_sets['15min'][tech, 'DE', 'capacity']
            tuples = [(tech, 'DE', 'profile', 'own calculation', web)]
            columns = pd.MultiIndex.from_tuples(tuples, names=HEADERS)
            profile_col.columns = columns
            data_sets['15min'] = data_sets['15min'].combine_first(profile_col)

## 2.2 Create hourly data from 15' data

The German renewables in-feed data comes in 15-minute intervals. We resample it to hourly intervals in order to match the load data from ENTSO-E.

In [None]:
resampled = data_sets['15min'].resample('H').mean()
data_sets['60min'] = data_sets['60min'].combine_first(resampled)

# 3. Create metadata

In this part, we create the metadata that will document the data output in CSV format. The metadata we be stored in JSON format, which is very much like a python dictionary.

## 3.1 General metadata

First, we define the general metadata for the timeseries datapackage

In [None]:
metadata = {
    'name': 'opsd-timeseries',
    'title': 'Time-series data: load, wind and solar, prices',
    'description': 'This data package contains different kinds of timeseries ' +
        'data relevant for power system modelling. Currently, the data ' + 
        'includes hourly electricity consumption (load) from ENTSO-E for 36 ' +
        'European countries, wind and solar power generation from German ' +
        'transmission system operators 50Hertz, Amprion, TenneT and '+
        'TransnetBW for every quarter hour, and daily wind and solar ' +
        'capacity data from Netztransparenz.de and Bundesnetzagentur. We use ' +
        'this data to calculate Germany-wide renewables in-feed and profile ' +
        'timeseries. While the some of the wind in-feed data dates back to ' +
        '2005, the full dataset is only available from 2012 onwards. The ' +
        'data has been downloaded from the sources, resampled and merged in ' +
        'a large CSV file with hourly resolution. Additionally, the data ' +
        'available at a higher resolution (German renewables in-feed, 15 ' +
        'minutes) is provided in a separate file. All data processing is ' +
        'conducted in python and pandas and has been documented in the ' +
        'Jupyter notebooks linked below.',
    'opsd-jupyter-notebook-url': 'https://github.com/Open-Power-System-Data/' +
        'datapackage_timeseries/blob/master/main.ipynb',
    'version': '2016-03-18',
    'opsd-changes-to-last-version': 'Introduced various formats for output data',
    'keywords': [
        'timeseries','electricity','in-feed','capacity','renewables', 'wind',
        'solar','load','tso','europe','germany'
        ],
    'geographical-scope': 'Europe/Germany',
    'licenses': [{
        'url': 'http://example.com/license/url/here',
        'version': '1.0',
        'name': 'License Name Here',
        'id': 'license-id-from-open'
        }],
    'views': [{}],
    'sources': [{
        'name': 'See the "Source" column in the field documentation'
        }],
    'maintainers': [{
        'web': 'http://example.com/',
        'name': 'Jonathan Muehlenpfordt',
        'email': 'muehlenpfordt@neon-energie.de'
        }],
    'resources': [{ # The following is an example of how the file-specific metadata is 
        'path': 'path_to.csv', # structured. The actual metadata is created below
        'format': 'csv',
        'mediatype': 'text/csv',
        'schema': {
            'fields': [{
                'name': 'load_AT_actual',
                'description': 'Consumption in Austria in MW',
                'type': 'number',
                'source': {
                    'name': 'Example',
                    'web': 'http://www.example.com'
                    },
                'opsd-properties': {
                    'Country': 'AT',
                    'Variable': 'load',
                    }
                }]
            }
        }]
    }

indexfield = {
    'name': 'timestamp',
    'description': 'Start of timeperiod in UTC',
    'type': 'datetime',
    'format': 'YYYY-MM-DDThh:mm:ssZ'
    }

descriptions = {
    'load': 'Consumption in {geo} in MW',
    'generation': 'Actual {tech} generation in {geo} in MW',
    'forecast': '{tech} day-ahead generation forecast in {geo} in MW',
    'capacity': '{tech} capacity in {geo} in MW',
    'profile': 'Share of {tech} capacity producing in {geo}',
    'offshoreshare': '{tech} actual offshore generation in {geo} in MW'
    }

## 3.2 Columns-specific metadata

For each dataset/outputfile, the metadata has an entry in the "resources" list that describes the file/dataset. The main part of each entry is the "schema" dictionary, consisting of a list of "fields", meaning the columns in the dataset. The first field is the timestamp index of the dataset. For the other fields, we iterate over the columns of the MultiIndex index of the datasets to contruct the corresponding metadata.

At the same time, a copy of the datasets is created that has a single line column index instead of the MultiIndex.

In [None]:
data_sets_singleindex = copy.deepcopy(data_sets)##########################
resources = []
for res_key, data_set in data_sets.items():
    columns_singleindex = [] ####################
    fields = [indexfield]
    for col in data_set.columns:
        h = {k: v for k, v in zip(HEADERS, col)}
        if len(h['country']) > 2:
            geo = h['country'] + ' control area'
        elif h['country'] == 'NI':
            geo = 'Northern Ireland'
        elif h['country'] == 'CS':
            geo = 'Serbia and Montenegro'
        else:
            geo = pycountry.countries.get(alpha2=h['country']).name

        field = {}    
        field['description'] = descriptions[h['attribute']].format(
            tech=h['variable'], geo=geo)
        field['type'] = 'number'
        field['source'] = {
            'name': h['source'],
            'web': h['web']
            }
        field['opsd-properties'] = {
            'Country': h['country'],
            'Variable': h['variable'],
            }
        components = [h['variable'], h['country']]
        if not h['variable'] == 'load':
            components.append(h['attribute'])
            field['opsd-properties']['Attribute'] = h['attribute']
        field['name'] = '_'.join(components)
        columns_singleindex.append(field['name'])
        fields.append(field)
        
    resource = {
        'path': 'timeseries' + res_key + '.csv',
        'format': 'csv',
        'mediatype': 'text/csv',
        'alternative_formats': [
        {
          'path': 'timeseries' + res_key + '.csv',
          'stacking': 'Singleindex',
          'format': 'csv'
        },
        {
          'path': 'timeseries' + res_key + '.xlsx',
          'stacking': 'Singleindex',
          'format': 'xlsx'
        },
        {
          'path': 'timeseries' + res_key + '_multiindex.xlsx',
          'stacking': 'Multiindex',
          'format': 'xlsx'
        },
        {
          'path': 'timeseries' + res_key + '_multiindex.csv',
          'stacking': 'Multiindex',
          'format': 'csv'
        },
        {
          'path': 'timeseries' + res_key + '_stacked.csv',
          'stacking': 'Stacked',
          'format': 'csv'
        }
      ],        
        'schema': {'fields': fields}
        }       
    resources.append(resource)
    data_sets_singleindex[res_key].columns = columns_singleindex ###############
    
metadata['resources'] = resources

Execute this to write the metadata to disk

In [None]:
datapackage_json = json.dumps(metadata, indent=2, separators=(',', ': '))
with open('datapackage.json', 'w') as f:
    f.write(datapackage_json)

# 4. Writing the data to disk

Finally, we write the data to CSV format and save it in the directory of this notebook. 

In [None]:
data_sets_multiindex = {}
data_sets_stacked = {}
for res_key in ['15min', '60min']:
    data_sets_multiindex[res_key + '_multiindex'] = data_sets[res_key]
    
    stacked = data_sets[res_key].copy()
    stacked.columns = stacked.columns.droplevel(['source', 'web'])
    stacked = stacked.transpose().stack(dropna=True).to_frame(name='data')
    data_sets_stacked[res_key + '_stacked'] = stacked

## 4.1 Write to SQL-database

This file format is required for the filtering funtion on the OPSD website

In [None]:
for res_key, data_set in data_sets_singleindex.items():
    f = 'timeseries' + res_key
    ds = data_set.copy()
    ds.index = ds.index.strftime('%Y-%m-%dT%H:%M:%SZ')
    ds.to_sql(f, sqlite3.connect(f + '.sqlite'),
              if_exists='replace', index_label='timestamp') 

## 4.2 Write to Excel

This takes about 15 Minutes to complete.

In [None]:
for res_key, data_set in chain(data_sets_singleindex.items(),
                               data_sets_multiindex.items()):
    f = 'timeseries' + res_key
    data_set.to_excel(f+ '.xlsx', float_format='%.2f')

## 4.3 Write to CSV

This takes about 10 minutes to complete.

In [None]:
for res_key, data_set in chain(data_sets_singleindex.items(),
                               data_sets_multiindex.items()):
                               data_sets_stacked.items()):
    f = 'timeseries' + res_key
    data_set.to_csv(f + '.csv', float_format='%.2f',
                    date_format='%Y-%m-%dT%H:%M:%SZ')

# 5. Missing data handling

work in progress

In [None]:
compact = data_sets['15min'].copy()
compact.columns = compact.columns.droplevel([3, 4])
compact.insert(0,'time',  compact.index.time)
compact

In [None]:
#for col in data_set.columns:
#    df = col.to_frame()
df = compact.iloc[:,15].to_frame()
df['tag'] = ((df.index >= df.first_valid_index()) &
             (df.index <= df.last_valid_index()) &
             df.isnull().transpose().as_matrix()).transpose()

# make another DF to hold info about each region
regs_isnull = pd.DataFrame()

# first row of consecutive region is a True preceded by a False in tags
regs_isnull['start_idx']  = df.index[df['tag'] & ~ df['tag'].shift(1).fillna(False)]

# last row of consecutive region is a False preceded by a True   
regs_isnull['end_idx']  = df.index[df['tag'] & ~ df['tag'].shift(-1).fillna(False)] 

if df['tag'].any():
    # how long is each region
    regs_isnull['spans'] = regs_isnull['end_idx'] - regs_isnull['start_idx'] + timedelta(minutes=15)
    
    # index of the region with the longest span      
    max_idx = regs_isnull['spans'].argmax()

    # we can get the start and end points of longest region from the original dataframe 
    df.ix[regs_isnull.ix[max_idx][['start_idx', 'end_idx']].values]

In [None]:
df.ix[regs_isnull.ix[max_idx][['start_idx', 'end_idx']].values]

In [None]:
regs_isnull[regs_isnull['spans']>timedelta(minutes=60)]

In [None]:
compact.insert(0,'time', compact.index.time)

In [None]:
pv = compact.xs(('solar'), level=('variable'), axis=1, drop_level=False)
pv.index = pd.MultiIndex.from_arrays([pv.index.date, pv.index.time], names=['date','time'])
pv

In [None]:
pv.groupby(level='time').max()

In [None]:
pv.unstack().idxmax().to_frame().unstack().transpose()