# Representative period processing

This script processes representative periods for the NE-model.
The processing consists of the following steps:
1. Python configuration for the script.
2. Read `.gdx` time series from a pre-processed NE-model input folder.
3. Format them into a single massive dataframe for TSAM.
4. Run TSAM and process the representatives into Backbone samples.
5. Output Backbone settings.

You will need to have an input folder processed by the NE-model `building_input_data.py`,
as this scrip will use the preprocessed `.gdx` time series files.

## 1. Config

Import the necessary packages etc.

In [None]:
## Import necessary packages

import os # For gathering all timeseries files.
from itertools import product # More efficient looping
import gams.transfer as gt # For reading said timeseries files.
import tsam.timeseriesaggregation as tsam # For aggregating representatives.

## 2. Read time series input data

Here, we read the time series input data from the given `input_folder_path`.
The `omit_suffixes`, `aggregate_weather_years` and `aggregate_timeseries_names` filters are also applied
here to avoid unnecessary input data reading.

In [None]:
## Configure input data to be included for aggregation.

# Path to the NE-model preprocessed input data folder.
# The one containing a ton of ts_*.gdx files)
input_folder_path = "./north_european_model/input_National_Trends_2040_nucTypical/"

# Select weather years to be processed `list(<str>)`!!!
# `None` processes all available years.
aggregate_weather_years = None

# Select time series names for aggregation `list(<str>)`.
# `None` processes all available timeseries.
aggregate_timeseries_names = None

# Omit `.gdx` files with these as their suffix.
omit_suffixes = ["forecasts", "demands"]

In [None]:
## Gather and filter the timeseries files in the input folder.

# Reading and preprocessing relevant filenames.
timeseries_files = os.listdir(input_folder_path) # Gather all files in the dir
timeseries_files = [f.split('.') for f in timeseries_files] # Split file suffix
timeseries_files = [f[0].split('_') for f in timeseries_files if f[-1] == "gdx"] # Split filename by '_' and filter by .gdx
timeseries_files = [('_'.join(f[0:-1]), f[-1]) for f in timeseries_files if f[0] == "ts"] # Form filenames/years and filter by 'ts' prefix

# Config filtering by suffix, timeseries names, and weather years.
if omit_suffixes is not None:
    timeseries_files = [f for f in timeseries_files if f[-1] not in omit_suffixes]
if aggregate_weather_years is not None:
    timeseries_files = [f for f in timeseries_files if f[-1] in aggregate_weather_years]
if aggregate_timeseries_names is not None:
    timeseries_files = [f for f in timeseries_files if f[0] in aggregate_timeseries_names]

# Determine sets of remaining filenames and years.
filtered_filenames = set([f[0] for f in timeseries_files])
filtered_years = set([f[-1] for f in timeseries_files])

In [None]:
## Read .gdx data into a nested dictionary with `year`->`param_name` as the keys.

gdx_df_dict = dict() # Initialize empty dict for collecting input data.
for (year, filename) in product(filtered_years, filtered_filenames):
    gdx = gt.Container(f"{input_folder_path}{filename}_{year}.gdx") # Read input data file.
    gdx_df_dict.setdefault(year, dict()) # Initialize empty sub-dictionary for parameter values.
    for (param_name, vals) in gdx.data.items():
        gdx_df_dict[year][param_name] = vals.records # Record values per year per param_name

## 3. Format data for TSAM

All the data needs to be in a single dataframe with timesteps as indices for TSAM,
while all timeseries values need to be pivoted to columns.

>**NOTE!**
>Currently, all timeseries are used raw, as-is, without any weighting.
>Better results might be achieved via some form of weighting, but I'm unsure how that should be implemented in TSAM.

In [None]:
## Settings for TSAM data formatting
# Don't change these unless you know what you're doing.

index_column_name = 't' # Name of the time index column.
value_column_name = 'value' # Name of the value column.
forecast_filter = {'f00'} # Filter out forecasts besides these.

In [None]:
## Format data for TSAM
# This unfortunately seems to take ~1.5 min for the full dataset.

tsam_df_dict = dict() # Initialize empty dict for collecting all timeseries per year.
for (year, param_dict) in gdx_df_dict.items():
    for (param_name, vals) in param_dict.items():
        agg_cols = vals.columns.difference([index_column_name, value_column_name]) # Gather all other column names to be aggregated.
        agg_col_name = '-'.join([param_name, *agg_cols]) # Name for the aggregated column including parameter name.
        if vals.get('f') is not None: # Forecast filtering applied when needed.
            vals = vals.loc[vals['f'].isin(forecast_filter)]
        vals[agg_col_name] = param_name + '-' + vals[agg_cols].agg('-'.join, axis=1) # Create aggregate column value.
        vals = vals[[index_column_name, value_column_name, agg_col_name]] # Omit old columns.
        vals = vals.pivot( # Pivot timeseries dataframe for TSAM
            index=index_column_name, columns=agg_col_name, values=value_column_name
        )
        if tsam_df_dict.get(year) is None:
            tsam_df_dict[year] = vals # First dataframe becomes the starting point.
        else:
            tsam_df_dict[year] = tsam_df_dict[year].join(vals) # Rest are joined on index.

## 4. Run TSAM and process Backbone samples.

In [None]:
## Configure TSAM aggregation
# (see https://tsam.readthedocs.io/en/latest/timeseriesaggregationDoc.html)

noTypicalPeriods = 4 # Number of representative periods.
hoursPerPeriod = 168 # Hours per representative period.
resolution = 1 # Resolution of input data in hours.
clusterMethod = "hierarchical" # Select clustering method. (`hierarchical` or `k_medoids` recommended)
rescaleClusterPeriods = False # Don't rescale periods, we don't use that data anyhow.
extremePeriodMethod = 'None' # Method to integrate extreme periods?

In [None]:
## TSAM time series aggregation
# This seems amazingly fast, less than 5 seconds for all the data.

tsam_dict = dict() # Initialize dict to store TSAM aggregation per year.
for (year, data) in tsam_df_dict.items():
    aggregation = tsam.TimeSeriesAggregation( ## Define TSAM aggregation.
        data,
        noTypicalPeriods=noTypicalPeriods,
        hoursPerPeriod=hoursPerPeriod,
        clusterMethod=clusterMethod,
        resolution=resolution,
        rescaleClusterPeriods=rescaleClusterPeriods,
        extremePeriodMethod=extremePeriodMethod,
    )
    aggregation.createTypicalPeriods() ## Run TSAM aggregation.
    tsam_dict[year] = aggregation


In [None]:
## Process Backbone samples per year from the aggregation
# The samples are processed based on the "clusters" in TSAM.
# NOTE! There is no guarantee that TSAM samples occur in a sequence similar to Backbone samples!
# However, based on very brief testing this seems to be the case? 

sample_dict = dict() # Initialize dict to store Backbone samples per year.
for (year, agg) in tsam_dict.items():
    # First timestep index of each sample, directly from cluster center indices.
    sample_start_indices = agg.clusterCenterIndices
    # Calculate sample weights based on the cluster occurrence number.
    sample_weights = agg.clusterPeriodNoOccur
    sample_weights = {
        ind: val/sum(sample_weights.values()) for (ind, val) in sample_weights.items()
    }
    # Figure out the order the samples occur during the year
    sample_order = agg.clusterOrder # Order of samples representing the raw data (year).
    # TODO
    # How do we actually want to represent the year?
    # If we use the periods where they actually occur (clusterCenterIndices),
    # the weights won't match the outcome desired by TSAM when fed to Backbone.
    # If we instead match the weights, the periods cannot occur where Backbone needs them to.
    # Then again, we need to place the samples where they are to get the correct data
    # into Backbone (agg.clusterCenterIndices), so I guess we can't really consider the order?

# TODO?

Interesting idea by Niina: First reduce the number of years by selecting "representative years", then represent them using "representative periods".
This could significantly reduce the computational burden to represent multiple years.
However, we would likely want to include different weather years for the investments, so I'm not sure if TSAM can handle that.