
# Representative Days for EPM Model

This notebook is used to calculate representative days used in the EPM model. The script is divided into 4 parts:
1. Group data by season
2. Format it for the Poncelet algorithm.
2. Calculate representative year among historical data.
2. Calculate special and representative days within this year.
4. Export pHours, pVREgen in EPM format.

It is based on previously developed GAMS code for the Poncelet algorithm. The objective has been to automate the process and make it more user-friendly.
The code will automatically get the min production for PV, the min production for Wind, and the max load days for each season, called the special days.
It will automatically removes the special days from the input file for the Poncelet algorithm and then runs the Poncelet algorithm to generate the representative days.

## Special days methodology

**Clustering-Based Method**
To increase the influence of extreme days on the model, we proceed as follows:
- Clustering: All days are grouped into k clusters using K-Means, and based on the features (VRE production and load).
- Selection of Extremes: Days with the lowest PV, lowest wind, and highest load are identified as special days. For multi-zone cases, total daily PV or wind production across all zones is used.
- Cluster Exclusion and Weight:
    - The entire cluster containing each extreme day is excluded from the main dataset.
    - The centroid (i.e., most representative day) of each excluded cluster is then included as a special day.
    - This day is assigned a weight equal to the cluster’s share of the original dataset.

This method ensures that days with system stress (e.g., low renewable production) have a greater impact on optimization results through the higher weight.


In [1]:
import os
import pandas as pd
import matplotlib.pyplot as plt

from utils_reprdays import *

We provide two examples, one for a country model, one for a regional model. The country example is based on data that was provided by the utility, the regional example is based on data that was downloaded from renewable ninja.

## 0. User input parameters (to manually change)

In [2]:
seasons_dict = {
    1: 2,
    2: 2,
    3: 2,
    4: 2,
    5: 1,
    6: 1,
    7: 1,
    8: 1,
    9: 1,
    10: 2,
    11: 2,
    12: 2
}  # grouping months into 4 seasons, to define according to user

# The name of the file that must be in the input folder
filenames_input = {'PV': 'data_capp_solar.csv',
                   'Wind': 'data_capp_wind.csv',
                   'Load': 'load_full_year.csv'
                   }

#
zones_to_remove = ['STP', 'Rwanda']


## 1. Create the folder structure

The following cell creates the folder structure required for the script to run.
1. The `main_folder` is `data/name_data/`.
2. In `data/name_data/raw_data/`, the raw data is stored, including the data from Ninja API.
3. In `data/name_data/repr_days/`, the representative days are stored, as well as some auxiliary files.

In [3]:
folder_input = 'input'
# Make folder
if not os.path.exists(folder_input):
    os.makedirs(folder_input)

folder_output = 'output'
if not os.path.exists(folder_output):
    os.makedirs(folder_output)
    print(f'Output folder: {folder_output}')

## 2. Process data to group by season
Renewable ninja ata is processed to group months into season, to be used as input to EPM. This step may be skipped if one wishes to keep the seasonal definition at the monthly scale, or updated based on the seasonal grouping that makes the most sense for the case study at hand.

In [4]:

filenames = {key: os.path.join(folder_input, filename) for key, filename in filenames_input.items()}

def month_to_season(data, seasons_dict, other_columns=None):
    """Convert month number to season number."""
    data = data.rename(columns={'season': 'month'})
    data['season'] = data.apply(lambda row: seasons_dict[row['month']], axis=1)
    data = data.sort_values(by=['season', 'month', 'day', 'hour'])
    data = data[~((data['month'] == 2) & (data['day'] == 29))]
    # Renumber days sequentially within each season
    data['season_day'] = data.groupby(other_columns + ['season']).cumcount() // 24 + 1
    data = data.drop(columns=['day']).rename(columns={'season_day': 'day'})
    data = data.set_index(other_columns + ['season', 'day', 'hour']).reset_index().drop(columns=['month'])
    data = data.sort_values(by=other_columns + ['season', 'day', 'hour'])
    return data


for key, filename in filenames.items():
    if not os.path.exists(filename):
        print(f'File {filename} does not exist. Please check the input folder.')
        raise FileNotFoundError(f'File {filename} not found.')
    data = pd.read_csv(filename, index_col=False)

    # Load data hours should start with 0, not 1
    if data['hour'].min() == 1:
        data['hour'] = data['hour'] - 1

    # Rename value by 2018
    data = data.rename(columns={'value': 2023})

    data = data[~data['zone'].isin(zones_to_remove)]

    data = month_to_season(data, seasons_dict, other_columns=['zone'])

    name, ext = os.path.splitext(filename)
    filename = f'{name}_season{ext}'

    data.to_csv(filename, float_format='%.4f', index=False)
    print(f'Data saved to {filename}')

Data saved to input/data_capp_solar_season.csv
Data saved to input/data_capp_wind_season.csv
Data saved to input/load_full_year_season.csv


## 3. Format data for the Poncelet algorithm

In [9]:

filenames = {}
for tech, filename in filenames_input.items():
    name, ext = os.path.splitext(filename)
    filename = f'{name}_season{ext}'
    filename = os.path.join(folder_input, filename)

    if not os.path.exists(filename):
        print(f'File {filename} does not exist. Please check the input folder.')
        raise FileNotFoundError(f'File {filename} not found.')
    filenames.update({tech: filename})

# The name of the data, used to save the results
df_energy = format_data_energy(filenames)

# Drop columns with all NaN values
df_energy = df_energy.dropna(axis=1, how='all')

Representative year 2023
Annual capacity factor (%): tech              zone        PV      Wind      Load
0               Angola  0.193551  0.133399  0.775620
1              Burundi  0.173415       NaN  0.742856
2                  CAR  0.151549       NaN  0.803250
3             Cameroon  0.171599  0.069570  0.740909
4                 Chad  0.184827  0.407908  0.685850
5                Congo  0.139433       NaN  0.720891
6                  DRC  0.165349       NaN  0.880314
7     EquatorialGuinea  0.138428       NaN  0.805217
8                Gabon  0.123646       NaN  0.783922


## 4. Generate representative days

**User-defined parameters**

User needs to change the following parameters in the following cell:

1. `nbr_days`: Defines the number of representative days used in the Poncelet algorithm.
2. Elements in `filenames`: Specify whether the input data should be treated as standard or renewable_ninja, depending on its source and formatting.
3. `n_clusters`: Only relevant if the clustering method is used to identify special days. Specifies the number of clusters (i.e., representative days) to generate, which will then be used to extract special days representing extreme clusters.
4. `settings.csv` file: This file defines the level of detail in the Poncelet algorithm. A higher number of zones or time series increases numerical complexity. Start with settings_bins10.csv (fewer bins) for easier computation. Increase the number of bins progressively if the problem remains easy to solve.
5. `n_features_selection` (optional):    Enables **automatic feature selection** to reduce the number of time series used in the Poncelet algorithm.
   This is useful when:
   - Modeling many zones or countries.
   - The number of pairwise correlations becomes large, increasing computational time.

    Use this parameter if the Poncelet algorithm takes more than a few minutes to run.

**Data formatting**
Proper formatting of input data is essential for seamless integration into the model. Please follow these guidelines:

**Renewable Ninja Data**:

If the data is extracted through this code, the formatting will be correct by default. The specification renewable_ninja should be specified in the filenames to indicate that the data originates from Renewable Ninja.

**User-Specific Data**:

Example: capacity factors derived from actual production data provided by the client, or load demand provided by the utility. Users are responsible for ensuring that the data is formatted correctly. Reference examples are available in the data_test folder. The required columns are zone, season, day, hour, , , ... In this case, the specification standard should be specified in the filenames to indicate that the data is formatted in the standard way.

In [11]:
# Clustering the data to extract clusters corresponding to extreme conditions
n_clusters = 20

# Feature selection is optional but recommended if you are working with a large number of zones or time series. This reduces the number of pairwise correlations and helps avoid high computational complexity in the optimization step.
n_features_selection = 30

name_data = 'capp'

In [12]:

df_energy_cluster, df_closest_days, centroids_df = cluster_data_new(df_energy, n_clusters=n_clusters)

# Extracting special days as centroids of the extreme clusters
special_days, df_energy_no_special = get_special_days_clustering(df_closest_days,
                                                                 df_energy_cluster, threshold=0.1)

print('Number of hours in the year:', len(df_energy_no_special))
print('Removed days:', (len(df_energy) - len(df_energy_no_special)) / 24)

# Format the data (including correlation calculation) and save it in a .csv file
_, path_data_file = format_optim_repr_days(df_energy_no_special, name_data, folder_output)


selected_series, df, path_data_file_selection = (
    select_representative_series_hierarchical(path_data_file, n=n_features_selection, method='ward', metric='euclidean', scale=True, scale_method='standard'))

# Launch the optimization to find the representative days
n_rep_days = 2
path_data = path_data_file  # you want to include all the features to identify representative days
path_data = path_data_file_selection  # you only want to work with the reduced number of features to identify representative days

launch_optim_repr_days(path_data, folder_output, nbr_days=n_rep_days, main_file='OptimizationModelZone.gms',
                       nbr_bins=10)

# Get the results
repr_days = parse_repr_days(folder_output, special_days)

# Format the data to be used in EPM
format_epm_phours(repr_days, folder_output, name_data=name_data)
format_epm_pvreprofile(df_energy, repr_days, folder_output, name_data=name_data)
# only activate when load data is provided
format_epm_demandprofile(df_energy, repr_days, folder_output, name_data=name_data)

# Export in .csv format
repr_days.to_csv(os.path.join(folder_output, 'repr_days_{}.csv'.format(name_data)), index=False)
df_energy.to_csv(os.path.join(folder_output, 'df_energy_{}.csv'.format(name_data)), index=False)

Number of hours in the year: 7776
Removed days: 41.0
File saved at: output/data_formatted_optim_capp.csv
File saved at /Users/lucas/Documents/World Bank/Projects/EPM_APPLIED/EPM_CAPP/pre-analysis/representative_days/output/data_formatted_optim_capp_selection.csv
File saved to: gams/bins_settings_10.csv
Launch GAMS code
End GAMS code
Number of days: 10
Total weight: 365
season
Q1    153.0
Q2    212.0
Name: weight, dtype: float64
pHours file saved at: output/pHours_capp.csv
Number of hours: 365
VRE Profile file saved at: output/pVREProfile_capp.csv
pDemandProfile file saved at: output/pDemandProfile_capp.csv


  pVREProfile = pVREProfile.stack(level=['fuel', 'zone'])
  pDemandProfile = pDemandProfile.stack('zone')


In [13]:
if False:
    df_energy = pd.read_csv(os.path.join(folder_output, 'df_energy_{}.csv'.format(name_data)))
    repr_days = pd.read_csv(os.path.join(folder_output, 'repr_days_{}.csv'.format(name_data)))


## Plot results 

Optional cell to plot the results. It will plot the load, wind and solar data for the representative year.

In [14]:
# Get data 
input_file = pd.read_csv(os.path.join(folder_output, 'data_formatted_optim_{}.csv'.format(name_data)), index_col=[0,1,2])
input_file.index.names = ['season', 'day', 'hour']

VREProfile = pd.read_csv(os.path.join(folder_output, 'pVREProfile_{}.csv'.format(name_data)), index_col=[0,1,2,3])

pHours = pd.read_csv(os.path.join(folder_output, 'pHours_{}.csv'.format(name_data)), index_col=[0,1])

In [17]:
# Checking the representative days

# === SETTINGS ===

season_colors = {
    'Q1': 'darkred',
    'Q2': 'dimgrey',
    'Q3': 'steelblue',
    'Q4': 'seagreen'}


# Total renewable production over all zones
plot_vre_repdays(input_file=input_file, vre_profile=VREProfile, pHours=pHours,
          season_colors=season_colors, min_alpha=0.5, max_alpha=1, path=os.path.join(folder_output, 'plot_vre_repdays_{}.png'.format(name_data)))

In [None]:
# Checking the representative days

# === SETTINGS ===
season_colors = {
    'Q1': 'darkred',
    'Q2': 'dimgrey',
    'Q3': 'steelblue',
    'Q4': 'seagreen'}

# Representative days per country
country = ['Angola']
plot_vre_repdays(input_file=input_file, vre_profile=VREProfile, pHours=pHours,
          season_colors=season_colors, countries=country, min_alpha=0.5, max_alpha=1)