# Hydro Representative Years (in progress)

**Objective**  
Select hydrological representative years and convert them into EPM-ready availability series for each hydro plant.

**Data requirements (user-provided) and method**  
- Data requirements: Drop one CSV per hydro plant under `pre-analysis/prepare-data/input/hydro_representative_years/` (months in the index, calendar years as columns). Provide `capacity_hydro_sample.csv`, optional `impact_cc_hydro_sample.csv`, and any extra availability blocks directly under `pre-analysis/prepare-data/input/`. This repo now ships with toy CSVs so the notebook runs end-to-end after cloning.  
- Method: Configure clustering/selection parameters, load and clean the monthly profiles, apply any climate adjustments, compute representative years, and export availability tables plus QA plots under `pre-analysis/prepare-data/output/hydro_representative_years/`.

**Overview of steps**  
1. Step 1 - Import helpers and point to the working directories.  
2. Step 2 - Declare rename dictionaries, clustering options, and scenario filters.  
3. Step 3 - Build helper functions to compute seasonal averages and align metadata.  
4. Step 4 - Load all hydro CSVs, compute representative years, and format the resulting EPM tables.  
5. Step 5 - Review the exported CSVs and diagnostic plots.


## 1. User Inputs

Edit these parameters first. Paths can be absolute or relative to `pre-analysis/prepare-data`. By default, the notebook reads files directly from the shared `input/` and `output/` folders so it is immediately reusable across countries.


In [None]:
# Paths relative to `pre-analysis/prepare-data/`
monthly_hydro_folder = 'input/hydro_representative_years'  # folder with one CSV per plant
output_folder = 'output/hydro_representative_years'        # folder that will store CSVs/plots

capacity_file = 'input/capacity_hydro_sample.csv'          # installed capacity per plant (after rename)
impact_cc_file = 'input/impact_cc_hydro_sample.csv'        # set to None to skip climate adjustments
availability_other_file = 'input/pAvailability_others_sample.csv'  # optional: merged with hydro availability
existing_plants = ['kaleta', 'souapiti']                   # subset used for diagnostics

n_clusters = 2                      # number of representative years to extract
method = 'real'                     # clustering method passed to run_reduced_scenarios
format_preparation = 'stochastic'   # 'stochastic' or 'monte_carlo' downstream formatting
extract_extreme = True              # True to keep explicit min/max stress cases
rename_scenario = None              # Optional dict mapping raw scenario names to low/medium/high labels


## 2. Setup: imports and helper paths

Loads pandas, resolves the `pre-analysis/prepare-data` root regardless of where the kernel started, and imports utilities from the `representative_days` package.


In [1]:
from pathlib import Path
import sys

import pandas as pd


def locate_prepare_data_dir():
    """Return the absolute path to `pre-analysis/prepare-data`."""
    cwd = Path.cwd().resolve()
    candidates = [
        cwd,
        cwd / 'pre-analysis' / 'prepare-data',
        cwd / 'prepare-data',
    ]
    candidates += [parent / 'pre-analysis' / 'prepare-data' for parent in cwd.parents]
    seen = set()
    for candidate in candidates:
        candidate = candidate.resolve()
        if candidate in seen:
            continue
        seen.add(candidate)
        if (candidate / 'hydro_representative_years.ipynb').exists():
            return candidate
    raise FileNotFoundError(
        'Cannot locate `pre-analysis/prepare-data`. Start the kernel inside the repo or update `locate_prepare_data_dir`.'
    )


PREPARE_DATA_DIR = locate_prepare_data_dir()
INPUT_DIR = PREPARE_DATA_DIR / 'input'
OUTPUT_DIR = PREPARE_DATA_DIR / 'output'


def resolve_path(path_like):
    """Resolve absolute paths while keeping `None` untouched."""
    if path_like is None:
        return None
    path = Path(path_like)
    return path if path.is_absolute() else PREPARE_DATA_DIR / path


if str(PREPARE_DATA_DIR) not in sys.path:
    # Allow importing the `representative_days` package without manual PYTHONPATH tweaks.
    sys.path.insert(0, str(PREPARE_DATA_DIR))

try:
    from representative_days.utils_reprdays import run_reduced_scenarios, plot_uncertainty, nb_days
except ModuleNotFoundError as exc:
    raise ModuleNotFoundError(
        'Install or activate the `representative_days` package so `utils_reprdays` can be imported.'
    ) from exc


monthly_hydro_folder = resolve_path(monthly_hydro_folder)
output_folder = resolve_path(output_folder)
capacity_file = resolve_path(capacity_file)
impact_cc_file = resolve_path(impact_cc_file)
availability_other_file = resolve_path(availability_other_file)


## 3 - Define plant metadata helpers

In [3]:
rename_hpp = {
    'amaria': 'Amaria',
    'baneah': 'Baneah', 
    'bonkon_diaria': 'Bonkon-Diaria', 
    'boureya': 'Boureya', 
    'diallol': 'Diallol', 
    'diareguela': 'Diareguela', 
    'digan': 'Digan',
    'donkea': 'Donkea', 
    'farankonedou': 'Farankonedou', 
    'fello_sounga': 'Fello Sounga',
    'fetore': 'Fetore', 
    'fomi': 'Fomi', 
    'garafiri': 'Garafiri',
    'grand_chutes': 'Grand Chutes', 
    'grand_kinkon': 'Grand Kinkon',
    'guozoguezia': 'Guozoguezia',
    'hakkunde': 'Hakkunde-Mitti',
    'kaleta': 'Kaleta',
    'kassa': 'Kassa',
    'kogebedou': 'Kogebedou',
    'korafindi': 'Korafindi',
    'kouloutamba': 'Koukoutamba',
    'kouravel': 'Kouravel',
    'morisanako': 'Morisanako',
    'niagara': 'Niagara',
    'nzebela': 'Nzebela',
    'poudalde': 'Poudalde',
    'souapiti': 'Souapiti',
    'tiopo_105': 'Tiopo'
    
}

In [4]:
def calculate_seasonal_average(df):
    """Calculate the seasonal average of the data.
    
    The seasonal average is calculated for the period from November to June.
    
    Parameters
    ----------
    df : pd.DataFrame
        The data to calculate the seasonal average. The data should be in the format of a DataFrame with the columns
        'year' and 'month' and the values to calculate the seasonal average.
    
    Returns
    -------
    pd.DataFrame
    """

    temp = df.stack()
    temp.index.names = ['month', 'year']
    temp = temp.reorder_levels(['year', 'month'])
    temp = temp.sort_index()
    temp = temp.reset_index(name='value')
    temp['seasonal year'] = temp['year'].astype(int)
    temp.loc[temp['month'] >= 11, 'seasonal year'] += 1
    temp = temp.astype({'year': int, 'month': int, 'value': float, 'seasonal year': int})
    temp_filtered = temp[temp['month'].isin([11, 12, 1, 2, 3, 4, 5, 6])]
    seasonal_averages = temp_filtered.groupby('seasonal year')['value'].mean()
    seasonal_averages.name = 'value avg'
    temp_filtered = pd.merge(temp_filtered, seasonal_averages, left_on='seasonal year', right_index=True)
    temp_filtered.loc[:, 'value'] = temp_filtered.loc[:, 'value avg']
    temp_filtered = temp_filtered.loc[:, ['year', 'month', 'value']]
    temp_filtered.rename(columns={'value': 'value avg'}, inplace=True)

    temp = pd.merge(temp, temp_filtered, on=['year', 'month'], how='left')
    # Replace the value by the average value if value avg is not missing
    temp.loc[~temp['value avg'].isnull(), 'value'] = temp.loc[~temp['value avg'].isnull(), 'value avg']
    
    temp = temp.loc[:, ['year', 'month', 'value']]
    temp = temp.pivot(index='month', columns='year', values='value') 
    temp.columns = temp.columns.astype(str)

    return temp

## 4 - Load hydro files and compute representative years

In [None]:
if monthly_hydro_folder is None:
    raise ValueError('Set `monthly_hydro_folder` to the folder that stores the monthly hydro CSVs.')
if not monthly_hydro_folder.exists():
    raise FileNotFoundError(f'Cannot find monthly hydro folder: {monthly_hydro_folder}')

# Default output mirrors the shared `output/` folder so teams can compare runs quickly.
default_output = OUTPUT_DIR / 'hydro_representative_years'
out_folder = output_folder or default_output
out_folder = Path(out_folder)
out_folder.mkdir(parents=True, exist_ok=True)

suffix_file = f'{n_clusters}'
if impact_cc_file is not None:
    suffix_file = f'{suffix_file}_cc'
if extract_extreme:
    suffix_file = f'{suffix_file}_extreme'

impact_cc = None
if impact_cc_file is not None:
    if not impact_cc_file.exists():
        raise FileNotFoundError(f'Cannot find climate-impact file: {impact_cc_file}')
    impact_cc = pd.read_csv(impact_cc_file, index_col=0).squeeze()

if existing_plants is None:
    existing_plants = []

csv_files = sorted(monthly_hydro_folder.glob('*.csv'))
if not csv_files:
    raise FileNotFoundError(f'No CSV files found under {monthly_hydro_folder}')

data, results = {}, {}
tot, tot_existing = [], []

# Representative years are determined from the system-level production, not individual files.
for csv_path in csv_files:
    print(csv_path.name)
    name = csv_path.stem
    df = pd.read_csv(csv_path, index_col=0)

    # Only select years after 1980 to focus on recent hydrology.
    df.columns = pd.to_numeric(df.columns)
    df = df.loc[:, [i for i in df.columns if i >= 1980]]
    df.columns = df.columns.astype(str)

    # Keep the average production for the dry season (Novâ€“Jun).
    df = calculate_seasonal_average(df)

    if impact_cc is not None:
        df_cc = pd.DataFrame()
        for scenario, factor in impact_cc.items():
            temp = df.copy() * (1 + factor)
            temp.columns = [f'{col}-{scenario}' for col in temp.columns]
            df_cc = pd.concat([df_cc, temp], axis=1)

        df = df_cc.copy()

    data.update({name: df.copy()})
    tot = tot + [df]
    if name in existing_plants:
        tot_existing = tot_existing + [df]

# Sum all the hydropower production to feed the clustering algorithm.
tot = sum(tot)
tot_existing = sum(tot_existing)

# Indicator (useful to keep track of average production levels).
prod = (tot.T * nb_days).T * 24 / 1e3
indicator = {'yearly_prod_avg': prod.sum().mean(), 'monthly_prod_avg': prod.mean(axis=1)}

# Find the representative years and optionally tag explicit min/max chronicles.
if extract_extreme:
    worst_case = tot.sum().sort_values().iloc[:int(len(tot.columns) / 20)]
    min_prod = {k: i.loc[:, list(worst_case.index)].mean(axis=1) for k, i in data.items()}
    data = {k: i.drop(columns=list(worst_case.index)) for k, i in data.items()}
    proba_min = worst_case.shape[0] / len(tot.columns)

    best_case = tot.sum().sort_values().iloc[-int(len(tot.columns) / 20):]
    max_prod = {k: i.loc[:, list(best_case.index)].mean(axis=1) for k, i in data.items()}
    data = {k: i.drop(columns=list(best_case.index)) for k, i in data.items()}
    proba_max = best_case.shape[0] / len(tot.columns)

years_repr = run_reduced_scenarios(tot, n_clusters=n_clusters, method=method)
scenarios = [i.split(' - ')[0] for i in years_repr.columns]

# Select the data for the representative years
data = {k: i.loc[:, scenarios].stack() for k, i in data.items()}
if format_preparation == 'monte_carlo':
    mc_folder = out_folder / 'output_monte_carlo'
    mc_folder.mkdir(parents=True, exist_ok=True)

elif format_preparation == 'stochastic':
    data = pd.concat(data, axis=1, names=['hpp']).T
    data.columns.names = ['Month', 'Scenarios']
    data = data.reorder_levels(['Scenarios', 'Month'], axis=1)
    data = data.sort_index(axis=1)

    # Assign the scenarios to low, medium and high if the user did not pass their own mapping.
    if n_clusters == 1:
        rename_scenario = {scenarios[0]: 'baseline'}
        data.rename(columns=rename_scenario, level='Scenarios', inplace=True)
    elif n_clusters == 3:
        temp = tot.loc[:, scenarios].sum()
        rename_scenario = {
            temp.idxmin(): 'low',
            temp.idxmax(): 'high',
            [i for i in temp.index if i not in [temp.idxmin(), temp.idxmax()]][0]: 'medium',
        }
        data = data.rename(columns=rename_scenario, level='Scenarios')

    if extract_extreme:
        min_prod = pd.concat(min_prod, axis=1, names=['hpp']).T
        min_prod = pd.concat([min_prod], axis=1, keys=['min'])
        min_prod.columns.names = ['Scenarios', 'Month']
        min_prod = min_prod.sort_index(axis=1)

        data = pd.concat([data, min_prod], axis=1)

        max_prod = pd.concat(max_prod, axis=1, names=['hpp']).T
        max_prod = pd.concat([max_prod], axis=1, keys=['max'])
        max_prod.columns.names = ['Scenarios', 'Month']
        max_prod = max_prod.sort_index(axis=1)

        data = pd.concat([data, max_prod], axis=1)

    # Rename the hpp to the readable names expected downstream.
    if rename_hpp is not None:
        data.rename(index=rename_hpp, level='hpp', inplace=True)

    if capacity_file is None:
        raise ValueError('Capacity file is required to convert production into availability.')
    if not capacity_file.exists():
        raise FileNotFoundError(f'Cannot find capacity file: {capacity_file}')

    print('Capacity file found -> calculate availability profiles')
    capacity = pd.read_csv(capacity_file, index_col=0).squeeze()
    capacity = capacity.loc[data.index.get_level_values('hpp').unique()]
    availability = (data.T / capacity).T

    # Export in EPM format
    availability.index.names = [None]

    if availability_other_file is not None and availability_other_file.exists():
        print(f'Merging additional availability from {availability_other_file.name}')
        availability_other = pd.read_csv(availability_other_file, index_col=0, header=0)
        availability_other.columns.names = ['Month']
        availability_other.columns = availability_other.columns.astype(int)

        availability_other = pd.concat(
            [availability_other] * len(availability.columns.get_level_values('Scenarios').unique()),
            keys=availability.columns.get_level_values('Scenarios').unique(),
            axis=1,
        )
        availability = pd.concat((availability, availability_other), axis=0)

    availability.round(3).to_csv(out_folder / f'pAvailability_hydro_{suffix_file}.csv')

    # For sensitivity analysis
    if n_clusters == 3:
        for i in ['low', 'medium', 'high']:
            availability.loc(axis=1)[i].round(3).to_csv(out_folder / f'pAvailability_hydro_{i}.csv')

    # Probabilities is in the name of the scenarios
    pProbaScenarios = pd.Series(
        [i.split(' - ')[1] for i in years_repr.columns],
        index=pd.Index([i.split(' - ')[0] for i in years_repr.columns], name='Scenarios'),
        name='Value',
    )
    if rename_scenario is not None:
        pProbaScenarios = pProbaScenarios.rename(index=rename_scenario)
    if extract_extreme:
        pProbaScenarios = pProbaScenarios.astype(float) * (1 - proba_min - proba_max)
        pProbaScenarios.loc['min'] = proba_min
        pProbaScenarios.loc['max'] = proba_max

    pProbaScenarios.to_csv(out_folder / f'pProbaScenarios_hydro_{suffix_file}.csv')


fello_sounga.csv
farankonedou.csv
baneah.csv
poudalde.csv
nzebela.csv
tiopo_105.csv
niagara.csv
digan.csv
bonkon_diaria.csv
grand_chutes.csv
diallol.csv
amaria.csv
fetore.csv
kaleta.csv
kaleta
kouloutamba.csv
donkea.csv
fomi.csv
garafiri.csv
garafiri
guozoguezia.csv
hakkunde.csv
korafindi.csv
souapiti.csv
souapiti
kogebedou.csv
diareguela.csv
kouravel.csv
kassa.csv
morisanako.csv
grand_kinkon.csv
boureya.csv
Capacity file exist to calculate the availability
Availability file exists for other sources


## 5 - Review outputs and QA plots

In [6]:
if 'baseline' in data.columns.get_level_values('Scenarios'):
    capacity_baseline = data.xs('baseline', level='Scenarios', axis=1)
    prod = (capacity_baseline * nb_days).T * 24 / 1e3
    display(prod.sum(axis=1), prod.sum().sum())

    prod_current = prod.loc[:, ['Kaleta', 'Souapiti']]
    display(prod_current.sum(axis=1))

In [7]:
prod = data.mul(nb_days, level='Month', axis=1) * 24 / 1e3
capacity = data.sum()
existing_plants_named = [rename_hpp[i] for i in existing_plants if i in rename_hpp]
capacity_existing = data.loc[existing_plants_named, :].sum()
prod = prod.sum()
prod, capacity, capacity_existing = (
    prod.unstack('Scenarios'),
    capacity.unstack('Scenarios'),
    capacity_existing.unstack('Scenarios'),
)


In [8]:
filename = out_folder / f'production_hydro_tot_{suffix_file}.png'
plot_uncertainty(
    tot,
    df2=capacity,
    title="Hydropower capacity (MW)",
    ylabel="MW",
    filename=str(filename),
    ymax=4500,
)

filename = out_folder / f'production_hydro_existing_{suffix_file}.png'
plot_uncertainty(
    tot_existing,
    df2=capacity_existing,
    title="Hydropower capacity existing (MW)",
    ylabel="MW",
    filename=str(filename),
    ymax=1100,
)
