
# Parameter Exploration with Gaussian Processes

This notebook explores which parameter combinations best reproduce the observed dengue incidence in Colombia, using a pre-trained Gaussian Process (GP) emulator for the maximum incidence.

Key points of the analysis:

* **Municipality-Specific Calibration:** Municipality-level heterogeneity in transmission is captured via municipality-specific average infectivities. All other model parameters are held constant across municipalities (except for the timing of the first case, which is addressed separately).

* **Data Splitting for Calibration and Testing:** Among 173 municipalities with at least three detected epidemics (1,186 outbreaks total), data are split within each municipality into 67% for calibration (737 outbreaks total) and 33% for testing (449 outbreaks total).

* **Goal:** Identify the parameter combinations that results in the best agreement between GP-predicted maximum incidence values and the empirical dengue outbreak data.

**Note:** The GP implementation uses slightly different variable names than those in the manuscript. The table below shows the correspondence.

| GP Variable        | Manuscript Term           | Description                                                                 |
|--------------------|---------------------------|------------------------------------------------------------------------------|
| `alphaRest`        | Average infectivity        | Baseline probability of infection per day.                                   |
| `alphaAmp`         | Seasonality strength       | Scaling factor controlling seasonal variation in infection probability.      |
| `alphaShift`       | First case timing          | Timing of the first case relative to the seasonal peak in infection probability. |
| `infTicksCount`    | Infectious period          | Average number of days an individual remains infectious.                     |
| `avgVisitsCount`   | Average mobility           | Average number of daily visits per individual.                               |
| `pVisits`          | Mobility skewness          | Parameter controlling variability of daily visits.                           |
| `propSocialVisits` | Social structure           | Probability that a visit occurs within a family cluster.                     |
| `locPerSGCount`    | Family cluster size        | Number of locations per family cluster.                                      |

--- 

## Setup

### Imports

This chunk prepares the environment for calibrating the GP emulator to empirical dengue outbreak data:

1. **`SIR_gp`**: Imports the GP emulator class. 
2. **Emukit**: Provides tools for parameter space definitions (`ParameterSpace`, `ContinuousParameter`) and Latin Hypercube Sampling (`LatinDesign`) to explore the parameter space efficiently.
4. **Spearman correlation**: `scipy.stats.spearmanr` will be used to compare GP predictions with empirical data.
5. **Warnings**: Suppressed to keep notebook output clean



In [1]:
# Import GP emulator class and required packages
from SIR_gp import SIR_GP  # assuming SIR_GP is the class name in SIR_gp.py

import pandas as pd
import numpy as np
import torch

# Emukit for parameter space definition and Latin Hypercube Sampling
from emukit.core import ParameterSpace, ContinuousParameter
from emukit.core.initial_designs.latin_design import LatinDesign

# For statistical correlation analysis
from scipy.stats import spearmanr

import warnings
warnings.filterwarnings("ignore")  # suppress warnings carefully


### Functions

In [2]:

def filter_min_count_epidemics(df, min_count=3):
    """
    Filter for municipalities with at least `min_count` outbreaks.

    Parameters
    ----------
    df : pd.DataFrame
        Dataframe containing epidemic records with column 'ocha_ID' (municipality code).
    min_count : int
        Minimum number of outbreaks required to keep a municipality.

    Returns
    -------
    pd.DataFrame
        Filtered dataframe containing only municipalities with at least `min_count` outbreaks.
    """
    # Count outbreaks per municipality
    outbreak_counts = df['ocha_ID'].value_counts()
    
    # Identify municipalities meeting threshold
    valid_municipalities = outbreak_counts[outbreak_counts >= min_count].index
    
    # Filter dataframe
    df_filtered = df[df['ocha_ID'].isin(valid_municipalities)].copy()
    
    return df_filtered


def split_dataframe(df, train_proportion=0.67):
    """
    Split dataframe into two subsets (train and test) while keeping outbreaks from
    the same municipality grouped.

    Parameters
    ----------
    df : pd.DataFrame
        Dataframe containing epidemic records with column 'ocha_ID' (municipality code).
    train_proportion : float
        Proportion of outbreaks per municipality to include in the training set.

    Returns
    -------
    tuple of pd.DataFrame
        df_train : DataFrame with `train_proportion` of outbreaks per municipality.
        df_test : DataFrame with remaining outbreaks.
    """
    df_train = pd.DataFrame()
    df_test = pd.DataFrame()
    
    # Split data by municipality
    for municipality, group in df.groupby('ocha_ID'):
        num_outbreaks = len(group)
        indices = np.arange(num_outbreaks)
        np.random.shuffle(indices)  # Shuffle indices for random selection
        
        split_idx = int(train_proportion * num_outbreaks)
        
        df_train = pd.concat([df_train, group.iloc[indices[:split_idx]]], ignore_index=True)
        df_test = pd.concat([df_test, group.iloc[indices[split_idx:]]], ignore_index=True)
    
    return df_train, df_test


def format_data(LHS_sample, epidemics, epidemic_id):
    """
    Formats one epidemic's input data for GP prediction.

    Each epidemic is characterized by its starting day ('start_day'),
    which is used to shift the phase of the seasonality.

    Parameters:
    ----------
    LHS_sample : np.ndarray
        Candidate parameter combinations (without correction factor column).
    epidemics : pd.DataFrame
        Epidemic metadata, must contain 'start_day'.
    epidemic_id : int
        Index of the epidemic being processed.

    Returns:
    -------
    torch.Tensor
        Formatted tensor ready for GP prediction.
    """
    res = LHS_sample.copy()
    start_day = epidemics.loc[epidemic_id, 'start_day']

    # Apply phase shift adjustment (wrap around the seasonal cycle)
    phase_shift = (res[:, 2] + start_day / 365.0) % 1.0
    res[:, 2] = phase_shift

    return torch.from_numpy(res).float().contiguous()


def predict_var(LHS_sample, epidemics, GP_model, verbose=False):
    """
    Performs GP predictions for each epidemic given a set of sampled parameters.

    Parameters:
    ----------
    LHS_sample : np.ndarray
        Candidate parameter matrix (N_samples x N_params)
    epidemics : pd.DataFrame
        Epidemic data containing either 'max_incidence' or 'duration_days'
    GP_model : SIR_GP
        Trained GP emulator
    verbose : bool
        If True, prints progress messages

    Returns:
    -------
    np.ndarray
        Matrix of GP predictions [N_samples x N_epidemics]
    """
    # Select observed variable type
    if GP_model.model_type == "maxIncidence":
        observed = epidemics['max_incidence'].values
    elif GP_model.model_type == "duration":
        observed = np.log10(epidemics['duration_days'].values)
    else:
        raise ValueError(f"Unknown GP model type: {GP_model.model_type}")

    # Extract correction factor (last column)
    correction = LHS_sample[:, -1].reshape(-1, 1)

    n_samples = LHS_sample.shape[0]
    n_epidemics = len(observed)
    results = np.empty((n_samples, n_epidemics))

    if verbose:
        print(f"Predicting {n_samples} parameter sets across {n_epidemics} epidemics...")

    for i in range(n_epidemics):
        if verbose and i % 25 == 0:
            print(f" → Processing epidemic {i+1}/{n_epidemics}")

        formatted_data = format_data(LHS_sample=LHS_sample[:, :-1],
                                     epidemics=epidemics,
                                     epidemic_id=i)

        preds, _, _ = GP_model.predict_ys(parsed_data=formatted_data)

        # Apply correction factor for maxIncidence
        if GP_model.model_type == "maxIncidence":
            results[:, i] = preds.numpy().flatten() * correction.flatten()
        else:
            results[:, i] = preds.numpy().flatten()

    # Clip outputs to valid ranges
    clip_bounds = (0.0, 1.0) if GP_model.model_type == "maxIncidence" else (0.0, 3.0)
    return np.clip(results, *clip_bounds)

def calculate_rmse(epidemics, predictions, model_type):
    """
    Calculate the Root Mean Squared Error (RMSE) between GP predictions and observed epidemic data.

    Parameters
    ----------
    epidemics : pd.DataFrame
        Empirical outbreak data for a set of epidemics. Must contain either 'max_incidence' or 'duration_days'.
    predictions : np.ndarray
        GP predictions with shape (n_candidates, n_epidemics)
    model_type : str
        Type of GP model. Options:
        - "maxIncidence" : compare predicted vs observed maximum incidence
        - "duration"     : compare predicted vs log10(observed duration_days)

    Returns
    -------
    np.ndarray
        RMSE values for each candidate parameter set (shape: n_candidates,)
    """
    
    # Extract observed values based on model type
    if model_type == "maxIncidence":
        observed = epidemics['max_incidence'].values
    elif model_type == "duration":
        observed = np.log10(epidemics['duration_days'].values)
    else:
        raise ValueError(f"Unknown model_type: {model_type}. Must be 'maxIncidence' or 'duration'.")
    
    # Ensure observed has shape (1, n_epidemics) for broadcasting
    observed = observed.reshape(1, -1)
    
    # Compute RMSE for each candidate parameter set
    rmse = np.sqrt(np.mean((predictions - observed) ** 2, axis=1))
    
    return rmse


def shuffle_var(emp_df, f_alpha_df, f_candidate_LHS, n_iter, shuffle_var, GP_model):
    """
    Conduct permutation tests by shuffling outbreak times, municipalities, or both.

    Parameters
    ----------
    emp_df : pd.DataFrame
        Empirical outbreak data.
    f_alpha_df : pd.DataFrame
        Municipality-specific alphaRest estimates.
    f_candidate_LHS : np.ndarray
        LHS parameter set used for predictions (excluding alphaRest).
    n_iter : int
        Number of permutation iterations.
    shuffle_var : str
        Type of shuffle: 'day', 'municipality', or 'both'.
    GP_model : SIR_GP
        Trained GP emulator.

    Returns
    -------
    list
        Spearman correlation coefficients from each permutation iteration.
    """
    f_df = emp_df.copy()
    shuffle_stats = []

    for i in range(n_iter):
        if i % 10 == 0:
            print(f"{i} out of {n_iter} permutations done.")

        f_predictions = pd.DataFrame()

        # Shuffle data based on specified method
        if shuffle_var == 'day':
            f_df['start_day'] = np.random.permutation(f_df['start_day'].values)
        elif shuffle_var == 'municipality':
            f_df['ocha_ID'] = np.random.permutation(f_df['ocha_ID'].values)
        elif shuffle_var == 'both':
            f_df['start_day'] = np.random.permutation(f_df['start_day'].values)
            f_df['ocha_ID'] = np.random.permutation(f_df['ocha_ID'].values)
        else:
            raise ValueError(f"Unknown shuffle_var: {shuffle_var}. Choose 'day', 'municipality', or 'both'.")

        # Loop over municipalities to make predictions
        for m in f_alpha_df['municipality']:
            # Subset empirical data for current municipality
            f_df_m = f_df.loc[f_df['ocha_ID'] == m].reset_index(drop=True)

            # Get municipality-specific alphaRest
            f_alpha_m = f_alpha_df.loc[f_alpha_df['municipality'] == m, 'alphaRest'].values

            # Combine alphaRest with LHS parameters
            f_params = np.hstack((f_alpha_m, f_candidate_LHS)).reshape(1, -1)

            # Predict max incidence
            f_pred = predict_var(LHS_sample=f_params, epidemics=f_df_m, GP_model=GP_model)

            # Combine predictions with original data
            f_pred_df = pd.DataFrame(f_pred.flatten(), columns=['pred'])
            f_combined_df = pd.concat([f_df_m, f_pred_df], axis=1)

            # Append to cumulative predictions
            f_predictions = pd.concat([f_predictions, f_combined_df], ignore_index=True)

        # Compute Spearman correlation between observed and predicted
        iter_stat = spearmanr(f_predictions['max_incidence'], f_predictions['pred']).statistic
        shuffle_stats.append(iter_stat)

    return shuffle_stats


### Data and Parameter Setup

Before calibrating the Gaussian Process (GP) emulator to empirical dengue outbreak data, we define the paths to all required datasets and the pre-trained GP model.  

We also specify the ranges of model parameters to explore. For `alphaRest`, a separate small range is defined with discrete steps.  

We configure sampling via Latin Hypercube Sampling (LHS) to generate candidate parameter combinations, set the proportion of data used for calibration (`PROP_FIT = 0.67`), and define output settings to retain the top parameter sets and perform permutations.

**Note:** For demonstration purposes, the parameters are set to `N_SAMPLE_LHS = 1000`, `PARAMS_STEP_ALPHA = 10`, and `N_SHUFFLE = 100`.
To reproduce analyses comparable to those presented in the [preprint](https://www.medrxiv.org/content/10.1101/2024.11.28.24318136v2), use `N_SAMPLE_LHS = 25000`, `PARAMS_STEP_ALPHA = 50`, and `N_SHUFFLE = 1000`.


In [3]:
# -------------------------------
# Paths to Data and Model Files
# -------------------------------
PATH_DATA = "../data/empirical/OpenDengue_detected_epidemics.txt"  # Empirical dengue outbreak summary
PATH_GP_TRAIN = "../data/GP/data/sim-training-maxIncidence-round15.txt"  # GP training dataset
PATH_GP_TEST = "../data/GP/data/DD-AML-test-LHS-10000-condSim-logDuration.txt"  # GP test dataset
PATH_GP_MODEL = "../data/GP/model/maxIncidence-round15-snap3.pth"  # Pre-trained GP model snapshot

# -------------------------------
# Parameter Ranges for Exploration
# -------------------------------
PARAM_RANGES = ParameterSpace([
    ContinuousParameter("alphaAmp", 0, 1),
    ContinuousParameter("alphaShift", 0, 1),
    ContinuousParameter("infTicksCount", 4, 6),
    ContinuousParameter("avgVisitsCount", 1, 5),
    ContinuousParameter("pVisits", 0.05, 0.95),
    ContinuousParameter("propSocialVisits", 0, 1),
    ContinuousParameter("locPerSGCount", 1, 20),
    ContinuousParameter("correctionFactor", 0, 0.1)  # correction factor applied to incidences
])

# Parameter range for alphaRest (not included in PARAM_RANGES)
PARAM_RANGES_ALPHA = [0, 0.03]  # Min and max values
PARAM_STEPS_ALPHA = 10  # Number of steps to discretize alphaRest

# -------------------------------
# Sampling and Fitting Configuration
# -------------------------------
N_SAMPLE_LHS = 1000       # Number of parameter sets to draw from Latin Hypercube Sampling (LHS)
RANDOM_STATE_SEED = 42    # Seed for reproducibility (LHS sample, train/test split)
PROP_FIT = 0.67           # Proportion of empirical data used for GP calibration

# -------------------------------
# Output and Evaluation Settings
# -------------------------------
TOP_X = 250     # Keep top X parameter sets with lowest RMSE
N_SHUFFLE = 100 # Number of iterations for permutation

### Empirical Data Preparation

In this step, we:

1. Load the detected dengue epidemics (`OpenDengue_detected_epidemics.txt`).
2. Filter municipalities to only those with at least three outbreaks, ensuring sufficient data for calibration.
3. Split the filtered data within each municipality into a calibration set (`df_calibration`, 67% of outbreaks) and a testing set (`df_testing`, 33% of outbreaks).
4. Extract the list of unique municipalities used for GP calibration.


In [4]:
# -------------------------------
# Load and Prepare Empirical Epidemic Data
# -------------------------------

np.random.seed(RANDOM_STATE_SEED)  # Ensure reproducibility

# Load detected dengue epidemics
epidemics_all = pd.read_csv(PATH_DATA, sep="\t", header=0)

# Filter to municipalities with at least 3 outbreaks
epidemics_filtered = filter_min_count_epidemics(epidemics_all, min_count=3)

# Split filtered data into calibration (fit) and testing (prediction) sets
df_calibration, df_testing = split_dataframe(epidemics_filtered, train_proportion=PROP_FIT)

# List of unique municipalities used for calibration
unique_municipalities = df_calibration['ocha_ID'].unique()

# -------------------------------
# Display basic info
# -------------------------------
print("Filtered epidemics (municipalities with ≥3 outbreaks):", epidemics_filtered.shape)
print("Calibration set (used for GP fitting):", df_calibration.shape)
print("Testing set (used for prediction assessment):", df_testing.shape)
print("Number of unique municipalities in calibration set:", len(unique_municipalities))



Filtered epidemics (municipalities with ≥3 outbreaks): (1186, 5)
Calibration set (used for GP fitting): (737, 5)
Testing set (used for prediction assessment): (449, 5)
Number of unique municipalities in calibration set: 173




### Generating Parameter Candidates

To explore the parameter space efficiently, we generate a diverse set of parameter combinations using Latin Hypercube Sampling (LHS). Each sample represents a unique configuration of model parameters that will be evaluated using the trained GP emulator.

We include two sets of parameters:

* **GP parameter values, execpt alphaRest** are sampled with LHS.
* **alphaRest values** are ampled over a linear range.

The two sets are then combined into one array of parameter candidates.


In [5]:
# Generate candidate parameter combinations via Latin Hypercube Sampling (LHS)

# Draw N_SAMPLE_LHS samples from the defined parameter ranges
lhs_samples = LatinDesign(PARAM_RANGES).get_samples(N_SAMPLE_LHS)

# Create a linearly spaced vector of alphaRest values
alpha_rest_values = np.linspace(
    PARAM_RANGES_ALPHA[0],
    PARAM_RANGES_ALPHA[1],
    PARAM_STEPS_ALPHA
).reshape(-1, 1)

# Combine LHS samples with alphaRest values:
# - Each LHS sample is repeated for all alphaRest values
# - The alphaRest vector is tiled to match the number of LHS samples
lhs_repeated = np.repeat(lhs_samples, alpha_rest_values.shape[0], axis=0)
alpha_tiled = np.tile(alpha_rest_values, (lhs_samples.shape[0], 1))

# Final candidate matrix: alphaRest + all other parameters
param_candidates = np.hstack((alpha_tiled, lhs_repeated))

# Print dimensional summaries
print(f"LHS samples shape: {lhs_samples.shape}")
print(f"alphaRest range shape: {alpha_rest_values.shape}")
print(f"Combined candidate matrix shape: {param_candidates.shape}")


LHS samples shape: (1000, 8)
alphaRest range shape: (10, 1)
Combined candidate matrix shape: (10000, 9)


## GP
### Load pre-trained GP
In this section, we load the pre-trained Gaussian Process (GP) emulator for maximum dengue incidence.
The model, trained earlier on individual-based simulations, serves as a fast surrogate for exploring the parameter space.

**Note:** Loading the GP from disk requires that the GP is re-trained for a single iteration (`train(1)`) on the test data, whhich is done internally by the `.load()` function. This step can introduce very small stochastic differences between runs. Therefore, reported RMSE values or predictions may differ slightly between consecutive loads of the same model file.

In [6]:
# === Load the pre-trained Gaussian Process (GP) emulator ===

# Path to the trained model, training data, and test data are defined above.
# The GP emulates the maximum dengue incidence ("imax") based on key model parameters.

myGP = SIR_GP(training_data=PATH_GP_TRAIN, model_type="maxIncidence")  # Initialize GP emulator
myGP.load(filename=PATH_GP_MODEL)                                      # Load pre-trained weights

# Perform a sanity check to verify that the loaded GP performs as expected
rmse = myGP.get_rmse(PATH_GP_TEST)
print(f"Sanity check RMSE on test set: {rmse:.4f}")

Model loaded. Loss: -1.773698091506958
Sanity check RMSE on test set: 0.0421


### Predictions

Here, we use the trained GP model to predict the maximum incidence for each epidemic in the calibration dataset across all sampled parameter sets.

- `param_candidates`: LHS + alphaRest samples (each row = one parameter combination)
- `df_calibration`: calibration epidemics (subset of empirical outbreaks)
- Output: prediction matrix with shape [n_param_candidates × n_epidemics]

In [7]:

predictions = predict_var(
    LHS_sample=param_candidates,
    GP_model=myGP,
    epidemics=df_calibration,
    verbose=True
)

print(f"Prediction matrix shape: {predictions.shape}")



Predicting 10000 parameter sets across 737 epidemics...
 → Processing epidemic 1/737
 → Processing epidemic 26/737
 → Processing epidemic 51/737
 → Processing epidemic 76/737
 → Processing epidemic 101/737
 → Processing epidemic 126/737
 → Processing epidemic 151/737
 → Processing epidemic 176/737
 → Processing epidemic 201/737
 → Processing epidemic 226/737
 → Processing epidemic 251/737
 → Processing epidemic 276/737
 → Processing epidemic 301/737
 → Processing epidemic 326/737
 → Processing epidemic 351/737
 → Processing epidemic 376/737
 → Processing epidemic 401/737
 → Processing epidemic 426/737
 → Processing epidemic 451/737
 → Processing epidemic 476/737
 → Processing epidemic 501/737
 → Processing epidemic 526/737
 → Processing epidemic 551/737
 → Processing epidemic 576/737
 → Processing epidemic 601/737
 → Processing epidemic 626/737
 → Processing epidemic 651/737
 → Processing epidemic 676/737
 → Processing epidemic 701/737
 → Processing epidemic 726/737
Prediction matrix s

## Calibration

### Municipality-Specific RMSE Calculation

This section calculates the Root Mean Squared Error (RMSE) for each candidate parameter set separately for each municipality, allowing us to identify which parameter combinations best reproduce observed outbreaks at the municipal level.

**Key steps:**

1. Iterate over each municipality.
2. Identify indices in the calibration dataset corresponding to the current municipality.
3. Compute RMSE between GP predictions and empirical data for that municipality.
4. Store RMSE values in a matrix of shape `[n_candidates × n_municipalities]`.


In [8]:
num_municipalities = len(unique_municipalities)      # Number of municipalities
num_candidates = param_candidates.shape[0]          # Number of parameter sets
rmse_municipality = np.empty((num_candidates, num_municipalities))  # Initialize RMSE matrix

# Compute RMSE for each municipality
for m_idx, municipality in enumerate(unique_municipalities):
    # Indices of outbreaks corresponding to this municipality
    municipality_indices = df_calibration.index[df_calibration['ocha_ID'] == municipality].tolist()
    
    # Calculate RMSE for this municipality across all candidate parameter sets
    rmse_values = calculate_rmse(
        epidemics=df_calibration.iloc[municipality_indices],
        predictions=predictions[:, municipality_indices],
        model_type=myGP.model_type
    )
    
    rmse_municipality[:, m_idx] = rmse_values.flatten()

print(f"RMSE matrix shape: {rmse_municipality.shape}")


RMSE matrix shape: (10000, 173)


### Selecting Optimal `alphaRest` per Municipality


This section identifies the best `alphaRest` value within each block of candidate parameters for each municipality, based on the minimum RMSE.

Key steps:

1. Iterate over municipalities.
2. For each block of parameter candidates that share the same non-`alphaRest` values, find the `alphaRest` value that minimizes RMSE.
3. Store both the optimal `alphaRest` and corresponding minimum RMSE for each LHS sample and municipality.


In [9]:

# Initialize arrays to store optimal alphaRest and corresponding RMSE
alpha_municipalities = np.empty((N_SAMPLE_LHS, num_municipalities))
min_rmse_municipalities = np.empty((N_SAMPLE_LHS, num_municipalities))

# Loop over municipalities
for m_idx in range(num_municipalities):
    optimal_alpha = []  # Store best alphaRest per LHS sample
    optimal_rmse = []   # Store corresponding RMSE

    # Loop over each block of parameter candidates with the same LHS sample (different alphaRest values)
    for i in range(0, num_candidates, PARAM_STEPS_ALPHA):
        # Subset RMSE for the current block
        rmse_block = rmse_municipality[i:(i + PARAM_STEPS_ALPHA), m_idx]

        # Find index of minimum RMSE within block
        min_idx = np.argmin(rmse_block)
        min_rmse = rmse_block[min_idx]

        # Store corresponding alphaRest value
        alpha_value = alpha_rest_values[min_idx]
        optimal_alpha.append(alpha_value)
        optimal_rmse.append(min_rmse)

    # Convert lists to arrays and store in final matrices
    alpha_municipalities[:, m_idx] = np.array(optimal_alpha).flatten()
    min_rmse_municipalities[:, m_idx] = np.array(optimal_rmse).flatten()

print(f"Optimal alphaRest matrix shape: {alpha_municipalities.shape}")  # [LHS_samples × municipalities]
print(f"Minimum RMSE matrix shape: {min_rmse_municipalities.shape}")     # [LHS_samples × municipalities]


Optimal alphaRest matrix shape: (1000, 173)
Minimum RMSE matrix shape: (1000, 173)


### Selecting the Globally Best Parameter Set

This step identifies the single best parameter set across all municipalities, based on the sum of minimum RMSEs per candidate parameter set. It also collects the corresponding optimal `alphaRest` values for each municipality.


In [10]:

# Sum minimum RMSEs across municipalities for each LHS sample
rmse_sums = np.sum(min_rmse_municipalities, axis=1)  # Shape: [n_LHS_samples]

# Identify the index of the parameter set with the lowest total RMSE
best_fit_id = np.argmin(rmse_sums)

# Display summary
print(f"RMSE sums shape: {rmse_sums.shape}")                   # [n_LHS_samples,]
print(f"LHS candidate matrix shape: {lhs_samples.shape}")  # [n_LHS_samples × n_parameters (excluding alphaRest)]
print("Best-fitting parameter set (excluding alphaRest):")
print(lhs_samples[best_fit_id:(best_fit_id+1), :])

# Collect corresponding optimal alphaRest values for each municipality
alpha_data = {
    'municipality': unique_municipalities,
    'alphaRest': alpha_municipalities[best_fit_id]
}

alpha_df = pd.DataFrame(alpha_data)
print(alpha_df.describe())  # Summary statistics of alphaRest across municipalities

RMSE sums shape: (1000,)
LHS candidate matrix shape: (1000, 8)
Best-fitting parameter set (excluding alphaRest):
[[0.0075  0.3395  4.265   1.102   0.40145 0.5605  6.6145  0.02705]]
        alphaRest
count  173.000000
mean     0.018555
std      0.005594
min      0.010000
25%      0.013333
50%      0.016667
75%      0.020000
max      0.030000


### Selecting the Top Parameter Sets

This section identifies the TOP_X parameter sets with the lowest total RMSE across municipalities, providing a shortlist of best candidates for further analysis or reporting.


In [11]:

# Identify indices of TOP_X LHS samples with lowest RMSE sums
top_indices = np.argsort(rmse_sums)[:TOP_X]  # Indices of top candidates

# Extract corresponding parameter sets (excluding alphaRest)
top_fits = lhs_samples[top_indices, :]  # Shape: [TOP_X × n_parameters]
print(f"Top {TOP_X} parameter sets shape: {top_fits.shape}")

# Convert to DataFrame for easier inspection and summary statistics
top_fits_df = pd.DataFrame(top_fits, columns=PARAM_RANGES.parameter_names)
print(top_fits_df.describe())


Top 250 parameter sets shape: (250, 8)
         alphaAmp  alphaShift  infTicksCount  avgVisitsCount     pVisits  \
count  250.000000  250.000000     250.000000      250.000000  250.000000   
mean     0.390028    0.480280       4.921448        2.705008    0.491529   
std      0.286588    0.292636       0.567150        1.161026    0.245857   
min      0.000500    0.000500       4.001000        1.006000    0.052250   
25%      0.138750    0.240750       4.475000        1.636000    0.281075   
50%      0.332000    0.475000       4.907000        2.638000    0.483350   
75%      0.608250    0.726000       5.313000        3.661000    0.699350   
max      0.993500    0.999500       5.997000        4.990000    0.944150   

       propSocialVisits  locPerSGCount  correctionFactor  
count        250.000000     250.000000        250.000000  
mean           0.538640      11.033064          0.024252  
std            0.290985       5.459492          0.017556  
min            0.001500       1.028500  


### Collecting `alphaRest` for Top Parameter Sets

This section creates a DataFrame containing the`alphaRest` values per municipality for each of the TOP_X best parameter sets, along with their total RMSE and rank.


In [12]:

# Initialize DataFrame to store alphaRest info for TOP_X candidates
alpha_df_topX = pd.DataFrame()

# Loop over TOP_X best candidates
for rank, idx in enumerate(top_indices, start=1):
    alpha_data_iter = {
        'municipality': unique_municipalities,
        'alphaRest': alpha_municipalities[idx],
        'rmse': rmse_sums[idx],
        'rank': rank
    }
    
    # Convert to DataFrame
    alpha_df_iter = pd.DataFrame(alpha_data_iter)
    
    # Concatenate to cumulative DataFrame
    alpha_df_topX = pd.concat([alpha_df_topX, alpha_df_iter], ignore_index=True)

# Display summary statistics
print(alpha_df_topX.describe())


          alphaRest          rmse          rank
count  43250.000000  43250.000000  43250.000000
mean       0.011004      0.704196    125.500000
std        0.009310      0.049112     72.169041
min        0.000000      0.598654      1.000000
25%        0.003333      0.664435     63.000000
50%        0.006667      0.707132    125.500000
75%        0.013333      0.745508    188.000000
max        0.030000      0.777695    250.000000


## GP Predictions

In this section, we generate GP predictions for each municipality using the best-fitting parameters combined with municipality-specific `alphaRest` values.

**Steps:**

1. Loop over municipalities.
2. Extract municipality-specific epidemics from the prediction dataset.
3. Combine the municipality’s `alphaRest` with the best-fitting LHS parameters.
4. Predict maximum incidence using the GP model.
5. Store predictions alongside the corresponding empirical data.



In [13]:

# Initialize DataFrame to store predictions
my_predictions = pd.DataFrame()

# Loop over municipalities
for m in alpha_df['municipality']:
    # Extract epidemics for the current municipality
    df_pred_m = df_testing.loc[df_testing['ocha_ID'] == m].reset_index(drop=True)
    
    # Extract municipality-specific alphaRest
    alpha_m = alpha_df.loc[alpha_df['municipality'] == m, 'alphaRest'].values  # shape (1,)
    
    # Combine alphaRest with best-fitting LHS parameters
    pred_params = np.hstack((alpha_m, lhs_samples[best_fit_id]))  # shape (1, n_parameters + 1)
    pred_params = pred_params.reshape(1, -1)  # ensure correct shape
    
    # Predict max incidence for this municipality
    pred_m = predict_var(
        LHS_sample=pred_params,
        epidemics=df_pred_m,
        GP_model=myGP,
        verbose=False
    )
    
    # Combine predictions with original data
    pred_m_df = pd.DataFrame(pred_m.flatten(), columns=['pred'])
    combined_df = pd.concat([df_pred_m, pred_m_df], axis=1)
    
    # Append to cumulative predictions DataFrame
    my_predictions = pd.concat([my_predictions, combined_df], ignore_index=True)

# Display shape of final predictions
print(f"Predictions DataFrame shape: {my_predictions.shape}")

Predictions DataFrame shape: (449, 6)


### Permutation Test for Model Validation

This section evaluates the robustness of GP predictions by **permutation testing**, where the empirical data (i.e., starting day and municipality) are randomly shuffled. For each shuffle, predictions are recalculated, and the Spearman rank correlation between observed and predicted maximum incidences is stored.


In [14]:
permut_both = shuffle_var(
    emp_df=df_testing,
    f_alpha_df=alpha_df,
    f_candidate_LHS=lhs_samples[best_fit_id],
    n_iter=N_SHUFFLE,
    shuffle_var='both',
    GP_model=myGP
)

0 out of 100 permutations done.
10 out of 100 permutations done.
20 out of 100 permutations done.
30 out of 100 permutations done.
40 out of 100 permutations done.
50 out of 100 permutations done.
60 out of 100 permutations done.
70 out of 100 permutations done.
80 out of 100 permutations done.
90 out of 100 permutations done.



#### Summarizing Permutation Test Results


In [15]:

# Create DataFrame of permutation results
permut_df = pd.DataFrame({
    'shuffle_iteration': range(N_SHUFFLE),
    'both': permut_both
})

# Summary statistics of permutation correlations
print("Permutation test (both) summary:")
print(permut_df['both'].describe())

# Compute Spearman correlation for the observed (unshuffled) predictions
observed_corr = spearmanr(my_predictions['max_incidence'], my_predictions['pred']).statistic
print(f"Observed Spearman correlation: {observed_corr:.4f}")

Permutation test (both) summary:
count    100.000000
mean      -0.005019
std        0.050506
min       -0.142570
25%       -0.041272
50%       -0.009846
75%        0.033469
max        0.156240
Name: both, dtype: float64
Observed Spearman correlation: 0.4473



## Export Results


In [16]:

# 1. GP predictions for all test epidemics
my_predictions.to_csv(
    'gp_predictions_calibration.tsv', 
    index=False, header=True, sep='\t'
)

# 2. Best-fitting alphaRest parameters (per municipality, single best parameter set)
alpha_df.to_csv(
    'best_alphaRest_per_municipality.tsv', 
    index=False, header=True, sep='\t'
)

# 3. Top X alphaRest parameters across municipalities (long-form for TOP_X candidates)
alpha_df_topX.to_csv(
    f'top{TOP_X}_alphaRest_per_municipality.tsv', 
    index=False, header=True, sep='\t'
)

# 4. Top X best-fitting LHS parameter sets (excluding alphaRest)
top_fits_df.to_csv(
    f'top{TOP_X}_lhs_parameters.tsv', 
    index=False, header=True, sep='\t'
)

# 5. Permutation test results (Spearman correlations)
permut_df.to_csv(
    'permutation_test_results.tsv', 
    index=False, header=True, sep='\t'
)