## Purging

In time series problems, using a KFold or StratifiedKFold can lead to data leakage if the training observations are temporally close to the validation observations.
This causes the model to indirectly see future information, inflating the metrics.

**Purging** is a validation technique that:

1. Splits the data by *groups* (in this case, dates or time indices).
2. Inserts a time gap between the training and validation sets (`group_gap`) to prevent information leakage.
3. Allows you to limit the maximum size of the training or test groups (`max_train_group_size`, `max_test_group_size`).

**Advantage**: Reproduces a real *out-of-sample* training and validation scenario, preventing the model from being trained with data very close to the test.

In [6]:
import sys
import os
sys.path.append(os.path.abspath("../../scripts"))
from sklearn.model_selection import GridSearchCV, BaseCrossValidator
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PowerTransformer, QuantileTransformer
from sklearn.neural_network import MLPClassifier
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score, brier_score_loss
from sklearn.model_selection import BaseCrossValidator
import numpy as np
import pandas as pd
from model_utils import rolling_test, optimize_threshold
from preprocessing import feature_engineering, analyze_features
import joblib

In [7]:
class PurgedGroupTimeSeriesSplit(BaseCrossValidator):
    def __init__(self, n_splits=5, group_gap=1, max_train_group_size=None, max_test_group_size=None):
        self.n_splits = n_splits
        self.group_gap = group_gap
        self.max_train_group_size = max_train_group_size
        self.max_test_group_size = max_test_group_size

    def split(self, X, y=None, groups=None):
        if groups is None:
            raise ValueError("The 'groups' parameter should not be None")

        unique_groups = np.unique(groups)
        n_groups = len(unique_groups)

        if self.n_splits > n_groups:
            raise ValueError("Number of splits must be less than or equal to the number of groups")

        group_test_size = self.max_test_group_size or (n_groups // self.n_splits)
        group_test_starts = range(n_groups - self.n_splits * group_test_size, n_groups, group_test_size)

        for test_start in group_test_starts:
            test_end = test_start + group_test_size
            train_end = test_start - self.group_gap
            train_start = 0 if self.max_train_group_size is None else max(0, train_end - self.max_train_group_size)

            train_groups = unique_groups[train_start:train_end]
            test_groups = unique_groups[test_start:test_end]

            train_indices = np.where(np.isin(groups, train_groups))[0]
            test_indices = np.where(np.isin(groups, test_groups))[0]
            yield train_indices, test_indices

    def get_n_splits(self, X=None, y=None, groups=None):
        return self.n_splits

In [11]:
X = pd.read_csv('../../data/X.csv', index_col='date', parse_dates=True)
y = pd.read_csv('../../data/y.csv', index_col='date', parse_dates=True)
groups = X.index
pipeline = joblib.load("../../models/final_mlp_pipeline.joblib")
model = pipeline.named_steps['model']

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


In [12]:
split = 150

X_train, y_train = X.iloc[:split], y.iloc[:split]
X_test, y_test = X.iloc[split:], y.iloc[split:]

In [13]:
preprocessor = ColumnTransformer([
    ("power", PowerTransformer(), X_train.columns),
    ("quantile", QuantileTransformer(output_distribution='normal'), X_train.columns)
], remainder='drop')

In [14]:
pipeline = Pipeline([
    ("preprocessing", preprocessor),
    ("model", model)
])

In [15]:
purged_split = PurgedGroupTimeSeriesSplit(
    n_splits=5,
    group_gap=5,
)

In our case:

- **Groups**: Each observation date.
- **Time gap**: 5 steps (`group_gap=5`), which prevents observations close in time to the validation set from entering the training.
- **Number of splits**: 5.
- **Model evaluated**: The final pipeline (`MLPClassifier` + `PowerTransformer` + `QuantileTransformer`), pre-calibrated and with an adjusted threshold.
- **Metrics**: Accuracy, ROC AUC, F1, and Brier Score in train and test.

This allows us to verify whether our usual validation method (`rolling test`) was introducing information leakage. If the metrics are similar, we can conclude that there was no significant leakage, and that is the main purpose of this experiment.

In [None]:
train_scores = {'accuracy': [], 'roc_auc': [], 'f1': [], 'brier': []}
test_scores = {'accuracy': [], 'roc_auc': [], 'f1': [], 'brier': []}


for train_idx, test_idx in purged_split.split(X, y, groups=groups):

    X_train_split, X_test_split = X.iloc[train_idx], X.iloc[test_idx]
    y_train_split, y_test_split = y.iloc[train_idx], y.iloc[test_idx]
    
    pipeline.fit(X_train_split, y_train_split.values.ravel())
    
    y_train_pred = pipeline.predict(X_train_split)
    y_train_proba = pipeline.predict_proba(X_train_split)[:, 1]
    y_test_pred = pipeline.predict(X_test_split)
    y_test_proba = pipeline.predict_proba(X_test_split)[:, 1]
    
    train_scores['accuracy'].append(accuracy_score(y_train_split, y_train_pred))
    train_scores['roc_auc'].append(roc_auc_score(y_train_split, y_train_proba))
    train_scores['f1'].append(f1_score(y_train_split, y_train_pred))
    train_scores['brier'].append(brier_score_loss(y_train_split, y_train_proba))
    
    test_scores['accuracy'].append(accuracy_score(y_test_split, y_test_pred))
    test_scores['roc_auc'].append(roc_auc_score(y_test_split, y_test_proba))
    test_scores['f1'].append(f1_score(y_test_split, y_test_pred))
    test_scores['brier'].append(brier_score_loss(y_test_split, y_test_proba))

for metric in train_scores.keys():
    print(f"Train {metric.capitalize()} Scores:", train_scores[metric])
    print(f"Test {metric.capitalize()} Scores:", test_scores[metric])
    print(f"Average Train {metric.capitalize()} Score:", np.mean(train_scores[metric]))
    print(f"Average Test {metric.capitalize()} Score:", np.mean(test_scores[metric]))



Train Accuracy Scores: [0.5548654244306418, 0.6736842105263158, 0.4397905759162304, 0.47038327526132406, 0.5091383812010444]
Test Accuracy Scores: [0.6354166666666666, 0.375, 0.4791666666666667, 0.46875, 0.5208333333333334]
Average Train Accuracy Score: 0.5295723734671112
Average Test Accuracy Score: 0.4958333333333333
Train Roc_auc Scores: [0.5975056689342404, 0.7793040293040293, 0.39284145805884935, 0.38663809894818857, 0.45904371584699455]
Test Roc_auc Scores: [0.6633928571428571, 0.42569930069930073, 0.4575163398692811, 0.42683456361267913, 0.41521739130434787]
Average Train Roc_auc Score: 0.5230665942184605
Average Test Roc_auc Score: 0.4777320905256932
Train F1 Scores: [0.6861313868613139, 0.7801418439716312, 0.5367965367965368, 0.6122448979591837, 0.6747404844290658]
Test F1 Scores: [0.7445255474452555, 0.5238095238095238, 0.6031746031746031, 0.5321100917431193, 0.684931506849315]
Average Train F1 Score: 0.6580110300035462
Average Test F1 Score: 0.6177102546043634
Train Brier Sc

## Results

Averages in tests with PGTSS:

- **Accuracy**: ~0.496
- **ROC AUC**: ~0.478
- **F1**: ~0.618
- **Brier Score**: ~0.256

Compared to the previous rolling test, the metrics are **very similar**, indicating that our original method was not introducing relevant information leakage.

Furthermore, the fact that PGTSS does not improve the metrics is expected:
- Its goal is to detect leakage, not optimize performance.
- If the metrics had dropped drastically, it would indicate that the rolling test was overestimating performance.

## Conclusion

Using the PurgedGroupTimeSeriesSplit confirms that our original pipeline and validation method (rolling test) were not contaminated by future data.
This gives us confidence that the holdout metrics reflect realistic performance.

**Role in the project**:
- Remains as an internal verification tool.
- Does not replace the rolling test as the primary validation method, but serves as an additional check against data leakage.