<a   href="https://colab.research.google.com/github/N-Nieto/OHBM_SEA-SIG_Educational_Course/blob/master/03_pitfalls/03_02_data_leakage.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### If you are running in Google Colab, uncomment the cell below to load the data.
### If you are running locally, ignore the cell.

For questions on this notebook contact: n.nieto@fz-juelich.de

In [None]:
# from pathlib import Path
# from urllib.request import urlretrieve
# # Clean files
# import pandas as pd
# import numpy as np

# # Download necessary data files
# Path("data").mkdir(exist_ok=True)

# # 01_basic_ML.ipynb needs this files
# urlretrieve('https://zenodo.org/records/17056022/files/cleaned_VBM_GM_Schaefer100x17_mean_aggregation.csv?download=1', './data/cleaned_VBM_GM_Schaefer100x17_mean_aggregation.csv')
# urlretrieve('https://zenodo.org/records/17056022/files/cleaned_IXI_behavioural.csv?download=1', './data/cleaned_IXI_behavioural.csv')

# # 02_XAI.ipynb needs also this files
# urlretrieve('https://zenodo.org/records/17056022/files/cleaned_VBM_GM_TianxS1x3TxMNI6thgeneration_mean_aggregation.csv?download=1', './data/cleaned_VBM_GM_TianxS1x3TxMNI6thgeneration_mean_aggregation.csv')

# # Load data
# df_behav = pd.read_csv("data/cleaned_IXI_behavioural.csv", index_col=0)

# # Some height values are not sensible, we filter them out
# height = df_behav["HEIGHT"].values
# df_behav = df_behav[np.logical_and(height > 120, height < 200)]

# # Remove NaNs and duplicates
# df_behav.dropna(inplace=True)
# df_behav.drop_duplicates(keep='first', inplace=True)
# df_behav.to_csv('data/cleaned_IXI_behavioural.csv')

# # Remove NaNs
# df_cortical_100 = pd.read_csv("data/cleaned_VBM_GM_Schaefer100x17_mean_aggregation.csv", index_col=0)
# df_cortical_100.dropna(inplace=True)
# df_cortical_100.to_csv('data/cleaned_VBM_GM_Schaefer100x17_mean_aggregation.csv')

# # Remove NaNs
# df_subcortical = pd.read_csv("data/cleaned_VBM_GM_TianxS1x3TxMNI6thgeneration_mean_aggregation.csv", index_col=0)
# df_subcortical.dropna(inplace=True)
# df_subcortical.to_csv('data/cleaned_VBM_GM_TianxS1x3TxMNI6thgeneration_mean_aggregation.csv')

# data_path = Path("data/")


# Leakage exploration

In [None]:
# Import modules
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from pathlib import Path

if 'data_path' not in locals():
    data_path = Path("../data/")

## Load data

In [20]:
# Prepare the data
# Features: Cortical + Subcortical
features = ["cortical", "subcortical"]

# Target: Sex
target = ["SEX_ID (1=m, 2=f)"]
# Confounding variables: No for this example
confounding = []

df_data = pd.read_csv(data_path / "cleaned_IXI_behavioural.csv", index_col=0)
columns_features = []
for feature in features:
    if feature == "cortical":
        df_feature = pd.read_csv(
            data_path / "cleaned_VBM_GM_Schaefer100x17_mean_aggregation.csv",
            index_col=0,
        )
    elif feature == "subcortical":
        df_feature = pd.read_csv(
            data_path
            / "cleaned_VBM_GM_TianxS1x3TxMNI6thgeneration_mean_aggregation.csv",
            index_col=0,
        )
    else:
        print("feature not recognized")

    df_data = df_data.join(df_feature, how="inner")
    columns_features = columns_features + df_feature.columns.to_list()

print(f"Final data shape: {df_data.shape}")

y = df_data[target].values.ravel()
if target == ["SEX_ID (1=m, 2=f)"]:
    y = np.where(y == 2, 0, 1)  # Put the classes as 0 and 1

X = df_data.loc[:, columns_features].values  # only brain features

print("X shape")
print(X.shape)


Final data shape: (533, 122)
X shape
(533, 116)


In [21]:
test_size = 0.3
random_state = 42
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=test_size, shuffle=True, random_state=random_state
)
# Check size of data
print("X shape", X.shape)
print("y shape", y.shape)
print("X_train shape", X_train.shape)
print("X_test shape", X_test.shape)
print("y_train shape", y_train.shape)
print("y_test shape", y_test.shape)


X shape (533, 116)
y shape (533,)
X_train shape (373, 116)
X_test shape (160, 116)
y_train shape (373,)
y_test shape (160,)


#  Leakage example 1:
### Train on whole data:

In [None]:
# Train our model on the whole data (Fig. 2 in Sasse et al., 2025)
model = RandomForestClassifier(max_depth=10, random_state=random_state)
model.fit(X, y)

print("Raw Data - Train AUC:", accuracy_score(model.predict(X), y))          # Train on the whole data! Very bad!
print("Raw Data - Test AUC:", accuracy_score(model.predict(X_test), y_test)) # Test data belongs to the X!

Raw Data - Train AUC: 1.0
Raw Data - Test AUC: 1.0


### Correct procedure:

In [28]:
# Train our model on the train set and test on the test set
model = RandomForestClassifier(max_depth=10, random_state=random_state)
model.fit(X_train, y_train)

print("Raw Data - Train accuracy:", accuracy_score(model.predict(X_train), y_train))
print("Raw Data - Test accuracy:", accuracy_score(model.predict(X_test), y_test))

Raw Data - Train accuracy: 1.0
Raw Data - Test accuracy: 0.7375


When the model was trained in train set and tested on test set the test performance dropped.
When the model was trained in the whole dataset it performed well in both, train and test datasets. 
This is because the model learned patterns of the test set during training.

# Leakage example 2:
### Feature selection on whole dataset:

In [29]:

# Define reproducible cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)

# Preprocess the whole data (leakage)
n_components = 7
pca = PCA(n_components=n_components)
X_pca = pca.fit_transform(X.copy(), y)

# Scale the entire dataset (leakage)
scaler = StandardScaler()
X_pca_scaled = scaler.fit_transform(X_pca)

# Evaluate using fixed CV
model = DecisionTreeClassifier(max_depth=10, random_state=random_state)
scores_leakage = []

for train, test in cv.split(X, y):
    model.fit(X_pca_scaled[train, :], y[train])
    pred = model.predict(X_pca_scaled[test, :])
    scores_leakage.append(roc_auc_score(y[test], pred))

### Correct procedure:

In [32]:
# Pipeline with feature selection inside each fold
pca = PCA(n_components=n_components)
scaler = StandardScaler()
model = DecisionTreeClassifier(max_depth=10, random_state=random_state)

scores_no_leakage = []

for train, test in cv.split(X, y):
    X_train = X[train, :]
    y_train = y[train]
    X_test = X[test, :]
    y_test = y[test]

    # Fit feature selector
    X_train = pca.fit_transform(X_train, y_train)
    X_test = pca.transform(X_test)

    # Scale and reduce dimensionality on entire dataset (leakage)
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)

    # Fit ML model
    model.fit(X_train, y_train)

    pred = model.predict(X_test)
    scores_no_leakage.append(roc_auc_score(y[test], pred))

In [34]:
# ========== Compare Results ==========
results_df = pd.DataFrame(
    {
        "Fold": np.arange(1, 6),
        "Accuracy with Leakage": scores_leakage,
        "Accuracy without Leakage": scores_no_leakage,
        "Difference": np.array(scores_leakage) - np.array(scores_no_leakage),
    }
)

print(results_df)
print("\nMean Accuracy with Leakage: ", round(np.mean(scores_leakage)*100, 4))
print("Mean Accuracy without Leakage: ", round(np.mean(scores_no_leakage)*100, 4))
print("Mean Difference: ", round(np.mean(results_df["Difference"]), 4))

   Fold  Accuracy with Leakage  Accuracy without Leakage  Difference
0     1               0.734397                  0.635816    0.098582
1     2               0.702482                  0.559929    0.142553
2     3               0.682733                  0.644244    0.038489
3     4               0.622070                  0.687883   -0.065813
4     5               0.674901                  0.643166    0.031735

Mean Accuracy with Leakage:  68.3317
Mean Accuracy without Leakage:  63.4208
Mean Difference:  0.0491


The approach causing leakage generally yielded better performance than the correct approach. Even though in this example the effect of leakage is not huge, in bigger and complex datasets its effect is much severe.

It is important to note that the results and the effect of leakage might change based on the use of different models, seeds, samples size, features number and distribution, linear or non-linear signal, amount of noise, etc.

# To do!
Lets try another model, another pre-processing, or even random seeds to see the effect of the data leakage.