# Notebook 05: Cross-Validation & Leakage

## The Validation Trap

Data leakage is the silent killer of model insights. When training and test sets share information they shouldn't, metrics become misleading. GroupKFold and TimeSeriesSplit prevent this, ensuring our interpretations reflect reality.

---

## What is Data Leakage?

Data leakage occurs when information from the future or from other groups leaks into training data:

- **Temporal leakage**: Using future data to predict the past
- **Group leakage**: Same entity appears in both train and test
- **Target leakage**: Using information that wouldn't be available at prediction time

## Cross-Validation Schemes

- **KFold**: Standard k-fold CV (assumes independent samples)
- **GroupKFold**: Ensures groups don't split across folds
- **TimeSeriesSplit**: Respects temporal order

## When to Use Each

- **KFold**: Independent samples, no groups, no time order
- **GroupKFold**: Repeated measurements, player/patient IDs
- **TimeSeriesSplit**: Time series, temporal data

## Setup and Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import KFold, GroupKFold, TimeSeriesSplit, cross_val_score
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

import sys
from pathlib import Path
project_root = Path().resolve().parent if Path().resolve().name == 'notebooks' else Path().resolve()
sys.path.insert(0, str(project_root))

from src.utils import set_seed

sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

set_seed(42)
print("✓ Imports successful!")

## Step 1: KFold vs GroupKFold

Demonstrate how GroupKFold prevents group leakage.

In [None]:
# === TODO: Demonstrate KFold vs GroupKFold
# Hints:
#   - Create toy dataset with repeated group IDs
#   - Apply KFold and GroupKFold
#   - Show that GroupKFold keeps groups together
#   - Compare CV scores
# Acceptance: Show GroupKFold gives more conservative estimate

## Step 2: TimeSeriesSplit

Demonstrate temporal cross-validation.

In [None]:
# === TODO: TimeSeriesSplit demo
# Hints:
#   - Create time-ordered dataset
#   - Apply TimeSeriesSplit
#   - Plot fold boundaries
#   - Compute fold metrics
# Acceptance: Plot fold boundaries and compute fold metrics

## Summary

Proper cross-validation prevents data leakage. Choose the right scheme for your data structure.

**Next**: Notebook 06 is a summary quiz to reinforce learning.