# Lesson 4: Data Leakage Detection & Prevention

**Module 3: Data & Pipeline Engineering** | **Time**: 4-5 hours | **Difficulty**: Intermediate-Advanced

---

## 🎯 Learning Objectives

✅ Understand what data leakage is and why it’s the most insidious ML bug  
✅ Identify the 3 main types of leakage (target, train-test, temporal)  
✅ Build leak-free preprocessing pipelines  
✅ Detect leakage in existing pipelines  
✅ Answer 5 interview questions on data leakage  

---

## 📚 Table of Contents

1. [What Is Data Leakage?](#1-what-is-leakage)
2. [Type 1: Target Leakage](#2-target-leakage)
3. [Type 2: Train-Test Contamination](#3-train-test)
4. [Type 3: Temporal Leakage](#4-temporal)
5. [Building Leak-Free Pipelines](#5-leak-free)
6. [Hands-On: Leaky vs Clean Pipeline](#6-hands-on)
7. [Leakage Detection Checklist](#7-checklist)
8. [Exercises](#8-exercises)
9. [Interview Preparation](#9-interview)

---

## 1. What Is Data Leakage? <a id='1-what-is-leakage'></a>

Data leakage occurs when **information from outside the training dataset** is used to create the model, leading to **overly optimistic performance** during development that **doesn’t hold in production**.

### The Danger

```
                 WITH LEAKAGE                    WITHOUT LEAKAGE
              ┌─────────────────┐          ┌─────────────────┐
  Dev:        │  Accuracy: 99%  │          │  Accuracy: 85%  │
              └─────────────────┘          └─────────────────┘
              ┌─────────────────┐          ┌─────────────────┐
  Production: │  Accuracy: 55%  │  💥      │  Accuracy: 83%  │  ✅
              └─────────────────┘          └─────────────────┘
```

### Leakage Flow Diagram

```
  ┌─────────────────────────────────────────────────────┐
  │                 THREE TYPES OF LEAKAGE                  │
  ├────────────────┬────────────────┬───────────────────┤
  │ TARGET LEAKAGE  │ TRAIN-TEST LEAK │ TEMPORAL LEAKAGE    │
  │                │                │                     │
  │ Features that  │ Info from test │ Future info used    │
  │ encode the     │ set leaks into │ to predict past     │
  │ target label   │ training       │ events              │
  └────────────────┴────────────────┴───────────────────┘
```

---

## 2. Type 1: Target Leakage <a id='2-target-leakage'></a>

When a **feature directly encodes the target variable** — information that would NOT be available at prediction time.

### Classic Examples

| Task | Leaky Feature | Why It Leaks |
|------|--------------|---------------|
| Predict hospital readmission | `discharge_notes` | Only exist AFTER discharge decision |
| Predict loan default | `collection_agency_id` | Only assigned AFTER default |
| Predict churn | `cancellation_reason` | Only known AFTER churning |
| Predict click | `purchase_amount` | Purchase happens AFTER click |

### Visual: Timeline of Feature Availability

```
  PREDICTION POINT
        │
  TIME  ▼
  ─────┼──────────────────────────▶
       │
  ✅ Features available      ❌ Features NOT available
  at prediction time:       at prediction time:
  - user_age               - discharge_notes
  - prior_visits           - cancellation_reason
  - account_age            - collection_agency_id
```

### The Heuristic

> **Ask yourself: "Would I have this feature at the time I need to make a prediction?"**  
> If the answer is **NO**, it’s leakage. Remove it.

---

## 3. Type 2: Train-Test Contamination <a id='3-train-test'></a>

When information from the **test set leaks into the training process**.

### Most Common Cause: Preprocessing Before Splitting

```
  ❌ WRONG (Leaky Pipeline):
  ┌─────────────┐    ┌────────────────┐    ┌───────────┐
  │  ALL DATA   │─▶│ Normalize on  │─▶│  Split    │
  │             │    │ ENTIRE dataset│    │ train/test│
  └─────────────┘    └────────────────┘    └───────────┘
       Test set statistics are embedded in training data! 💥

  ✅ CORRECT (Clean Pipeline):
  ┌─────────────┐    ┌───────────┐    ┌────────────────┐
  │  ALL DATA   │─▶│  Split    │─▶│ Normalize on  │
  │             │    │ train/test│    │ TRAIN only    │
  └─────────────┘    └───────────┘    └────────────────┘
       fit on train, transform both train AND test
```

### Common Train-Test Leaks

| Action | Leaky | Clean |
|--------|-------|-------|
| StandardScaler | `scaler.fit(ALL_DATA)` | `scaler.fit(X_train)` |
| Missing value imputation | Impute on all data | Impute stats from train only |
| Feature selection | Select features on all data | Select features on train only |
| Oversampling (SMOTE) | SMOTE before split | SMOTE on train only |

---

## 4. Type 3: Temporal Leakage <a id='4-temporal'></a>

Using **future information to predict past events**. Especially dangerous with time-series data.

### Visual: Temporal Split vs Random Split

```
  ❌ RANDOM SPLIT (leaky for time-series):
  Time: Jan   Feb   Mar   Apr   May   Jun   Jul   Aug
        [T]   [V]   [T]   [T]   [V]   [T]   [V]   [T]
        T=Train, V=Validation
        February data "knows" about March-August! 💥

  ✅ TEMPORAL SPLIT (correct):
  Time: Jan   Feb   Mar   Apr   May   Jun  | Jul   Aug
        [  T   R   A   I   N   I   N   G  ]|[ T E S T ]
        Past data only ──────────────────▶| Future
```

### Examples of Temporal Leakage

| Scenario | Leak | Fix |
|----------|------|-----|
| Stock prediction | Using tomorrow’s moving average | Use only past-looking windows |
| Churn prediction | Including next month’s activity | Use only historical features |
| Demand forecasting | Including actual sales as feature | Use only data up to prediction point |

---

## 5. Building Leak-Free Pipelines <a id='5-leak-free'></a>

### The Solution: `sklearn.pipeline.Pipeline`

Scikit-learn’s Pipeline ensures that **fitting happens only on training data**:

```
  ┌───────────┐    ┌─────────────┐    ┌─────────────┐    ┌────────────┐
  │ Imputer   │─▶│ Scaler      │─▶│ Encoder     │─▶│ Model      │
  │ (fit on   │    │ (fit on     │    │ (fit on     │    │ (fit on    │
  │  train)   │    │  train)     │    │  train)     │    │  train)    │
  └───────────┘    └─────────────┘    └─────────────┘    └────────────┘
  pipeline.fit(X_train, y_train)  →  All steps fit on train ONLY
  pipeline.predict(X_test)        →  All steps transform (no refit)
```

---

In [None]:
# ============================================================
# DEMO: Leaky vs Clean Pipeline
# ============================================================
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, f1_score
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

# Generate synthetic classification data
n = 5000
X = np.random.randn(n, 10)
# Introduce missing values (~10%)
mask = np.random.random(X.shape) < 0.1
X[mask] = np.nan
y = (X[:, 0] + X[:, 1] * 0.5 + np.random.randn(n) * 0.3 > 0).astype(int)
# Fix NaN effects on y
X_df = pd.DataFrame(X, columns=[f'feat_{i}' for i in range(10)])
y = pd.Series(y)

print(f"Dataset: {X_df.shape}, Missing values: {X_df.isna().sum().sum()}")
print(f"Class distribution: {dict(y.value_counts())}")

In [None]:
# ============================================================
# ❌ LEAKY PIPELINE: Preprocess BEFORE splitting
# ============================================================
print("="*60)
print("❌ LEAKY PIPELINE")
print("="*60)

# Impute on ALL data (leak!)
imputer_leak = SimpleImputer(strategy='mean')
X_imputed = imputer_leak.fit_transform(X_df)

# Scale on ALL data (leak!)
scaler_leak = StandardScaler()
X_scaled = scaler_leak.fit_transform(X_imputed)

# NOW split (too late! Statistics from test set already embedded)
X_train_leak, X_test_leak, y_train_leak, y_test_leak = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)

# Train and evaluate
rf_leak = RandomForestClassifier(n_estimators=100, random_state=42)
rf_leak.fit(X_train_leak, y_train_leak)
y_pred_leak = rf_leak.predict(X_test_leak)

acc_leak = accuracy_score(y_test_leak, y_pred_leak)
f1_leak = f1_score(y_test_leak, y_pred_leak)
print(f"  Accuracy: {acc_leak:.4f}")
print(f"  F1 Score: {f1_leak:.4f}")
print("  (Overly optimistic due to leakage!)")

In [None]:
# ============================================================
# ✅ CLEAN PIPELINE: Split FIRST, then preprocess
# ============================================================
print("="*60)
print("✅ CLEAN PIPELINE (using sklearn Pipeline)")
print("="*60)

# Split FIRST
X_train, X_test, y_train, y_test = train_test_split(
    X_df, y, test_size=0.2, random_state=42
)

# Create a pipeline (fit only on train!)
clean_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler()),
    ('model', RandomForestClassifier(n_estimators=100, random_state=42))
])

# .fit() fits ALL steps on training data only
clean_pipeline.fit(X_train, y_train)

# .predict() transforms test data using train-fitted params
y_pred_clean = clean_pipeline.predict(X_test)

acc_clean = accuracy_score(y_test, y_pred_clean)
f1_clean = f1_score(y_test, y_pred_clean)
print(f"  Accuracy: {acc_clean:.4f}")
print(f"  F1 Score: {f1_clean:.4f}")
print("  (Honest estimate of real-world performance)")

# Compare
print("\n" + "="*60)
print("COMPARISON")
print("="*60)
print(f"{'Metric':<12} {'Leaky':<12} {'Clean':<12} {'Difference':<12}")
print(f"{'Accuracy':<12} {acc_leak:<12.4f} {acc_clean:<12.4f} {acc_leak-acc_clean:+.4f}")
print(f"{'F1 Score':<12} {f1_leak:<12.4f} {f1_clean:<12.4f} {f1_leak-f1_clean:+.4f}")
print("\n>>> The leaky pipeline appears better, but it's a LIE.")
print(">>> In production, the clean pipeline will perform as expected.")

In [None]:
# ============================================================
# DEMO: Target Leakage Detection
# ============================================================
print("="*60)
print("TARGET LEAKAGE DETECTION")
print("="*60)

# Create a dataset with a leaky feature
n = 5000
df = pd.DataFrame({
    'age': np.random.randint(18, 80, n),
    'income': np.random.normal(50000, 15000, n),
    'credit_score': np.random.randint(300, 850, n),
    'loan_amount': np.random.uniform(1000, 50000, n),
})

# Target: loan default (1 = defaulted)
df['defaulted'] = ((df['credit_score'] < 500) & 
                   (df['loan_amount'] > 20000)).astype(int)

# LEAKY FEATURE: collection_agency_id (only assigned AFTER default)
df['collection_agency_id'] = np.where(
    df['defaulted'] == 1,
    np.random.choice(['AgencyA', 'AgencyB', 'AgencyC'], n),
    'None'
)

# Detection: Check feature-target correlation
print("\nFeature-target correlations (suspicious if too high):")
for col in ['age', 'income', 'credit_score', 'loan_amount', 'collection_agency_id']:
    if df[col].dtype == 'object':
        # For categorical: check if it perfectly predicts target
        is_none = (df[col] == 'None').astype(int)
        corr = is_none.corr(df['defaulted'])
        flag = ' 🚩 LEAK!' if abs(corr) > 0.9 else ''
    else:
        corr = df[col].corr(df['defaulted'])
        flag = ' 🚩 LEAK!' if abs(corr) > 0.9 else ''
    print(f"  {col:<25} corr={corr:+.3f}{flag}")

print("\n>>> collection_agency_id has near-perfect correlation with target!")
print(">>> This feature would NOT exist at prediction time. Remove it.")

In [None]:
# ============================================================
# DEMO: Temporal Leakage - Walk-Forward Validation
# ============================================================
print("="*60)
print("TEMPORAL SPLIT: Walk-Forward Validation")
print("="*60)

# Simulate time-series data
dates = pd.date_range('2023-01-01', periods=365, freq='D')
ts_df = pd.DataFrame({
    'date': dates,
    'feature_1': np.random.randn(365).cumsum(),
    'feature_2': np.random.randn(365),
    'target': (np.random.randn(365).cumsum() > 0).astype(int)
})

# Temporal split: first 80% train, last 20% test
split_idx = int(len(ts_df) * 0.8)
train_ts = ts_df[:split_idx]
test_ts = ts_df[split_idx:]

print(f"Train period: {train_ts['date'].min().date()} to {train_ts['date'].max().date()}")
print(f"Test period:  {test_ts['date'].min().date()} to {test_ts['date'].max().date()}")
print(f"\nTrain: {len(train_ts)} days | Test: {len(test_ts)} days")
print("\n>>> No future data leaks into training!")

# Walk-forward validation
print("\nWalk-Forward Cross-Validation Folds:")
print("  Fold 1: Train [Jan-Mar] → Test [Apr]")
print("  Fold 2: Train [Jan-Apr] → Test [May]")
print("  Fold 3: Train [Jan-May] → Test [Jun]")
print("  ...")
print("\n>>> Each fold only uses past data to train!")

## 7. Leakage Detection Checklist <a id='7-checklist'></a>

```
  ┌───────────────────────────────────────────────┐
  │       LEAKAGE DETECTION CHECKLIST              │
  ├───────────────────────────────────────────────┤
  │ ☐ Suspiciously high accuracy (>95%)?            │
  │ ☐ Any feature with >0.9 target correlation?      │
  │ ☐ Preprocessing done BEFORE train/test split?    │
  │ ☐ Using future info for time-series data?        │
  │ ☐ Test performance >> cross-validation?          │
  │ ☐ SMOTE/oversampling applied before splitting?   │
  │ ☐ Feature available at prediction time?          │
  └───────────────────────────────────────────────┘
```

---

## 8. Exercises <a id='8-exercises'></a>

### Exercise 1: Find the Leak
Given a medical dataset with columns `[age, blood_pressure, cholesterol, diagnosis_code, treatment_outcome]` where the task is to predict `treatment_outcome` — which feature is likely leaky and why?

### Exercise 2: Fix the Pipeline
Refactor this leaky code using `sklearn.pipeline.Pipeline`:
```python
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Leaky!
X_train, X_test = train_test_split(X_scaled)
model.fit(X_train, y_train)
```

### Exercise 3: Cross-Validation Without Leakage
Implement k-fold cross-validation that correctly handles scaling inside each fold using `cross_val_score` with a Pipeline.

---

## 9. Interview Preparation <a id='9-interview'></a>

### Q1: "What is data leakage? Give me a real example."

**Answer:**  
"Data leakage is when information that wouldn’t be available at prediction time is used during training. Example: predicting hospital readmission using discharge notes — those notes only exist AFTER the discharge decision, so they encode the outcome.

The telltale sign is **suspiciously high training/validation accuracy that drops in production**."

---

### Q2: "How do you prevent train-test contamination in preprocessing?"

**Answer:**  
"Use `sklearn.pipeline.Pipeline`. It ensures all preprocessing steps (imputing, scaling, encoding) are **fit only on training data**. When you call `pipeline.predict(X_test)`, it applies the train-fitted transformers without re-fitting.

Key rules: split FIRST, then preprocess. Never call `.fit()` on test data. For cross-validation, use `cross_val_score` with a Pipeline."

---

### Q3: "How would you detect leakage in an existing model?"

**Answer:**  
"Red flags: (1) Test accuracy suspiciously close to 100%, (2) Large gap between CV and production performance, (3) Feature has near-perfect correlation with target. Detection: examine feature importance — if one feature dominates, check if it’s available at prediction time. Also verify preprocessing happens after splitting."

---

### Q4: "How do you split time-series data without leakage?"

**Answer:**  
"Never use random splits for time-series. Use **temporal splitting**: train on past data, test on future data. For cross-validation, use **walk-forward** (expanding window): train on [Jan-Mar], test on [Apr]; train on [Jan-Apr], test on [May]; etc. `sklearn.TimeSeriesSplit` implements this."

---

### Q5: "You built a model with 98% accuracy but it performs at 60% in production. What happened?"

**Answer:**  
"Almost certainly data leakage. I’d investigate:
1. **Target leakage**: Is any feature derived from the target?
2. **Train-test contamination**: Was preprocessing done before splitting?
3. **Temporal leakage**: Was future data used to build features?
4. **Distribution shift**: Has the production data distribution changed?

First fix: rebuild with a Pipeline, verify no leaky features, add post-load validation. Then compare dev vs production metrics."

---

## 🎓 Key Takeaways

1. **Leakage is the most insidious ML bug** — it makes your model look great in dev, then fail in production
2. **Three types**: Target leakage, train-test contamination, temporal leakage
3. **Always split FIRST**, then preprocess
4. **Use `sklearn.Pipeline`** to prevent contamination automatically
5. **For time-series**: temporal splits only, never random
6. **Suspect leakage** whenever accuracy is too good to be true

---

➡️ **Next Lesson**: [Lesson 5: Feature Stores with Feast](./lesson_05_feature_stores_feast.ipynb)