# Lesson 4: Data Leakage

**Module 3: Data & Pipeline Engineering**  
**Estimated Time**: 1-2 hours  
**Difficulty**: Advanced

---

## ðŸŽ¯ Learning Objectives

By the end of this lesson, you will:

âœ… Identify the different types of Data Leakage  
âœ… Understand why simple random splits fail for time-series data  
âœ… Detect leaks by analyzing feature importance  
âœ… Answer the classic "My model failed in production" interview question  

---

## ðŸ“š Table of Contents

1. [What is Data Leakage?](#1-what-is-leakage)
2. [Type 1: Target Leakage](#2-target-leakage)
3. [Type 2: Train-Test Contamination](#3-contamination)
4. [Hands-On: Detecting Leakage](#4-hands-on)
5. [Interview Preparation](#5-interview-questions)

---

## 1. What is Data Leakage?

Data leakage happens when your training data contains information that will **not** be available at prediction time.

**Symptom**: Incredible training/testing metrics (e.g., 99.9% AUC), but terrible performance in production.

> "If it looks too good to be true, it's probably data leakage."

## 2. Type 1: Target Leakage

This occurs when you include a feature that is a **proxy for the target** or happens **after** the target event.

### Example: Predicting Churn
- **Target**: `churned_next_month` (1/0)
- **Leaky Feature**: `cancellation_reason`
  
**Why**: A user only has a cancellation reason **after** they churn. The model sees `cancellation_reason="Price"` and instantly knows they churned. In production, an active user won't have this value yet.

## 3. Type 2: Train-Test Contamination

### Temporal Leakage (Time-Travel)
Using future data to predict the past.

**Example**: Predicting Stock Price.
- Random Split: Train on Jan 1st and Jan 3rd. Test on Jan 2nd.
- **Leak**: The model implicitly learns the trend from Jan 3rd (Future) to predict Jan 2nd (Past).
- **Fix**: Always use **Time-Series Split** (Train: Jan-Feb, Test: March).

### Preprocessing Leakage
Calculating statistics (mean, variance) on the **entire dataset** before splitting.

**Wrong**:
```python
# Calculating mean on ALL data
df['age'] = df['age'].fillna(df['age'].mean())
train, test = split(df)
```

**Right**:
```python
train, test = split(df)
# Calculate mean ONLY on train
train_mean = train['age'].mean()
train['age'] = train['age'].fillna(train_mean)
test['age'] = test['age'].fillna(train_mean)
```

## 4. Hands-On: Detecting Leakage

We'll introduce a leak intentionally and show how to catch it.

In [None]:
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# 1. Create Data
X, y = make_classification(n_samples=1000, n_features=10, random_state=42)
df = pd.DataFrame(X, columns=[f'feat_{i}' for i in range(10)])
df['target'] = y

# 2. INJECT LEAKAGE
# We make feat_0 highly correlated with target by cheating
df['leaky_feat'] = df['target'] + pd.np.random.normal(0, 0.1, 1000)

X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Train Model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

print(f"Test Accuracy: {accuracy_score(y_test, model.predict(X_test)):.4f}")

# 4. DETECT LEAKAGE via Feature Importance
importances = pd.Series(model.feature_importances_, index=X.columns)
print("\nTop Feature Importances:")
print(importances.sort_values(ascending=False).head(5))

print("\nanalysis: 'leaky_feat' dominates everything. This is a huge red flag!")

## 5. Interview Preparation

### Common Questions

#### Q1: "How do you handle categorical encoding in train/test splits?"
**Answer**: "To avoid leakage, I fit the encoder (e.g., OneHot or Target Encoder) **only** on the training set. Then I transform the test set using the learned mapping. If the test set has categories not seen in training, I handle them as 'unknown' rather than fitting on the full dataset."

#### Q2: "Why is Cross-Validation leaky for Time Series?"
**Answer**: "Standard K-Fold is random. Fold 1 might contain data from Feb, while training happens on data from Jan and March. Learning from March (future) to predict Feb (past) is classic temporal leakage. I would use `TimeSeriesSplit` instead."