# Lesson 3: Sampling Strategies

**Module 3: Data & Pipeline Engineering**  
**Estimated Time**: 1-2 hours  
**Difficulty**: Intermediate

---

## ðŸŽ¯ Learning Objectives

By the end of this lesson, you will:

âœ… Handle massive Class Imbalance (e.g., Fraud Detection)  
âœ… Implement Stratified Splitting  
âœ… Use SMOTE (Synthetic Minority Over-sampling Technique)  
âœ… Understand when Upsampling is better than Downsampling  

---

## ðŸ“š Table of Contents

1. [The Imbalance Problem](#1-imbalance)
2. [Strategic Splitting: Stratified Sampling](#2-stratified)
3. [Resampling Techniques: Up vs Down](#3-resampling)
4. [Advanced: SMOTE](#4-smote)
5. [Interview Preparation](#5-interview-questions)

---

## 1. The Imbalance Problem

**Scenario**: 1000 transactions. 995 are Legit (Class 0), 5 are Fraud (Class 1).

**The Trap**: A "dumb" model that always predicts "Legit" gets **99.5% Accuracy**.

**The Solution**: We need to change the data distribution during *training* so the model pays attention to the minority class.

## 2. Strategic Splitting: Stratified Sampling

Before resampling, you MUST split your data correctly.

**Random Split Risk**: If you have 5 fraud cases and do an 80/20 random split, you might end up with ALL 5 fraud cases in the Test set. Your text/train sizes are fine, but your training set has ZERO fraud examples.

**Fix**: `stratify=y` ensures the ratio of classes is preserved in both Train and Test sets.

## 3. Resampling Techniques: Up vs Down

### Downsampling (Undersampling)
- **Action**: Delete random rows from Majority class.
- **Pro**: Faster training.
- **Con**: Loss of information.
- **When to use**: You have HUGE data (millions of rows).

### Upsampling (Oversampling)
- **Action**: Duplicate random rows from Minority class.
- **Pro**: Keeps all information.
- **Con**: Overfitting (Model memorizes duplicates).
- **When to use**: Small dataset.

## 4. Advanced: SMOTE

**S**ynthetic **M**inority **O**ver-sampling **TE**chnique.

Instead of duplicating, it generates **new synthetic examples** by interpolating between existing minority samples.

**Library**: `imbalanced-learn` (`pip install imblearn`)

In [None]:
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from collections import Counter
from imblearn.over_sampling import SMOTE

# 1. Generate Imbalanced Data
X, y = make_classification(
    n_samples=5000, 
    n_features=10, 
    weights=[0.99], # 99% Majority
    random_state=42
)

print(f"Original Class Distribution: {Counter(y)}")

# 2. Stratified Split (CRITICAL STEP)
# Never resample before splitting! You will leak data.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42, 
    stratify=y  # <--- MAGIC KEYWORD
)

print(f"Train Distribution: {Counter(y_train)}")
print(f"Test Distribution: {Counter(y_test)}")

# 3. Apply SMOTE (Only on Training Data)
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

print(f"Resampled Train Distribution: {Counter(y_train_resampled)}")

## 5. Interview Preparation

### Common Questions

#### Q1: "In fraud detection, accuracy is 99%. Is the model good?"
**Answer**: "Likely not. If fraud is 1% of data, a dummy classifier gets 99%. I would look at Precision, Recall, and F1-score, specifically focusing on Recall (catching as many frauds as possible)."

#### Q2: "When would you use Random Undersampling over SMOTE?"
**Answer**: "If I have massive data (e.g., 100 million negative clicks vs 1 million positive clicks), SMOTE is computationally expensive (O(N^2) or KNN based). Undersampling reduces the data size, making training feasible, and with 1 million positives, I simpler methods work fine."

#### Q3: "Should I resample the Test set?"
**Answer**: "**NEVER**. The test set must reflect the REAL production distribution. If you balance the test set, your metrics (Accuracy/Precision) will be completely wrong relative to real-world performance."