# Lab 3: Class Imbalance Project - Handle Imbalanced Data Like a Pro!

## Welcome to the Real World of Imbalanced Data!

Most real-world classification problems have imbalanced classes. Fraud detection, disease diagnosis, spam filtering - they all face this challenge!

### What You'll Build:
A complete **fraud detection system** using highly imbalanced credit card transaction data, applying multiple techniques to handle the imbalance!

### Learning Goals:
- Understand why class imbalance is a problem
- Learn why accuracy is MISLEADING for imbalanced data
- Apply proper evaluation metrics (precision, recall, F1, ROC-AUC, PR-AUC)
- Use undersampling techniques
- Use oversampling techniques (SMOTE, ADASYN)
- Apply class weights
- Compare all rebalancing strategies
- Calibrate model probabilities
- Build production-ready pipeline

### Don't Panic!
- Read each instruction carefully
- Try the TODO exercises yourself first
- Hints are provided if you get stuck
- Solutions are at the end (but try not to peek!)

**Let's tackle class imbalance!**

## Step 1: Import Libraries

First, let's import tools including imblearn for resampling.

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report,
    roc_auc_score, roc_curve, precision_recall_curve,
    average_precision_score
)
from sklearn.calibration import CalibratedClassifierCV, calibration_curve

# Imbalanced-learn library for resampling
from imblearn.over_sampling import RandomOverSampler, SMOTE, ADASYN, BorderlineSMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline

# Make plots look nice
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)

print("Libraries imported successfully!")
print("You're ready to handle imbalanced data!")
print("\n⚠️ Make sure you have installed: pip install imbalanced-learn")

## Step 2: Create Highly Imbalanced Dataset

We'll simulate credit card fraud detection data:
- **99.7% legitimate** transactions
- **0.3% fraudulent** transactions

This extreme imbalance is realistic for fraud detection!

In [None]:
# Create highly imbalanced credit card fraud dataset
np.random.seed(42)
n_transactions = 10000
fraud_ratio = 0.003  # 0.3% fraud rate

# Generate features
n_features = 28
X = np.random.randn(n_transactions, n_features)

# Generate imbalanced target
n_fraud = int(n_transactions * fraud_ratio)
n_legit = n_transactions - n_fraud

y = np.array([0] * n_legit + [1] * n_fraud)

# Make fraudulent transactions distinguishable (but not perfectly)
# Add signal to fraud samples
fraud_mask = (y == 1)
X[fraud_mask, :5] += np.random.randn(n_fraud, 5) * 3  # Stronger signal in first 5 features
X[fraud_mask, 5:10] += np.random.randn(n_fraud, 5) * 2  # Medium signal

# Add transaction amounts (feature 0 will be amount)
amounts = np.random.exponential(100, n_transactions)
amounts[fraud_mask] = np.random.exponential(300, n_fraud)  # Fraud tends to be larger

# Create DataFrame
feature_names = [f'V{i}' for i in range(1, n_features+1)]
df = pd.DataFrame(X, columns=feature_names)
df.insert(0, 'Amount', amounts)
df['Class'] = y

# Shuffle
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

print(f"Dataset created!")
print(f"Total transactions: {len(df):,}")
print(f"Total features: {len(df.columns) - 1}")
print(f"\nClass distribution:")
print(df['Class'].value_counts())
print(f"\nFraud rate: {df['Class'].mean()*100:.3f}%")
print(f"\n⚠️ This is EXTREMELY imbalanced - typical for fraud detection!")

## Step 3: Explore Imbalanced Data

### TODO 1: Explore and Visualize Class Imbalance

1. Display class distribution (counts and percentages)
2. Create a bar plot showing the imbalance
3. Calculate the imbalance ratio
4. Show some statistics for each class

💡 **Hint:**
```python
class_counts = df['Class'].value_counts()
class_pct = df['Class'].value_counts(normalize=True) * 100
imbalance_ratio = class_counts[0] / class_counts[1]
```

In [None]:
# TODO 1: YOUR CODE HERE
# Explore and visualize class imbalance



## Step 4: Split Data with Stratification

**CRITICAL:** Always use stratified splitting for imbalanced data!

### TODO 2: Split Data with Stratification

1. Split into train (80%) and test (20%) sets
2. Use `stratify=y` to maintain class distribution
3. Verify that both sets have similar class distributions

💡 **Hint:**
```python
X = df.drop('Class', axis=1)
y = df['Class']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
```

⚠️ **Common Mistake:** Forgetting `stratify` can result in test sets with even fewer fraud cases!

In [None]:
# TODO 2: YOUR CODE HERE
# Split data with stratification



## Step 5: Train Baseline Model (No Rebalancing)

### TODO 3: Train Baseline Model

Train a RandomForestClassifier with default parameters:
1. No rebalancing yet!
2. Fit on training data
3. Make predictions on test set

💡 **Hint:**
```python
clf_baseline = RandomForestClassifier(n_estimators=100, random_state=42)
clf_baseline.fit(X_train, y_train)
y_pred_baseline = clf_baseline.predict(X_test)
```

In [None]:
# TODO 3: YOUR CODE HERE
# Train baseline model



## Step 6: Evaluate with Accuracy (THE WRONG METRIC!)

### TODO 4: Calculate Accuracy (and understand why it's misleading)

1. Calculate accuracy on test set
2. Compare with a naive "always predict majority class" baseline
3. Realize why accuracy is useless here!

💡 **Hint:**
```python
accuracy = accuracy_score(y_test, y_pred_baseline)
# Naive baseline: always predict 0 (legitimate)
y_pred_naive = np.zeros_like(y_test)
accuracy_naive = accuracy_score(y_test, y_pred_naive)
```

In [None]:
# TODO 4: YOUR CODE HERE
# Calculate accuracy and understand why it's misleading



✅ **Check:** Did you get ~99.7% accuracy? That's TERRIBLE!
- A model that predicts "no fraud" for everything gets 99.7% accuracy
- But it catches ZERO fraud cases!
- **Accuracy is useless for imbalanced data!**

⚠️ **Never use accuracy alone for imbalanced classification!**

## Step 7: Evaluate with Proper Metrics

### TODO 5: Calculate Proper Metrics for Imbalanced Data

Calculate and display:
1. **Confusion Matrix**: See actual TP, FP, TN, FN
2. **Precision**: Of predicted frauds, how many are correct?
3. **Recall**: Of actual frauds, how many did we catch?
4. **F1-Score**: Harmonic mean of precision and recall
5. **ROC-AUC**: Area under ROC curve
6. **Precision-Recall AUC**: Better for imbalanced data!

Plot:
- Confusion matrix
- ROC curve
- Precision-Recall curve

💡 **Hint:**
```python
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_test, y_pred_baseline))
print(classification_report(y_test, y_pred_baseline, target_names=['Legit', 'Fraud']))

# For ROC and PR curves, need probability predictions
y_pred_proba = clf_baseline.predict_proba(X_test)[:, 1]
```

In [None]:
# TODO 5: YOUR CODE HERE
# Calculate and visualize proper metrics



## Step 8: Apply Random Undersampling

### TODO 6: Apply Random Undersampling

Undersample the majority class to balance the dataset:
1. Use RandomUnderSampler from imblearn
2. Train model on resampled data
3. Evaluate on original test set
4. Compare with baseline

💡 **Hint:**
```python
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=42)
X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)

clf_rus = RandomForestClassifier(n_estimators=100, random_state=42)
clf_rus.fit(X_train_rus, y_train_rus)
```

⚠️ **Important:** Always evaluate on the ORIGINAL test set, not resampled!

In [None]:
# TODO 6: YOUR CODE HERE
# Apply random undersampling



## Step 9: Apply Random Oversampling

### TODO 7: Apply Random Oversampling

Oversample the minority class by duplicating samples:
1. Use RandomOverSampler
2. Train and evaluate
3. Compare with undersampling

💡 **Hint:**
```python
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=42)
X_train_ros, y_train_ros = ros.fit_resample(X_train, y_train)
```

In [None]:
# TODO 7: YOUR CODE HERE
# Apply random oversampling



## Step 10: Apply SMOTE (Synthetic Minority Over-sampling)

### TODO 8: Apply SMOTE

SMOTE creates synthetic samples instead of duplicating:
1. Use SMOTE from imblearn
2. Visualize a few synthetic samples (optional)
3. Train and evaluate
4. Compare with random oversampling

💡 **Hint:**
```python
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
```

In [None]:
# TODO 8: YOUR CODE HERE
# Apply SMOTE



## Step 11: Try SMOTE Variants

### TODO 9: Try SMOTE Variants (ADASYN or BorderlineSMOTE)

Try advanced SMOTE variants:
1. **ADASYN**: Adaptive Synthetic Sampling
2. **BorderlineSMOTE**: Focus on borderline cases

Pick one and compare with regular SMOTE!

💡 **Hint:**
```python
from imblearn.over_sampling import ADASYN, BorderlineSMOTE
adasyn = ADASYN(random_state=42)
X_train_adasyn, y_train_adasyn = adasyn.fit_resample(X_train, y_train)
```

In [None]:
# TODO 9: YOUR CODE HERE
# Try SMOTE variants



## Step 12: Apply Class Weights (No Resampling!)

### TODO 10: Use Class Weights

Instead of resampling, use class weights to penalize misclassification:
1. Train with `class_weight='balanced'`
2. No resampling needed!
3. Compare with resampling methods

💡 **Hint:**
```python
clf_weighted = RandomForestClassifier(
    n_estimators=100, 
    class_weight='balanced',  # This does the magic!
    random_state=42
)
clf_weighted.fit(X_train, y_train)  # Use original unbalanced data!
```

In [None]:
# TODO 10: YOUR CODE HERE
# Apply class weights



## Step 13: Comprehensive Comparison

### TODO 11: Compare All Methods

Create a comprehensive comparison table:
1. Baseline (no rebalancing)
2. Random Undersampling
3. Random Oversampling
4. SMOTE
5. SMOTE variant (ADASYN/Borderline)
6. Class Weights

For each, show:
- Precision
- Recall
- F1-Score
- ROC-AUC
- PR-AUC (Average Precision)

Visualize with grouped bar chart!

💡 **Hint:**
```python
results = {
    'Method': [],
    'Precision': [],
    'Recall': [],
    'F1': [],
    'ROC-AUC': [],
    'PR-AUC': []
}

# For each method:
results['Method'].append('Baseline')
results['Precision'].append(precision_score(y_test, y_pred_baseline))
# ... etc
```

In [None]:
# TODO 11: YOUR CODE HERE
# Create comprehensive comparison



## Step 14: Calibrate Best Model

### TODO 12: Calibrate Probability Predictions

Choose your best method and calibrate its probabilities:
1. Select best model from comparison
2. Apply CalibratedClassifierCV
3. Create calibration curve
4. Compare calibrated vs uncalibrated probabilities

💡 **Hint:**
```python
from sklearn.calibration import CalibratedClassifierCV, calibration_curve

# Choose best model (e.g., SMOTE)
clf_best = clf_smote  # or whichever performed best

# Calibrate
clf_calibrated = CalibratedClassifierCV(clf_best, method='sigmoid', cv=5)
clf_calibrated.fit(X_train_smote, y_train_smote)

# Get probabilities
y_proba_uncalib = clf_best.predict_proba(X_test)[:, 1]
y_proba_calib = clf_calibrated.predict_proba(X_test)[:, 1]

# Plot calibration curve
fraction_of_positives, mean_predicted_value = calibration_curve(
    y_test, y_proba_uncalib, n_bins=10
)
```

In [None]:
# TODO 12: YOUR CODE HERE
# Calibrate best model



## Step 15: Build Production Pipeline

### TODO 13: Create Production-Ready Pipeline

Build a complete pipeline with:
1. Scaling
2. Resampling (using imblearn Pipeline)
3. Model training

💡 **Hint:**
```python
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.preprocessing import StandardScaler

pipeline = ImbPipeline([
    ('scaler', StandardScaler()),
    ('sampler', SMOTE(random_state=42)),  # imblearn Pipeline supports this!
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

pipeline.fit(X_train, y_train)
y_pred_pipeline = pipeline.predict(X_test)
```

In [None]:
# TODO 13: YOUR CODE HERE
# Build production pipeline



## Step 16: Final Evaluation and Business Interpretation

### TODO 14: Final Evaluation and Business Discussion

1. Evaluate final pipeline on test set
2. Create comprehensive report with all metrics
3. Discuss business implications:
   - What's the cost of false positives? (blocking legitimate transaction)
   - What's the cost of false negatives? (missing fraud)
   - Which metric matters most for this business?
4. Recommend threshold adjustment if needed

💡 **Hint:**
```python
# Get probability predictions
y_proba_final = pipeline.predict_proba(X_test)[:, 1]

# Try different thresholds
thresholds = [0.3, 0.5, 0.7]
for threshold in thresholds:
    y_pred_threshold = (y_proba_final >= threshold).astype(int)
    # Calculate metrics
```

In [None]:
# TODO 14: YOUR CODE HERE
# Final evaluation and business interpretation



## Congratulations!

### You Did It!

You just:
- ✅ Understood why class imbalance is challenging
- ✅ Learned why accuracy is misleading
- ✅ Applied proper evaluation metrics
- ✅ Used undersampling (RandomUnderSampler)
- ✅ Used oversampling (RandomOverSampler, SMOTE, ADASYN)
- ✅ Applied class weights
- ✅ Compared all rebalancing strategies
- ✅ Calibrated model probabilities
- ✅ Built production-ready pipeline
- ✅ Discussed business implications

### What You Learned:

**1. The Problem:**
- Imbalanced data is common in real world (fraud, disease, spam)
- Accuracy is USELESS - can get 99% by predicting majority class!
- Need proper metrics: precision, recall, F1, ROC-AUC, PR-AUC

**2. Evaluation Metrics:**
- **Precision**: Of predicted positives, how many are correct?
- **Recall**: Of actual positives, how many did we find?
- **F1**: Harmonic mean of precision and recall
- **ROC-AUC**: Good general metric
- **PR-AUC**: Better for very imbalanced data!

**3. Rebalancing Techniques:**
- **Undersampling**: Fast but loses data
- **Random Oversampling**: Simple but risk of overfitting
- **SMOTE**: Creates synthetic samples, usually best
- **ADASYN**: Adaptive, focuses on hard cases
- **Class Weights**: No resampling needed, often effective

**4. Which Method to Use?**
- **Large dataset**: Try undersampling first (fast)
- **Small dataset**: Use SMOTE or class weights
- **Very imbalanced**: SMOTE + class weights combined
- **Production**: Use pipelines for reproducibility

**5. Business Considerations:**
- Cost of false positive vs false negative
- Adjust threshold based on business needs
- Monitor model performance over time
- Consider ensemble of different rebalancing methods

### Key Insights:
- NEVER use accuracy alone for imbalanced data
- Stratified splitting is CRITICAL
- Always evaluate on original (imbalanced) test set
- PR-AUC often better than ROC-AUC for extreme imbalance
- Class weights are underrated - try them first!
- Business context determines which metric matters most

### Best Practices:
1. Always check class distribution first
2. Use stratified train/test split
3. Try multiple rebalancing techniques
4. Use proper metrics (NOT accuracy!)
5. Plot ROC and PR curves
6. Calibrate probabilities for deployment
7. Consider business costs
8. Use pipelines for production

### Next Steps:
- Apply to real imbalanced datasets (Kaggle has many!)
- Try ensemble methods (EasyEnsemble, BalancedBaggingClassifier)
- Experiment with cost-sensitive learning
- Learn about anomaly detection approaches

---

## Extension Exercises (Optional, Harder!)

1. **Cost-Sensitive Learning**: Implement custom loss function with different costs for FP and FN
2. **Ensemble of Samplers**: Combine multiple resampling strategies
3. **Threshold Optimization**: Find optimal threshold using business costs
4. **One-Class Classification**: Try Isolation Forest or One-Class SVM
5. **Advanced Techniques**: Try SMOTE-ENN, SMOTE-Tomek
6. **Real Dataset**: Apply to Kaggle Credit Card Fraud dataset

---

## You're an Imbalanced Data Expert Now!

**You just mastered handling imbalanced classification from scratch!**

**That's AMAZING! Keep balancing those datasets!**

---
## Solutions (Only Look After Trying!)

Here are the solutions to all TODOs. But remember: **you learn by doing, not by copying!**

In [None]:
# SOLUTION TO TODO 1
print("Class Distribution Analysis")
print("="*60)

class_counts = df['Class'].value_counts()
class_pct = df['Class'].value_counts(normalize=True) * 100

print("\nCounts:")
print(class_counts)
print("\nPercentages:")
print(class_pct)

imbalance_ratio = class_counts[0] / class_counts[1]
print(f"\nImbalance Ratio: {imbalance_ratio:.1f}:1")
print(f"(Majority class is {imbalance_ratio:.0f}x larger than minority class)")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar plot
class_counts.plot(kind='bar', ax=axes[0], color=['steelblue', 'coral'])
axes[0].set_title('Class Distribution (Count)', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Class (0=Legit, 1=Fraud)')
axes[0].set_ylabel('Number of Transactions')
axes[0].set_xticklabels(['Legitimate', 'Fraud'], rotation=0)
for i, v in enumerate(class_counts):
    axes[0].text(i, v + 100, f'{v:,}', ha='center', fontweight='bold')

# Pie chart
axes[1].pie(class_counts, labels=['Legitimate', 'Fraud'], autopct='%1.3f%%',
           colors=['steelblue', 'coral'], startangle=90)
axes[1].set_title('Class Distribution (Percentage)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

# Statistics by class
print("\nStatistics by Class:")
print(df.groupby('Class')['Amount'].describe())

In [None]:
# SOLUTION TO TODO 2
print("Splitting data with stratification...\n")

X = df.drop('Class', axis=1)
y = df['Class']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Split completed!")
print(f"Training set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")

print("\nClass distribution in training set:")
print(y_train.value_counts())
print(f"Fraud rate: {y_train.mean()*100:.3f}%")

print("\nClass distribution in test set:")
print(y_test.value_counts())
print(f"Fraud rate: {y_test.mean()*100:.3f}%")

print("\n✅ Both sets have similar fraud rates - stratification worked!")

In [None]:
# SOLUTION TO TODO 3
print("Training baseline model (no rebalancing)...\n")

clf_baseline = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
clf_baseline.fit(X_train, y_train)
y_pred_baseline = clf_baseline.predict(X_test)

print("✅ Baseline model trained!")
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")

In [None]:
# SOLUTION TO TODO 4
print("Evaluating with ACCURACY (THE WRONG METRIC!)")
print("="*60)

accuracy = accuracy_score(y_test, y_pred_baseline)
print(f"\nModel Accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")

# Naive baseline
y_pred_naive = np.zeros_like(y_test)
accuracy_naive = accuracy_score(y_test, y_pred_naive)
print(f"Naive 'Always Predict 0' Accuracy: {accuracy_naive:.4f} ({accuracy_naive*100:.2f}%)")

print("\n⚠️ PROBLEM IDENTIFIED:")
print("   Both models have ~99.7% accuracy!")
print("   But the naive model catches ZERO fraud cases!")
print("   This proves accuracy is USELESS for imbalanced data!")

# Show predictions
print("\nBaseline model predictions:")
print(pd.Series(y_pred_baseline).value_counts())
print(f"\nFraud cases predicted: {y_pred_baseline.sum()}")
print(f"Actual fraud cases: {y_test.sum()}")

In [None]:
# SOLUTION TO TODO 5
print("Evaluating with PROPER METRICS")
print("="*60)

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_baseline)
print("\nConfusion Matrix:")
print(cm)
print("\n[TN  FP]")
print("[FN  TP]")

tn, fp, fn, tp = cm.ravel()
print(f"\nTrue Negatives (correct legit): {tn}")
print(f"False Positives (legit flagged as fraud): {fp}")
print(f"False Negatives (fraud missed): {fn}")
print(f"True Positives (fraud caught): {tp}")

# Classification Report
print("\nClassification Report:")
print(classification_report(y_test, y_pred_baseline, target_names=['Legitimate', 'Fraud']))

# Calculate metrics
precision = precision_score(y_test, y_pred_baseline)
recall = recall_score(y_test, y_pred_baseline)
f1 = f1_score(y_test, y_pred_baseline)

print(f"\nKey Metrics:")
print(f"  Precision: {precision:.4f} (of predicted frauds, {precision*100:.1f}% are correct)")
print(f"  Recall: {recall:.4f} (we catch {recall*100:.1f}% of actual frauds)")
print(f"  F1-Score: {f1:.4f} (harmonic mean)")

# ROC and PR curves
y_pred_proba_baseline = clf_baseline.predict_proba(X_test)[:, 1]
roc_auc = roc_auc_score(y_test, y_pred_proba_baseline)
pr_auc = average_precision_score(y_test, y_pred_proba_baseline)

print(f"  ROC-AUC: {roc_auc:.4f}")
print(f"  PR-AUC: {pr_auc:.4f}")

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# 1. Confusion Matrix
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0],
           xticklabels=['Legit', 'Fraud'], yticklabels=['Legit', 'Fraud'])
axes[0].set_title('Confusion Matrix')
axes[0].set_ylabel('Actual')
axes[0].set_xlabel('Predicted')

# 2. ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_pred_proba_baseline)
axes[1].plot(fpr, tpr, linewidth=2, label=f'ROC (AUC={roc_auc:.3f})')
axes[1].plot([0, 1], [0, 1], 'k--', label='Random')
axes[1].set_xlabel('False Positive Rate')
axes[1].set_ylabel('True Positive Rate')
axes[1].set_title('ROC Curve')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# 3. Precision-Recall Curve
precision_curve, recall_curve, _ = precision_recall_curve(y_test, y_pred_proba_baseline)
axes[2].plot(recall_curve, precision_curve, linewidth=2, label=f'PR (AUC={pr_auc:.3f})')
axes[2].axhline(y=y_test.mean(), color='k', linestyle='--', label='Baseline')
axes[2].set_xlabel('Recall')
axes[2].set_ylabel('Precision')
axes[2].set_title('Precision-Recall Curve')
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n✅ These metrics actually tell us how well we're catching fraud!")

In [None]:
# SOLUTION TO TODO 6
from imblearn.under_sampling import RandomUnderSampler

print("Applying Random Undersampling...\n")

rus = RandomUnderSampler(random_state=42)
X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)

print(f"Original training set: {len(X_train)} samples")
print(f"After undersampling: {len(X_train_rus)} samples")
print(f"Data loss: {len(X_train) - len(X_train_rus)} samples ({(1-len(X_train_rus)/len(X_train))*100:.1f}%)")

print("\nClass distribution after undersampling:")
print(pd.Series(y_train_rus).value_counts())

# Train model
clf_rus = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
clf_rus.fit(X_train_rus, y_train_rus)
y_pred_rus = clf_rus.predict(X_test)
y_proba_rus = clf_rus.predict_proba(X_test)[:, 1]

# Evaluate
print("\nPerformance on test set:")
print(f"  Precision: {precision_score(y_test, y_pred_rus):.4f}")
print(f"  Recall: {recall_score(y_test, y_pred_rus):.4f}")
print(f"  F1-Score: {f1_score(y_test, y_pred_rus):.4f}")
print(f"  ROC-AUC: {roc_auc_score(y_test, y_proba_rus):.4f}")

print("\n✅ Undersampling balances classes but loses data!")

In [None]:
# SOLUTION TO TODO 7
from imblearn.over_sampling import RandomOverSampler

print("Applying Random Oversampling...\n")

ros = RandomOverSampler(random_state=42)
X_train_ros, y_train_ros = ros.fit_resample(X_train, y_train)

print(f"Original training set: {len(X_train)} samples")
print(f"After oversampling: {len(X_train_ros)} samples")
print(f"Samples added: {len(X_train_ros) - len(X_train)} ({(len(X_train_ros)/len(X_train)-1)*100:.1f}% increase)")

print("\nClass distribution after oversampling:")
print(pd.Series(y_train_ros).value_counts())

# Train model
clf_ros = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
clf_ros.fit(X_train_ros, y_train_ros)
y_pred_ros = clf_ros.predict(X_test)
y_proba_ros = clf_ros.predict_proba(X_test)[:, 1]

# Evaluate
print("\nPerformance on test set:")
print(f"  Precision: {precision_score(y_test, y_pred_ros):.4f}")
print(f"  Recall: {recall_score(y_test, y_pred_ros):.4f}")
print(f"  F1-Score: {f1_score(y_test, y_pred_ros):.4f}")
print(f"  ROC-AUC: {roc_auc_score(y_test, y_proba_ros):.4f}")

print("\n✅ Oversampling duplicates minority class samples!")

In [None]:
# SOLUTION TO TODO 8
from imblearn.over_sampling import SMOTE

print("Applying SMOTE (Synthetic Minority Over-sampling)...\n")

smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

print(f"Original training set: {len(X_train)} samples")
print(f"After SMOTE: {len(X_train_smote)} samples")
print(f"Synthetic samples created: {len(X_train_smote) - len(X_train)}")

print("\nClass distribution after SMOTE:")
print(pd.Series(y_train_smote).value_counts())

# Train model
clf_smote = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
clf_smote.fit(X_train_smote, y_train_smote)
y_pred_smote = clf_smote.predict(X_test)
y_proba_smote = clf_smote.predict_proba(X_test)[:, 1]

# Evaluate
print("\nPerformance on test set:")
print(f"  Precision: {precision_score(y_test, y_pred_smote):.4f}")
print(f"  Recall: {recall_score(y_test, y_pred_smote):.4f}")
print(f"  F1-Score: {f1_score(y_test, y_pred_smote):.4f}")
print(f"  ROC-AUC: {roc_auc_score(y_test, y_proba_smote):.4f}")

print("\n✅ SMOTE creates synthetic samples instead of duplicating!")
print("   Often better than random oversampling.")

In [None]:
# SOLUTION TO TODO 9
from imblearn.over_sampling import ADASYN

print("Applying ADASYN (Adaptive Synthetic Sampling)...\n")

adasyn = ADASYN(random_state=42)
X_train_adasyn, y_train_adasyn = adasyn.fit_resample(X_train, y_train)

print(f"Original training set: {len(X_train)} samples")
print(f"After ADASYN: {len(X_train_adasyn)} samples")

print("\nClass distribution after ADASYN:")
print(pd.Series(y_train_adasyn).value_counts())

# Train model
clf_adasyn = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
clf_adasyn.fit(X_train_adasyn, y_train_adasyn)
y_pred_adasyn = clf_adasyn.predict(X_test)
y_proba_adasyn = clf_adasyn.predict_proba(X_test)[:, 1]

# Evaluate
print("\nPerformance on test set:")
print(f"  Precision: {precision_score(y_test, y_pred_adasyn):.4f}")
print(f"  Recall: {recall_score(y_test, y_pred_adasyn):.4f}")
print(f"  F1-Score: {f1_score(y_test, y_pred_adasyn):.4f}")
print(f"  ROC-AUC: {roc_auc_score(y_test, y_proba_adasyn):.4f}")

print("\n✅ ADASYN focuses on hard-to-learn examples!")

In [None]:
# SOLUTION TO TODO 10
print("Applying Class Weights (No Resampling!)...\n")

clf_weighted = RandomForestClassifier(
    n_estimators=100,
    class_weight='balanced',  # This handles imbalance!
    random_state=42,
    n_jobs=-1
)

# Train on ORIGINAL unbalanced data
clf_weighted.fit(X_train, y_train)
y_pred_weighted = clf_weighted.predict(X_test)
y_proba_weighted = clf_weighted.predict_proba(X_test)[:, 1]

print("Trained on original training set (no resampling!)")
print(f"Training samples: {len(X_train)}")

# Evaluate
print("\nPerformance on test set:")
print(f"  Precision: {precision_score(y_test, y_pred_weighted):.4f}")
print(f"  Recall: {recall_score(y_test, y_pred_weighted):.4f}")
print(f"  F1-Score: {f1_score(y_test, y_pred_weighted):.4f}")
print(f"  ROC-AUC: {roc_auc_score(y_test, y_proba_weighted):.4f}")

print("\n✅ Class weights handle imbalance internally - no resampling needed!")
print("   Often competitive with resampling methods.")

In [None]:
# SOLUTION TO TODO 11
print("COMPREHENSIVE COMPARISON OF ALL METHODS")
print("="*80)

# Collect all results
methods_data = {
    'Baseline': (y_pred_baseline, y_proba_baseline),
    'Undersampling': (y_pred_rus, y_proba_rus),
    'Oversampling': (y_pred_ros, y_proba_ros),
    'SMOTE': (y_pred_smote, y_proba_smote),
    'ADASYN': (y_pred_adasyn, y_proba_adasyn),
    'Class Weights': (y_pred_weighted, y_proba_weighted)
}

results = {
    'Method': [],
    'Precision': [],
    'Recall': [],
    'F1': [],
    'ROC-AUC': [],
    'PR-AUC': []
}

for method, (y_pred, y_proba) in methods_data.items():
    results['Method'].append(method)
    results['Precision'].append(precision_score(y_test, y_pred))
    results['Recall'].append(recall_score(y_test, y_pred))
    results['F1'].append(f1_score(y_test, y_pred))
    results['ROC-AUC'].append(roc_auc_score(y_test, y_proba))
    results['PR-AUC'].append(average_precision_score(y_test, y_proba))

results_df = pd.DataFrame(results)
print("\n", results_df.to_string(index=False))

# Find best method
best_f1_idx = results_df['F1'].idxmax()
best_method = results_df.loc[best_f1_idx, 'Method']
print(f"\n🏆 Best F1-Score: {best_method} ({results_df.loc[best_f1_idx, 'F1']:.4f})")

# Visualize comparison
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.ravel()

metrics = ['Precision', 'Recall', 'F1', 'ROC-AUC', 'PR-AUC']
colors = plt.cm.Set3(np.linspace(0, 1, len(results_df)))

for idx, metric in enumerate(metrics):
    axes[idx].bar(results_df['Method'], results_df[metric], color=colors)
    axes[idx].set_title(metric, fontsize=12, fontweight='bold')
    axes[idx].set_ylabel('Score')
    axes[idx].set_xticklabels(results_df['Method'], rotation=45, ha='right')
    axes[idx].grid(True, alpha=0.3, axis='y')
    
    # Highlight best
    best_idx = results_df[metric].idxmax()
    axes[idx].bar(results_df.loc[best_idx, 'Method'], results_df.loc[best_idx, metric],
                 color='gold', edgecolor='black', linewidth=2)

# Overall comparison
axes[5].axis('off')
summary_text = f"""Summary:

Best Overall: {best_method}
Best F1: {results_df['F1'].max():.4f}
Best Recall: {results_df['Recall'].max():.4f}
Best Precision: {results_df['Precision'].max():.4f}

Key Insights:
• Baseline has high precision, low recall
• Resampling improves recall significantly
• SMOTE usually balances precision/recall well
• Class weights are competitive
"""
axes[5].text(0.1, 0.5, summary_text, fontsize=11, verticalalignment='center',
            bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.suptitle('Comparison of Rebalancing Methods', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

print("\n✅ Comprehensive comparison complete!")

In [None]:
# SOLUTION TO TODO 12
from sklearn.calibration import CalibratedClassifierCV, calibration_curve

print("Calibrating best model...\n")

# Choose best model (e.g., SMOTE based on comparison)
best_clf = clf_smote
X_train_best = X_train_smote
y_train_best = y_train_smote

print(f"Using SMOTE model for calibration...")

# Calibrate
clf_calibrated = CalibratedClassifierCV(best_clf, method='sigmoid', cv=3)
clf_calibrated.fit(X_train_best, y_train_best)

# Get probabilities
y_proba_uncalib = best_clf.predict_proba(X_test)[:, 1]
y_proba_calib = clf_calibrated.predict_proba(X_test)[:, 1]

print("✅ Calibration complete!\n")

# Evaluate calibration
print("Comparing uncalibrated vs calibrated:")
print(f"  Uncalibrated ROC-AUC: {roc_auc_score(y_test, y_proba_uncalib):.4f}")
print(f"  Calibrated ROC-AUC: {roc_auc_score(y_test, y_proba_calib):.4f}")

# Plot calibration curves
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 1. Calibration curve
fraction_pos_uncalib, mean_pred_uncalib = calibration_curve(
    y_test, y_proba_uncalib, n_bins=10, strategy='uniform'
)
fraction_pos_calib, mean_pred_calib = calibration_curve(
    y_test, y_proba_calib, n_bins=10, strategy='uniform'
)

axes[0].plot([0, 1], [0, 1], 'k--', label='Perfect Calibration')
axes[0].plot(mean_pred_uncalib, fraction_pos_uncalib, 's-', label='Uncalibrated', linewidth=2)
axes[0].plot(mean_pred_calib, fraction_pos_calib, 'o-', label='Calibrated', linewidth=2)
axes[0].set_xlabel('Mean Predicted Probability')
axes[0].set_ylabel('Fraction of Positives')
axes[0].set_title('Calibration Curve')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# 2. Probability histograms
axes[1].hist(y_proba_uncalib[y_test==0], bins=50, alpha=0.5, label='Uncalib Legit', color='blue')
axes[1].hist(y_proba_uncalib[y_test==1], bins=50, alpha=0.5, label='Uncalib Fraud', color='red')
axes[1].hist(y_proba_calib[y_test==0], bins=50, alpha=0.3, label='Calib Legit', color='cyan')
axes[1].hist(y_proba_calib[y_test==1], bins=50, alpha=0.3, label='Calib Fraud', color='orange')
axes[1].set_xlabel('Predicted Probability')
axes[1].set_ylabel('Count')
axes[1].set_title('Probability Distributions')
axes[1].legend()

plt.tight_layout()
plt.show()

print("\n✅ Calibration ensures predicted probabilities are reliable!")

In [None]:
# SOLUTION TO TODO 13
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.preprocessing import StandardScaler

print("Building production-ready pipeline...\n")

# Create pipeline
pipeline = ImbPipeline([
    ('scaler', StandardScaler()),
    ('sampler', SMOTE(random_state=42)),
    ('classifier', RandomForestClassifier(
        n_estimators=100,
        random_state=42,
        n_jobs=-1
    ))
])

print("Pipeline structure:")
print(pipeline)

# Train pipeline
print("\nTraining pipeline...")
pipeline.fit(X_train, y_train)

# Evaluate
y_pred_pipeline = pipeline.predict(X_test)
y_proba_pipeline = pipeline.predict_proba(X_test)[:, 1]

print("\n✅ Pipeline trained successfully!")
print("\nPerformance:")
print(f"  Precision: {precision_score(y_test, y_pred_pipeline):.4f}")
print(f"  Recall: {recall_score(y_test, y_pred_pipeline):.4f}")
print(f"  F1-Score: {f1_score(y_test, y_pred_pipeline):.4f}")
print(f"  ROC-AUC: {roc_auc_score(y_test, y_proba_pipeline):.4f}")

print("\n✅ Pipeline is ready for production!")
print("   Benefits:")
print("   • Consistent preprocessing")
print("   • Prevents data leakage")
print("   • Easy to deploy")
print("   • Can be saved with joblib")

In [None]:
# SOLUTION TO TODO 14
print("FINAL EVALUATION AND BUSINESS INTERPRETATION")
print("="*80)

# Get probability predictions
y_proba_final = pipeline.predict_proba(X_test)[:, 1]

# Try different thresholds
print("\nThreshold Analysis:")
print("-" * 80)
thresholds = [0.3, 0.5, 0.7, 0.9]

threshold_results = []
for threshold in thresholds:
    y_pred_threshold = (y_proba_final >= threshold).astype(int)
    
    cm = confusion_matrix(y_test, y_pred_threshold)
    tn, fp, fn, tp = cm.ravel()
    
    prec = precision_score(y_test, y_pred_threshold, zero_division=0)
    rec = recall_score(y_test, y_pred_threshold)
    f1 = f1_score(y_test, y_pred_threshold)
    
    threshold_results.append({
        'Threshold': threshold,
        'TP': tp,
        'FP': fp,
        'FN': fn,
        'TN': tn,
        'Precision': prec,
        'Recall': rec,
        'F1': f1
    })
    
    print(f"\nThreshold = {threshold}")
    print(f"  True Positives (fraud caught): {tp}")
    print(f"  False Positives (legit flagged): {fp}")
    print(f"  False Negatives (fraud missed): {fn}")
    print(f"  Precision: {prec:.4f}, Recall: {rec:.4f}, F1: {f1:.4f}")

threshold_df = pd.DataFrame(threshold_results)

# Visualize threshold impact
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].plot(threshold_df['Threshold'], threshold_df['Precision'], 'o-', label='Precision', linewidth=2)
axes[0].plot(threshold_df['Threshold'], threshold_df['Recall'], 's-', label='Recall', linewidth=2)
axes[0].plot(threshold_df['Threshold'], threshold_df['F1'], '^-', label='F1', linewidth=2)
axes[0].set_xlabel('Threshold')
axes[0].set_ylabel('Score')
axes[0].set_title('Metrics vs Threshold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

x = np.arange(len(thresholds))
width = 0.2
axes[1].bar(x - width, threshold_df['TP'], width, label='True Pos', color='green')
axes[1].bar(x, threshold_df['FP'], width, label='False Pos', color='orange')
axes[1].bar(x + width, threshold_df['FN'], width, label='False Neg', color='red')
axes[1].set_xlabel('Threshold')
axes[1].set_ylabel('Count')
axes[1].set_title('Confusion Matrix Components')
axes[1].set_xticks(x)
axes[1].set_xticklabels(thresholds)
axes[1].legend()

plt.tight_layout()
plt.show()

# Business interpretation
print("\n" + "="*80)
print("BUSINESS INTERPRETATION")
print("="*80)

print("""
💰 Cost Analysis:

False Positive (FP): Legitimate transaction flagged as fraud
  • Customer inconvenience
  • Manual review cost (~$10)
  • Potential customer churn
  Cost: ~$10 - $50 per FP

False Negative (FN): Fraud transaction missed
  • Direct financial loss (avg transaction ~$300)
  • Investigation costs
  • Reputation damage
  Cost: ~$300 - $1000 per FN

📊 Recommendations:

1. For Conservative Approach (minimize fraud loss):
   • Use LOWER threshold (0.3)
   • Maximizes recall (catch more fraud)
   • More false positives (more reviews)
   • Best when fraud cost >> review cost

2. For Balanced Approach:
   • Use threshold ~0.5
   • Balance precision and recall
   • F1-score optimization
   • Good starting point

3. For Customer Experience Focus:
   • Use HIGHER threshold (0.7-0.9)
   • Minimize false positives
   • Accept some fraud losses
   • Best when customer satisfaction critical

🎯 Suggested Strategy:
   Start with threshold = 0.5
   Monitor false positive rate
   Adjust based on business metrics
   A/B test different thresholds
""")

print("\n✅ Final model is ready for deployment!")
print(f"\nFinal Performance (threshold=0.5):")
final_pred = (y_proba_final >= 0.5).astype(int)
print(classification_report(y_test, final_pred, target_names=['Legitimate', 'Fraud']))