# 03 - Imbalance Handling

Objective: compare multiple strategies to handle severe class imbalance without leaking information.

Methods:
- SMOTE on training data only
- Random undersampling (baseline)
- Class weights for algorithms that support it

We keep validation/test untouched by resampling.


In [None]:
import pandas as pd
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from sklearn.utils.class_weight import compute_class_weight
from pathlib import Path

PROCESSED_DIR = Path('data/processed')
X_train = pd.read_csv(PROCESSED_DIR / 'X_train_scaled.csv')
y_train = pd.read_csv(PROCESSED_DIR / 'y_train.csv').squeeze()

# Class weights (for use in later models)
class_weights = compute_class_weight(class_weight='balanced', classes=[0,1], y=y_train)
class_weight_dict = {0: class_weights[0], 1: class_weights[1]}
class_weight_dict

In [None]:
# Inspect imbalance before resampling
class_counts = y_train.value_counts().rename(index={0: 'Non-Fraud', 1: 'Fraud'})
display(class_counts)

import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid')
fig, ax = plt.subplots(figsize=(6,4))
sns.barplot(x=class_counts.index, y=class_counts.values, ax=ax)
ax.set_ylabel('Count')
ax.set_title('Training Class Distribution (Imbalanced)')
plt.show()


In [None]:
# SMOTE on training data only
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Random undersampling for comparison
rus = RandomUnderSampler(random_state=42)
X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)

X_train.shape, X_train_smote.shape, X_train_rus.shape

In [None]:
# Save resampled datasets for later models
X_train_smote.to_csv(PROCESSED_DIR / 'X_train_smote.csv', index=False)
y_train_smote.to_csv(PROCESSED_DIR / 'y_train_smote.csv', index=False)
X_train_rus.to_csv(PROCESSED_DIR / 'X_train_rus.csv', index=False)
y_train_rus.to_csv(PROCESSED_DIR / 'y_train_rus.csv', index=False)

Rationale: SMOTE synthesizes minority class samples to reduce bias, but applying it only to the training set prevents inflating performance on unseen data. Undersampling offers a simpler baseline. Class weights remain available for algorithms that natively support them.