# Handling Class Imbalance

**Purpose**: Address the severe class imbalance problem discovered in EDA.

**Problem Summary** (from EDA):
- Forward: 74.2% (7,343 samples) - dominant class
- Left: 16.4% (1,620 samples)
- Right: 9.5% (937 samples) - severe minority
- **Imbalance ratio**: 7.8:1 (Forward:Right)

**Why this matters**: 
- Models will bias toward predicting "Forward" to minimize training error
- Minority classes (especially Right) will be poorly predicted
- Overall accuracy can be high while minority class accuracy is terrible

**Goal**: Create balanced training sets using multiple strategies, compare them, save for use in modeling notebooks.

## 1. Setup and Load Data

In [None]:
# Import libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import json

# For class imbalance handling
from sklearn.utils.class_weight import compute_class_weight
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.combine import SMOTETomek

# Settings
plt.style.use('default')
sns.set_palette("husl")
%matplotlib inline

In [None]:
# Load temporal splits (created in EDA)
# We only balance TRAINING set, validation and test remain unchanged!
data = np.load('../data/splits_temporal.npz')

X_train = data['X_train']
y_train = data['y_train']
X_val = data['X_val']
y_val = data['y_val']
X_test = data['X_test']
y_test = data['y_test']

print("Loaded temporal splits:")
print(f"Train: {len(X_train)} samples")
print(f"Val:   {len(X_val)} samples")
print(f"Test:  {len(X_test)} samples")

# Label names for visualization
label_names = {-1: 'Left', 0: 'Forward', 1: 'Right'}

## 2. Review Original Imbalance

Let's visualize the class distribution in our training set to see the problem clearly.

In [None]:
# Count classes in training set
train_counts = Counter(y_train)

print("Training Set Distribution:")
print("-" * 50)
for label in [-1, 0, 1]:
    count = train_counts[label]
    percentage = (count / len(y_train)) * 100
    print(f"{label_names[label]:8s}: {count:5d} samples ({percentage:5.1f}%)")
print("-" * 50)

# Calculate imbalance ratios
forward_to_left = train_counts[0] / train_counts[-1]
forward_to_right = train_counts[0] / train_counts[1]
print(f"\nImbalance ratios:")
print(f"  Forward:Left  = {forward_to_left:.1f}:1")
print(f"  Forward:Right = {forward_to_right:.1f}:1  ‚Üê Severe!")

In [None]:
# Visualize original distribution
fig, ax = plt.subplots(figsize=(10, 6))

labels = ['Left', 'Forward', 'Right']
counts = [train_counts[-1], train_counts[0], train_counts[1]]
colors = ['#FF6B6B', '#4ECDC4', '#45B7D1']

bars = ax.bar(labels, counts, color=colors, alpha=0.7, edgecolor='black')
ax.set_ylabel('Number of Samples', fontsize=12)
ax.set_title('Original Training Set Distribution', fontsize=14, fontweight='bold')
ax.grid(axis='y', alpha=0.3)

# Add count labels on bars
for bar, count in zip(bars, counts):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height + 50,
            f'{count}\n({count/len(y_train)*100:.1f}%)',
            ha='center', va='bottom', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()

print("\n‚ö†Ô∏è Problem: Right class has only 937 samples, while Forward has 4,390!")
print("Models trained on this will likely ignore Right turns.")

## 3. Method 1: Class Weights

**How it works**: 
- Don't change the dataset
- Instead, change the loss function to penalize mistakes on minority classes more
- Formula: `weight[class] = n_samples / (n_classes √ó n_samples_in_class)`

**Advantages**:
- Simple and fast
- No synthetic data created
- Works with any model that supports weighted loss

**Disadvantages**:
- Doesn't increase training data for minority classes
- Model still sees fewer minority examples

In [None]:
# Compute class weights using sklearn
# This computes: total_samples / (n_classes √ó samples_per_class)
classes = np.unique(y_train)
weights_array = compute_class_weight(
    class_weight='balanced',
    classes=classes,
    y=y_train
)

print("Computed Class Weights:")
print("-" * 40)
for label, weight in zip(classes, weights_array):
    print(f"{label_names[label]:8s} (label={label:2d}): {weight:.3f}")
print("-" * 40)

print("\nInterpretation:")
print(f"- Right class has weight {weights_array[np.where(classes == 1)[0][0]]:.2f} (highest)")
print(f"- Forward class has weight {weights_array[np.where(classes == 0)[0][0]]:.2f} (lowest)")
print("‚Üí Mistakes on Right turns will be penalized ~7√ó more than Forward!")

In [None]:
# Create weight dictionary for Keras/TensorFlow
# Keras expects labels mapped to 0, 1, 2 (not -1, 0, 1)
# So we map: -1‚Üí0, 0‚Üí1, 1‚Üí2
class_weights_dict = {}
for i, label in enumerate([-1, 0, 1]):
    class_weights_dict[i] = weights_array[np.where(classes == label)[0][0]]

print("\nClass weights dictionary (for Keras):")
print(class_weights_dict)
print("\nUsage in Keras:")
print("  model.fit(X, y, class_weight=class_weights_dict, ...)")

# Save for use in modeling notebooks
np.save('../data/class_weights.npy', class_weights_dict)
print("\n‚úÖ Saved to: data/class_weights.npy")

## 4. Method 2: SMOTE (Synthetic Minority Over-sampling)

**How it works**:
- Create synthetic samples for minority classes
- For each minority sample, find its k nearest neighbors
- Create new samples by interpolating between the sample and its neighbors
- Formula: `new_sample = sample + random(0,1) √ó (neighbor - sample)`

**Advantages**:
- Balances dataset by creating more minority samples
- Model sees more varied minority examples
- Often improves minority class performance

**Disadvantages**:
- Creates synthetic (not real) data
- Can create unrealistic samples if applied incorrectly
- Increases training time (more samples)

In [None]:
# SMOTE requires 1D feature vectors, so flatten images
# (9900, 64, 64) ‚Üí (9900, 4096)
X_train_flat = X_train.reshape(len(X_train), -1)

print(f"Original shape: {X_train.shape}")
print(f"Flattened shape: {X_train_flat.shape}")

In [None]:
# Apply SMOTE
print("Applying SMOTE...")
smote = SMOTE(random_state=42)
X_train_smote_flat, y_train_smote = smote.fit_resample(X_train_flat, y_train)

# Reshape back to images
X_train_smote = X_train_smote_flat.reshape(-1, 64, 64)

print("\nBefore SMOTE:")
print(Counter(y_train))
print(f"Total: {len(y_train)} samples")

print("\nAfter SMOTE:")
print(Counter(y_train_smote))
print(f"Total: {len(y_train_smote)} samples")

print("\n‚úÖ Result: All classes now have equal representation!")

In [None]:
# Visualize SMOTE-generated samples
# Show a few synthetic samples (those added after original dataset)
original_size = len(y_train)
synthetic_indices = range(original_size, min(original_size + 10, len(y_train_smote)))

fig, axes = plt.subplots(2, 5, figsize=(15, 6))
axes = axes.flatten()

for i, idx in enumerate(synthetic_indices):
    axes[i].imshow(X_train_smote[idx], cmap='gray')
    axes[i].set_title(f"Synthetic {label_names[y_train_smote[idx]]}", fontsize=10)
    axes[i].axis('off')

plt.suptitle('Example SMOTE-Generated Synthetic Samples', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("üí° Note: These are interpolations between real samples.")
print("They should look realistic (not random noise).")

In [None]:
# Save SMOTE-balanced dataset
# IMPORTANT: Keep validation and test sets unchanged!
np.savez('../data/splits_temporal_smote.npz',
         X_train=X_train_smote, y_train=y_train_smote,
         X_val=X_val, y_val=y_val,
         X_test=X_test, y_test=y_test)

print("‚úÖ Saved to: data/splits_temporal_smote.npz")
print(f"   Training samples: {len(X_train_smote)}")
print(f"   Validation samples: {len(X_val)} (unchanged)")
print(f"   Test samples: {len(X_test)} (unchanged)")

## 5. Method 3: Random Undersampling

**How it works**:
- Randomly remove samples from majority class
- Keep all minority samples
- Result: balanced dataset, but smaller

**Advantages**:
- Simple and fast
- No synthetic data created
- Can reduce overfitting on majority class

**Disadvantages**:
- Throws away real data (bad when data is already limited!)
- May lose important majority class patterns
- Likely NOT good for our small dataset

In [None]:
# Apply random undersampling
print("Applying Random Undersampling...")
rus = RandomUnderSampler(random_state=42)
X_train_under_flat, y_train_under = rus.fit_resample(X_train_flat, y_train)

# Reshape back
X_train_under = X_train_under_flat.reshape(-1, 64, 64)

print("\nBefore Undersampling:")
print(Counter(y_train))
print(f"Total: {len(y_train)} samples")

print("\nAfter Undersampling:")
print(Counter(y_train_under))
print(f"Total: {len(y_train_under)} samples")

print("\n‚ö†Ô∏è Warning: We threw away {:.0f}% of our training data!".format(
    (len(y_train) - len(y_train_under)) / len(y_train) * 100
))
print("This is likely NOT a good strategy for our already-small dataset.")

In [None]:
# Save undersampled dataset (for completeness, but probably won't use)
np.savez('../data/splits_temporal_undersample.npz',
         X_train=X_train_under, y_train=y_train_under,
         X_val=X_val, y_val=y_val,
         X_test=X_test, y_test=y_test)

print("‚úÖ Saved to: data/splits_temporal_undersample.npz")
print("   (Included for completeness, but not recommended)")

## 6. Method 4: Combined Approach (SMOTE + Tomek Links)

**How it works**:
- First, apply SMOTE to oversample minority classes
- Then, apply Tomek Links to remove borderline samples
- Tomek Links: pairs of opposite-class samples that are nearest neighbors
- Removing them cleans the decision boundary

**Advantages**:
- Balances classes while cleaning noisy samples
- Often better than SMOTE alone
- Creates cleaner decision boundaries

**Disadvantages**:
- More complex
- Slower than plain SMOTE
- May remove useful borderline cases

In [None]:
# Apply SMOTE + Tomek Links
print("Applying SMOTE + Tomek Links...")
smt = SMOTETomek(random_state=42)
X_train_combined_flat, y_train_combined = smt.fit_resample(X_train_flat, y_train)

# Reshape back
X_train_combined = X_train_combined_flat.reshape(-1, 64, 64)

print("\nOriginal:")
print(Counter(y_train))
print(f"Total: {len(y_train)} samples")

print("\nAfter SMOTE + Tomek:")
print(Counter(y_train_combined))
print(f"Total: {len(y_train_combined)} samples")

print("\nüí° Note: Slightly fewer samples than pure SMOTE (Tomek removed borderline cases)")

In [None]:
# Save combined approach dataset
np.savez('../data/splits_temporal_combined.npz',
         X_train=X_train_combined, y_train=y_train_combined,
         X_val=X_val, y_val=y_val,
         X_test=X_test, y_test=y_test)

print("‚úÖ Saved to: data/splits_temporal_combined.npz")

## 7. Comparison of All Methods

Let's compare all balancing strategies side-by-side.

In [None]:
# Create comparison table
methods_data = {
    'Original': (y_train, 'No balancing'),
    'SMOTE': (y_train_smote, 'Oversample minority'),
    'Undersample': (y_train_under, 'Undersample majority'),
    'SMOTE+Tomek': (y_train_combined, 'Oversample + clean boundary')
}

import pandas as pd

summary_data = []
for name, (y, desc) in methods_data.items():
    counts = Counter(y)
    summary_data.append({
        'Method': name,
        'Description': desc,
        'Total': len(y),
        'Left': counts.get(-1, 0),
        'Forward': counts.get(0, 0),
        'Right': counts.get(1, 0),
        'Imbalance Ratio': f"{counts.get(0, 1) / counts.get(1, 1):.2f}:1"
    })

df_summary = pd.DataFrame(summary_data)
print("\n" + "="*80)
print("SUMMARY: Comparison of Class Balancing Methods")
print("="*80)
print(df_summary.to_string(index=False))
print("="*80)

In [None]:
# Visualize all methods side-by-side
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

for i, (name, (y, desc)) in enumerate(methods_data.items()):
    ax = axes[i]
    counts = Counter(y)
    
    labels = ['Left', 'Forward', 'Right']
    values = [counts.get(-1, 0), counts.get(0, 0), counts.get(1, 0)]
    colors = ['#FF6B6B', '#4ECDC4', '#45B7D1']
    
    bars = ax.bar(labels, values, color=colors, alpha=0.7, edgecolor='black')
    ax.set_ylabel('Number of Samples', fontsize=11)
    ax.set_title(f'{name}\n({desc})', fontsize=12, fontweight='bold')
    ax.grid(axis='y', alpha=0.3)
    
    # Add count labels
    for bar, count in zip(bars, values):
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height + height*0.02,
                f'{count}',
                ha='center', va='bottom', fontsize=10, fontweight='bold')

plt.suptitle('Comparison of Class Balancing Methods', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

## 8. Recommendations

Based on our dataset characteristics and goals:

In [None]:
print("\n" + "="*80)
print("RECOMMENDATIONS")
print("="*80)

print("\n1. START WITH: Class Weights")
print("   ‚úÖ Simplest approach")
print("   ‚úÖ No synthetic data created")
print("   ‚úÖ Works well with neural networks")
print("   ‚Üí Use: data/class_weights.npy")

print("\n2. IF Class Weights Don't Work: Try SMOTE")
print("   ‚úÖ Provides more minority class examples")
print("   ‚úÖ Often improves minority class F1 score")
print("   ‚ö†Ô∏è  Increases training time (more samples)")
print("   ‚Üí Use: data/splits_temporal_smote.npz")

print("\n3. ADVANCED: SMOTE + Tomek")
print("   ‚úÖ Cleanest decision boundaries")
print("   ‚úÖ May work better than plain SMOTE")
print("   ‚ö†Ô∏è  More complex, slower")
print("   ‚Üí Use: data/splits_temporal_combined.npz")

print("\n4. AVOID: Random Undersampling")
print("   ‚ùå Throws away 79% of training data")
print("   ‚ùå Our dataset is already small (5,940 samples)")
print("   ‚ùå Likely to underperform")
print("   ‚Üí Don't use unless other methods fail")

print("\n" + "="*80)
print("EVALUATION STRATEGY")
print("="*80)
print("When comparing methods, focus on:")
print("  ‚Ä¢ Per-class F1 scores (especially Right class)")
print("  ‚Ä¢ Confusion matrix (are Right turns being predicted?)")
print("  ‚Ä¢ F1-macro (average of per-class F1, weights all classes equally)")
print("  ‚ö†Ô∏è  DON'T rely on overall accuracy alone!")
print("="*80)

## 9. Save Metadata

Save a summary of all methods for easy reference in modeling notebooks.

In [None]:
# Create metadata summary
metadata = {
    'original_distribution': {
        'left': int(train_counts[-1]),
        'forward': int(train_counts[0]),
        'right': int(train_counts[1]),
        'total': int(len(y_train)),
        'imbalance_ratio': float(train_counts[0] / train_counts[1])
    },
    'methods': {
        'class_weights': {
            'description': 'Inverse frequency weights for loss function',
            'file': 'class_weights.npy',
            'changes_dataset': False,
            'train_size': int(len(y_train)),
            'weights': {int(k): float(v) for k, v in class_weights_dict.items()},
            'recommendation': 'Start here - simplest and most effective'
        },
        'smote': {
            'description': 'SMOTE oversampling of minority classes',
            'file': 'splits_temporal_smote.npz',
            'changes_dataset': True,
            'train_size': int(len(y_train_smote)),
            'distribution': {int(k): int(v) for k, v in Counter(y_train_smote).items()},
            'recommendation': 'Try if class weights insufficient for minority class'
        },
        'undersample': {
            'description': 'Random undersampling of majority class',
            'file': 'splits_temporal_undersample.npz',
            'changes_dataset': True,
            'train_size': int(len(y_train_under)),
            'distribution': {int(k): int(v) for k, v in Counter(y_train_under).items()},
            'recommendation': 'NOT recommended - throws away too much data'
        },
        'smote_tomek': {
            'description': 'SMOTE oversampling + Tomek Links boundary cleaning',
            'file': 'splits_temporal_combined.npz',
            'changes_dataset': True,
            'train_size': int(len(y_train_combined)),
            'distribution': {int(k): int(v) for k, v in Counter(y_train_combined).items()},
            'recommendation': 'Advanced option - may work better than plain SMOTE'
        }
    },
    'usage_example': {
        'class_weights': 'model.fit(X, y, class_weight=np.load("class_weights.npy").item(), ...)',
        'smote': 'data = np.load("splits_temporal_smote.npz"); X_train = data["X_train"]',
        'combined': 'data = np.load("splits_temporal_combined.npz"); X_train = data["X_train"]'
    }
}

# Save metadata
with open('../data/imbalance_summary.json', 'w') as f:
    json.dump(metadata, f, indent=2)

print("‚úÖ Metadata saved to: data/imbalance_summary.json")
print("\nYou can read this file in modeling notebooks to see details of each method.")

## 10. Summary

**Files created**:
1. `data/class_weights.npy` - Class weights for loss function (**recommended start**)
2. `data/splits_temporal_smote.npz` - SMOTE-balanced training set
3. `data/splits_temporal_undersample.npz` - Undersampled training set (not recommended)
4. `data/splits_temporal_combined.npz` - SMOTE + Tomek Links balanced set
5. `data/imbalance_summary.json` - Metadata about all methods

**Next steps**:
1. In baseline models notebook: Try each method, compare results
2. In CNN models notebook: Use best method from baseline experiments
3. Focus evaluation on per-class F1 scores, not just accuracy

**Expected results**:
- Class weights should improve Right class F1 by 10-20%
- SMOTE may improve further, but with longer training time
- Overall accuracy might go down slightly, but that's okay!
- Goal: Balanced performance across all three classes