# Convolutional Neural Network (CNN) Models

**Goal**: Test if CNNs can beat the 93.0% / 0.910 F1-Macro achieved by Random Forest.

## What is a CNN?

**Convolutional Neural Network** = specialized neural network for images

**Key difference from traditional models**:
- Traditional (LR, RF): Treat image as flat list of 4,096 pixels (loses spatial structure)
- CNN: Preserves 2D structure, learns spatial patterns

**How CNNs work**:
1. **Convolutional layers**: Scan small filters (e.g., 3×3) across image to detect patterns
   - Early layers: detect edges, corners
   - Middle layers: detect shapes, textures
   - Deep layers: detect complex patterns (track positions)

2. **Pooling layers**: Reduce image size while keeping important features
   - Makes model faster and more robust

3. **Dense layers**: Combine all features to make final prediction

**Example architecture**:
```
Input 64×64 image
→ Conv (find edges)
→ Pool (reduce size)
→ Conv (find shapes)
→ Pool (reduce size)
→ Flatten → Dense layers → Prediction
```

## Challenge for Our Dataset

**Problem**: CNNs typically need 50K+ samples. We only have 9,900.
- Risk: Overfitting (memorizing training data instead of learning patterns)
- Solution: Heavy regularization (dropout, weight decay)

**Baseline to beat**: Random Forest = 93.0% accuracy, 0.910 F1-Macro

## 1. Setup

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import time
import json

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models, optimizers, callbacks

from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix

# Set random seeds for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

# Settings
plt.style.use('default')
%matplotlib inline

label_names = {-1: 'Left', 0: 'Forward', 1: 'Right'}

print(f"TensorFlow version: {tf.__version__}")
print(f"Keras version: {keras.__version__}")

## 2. Load and Prepare Data

In [None]:
# Load temporal splits
data_temporal = np.load('../data/splits_temporal.npz')
data_random = np.load('../data/splits_random.npz')
data_tfi = np.load('../data/splits_temporal_tfi.npz')

# Temporal splits
X_train_temp = data_temporal['X_train']
y_train_temp = data_temporal['y_train']
X_val = data_temporal['X_val']
y_val = data_temporal['y_val']
X_test = data_temporal['X_test']
y_test = data_temporal['y_test']

# Random splits (for comparison)
X_train_rand = data_random['X_train']
y_train_rand = data_random['y_train']
X_val_rand = data_random['X_val']
y_val_rand = data_random['y_val']
X_test_rand = data_random['X_test']
y_test_rand = data_random['y_test']

# TFI-balanced
X_train_tfi = data_tfi['X_train']
y_train_tfi = data_tfi['y_train']

# Load class weights
class_weights = np.load('../data/class_weights.npy', allow_pickle=True).item()

print("Temporal splits:")
print(f"  Train: {X_train_temp.shape}, labels: {Counter(y_train_temp)}")
print(f"  Val:   {X_val.shape}")
print(f"  Test:  {X_test.shape}")
print(f"\nRandom splits:")
print(f"  Train: {X_train_rand.shape}")
print(f"\nTFI-balanced:")
print(f"  Train: {X_train_tfi.shape}, labels: {Counter(y_train_tfi)}")

In [None]:
# Reshape for CNN: (samples, height, width) → (samples, height, width, channels)
# Grayscale has 1 channel
X_train_temp_cnn = X_train_temp.reshape(-1, 64, 64, 1)
X_val_cnn = X_val.reshape(-1, 64, 64, 1)
X_test_cnn = X_test.reshape(-1, 64, 64, 1)

X_train_rand_cnn = X_train_rand.reshape(-1, 64, 64, 1)
X_val_rand_cnn = X_val_rand.reshape(-1, 64, 64, 1)
X_test_rand_cnn = X_test_rand.reshape(-1, 64, 64, 1)

X_train_tfi_cnn = X_train_tfi.reshape(-1, 64, 64, 1)

# Normalize to [0, 1]
X_train_temp_cnn = X_train_temp_cnn / 255.0
X_val_cnn = X_val_cnn / 255.0
X_test_cnn = X_test_cnn / 255.0

X_train_rand_cnn = X_train_rand_cnn / 255.0
X_val_rand_cnn = X_val_rand_cnn / 255.0
X_test_rand_cnn = X_test_rand_cnn / 255.0

X_train_tfi_cnn = X_train_tfi_cnn / 255.0

# Map labels: -1→0, 0→1, 1→2 (Keras expects labels starting from 0)
y_train_temp_mapped = y_train_temp + 1
y_val_mapped = y_val + 1
y_test_mapped = y_test + 1

y_train_rand_mapped = y_train_rand + 1
y_val_rand_mapped = y_val_rand + 1
y_test_rand_mapped = y_test_rand + 1

y_train_tfi_mapped = y_train_tfi + 1

print(f"Prepared for CNN:")
print(f"  X_train shape: {X_train_temp_cnn.shape}")
print(f"  y_train mapped labels: {np.unique(y_train_temp_mapped)}")
print(f"  Value range: [{X_train_temp_cnn.min():.2f}, {X_train_temp_cnn.max():.2f}]")

## 3. Helper Functions

In [None]:
def evaluate_cnn(model, X_test, y_test, model_name="CNN"):
    """
    Evaluate CNN model.
    y_test should be mapped (0, 1, 2)
    """
    y_pred = np.argmax(model.predict(X_test, verbose=0), axis=1)
    
    acc = accuracy_score(y_test, y_pred)
    f1_macro = f1_score(y_test, y_pred, average='macro')
    f1_per_class = f1_score(y_test, y_pred, average=None, labels=[0, 1, 2])
    cm = confusion_matrix(y_test, y_pred, labels=[0, 1, 2])
    
    print(f"\n{'='*60}")
    print(f"{model_name}")
    print(f"{'='*60}")
    print(f"Accuracy:  {acc:.3f} ({acc*100:.1f}%)")
    print(f"F1-Macro:  {f1_macro:.3f}")
    print(f"\nPer-class F1:")
    for idx, f1 in enumerate(f1_per_class):
        original_label = idx - 1
        print(f"  {label_names[original_label]:8s}: {f1:.3f}")
    
    return {
        'accuracy': acc,
        'f1_macro': f1_macro,
        'f1_left': f1_per_class[0],
        'f1_forward': f1_per_class[1],
        'f1_right': f1_per_class[2],
        'confusion_matrix': cm,
        'predictions': np.argmax(model.predict(X_test, verbose=0), axis=1)
    }

def plot_confusion_matrix(cm, title="Confusion Matrix"):
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=['Left', 'Forward', 'Right'],
                yticklabels=['Left', 'Forward', 'Right'],
                cbar_kws={'label': 'Count'})
    plt.ylabel('True Label', fontsize=12)
    plt.xlabel('Predicted Label', fontsize=12)
    plt.title(title, fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()

def plot_training_history(history, title="Training History"):
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Accuracy
    axes[0].plot(history.history['accuracy'], label='Train', linewidth=2)
    axes[0].plot(history.history['val_accuracy'], label='Validation', linewidth=2)
    axes[0].set_xlabel('Epoch', fontsize=12)
    axes[0].set_ylabel('Accuracy', fontsize=12)
    axes[0].set_title('Accuracy over Epochs', fontsize=12, fontweight='bold')
    axes[0].legend()
    axes[0].grid(alpha=0.3)
    
    # Loss
    axes[1].plot(history.history['loss'], label='Train', linewidth=2)
    axes[1].plot(history.history['val_loss'], label='Validation', linewidth=2)
    axes[1].set_xlabel('Epoch', fontsize=12)
    axes[1].set_ylabel('Loss', fontsize=12)
    axes[1].set_title('Loss over Epochs', fontsize=12, fontweight='bold')
    axes[1].legend()
    axes[1].grid(alpha=0.3)
    
    plt.suptitle(title, fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()

print("Helper functions defined")

## 4. Simple CNN Architecture

**Design philosophy**: Start simple to avoid overfitting

**Architecture**:
- Conv2D(32 filters, 3×3) + ReLU
- MaxPooling(2×2) → reduces 64×64 to 32×32
- Conv2D(64 filters, 3×3) + ReLU
- MaxPooling(2×2) → reduces 32×32 to 16×16
- Flatten → 16×16×64 = 16,384 features
- Dense(128) + Dropout(0.5) + ReLU
- Dense(3) + Softmax

**Total parameters**: ~50K (small for CNN)

In [None]:
def build_simple_cnn():
    model = models.Sequential([
        layers.Conv2D(32, (3, 3), activation='relu', input_shape=(64, 64, 1)),
        layers.MaxPooling2D((2, 2)),
        
        layers.Conv2D(64, (3, 3), activation='relu'),
        layers.MaxPooling2D((2, 2)),
        
        layers.Flatten(),
        layers.Dense(128, activation='relu'),
        layers.Dropout(0.5),
        layers.Dense(3, activation='softmax')
    ])
    
    model.compile(
        optimizer=optimizers.Adam(learning_rate=0.001),
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    
    return model

# Build and show architecture
model = build_simple_cnn()
model.summary()

## 5. Experiment 1: Simple CNN on Random Split

**Purpose**: Demonstrate data leakage from temporal correlation.

From EDA: consecutive frames have 0.89 correlation, frames within 100 steps still >0.5 correlated.

Random split mixes temporally-close frames between train/test → inflated performance.

In [None]:
print("Training Simple CNN on RANDOM split...")

model_rand = build_simple_cnn()

early_stop = callbacks.EarlyStopping(patience=10, restore_best_weights=True)
reduce_lr = callbacks.ReduceLROnPlateau(factor=0.5, patience=5, verbose=1)

start_time = time.time()
history_rand = model_rand.fit(
    X_train_rand_cnn, y_train_rand_mapped,
    validation_data=(X_val_rand_cnn, y_val_rand_mapped),
    epochs=50,
    batch_size=32,
    callbacks=[early_stop, reduce_lr],
    verbose=2
)
train_time_rand = time.time() - start_time

print(f"\nTraining completed in {train_time_rand:.1f} seconds")

results_cnn_rand = evaluate_cnn(model_rand, X_test_rand_cnn, y_test_rand_mapped, 
                                "Simple CNN (Random Split)")

In [None]:
plot_training_history(history_rand, "Simple CNN - Random Split")
plot_confusion_matrix(results_cnn_rand['confusion_matrix'], 
                      "Simple CNN (Random Split) - Confusion Matrix")

### Analysis: Random Split Results

*Fill in observations after running:*
- Accuracy: ____%
- F1-Macro: ____
- Gap from baseline (Random Forest 93%): ____
- Training vs validation gap: ____ (check for overfitting)

*Expected: High accuracy due to data leakage (temporally similar frames in train/test)*

## 6. Experiment 2: Simple CNN on Temporal Split (No Balancing)

**Purpose**: Realistic evaluation without data leakage.

In [None]:
print("Training Simple CNN on TEMPORAL split (original data)...")

model_temp_orig = build_simple_cnn()

early_stop = callbacks.EarlyStopping(patience=10, restore_best_weights=True)
reduce_lr = callbacks.ReduceLROnPlateau(factor=0.5, patience=5, verbose=1)

start_time = time.time()
history_temp_orig = model_temp_orig.fit(
    X_train_temp_cnn, y_train_temp_mapped,
    validation_data=(X_val_cnn, y_val_mapped),
    epochs=50,
    batch_size=32,
    callbacks=[early_stop, reduce_lr],
    verbose=2
)
train_time_temp_orig = time.time() - start_time

print(f"\nTraining completed in {train_time_temp_orig:.1f} seconds")

results_temp_orig = evaluate_cnn(model_temp_orig, X_test_cnn, y_test_mapped,
                                 "Simple CNN (Temporal, No Balancing)")

In [None]:
plot_training_history(history_temp_orig, "Simple CNN - Temporal Split (Original)")
plot_confusion_matrix(results_temp_orig['confusion_matrix'],
                      "Simple CNN (Temporal) - Confusion Matrix")

### Analysis: Temporal Split vs Random Split Comparison

*Fill in after running both experiments:*

| Metric | Random Split | Temporal Split | Gap |
|--------|--------------|----------------|-----|
| Accuracy | ____% | ____% | ____% |
| F1-Macro | ____ | ____ | ____ |
| F1-Right | ____ | ____ | ____ |

*Expected: Random split shows 5-10% higher accuracy due to data leakage*

**Conclusion**: *(Fill in based on results)*

## 7. Experiment 3: Simple CNN with Class Weights

In [None]:
print("Training Simple CNN with class weights...")

model_temp_weighted = build_simple_cnn()

early_stop = callbacks.EarlyStopping(patience=10, restore_best_weights=True)
reduce_lr = callbacks.ReduceLROnPlateau(factor=0.5, patience=5, verbose=1)

start_time = time.time()
history_temp_weighted = model_temp_weighted.fit(
    X_train_temp_cnn, y_train_temp_mapped,
    validation_data=(X_val_cnn, y_val_mapped),
    epochs=50,
    batch_size=32,
    class_weight=class_weights,
    callbacks=[early_stop, reduce_lr],
    verbose=2
)
train_time_temp_weighted = time.time() - start_time

print(f"\nTraining completed in {train_time_temp_weighted:.1f} seconds")

results_temp_weighted = evaluate_cnn(model_temp_weighted, X_test_cnn, y_test_mapped,
                                     "Simple CNN (Temporal + Class Weights)")

In [None]:
plot_training_history(history_temp_weighted, "Simple CNN - Class Weights")
plot_confusion_matrix(results_temp_weighted['confusion_matrix'],
                      "Simple CNN (Weighted) - Confusion Matrix")

### Analysis: Class Weights Impact

*Fill in after running:*

| Metric | No Weights | With Weights | Change |
|--------|------------|--------------|--------|
| Accuracy | ____% | ____% | ____% |
| F1-Right | ____ | ____ | ____ |

**Observations**: *(Fill in based on results)*

## 8. Experiment 4: Simple CNN with TFI Balanced Data

In [None]:
print("Training Simple CNN on TFI-balanced data...")

model_tfi = build_simple_cnn()

early_stop = callbacks.EarlyStopping(patience=10, restore_best_weights=True)
reduce_lr = callbacks.ReduceLROnPlateau(factor=0.5, patience=5, verbose=1)

start_time = time.time()
history_tfi = model_tfi.fit(
    X_train_tfi_cnn, y_train_tfi_mapped,
    validation_data=(X_val_cnn, y_val_mapped),
    epochs=50,
    batch_size=32,
    callbacks=[early_stop, reduce_lr],
    verbose=2
)
train_time_tfi = time.time() - start_time

print(f"\nTraining completed in {train_time_tfi:.1f} seconds")

results_tfi = evaluate_cnn(model_tfi, X_test_cnn, y_test_mapped,
                           "Simple CNN (TFI Balanced)")

In [None]:
plot_training_history(history_tfi, "Simple CNN - TFI Balanced")
plot_confusion_matrix(results_tfi['confusion_matrix'],
                      "Simple CNN (TFI) - Confusion Matrix")

### Analysis: TFI Impact

*Fill in after running:*

Training set size: ____ samples (vs ____ original)

| Metric | Original | TFI | Change |
|--------|----------|-----|--------|
| Accuracy | ____% | ____% | ____% |
| F1-Right | ____ | ____ | ____ |

**Observations**: *(Fill in)*

## 9. Experiment 5: Medium CNN (AlexNet-Style)

**Purpose**: Test if deeper architecture improves performance.

**Architecture** (simplified from 2020 project):
- Conv2D(96, 5×5, stride=2) + ReLU + MaxPool
- Conv2D(128, 3×3) + ReLU + MaxPool
- Conv2D(256, 3×3) + ReLU
- Flatten → Dense(512) + Dropout(0.5) + ReLU
- Dense(3) + Softmax

**Total parameters**: ~200K (4× more than simple CNN)

**Risk**: May overfit on 9,900 samples

In [None]:
def build_medium_cnn():
    model = models.Sequential([
        layers.Conv2D(96, (5, 5), strides=2, activation='relu', input_shape=(64, 64, 1)),
        layers.MaxPooling2D((2, 2)),
        
        layers.Conv2D(128, (3, 3), activation='relu'),
        layers.MaxPooling2D((2, 2)),
        
        layers.Conv2D(256, (3, 3), activation='relu'),
        
        layers.Flatten(),
        layers.Dense(512, activation='relu'),
        layers.Dropout(0.5),
        layers.Dense(3, activation='softmax')
    ])
    
    model.compile(
        optimizer=optimizers.Adam(learning_rate=0.001),
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    
    return model

model_medium = build_medium_cnn()
model_medium.summary()

In [None]:
print("Training Medium CNN on temporal split + class weights...")

early_stop = callbacks.EarlyStopping(patience=10, restore_best_weights=True)
reduce_lr = callbacks.ReduceLROnPlateau(factor=0.5, patience=5, verbose=1)

start_time = time.time()
history_medium = model_medium.fit(
    X_train_temp_cnn, y_train_temp_mapped,
    validation_data=(X_val_cnn, y_val_mapped),
    epochs=50,
    batch_size=32,
    class_weight=class_weights,
    callbacks=[early_stop, reduce_lr],
    verbose=2
)
train_time_medium = time.time() - start_time

print(f"\nTraining completed in {train_time_medium:.1f} seconds")

results_medium = evaluate_cnn(model_medium, X_test_cnn, y_test_mapped,
                              "Medium CNN (Temporal + Weights)")

In [None]:
plot_training_history(history_medium, "Medium CNN")
plot_confusion_matrix(results_medium['confusion_matrix'],
                      "Medium CNN - Confusion Matrix")

### Analysis: Simple vs Medium Architecture

*Fill in after running:*

| Model | Params | Accuracy | F1-Macro | Train Time |
|-------|--------|----------|----------|------------|
| Simple CNN | ~50K | ____% | ____ | ____s |
| Medium CNN | ~200K | ____% | ____ | ____s |

**Check training curves**: Does medium CNN show more overfitting (train-val gap)?

**Observations**: *(Fill in)*

## 10. Experiment 6: Regularization Tuning

**Purpose**: Find best regularization for simple CNN.

**Test**: Heavier dropout (0.7 instead of 0.5)

In [None]:
def build_simple_cnn_heavy_dropout():
    model = models.Sequential([
        layers.Conv2D(32, (3, 3), activation='relu', input_shape=(64, 64, 1)),
        layers.MaxPooling2D((2, 2)),
        
        layers.Conv2D(64, (3, 3), activation='relu'),
        layers.MaxPooling2D((2, 2)),
        
        layers.Flatten(),
        layers.Dense(128, activation='relu'),
        layers.Dropout(0.7),  # Heavy dropout
        layers.Dense(3, activation='softmax')
    ])
    
    model.compile(
        optimizer=optimizers.Adam(learning_rate=0.001),
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    
    return model

print("Training Simple CNN with heavy dropout (0.7)...")

model_heavy = build_simple_cnn_heavy_dropout()

early_stop = callbacks.EarlyStopping(patience=10, restore_best_weights=True)
reduce_lr = callbacks.ReduceLROnPlateau(factor=0.5, patience=5, verbose=1)

start_time = time.time()
history_heavy = model_heavy.fit(
    X_train_temp_cnn, y_train_temp_mapped,
    validation_data=(X_val_cnn, y_val_mapped),
    epochs=50,
    batch_size=32,
    class_weight=class_weights,
    callbacks=[early_stop, reduce_lr],
    verbose=2
)
train_time_heavy = time.time() - start_time

print(f"\nTraining completed in {train_time_heavy:.1f} seconds")

results_heavy = evaluate_cnn(model_heavy, X_test_cnn, y_test_mapped,
                             "Simple CNN (Heavy Dropout 0.7)")

In [None]:
plot_training_history(history_heavy, "Simple CNN - Heavy Dropout (0.7)")
plot_confusion_matrix(results_heavy['confusion_matrix'],
                      "Simple CNN (Heavy Dropout) - Confusion Matrix")

### Analysis: Regularization Impact

*Fill in:*

| Dropout | Accuracy | F1-Macro | Overfitting (train-val gap) |
|---------|----------|----------|----------------------------|
| 0.5 | ____% | ____ | ____ |
| 0.7 | ____% | ____ | ____ |

**Observations**: *(Does heavier dropout reduce overfitting? Improve generalization?)*

## 11. Overall Comparison: CNNs vs Baselines

In [None]:
import pandas as pd

# Load baseline results
with open('../results/baseline_results.json', 'r') as f:
    baseline_results = json.load(f)

# Create comprehensive comparison
all_results = pd.DataFrame([
    {'Model': 'Random Forest', 'Split': 'Temporal', 'Balance': 'Weights',
     'Params': 'N/A', 
     'Acc': f"{baseline_results['rf_weighted']['accuracy']:.3f}",
     'F1-Macro': f"{baseline_results['rf_weighted']['f1_macro']:.3f}",
     'F1-Right': f"{baseline_results['rf_weighted']['f1_right']:.3f}",
     'Time': '~45s'},
    
    {'Model': 'Simple CNN', 'Split': 'Random', 'Balance': 'None',
     'Params': '~50K',
     'Acc': f"{results_cnn_rand['accuracy']:.3f}",
     'F1-Macro': f"{results_cnn_rand['f1_macro']:.3f}",
     'F1-Right': f"{results_cnn_rand['f1_right']:.3f}",
     'Time': f'{train_time_rand:.0f}s'},
    
    {'Model': 'Simple CNN', 'Split': 'Temporal', 'Balance': 'None',
     'Params': '~50K',
     'Acc': f"{results_temp_orig['accuracy']:.3f}",
     'F1-Macro': f"{results_temp_orig['f1_macro']:.3f}",
     'F1-Right': f"{results_temp_orig['f1_right']:.3f}",
     'Time': f'{train_time_temp_orig:.0f}s'},
    
    {'Model': 'Simple CNN', 'Split': 'Temporal', 'Balance': 'Weights',
     'Params': '~50K',
     'Acc': f"{results_temp_weighted['accuracy']:.3f}",
     'F1-Macro': f"{results_temp_weighted['f1_macro']:.3f}",
     'F1-Right': f"{results_temp_weighted['f1_right']:.3f}",
     'Time': f'{train_time_temp_weighted:.0f}s'},
    
    {'Model': 'Simple CNN', 'Split': 'Temporal', 'Balance': 'TFI',
     'Params': '~50K',
     'Acc': f"{results_tfi['accuracy']:.3f}",
     'F1-Macro': f"{results_tfi['f1_macro']:.3f}",
     'F1-Right': f"{results_tfi['f1_right']:.3f}",
     'Time': f'{train_time_tfi:.0f}s'},
    
    {'Model': 'Medium CNN', 'Split': 'Temporal', 'Balance': 'Weights',
     'Params': '~200K',
     'Acc': f"{results_medium['accuracy']:.3f}",
     'F1-Macro': f"{results_medium['f1_macro']:.3f}",
     'F1-Right': f"{results_medium['f1_right']:.3f}",
     'Time': f'{train_time_medium:.0f}s'},
    
    {'Model': 'Simple CNN Heavy', 'Split': 'Temporal', 'Balance': 'Weights',
     'Params': '~50K',
     'Acc': f"{results_heavy['accuracy']:.3f}",
     'F1-Macro': f"{results_heavy['f1_macro']:.3f}",
     'F1-Right': f"{results_heavy['f1_right']:.3f}",
     'Time': f'{train_time_heavy:.0f}s'}
])

print("\n" + "="*100)
print("COMPLETE RESULTS: CNNs vs Baselines")
print("="*100)
print(all_results.to_string(index=False))
print("="*100)

In [None]:
# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# F1-Macro comparison
models_list = ['RF-Weight', 'CNN-Random', 'CNN-Temp', 'CNN-Weight', 'CNN-TFI', 'Medium', 'Heavy']
f1_macros = [
    baseline_results['rf_weighted']['f1_macro'],
    results_cnn_rand['f1_macro'],
    results_temp_orig['f1_macro'],
    results_temp_weighted['f1_macro'],
    results_tfi['f1_macro'],
    results_medium['f1_macro'],
    results_heavy['f1_macro']
]

colors = ['green' if 'RF' in m else 'red' if 'Random' in m else 'steelblue' for m in models_list]
axes[0].bar(range(len(models_list)), f1_macros, color=colors, alpha=0.7)
axes[0].axhline(y=0.910, color='green', linestyle='--', linewidth=2, label='RF Baseline (0.910)')
axes[0].set_xticks(range(len(models_list)))
axes[0].set_xticklabels(models_list, rotation=45, ha='right')
axes[0].set_ylabel('F1-Macro', fontsize=12)
axes[0].set_title('F1-Macro: CNNs vs Random Forest', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(axis='y', alpha=0.3)

# Training time comparison
times = [45, train_time_rand, train_time_temp_orig, train_time_temp_weighted, 
         train_time_tfi, train_time_medium, train_time_heavy]
axes[1].bar(range(len(models_list)), times, color=colors, alpha=0.7)
axes[1].set_xticks(range(len(models_list)))
axes[1].set_xticklabels(models_list, rotation=45, ha='right')
axes[1].set_ylabel('Training Time (seconds)', fontsize=12)
axes[1].set_title('Computational Cost', fontsize=14, fontweight='bold')
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

### Overall Analysis: Do CNNs Beat Random Forest?

*Fill in after seeing all results:*

**Best CNN performance**: ____ accuracy, ____ F1-Macro  
**Random Forest baseline**: 93.0% accuracy, 0.910 F1-Macro

**Gap**: ____ (CNN - RF)

**Interpretation**:
- If gap > +2%: CNNs successfully learned spatial features worth the complexity
- If gap ≈ 0±1%: Comparable performance, but CNNs have higher computational cost
- If gap < -2%: CNNs overfit on small dataset, simpler models are better

**Observations**: *(Fill in)*

## 12. Error Analysis

**Purpose**: Understand where models fail and why.

In [None]:
# Use best CNN model for error analysis
# Identify incorrectly classified samples
y_pred_best = results_temp_weighted['predictions']  # Change based on best model
y_test_original = y_test  # Original labels (-1, 0, 1)
y_pred_original = y_pred_best - 1  # Map back from (0,1,2) to (-1,0,1)

# Find misclassified samples
errors = y_test_original != y_pred_original
error_indices = np.where(errors)[0]

print(f"Total errors: {len(error_indices)} / {len(y_test)} ({len(error_indices)/len(y_test)*100:.1f}%)")

# Categorize errors
error_types = {}
for idx in error_indices:
    true_label = y_test_original[idx]
    pred_label = y_pred_original[idx]
    key = f"{label_names[true_label]}→{label_names[pred_label]}"
    error_types[key] = error_types.get(key, 0) + 1

print("\nError breakdown:")
for error_type, count in sorted(error_types.items(), key=lambda x: x[1], reverse=True):
    print(f"  {error_type:20s}: {count:3d} errors")

In [None]:
# Visualize worst errors (most confident wrong predictions)
# Get prediction probabilities
probs = model_temp_weighted.predict(X_test_cnn, verbose=0)
pred_confidence = np.max(probs, axis=1)

# Find confident errors
confident_errors = []
for idx in error_indices:
    confident_errors.append((idx, pred_confidence[idx]))

# Sort by confidence (most confident errors first)
confident_errors.sort(key=lambda x: x[1], reverse=True)

# Show top 10 confident errors
fig, axes = plt.subplots(2, 5, figsize=(18, 8))
axes = axes.flatten()

for i in range(min(10, len(confident_errors))):
    idx, conf = confident_errors[i]
    
    axes[i].imshow(X_test[idx], cmap='gray')
    true_label = y_test_original[idx]
    pred_label = y_pred_original[idx]
    
    axes[i].set_title(
        f"True: {label_names[true_label]}\nPred: {label_names[pred_label]} ({conf:.2f})",
        fontsize=10, color='red'
    )
    axes[i].axis('off')

plt.suptitle('Most Confident Errors (Wrong Predictions with High Confidence)', 
             fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

### Error Analysis Observations

*After visualizing errors, categorize them:*

**Common error patterns**:
1. Temporal lag cases: *(Do errors show label mismatch issue we discussed?)*
2. Ambiguous images: *(Images that genuinely look like they could go either way)*
3. Systematic biases: *(Does model always confuse specific pairs?)*

**Comparison to Random Forest**: *(Do they fail on same samples? Different samples?)*

**Insights**: *(Fill in)*

## 13. Data Leakage Analysis: Random vs Temporal Split

**Purpose**: Quantify the impact of temporal correlation on evaluation.

In [None]:
# Compare same model on random vs temporal splits
leakage_analysis = pd.DataFrame([
    {'Split Type': 'Random', 
     'Accuracy': f"{results_cnn_rand['accuracy']:.3f}",
     'F1-Macro': f"{results_cnn_rand['f1_macro']:.3f}"},
    {'Split Type': 'Temporal',
     'Accuracy': f"{results_temp_orig['accuracy']:.3f}",
     'F1-Macro': f"{results_temp_orig['f1_macro']:.3f}"}
])

print("\n" + "="*60)
print("DATA LEAKAGE ANALYSIS")
print("="*60)
print(leakage_analysis.to_string(index=False))
print("="*60)

# Calculate gap
acc_gap = results_cnn_rand['accuracy'] - results_temp_orig['accuracy']
f1_gap = results_cnn_rand['f1_macro'] - results_temp_orig['f1_macro']

print(f"\nPerformance inflation from random split:")
print(f"  Accuracy gap:  {acc_gap:.3f} ({acc_gap*100:.1f} percentage points)")
print(f"  F1-Macro gap:  {f1_gap:.3f}")

### Data Leakage Conclusions

*Fill in after seeing gap:*

**Observed gap**: ____% accuracy inflation from random split

**Interpretation**:
- If gap > 5%: Severe leakage - random split completely invalid
- If gap 2-5%: Moderate leakage - temporal split essential
- If gap < 2%: Minimal leakage (surprising given 0.89 frame correlation!)

**Conclusion**: *(Fill in)*

**This validates**: Temporal split methodology from EDA was correct.

## 14. Learning Curves (Dataset Size Analysis)

**Purpose**: Understand if more data would help.

Train on increasing fractions of training data: 20%, 40%, 60%, 80%, 100%

**What to look for**:
- If curves still rising at 100%: More data would help
- If curves plateau: Model has saturated, more data won't help

In [None]:
# Train on different data sizes
fractions = [0.2, 0.4, 0.6, 0.8, 1.0]
learning_curve_results = []

for frac in fractions:
    n_samples = int(len(X_train_temp_cnn) * frac)
    
    X_subset = X_train_temp_cnn[:n_samples]
    y_subset = y_train_temp_mapped[:n_samples]
    
    print(f"\nTraining on {frac*100:.0f}% data ({n_samples} samples)...")
    
    model_lc = build_simple_cnn()
    early_stop = callbacks.EarlyStopping(patience=10, restore_best_weights=True, verbose=0)
    
    history = model_lc.fit(
        X_subset, y_subset,
        validation_data=(X_val_cnn, y_val_mapped),
        epochs=50,
        batch_size=32,
        class_weight=class_weights,
        callbacks=[early_stop],
        verbose=0
    )
    
    # Evaluate
    y_pred = np.argmax(model_lc.predict(X_test_cnn, verbose=0), axis=1)
    acc = accuracy_score(y_test_mapped, y_pred)
    f1 = f1_score(y_test_mapped, y_pred, average='macro')
    
    learning_curve_results.append({
        'fraction': frac,
        'n_samples': n_samples,
        'accuracy': acc,
        'f1_macro': f1,
        'train_acc': history.history['accuracy'][-1],
        'val_acc': history.history['val_accuracy'][-1]
    })
    
    print(f"  Test accuracy: {acc:.3f}, F1-Macro: {f1:.3f}")

print("\nLearning curve analysis complete")

In [None]:
# Plot learning curves
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

n_samples_list = [r['n_samples'] for r in learning_curve_results]
test_accs = [r['accuracy'] for r in learning_curve_results]
train_accs = [r['train_acc'] for r in learning_curve_results]
val_accs = [r['val_acc'] for r in learning_curve_results]

# Test accuracy vs dataset size
axes[0].plot(n_samples_list, test_accs, marker='o', linewidth=2, markersize=8, label='Test')
axes[0].axhline(y=0.930, color='green', linestyle='--', label='RF Baseline (93%)')
axes[0].set_xlabel('Training Samples', fontsize=12)
axes[0].set_ylabel('Test Accuracy', fontsize=12)
axes[0].set_title('Learning Curve: Accuracy vs Dataset Size', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Train vs val accuracy (overfitting check)
axes[1].plot(n_samples_list, train_accs, marker='o', linewidth=2, markersize=8, label='Train')
axes[1].plot(n_samples_list, val_accs, marker='s', linewidth=2, markersize=8, label='Validation')
axes[1].set_xlabel('Training Samples', fontsize=12)
axes[1].set_ylabel('Accuracy', fontsize=12)
axes[1].set_title('Overfitting Check: Train vs Val', fontsize=14, fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

### Learning Curve Analysis

*Fill in observations:*

**Curve shape**: *(Flat? Still rising? Plateau?)*

**Would more data help?**: *(Yes if curve rising, No if plateau)*

**Overfitting severity**: *(Large train-val gap? Getting worse with more data?)*

**Insights**: *(Fill in)*

## 15. Save Results and Best Model

In [None]:
# Identify best model (change based on your results)
best_model = model_temp_weighted  # Update this
best_results = results_temp_weighted  # Update this
best_name = "Simple CNN (Temporal + Class Weights)"  # Update this

# Save model
best_model.save('../models/best_cnn.keras')
print(f"Best model saved: {best_name}")
print(f"  Accuracy: {best_results['accuracy']:.3f}")
print(f"  F1-Macro: {best_results['f1_macro']:.3f}")

# Save all CNN results
cnn_results = {
    'cnn_random': {k: v.tolist() if isinstance(v, np.ndarray) else v 
                   for k, v in results_cnn_rand.items()},
    'cnn_temporal_orig': {k: v.tolist() if isinstance(v, np.ndarray) else v 
                          for k, v in results_temp_orig.items()},
    'cnn_temporal_weighted': {k: v.tolist() if isinstance(v, np.ndarray) else v 
                              for k, v in results_temp_weighted.items()},
    'cnn_tfi': {k: v.tolist() if isinstance(v, np.ndarray) else v 
                for k, v in results_tfi.items()},
    'cnn_medium': {k: v.tolist() if isinstance(v, np.ndarray) else v 
                   for k, v in results_medium.items()},
    'cnn_heavy': {k: v.tolist() if isinstance(v, np.ndarray) else v 
                  for k, v in results_heavy.items()},
    'learning_curve': learning_curve_results,
    'best_model': best_name
}

with open('../results/cnn_results.json', 'w') as f:
    json.dump(cnn_results, f, indent=2)

print("\nResults saved to: results/cnn_results.json")

## 16. Final Summary and Conclusions

*Fill in after all experiments complete:*

### Key Findings

**1. CNN vs Random Forest**:
- Best CNN: ____% accuracy, ____ F1-Macro
- Random Forest: 93.0% accuracy, 0.910 F1-Macro
- Gap: ____ 
- **Conclusion**: *(CNNs beat/match/lose to RF)*

**2. Data Leakage from Random Split**:
- Random split: ____% accuracy (inflated)
- Temporal split: ____% accuracy (realistic)
- Inflation: ____%
- **Validates**: Temporal split methodology was essential

**3. Architecture Complexity**:
- Simple CNN (50K params): ____% 
- Medium CNN (200K params): ____%
- **Observation**: *(Does deeper help or hurt?)*

**4. Class Balancing**:
- Original: ____ F1-Right
- Class weights: ____ F1-Right
- TFI: ____ F1-Right
- **Impact**: *(Minimal/Moderate/Significant)*

**5. Dataset Size**:
- Learning curve: *(Still rising / Plateaued)*
- More data would: *(Help / Not help)*

### Honest Assessment

*Choose the appropriate conclusion based on results:*

**If CNN > 94%**:
> "CNNs successfully leveraged spatial hierarchies to beat traditional ML by X%, demonstrating that convolutional architectures can extract geometric features unavailable to pixel-based methods. The performance gain justifies the added complexity."

**If CNN ≈ 93%**:
> "CNNs achieved comparable performance to Random Forest (93% vs 93%), but with 4× longer training time and less interpretability. For this specific problem with clean edge features, simpler tree-based methods are preferable."

**If CNN < 92%**:
> "CNNs underperformed Random Forest (X% vs 93%) due to overfitting on the small 9,900-sample dataset. Despite heavy regularization, the parameter-to-sample ratio was too high. This demonstrates that deep learning is not always superior - dataset size matters more than model sophistication."

### Paper Implications

*(Fill in your honest take)*

**Key message for paper**: *(What did we learn about CNNs vs traditional ML for this task?)*