# PA4: CNN Training Dynamics - Analysis & Visualization

**Name:** [Your Name Here]

**Date:** [Date]

## Overview

This notebook explores CNN architectures, training dynamics, hyperparameter tuning, and overfitting through systematic experimentation. All visualizations use **matplotlib.pyplot only** (no seaborn).

## Learning Objectives
- Compare sequential vs. functional CNN architectures
- Evaluate different optimizers and their convergence behavior
- Design and conduct systematic hyperparameter search
- Identify, induce, and mitigate overfitting
- Assess effectiveness of training callbacks

In [None]:
# Import required libraries
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras

# Import student implementations
from student_code import (
    build_sequential_cnn,
    build_functional_inception_cnn,
    EarlyStoppingCallback,
    LearningRateSchedulerCallback,
    train_model_with_config,
    run_grid_search
)

# Import utility functions
from utils import (
    load_mnist_data,
    plot_training_history,
    plot_sample_predictions,
    plot_grid_search_results,
    plot_overfitting_comparison,
    compare_optimizers
)

# Set random seeds for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

# Configure matplotlib
plt.style.use('default')
%matplotlib inline

## Section 1: Data Loading and Exploration

Load MNIST dataset with CPU-feasible subset sizes.

In [None]:
# Load data (using subset for CPU-feasible training)
(X_train, y_train), (X_val, y_val), (X_test, y_test) = load_mnist_data(
    train_samples=10000,  # Subset of full 60k training set
    val_samples=2000,
    test_samples=10000    # Full test set
)

print(f"Training set: X_train.shape = {X_train.shape}, y_train.shape = {y_train.shape}")
print(f"Validation set: X_val.shape = {X_val.shape}, y_val.shape = {y_val.shape}")
print(f"Test set: X_test.shape = {X_test.shape}, y_test.shape = {y_test.shape}")

In [None]:
# Visualize sample images using matplotlib
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
for i, ax in enumerate(axes.flat):
    img = X_train[i].reshape(28, 28)
    label = np.argmax(y_train[i])
    ax.imshow(img, cmap='gray')
    ax.set_title(f'Label: {label}')
    ax.axis('off')
plt.suptitle('Sample MNIST Images', fontsize=14)
plt.tight_layout()
plt.show()

## Section 2: Model Architecture Comparison

Build and compare sequential and functional CNN architectures.

### Task 2.1: Build Both Models

In [None]:
# Build sequential model
model_sequential = build_sequential_cnn()
model_sequential.build(input_shape=(None, 28, 28, 1))

print("="*60)
print("SEQUENTIAL CNN ARCHITECTURE")
print("="*60)
model_sequential.summary()
print(f"\nTotal parameters: {model_sequential.count_params():,}")

In [None]:
# Build functional model with inception module
model_functional = build_functional_inception_cnn()
model_functional.build(input_shape=(None, 28, 28, 1))

print("="*60)
print("FUNCTIONAL CNN WITH INCEPTION MODULE")
print("="*60)
model_functional.summary()
print(f"\nTotal parameters: {model_functional.count_params():,}")

### Task 2.2: Analysis Questions

Answer the following questions about your architectures:

**Q1: Why is the Functional API necessary for the inception module? Could you build this with Sequential API?**

*Your answer here*

**Q2: How do the parameter counts compare? Why?**

*Your answer here*

**Q3: Explain how padding='same' vs. padding='valid' affects the output shape through your CNN layers.**

*Your answer here*

**Q4: What are the trade-offs between these two architectures in terms of complexity and expressiveness?**

*Your answer here*

## Section 3: Optimizer Comparison

Compare different optimizers on the same architecture.

### Task 3.1: Design and Run Optimizer Comparison

**Your Design Choices:**
- Choose 2-3 optimizers to compare (e.g., SGD, Adam, RMSprop, Adagrad)
- Choose a learning rate to use for all optimizers (for fair comparison)
- Choose batch size and number of epochs

**Justify your choices below:**

*Why did you choose these optimizers? What learning rate makes sense? How many epochs?*

In [None]:
# TODO: Define your experimental parameters
optimizers_to_test = []  # Your choices: e.g., ['sgd', 'adam', 'rmsprop']
learning_rate = 0.001    # Your choice: what LR makes sense?
batch_size = 32          # Your choice
epochs = 20              # Your choice: enough to see convergence?

# Run the comparison
optimizer_histories = {}

for opt_name in optimizers_to_test:
    print(f"\nTraining with {opt_name.upper()}...")
    
    # Build fresh model
    model = build_sequential_cnn()
    
    # Train model
    history = train_model_with_config(
        model, X_train, y_train, X_val, y_val,
        optimizer_name=opt_name,
        learning_rate=learning_rate,
        batch_size=batch_size,
        epochs=epochs,
        verbose=1
    )
    
    optimizer_histories[opt_name] = history
    print(f"Final val accuracy: {history.history['val_accuracy'][-1]:.4f}")

In [None]:
# Visualize optimizer comparison using matplotlib
fig = compare_optimizers(optimizer_histories, metrics=['loss', 'accuracy'])
plt.show()

### Task 3.2: Optimizer Analysis Questions

**Q1: Which optimizer converged fastest? Provide evidence from your training curves.**

*Your answer here*

**Q2: Which optimizer achieved the best final validation accuracy? Why do you think this happened?**

*Your answer here*

**Q3: Did any optimizer show signs of instability or oscillation? What might cause this?**

*Your answer here*

**Q4: Based on your results, which optimizer would you recommend for this problem? Justify your choice.**

*Your answer here*

## Section 4: Hyperparameter Grid Search

Systematically explore hyperparameter combinations.

### Task 4.1: Design Your Grid Search

**Your Task:** Design a grid search with 6-12 total configurations.

**Constraints:**
- CPU-feasible: Keep total training time under 20-30 minutes
- Meaningful: Explore parameters that might actually matter

**Consider:**
- Which optimizers to compare?
- What learning rate ranges make sense?
- Does batch size matter for this problem?
- How many epochs per configuration?

**Document your design rationale:**

*Why did you choose these parameters and ranges? What are you trying to learn?*

In [None]:
# TODO: Design your parameter grid (aim for 6-12 total configurations)
param_grid = {
    'optimizer': [],        # Your choices
    'learning_rate': [],    # Your choices
    'batch_size': []        # Your choices
}

# Calculate total configurations
from itertools import product
n_configs = len(list(product(*param_grid.values())))
print(f"Total configurations: {n_configs}")

# Use smaller subset for grid search to save time
X_train_small = X_train[:5000]
y_train_small = y_train[:5000]

# TODO: Choose number of epochs (balance speed vs. convergence)
grid_search_epochs = 15  # Your choice

print(f"Running grid search over {n_configs} configurations...")
print(f"Estimated time: ~{n_configs * grid_search_epochs * 0.5 / 60:.1f} minutes on CPU\n")

results = run_grid_search(
    build_sequential_cnn,
    X_train_small, y_train_small,
    X_val, y_val,
    param_grid,
    epochs=grid_search_epochs,
    verbose=0
)

print("Grid search complete!")

In [None]:
# Display results in a table
import pandas as pd

# TODO: Adapt this to your result structure
results_df = pd.DataFrame(results)
print("Grid Search Results:")
print(results_df.to_string(index=False))

In [None]:
# Visualize grid search results using matplotlib
# TODO: Choose which parameters to visualize and which metric to focus on
fig = plot_grid_search_results(
    results,
    x_param='learning_rate',  # Adjust based on your param_grid
    color_param='optimizer',   # Adjust based on your param_grid
    metric='final_val_acc'     # Or 'final_val_loss'
)
plt.show()

### Task 4.2: Grid Search Analysis

**Q1: Which configuration achieved the best validation performance? Report all hyperparameters.**

*Your answer here*

**Q2: How did learning rate affect performance? Provide specific examples from your results.**

*Your answer here*

**Q3: What interactions did you observe between hyperparameters (e.g., does the best learning rate depend on the optimizer)?**

*Your answer here*

**Q4: If you were to run a follow-up grid search, what would you explore and why?**

*Your answer here*

## Section 5: Overfitting Analysis

Observe, induce, and mitigate overfitting through controlled experiments.

### Task 5.1: Baseline Training

First, establish a baseline with reasonable training.

In [None]:
# TODO: Choose training parameters for baseline
baseline_optimizer = 'adam'      # Your choice
baseline_lr = 0.001              # Your choice
baseline_epochs = 30             # Your choice

print(f"Training baseline model...")
print(f"Optimizer: {baseline_optimizer}, LR: {baseline_lr}, Epochs: {baseline_epochs}\n")

model_baseline = build_sequential_cnn()

history_baseline = train_model_with_config(
    model_baseline,
    X_train, y_train,
    X_val, y_val,
    optimizer_name=baseline_optimizer,
    learning_rate=baseline_lr,
    batch_size=32,
    epochs=baseline_epochs,
    verbose=1
)

# Calculate train/val gap
final_train_acc = history_baseline.history['accuracy'][-1]
final_val_acc = history_baseline.history['val_accuracy'][-1]
gap_baseline = final_train_acc - final_val_acc

print(f"\nBaseline Results:")
print(f"Final train accuracy: {final_train_acc:.4f}")
print(f"Final val accuracy: {final_val_acc:.4f}")
print(f"Train/Val gap: {gap_baseline:.4f}")

In [None]:
# Plot baseline training curves
fig = plot_training_history(history_baseline, title="Baseline Training")
plt.show()

### Task 5.2: Induce Worse Overfitting

**Your Task:** Deliberately create worse overfitting.

**Options to consider:**
- Reduce training data (e.g., use only 500-2000 samples)
- Increase model complexity (add more layers/filters)
- Train for many more epochs
- Increase learning rate

**Document your approach:**

*How will you induce overfitting? Why did you choose this approach?*

In [None]:
# TODO: Design your overfitting experiment
# Option 1: Reduce training data
overfit_train_size = 1000  # Your choice (e.g., 500, 1000, 2000)
X_train_overfit = X_train[:overfit_train_size]
y_train_overfit = y_train[:overfit_train_size]

# Option 2: Or modify other parameters
overfit_epochs = 30       # Your choice
overfit_lr = 0.001        # Your choice

print(f"Inducing overfitting...")
print(f"Strategy: Training set reduced to {overfit_train_size} samples\n")

model_overfit = build_sequential_cnn()

history_overfit = train_model_with_config(
    model_overfit,
    X_train_overfit, y_train_overfit,
    X_val, y_val,
    optimizer_name=baseline_optimizer,
    learning_rate=overfit_lr,
    batch_size=32,
    epochs=overfit_epochs,
    verbose=1
)

# Calculate train/val gap
final_train_acc_overfit = history_overfit.history['accuracy'][-1]
final_val_acc_overfit = history_overfit.history['val_accuracy'][-1]
gap_overfit = final_train_acc_overfit - final_val_acc_overfit

print(f"\nOverfitting Results:")
print(f"Final train accuracy: {final_train_acc_overfit:.4f}")
print(f"Final val accuracy: {final_val_acc_overfit:.4f}")
print(f"Train/Val gap: {gap_overfit:.4f}")
print(f"\nGap increased from {gap_baseline:.4f} to {gap_overfit:.4f}")

### Task 5.3: Mitigate Overfitting with Early Stopping

**Your Task:** Choose early stopping parameters to mitigate overfitting.

**Consider:**
- What patience value makes sense?
- Should you monitor loss or accuracy?
- What about min_delta?

**Document your choices:**

*Why did you choose these early stopping parameters?*

In [None]:
# TODO: Design your early stopping strategy
es_monitor = 'val_loss'          # Your choice: 'val_loss' or 'val_accuracy'?
es_patience = 5                  # Your choice: how patient should we be?
es_min_delta = 0.0               # Your choice: minimum improvement to count?

print(f"Training with early stopping...")
print(f"Monitor: {es_monitor}, Patience: {es_patience}, Min Delta: {es_min_delta}\n")

model_early_stop = build_sequential_cnn()

# Create early stopping callback with your parameters
early_stop_callback = EarlyStoppingCallback(
    monitor=es_monitor,
    patience=es_patience,
    min_delta=es_min_delta,
    restore_best_weights=True
)

history_early_stop = train_model_with_config(
    model_early_stop,
    X_train_overfit, y_train_overfit,  # Same overfitting scenario
    X_val, y_val,
    optimizer_name=baseline_optimizer,
    learning_rate=overfit_lr,
    batch_size=32,
    epochs=50,  # Set high - early stopping will kick in
    callbacks=[early_stop_callback],
    verbose=1
)

epochs_trained = len(history_early_stop.history['loss'])
final_train_acc_es = history_early_stop.history['accuracy'][-1]
final_val_acc_es = history_early_stop.history['val_accuracy'][-1]
gap_early_stop = final_train_acc_es - final_val_acc_es

print(f"\nEarly Stopping Results:")
print(f"Stopped at epoch: {epochs_trained}/50")
print(f"Final train accuracy: {final_train_acc_es:.4f}")
print(f"Final val accuracy: {final_val_acc_es:.4f}")
print(f"Train/Val gap: {gap_early_stop:.4f}")

In [None]:
# Compare all three scenarios using matplotlib
scenarios = {
    f'Baseline ({len(X_train)} samples)': history_baseline,
    f'Overfitting ({overfit_train_size} samples)': history_overfit,
    f'With Early Stopping (patience={es_patience})': history_early_stop
}

fig = plot_overfitting_comparison(scenarios, metric='accuracy')
plt.show()

### Task 5.4: Overfitting Analysis Questions

**Q1: Describe the overfitting you observed in your baseline model. What evidence supports your conclusion?**

*Your answer here*

**Q2: How effective was your strategy for inducing overfitting? Explain the mechanism.**

*Your answer here*

**Q3: How effective was early stopping at mitigating overfitting? Support your answer with specific metrics.**

*Your answer here*

**Q4: At what point during training did overfitting become severe? How can you tell from the curves?**

*Your answer here*

**Q5: What other strategies could you use to reduce overfitting in this scenario?**

*Your answer here*

## Section 6: Callback Effectiveness

Analyze the impact of learning rate scheduling.

### Task 6.1: Design Learning Rate Schedule

**Your Task:** Choose LR scheduling parameters.

**Consider:**
- What initial learning rate?
- How aggressively should LR decay (decay_rate)?
- How often should it decay (decay_steps)?
- Which optimizer benefits most from scheduling?

**Document your design:**

*Why did you choose these scheduling parameters?*

In [None]:
# TODO: Design your LR scheduling experiment
lr_initial = 0.01        # Your choice
lr_decay_rate = 0.5      # Your choice (e.g., 0.5 = halve, 0.1 = reduce by 10x)
lr_decay_steps = 10      # Your choice (decay every N epochs)
lr_optimizer = 'sgd'     # Your choice (SGD often benefits most)

print(f"Training with learning rate scheduler...")
print(f"Initial LR: {lr_initial}, Decay rate: {lr_decay_rate}, Decay every: {lr_decay_steps} epochs\n")

model_lr_schedule = build_sequential_cnn()

# Create LR scheduler callback
lr_scheduler = LearningRateSchedulerCallback(
    initial_lr=lr_initial,
    decay_rate=lr_decay_rate,
    decay_steps=lr_decay_steps
)

history_lr_schedule = train_model_with_config(
    model_lr_schedule,
    X_train, y_train,
    X_val, y_val,
    optimizer_name=lr_optimizer,
    learning_rate=lr_initial,
    batch_size=32,
    epochs=30,
    callbacks=[lr_scheduler],
    verbose=1
)

print("\nLR Schedule completed.")

In [None]:
# Plot LR scheduling effect
fig = plot_training_history(
    history_lr_schedule, 
    title=f"Training with LR Decay (rate={lr_decay_rate}, every {lr_decay_steps} epochs)"
)
plt.show()

### Task 6.2: Callback Analysis Questions

**Q1: How did you choose your early stopping patience? What factors did you consider?**

*Your answer here*

**Q2: Did the learning rate scheduler improve convergence compared to fixed LR? Provide evidence.**

*Your answer here*

**Q3: How would you choose decay_rate and decay_steps in practice for a new problem?**

*Your answer here*

**Q4: Could you combine early stopping and LR scheduling? What might be the benefits and challenges?**

*Your answer here*

## Section 7: Executive Summary

### Task 7.1: Key Findings

Summarize your key findings from this assignment in 3-5 bullet points:

*Your summary here*

### Task 7.2: Practical Recommendations

Based on your experiments, what recommendations would you give to someone training CNNs for image classification?

*Your recommendations here*

### Task 7.3: Reflection

What was the most surprising or interesting finding from your experiments?

*Your reflection here*

## Section 8: Peer Review Preparation

### For your peer reviewers:

**What aspect of your analysis are you most proud of?**

*Your answer here*

**What question or uncertainty would you like feedback on?**

*Your answer here*

**What was the most challenging part of this assignment?**

*Your answer here*

---

## Submission Checklist

Before submitting, ensure you have:

- [ ] Completed all code implementations in `student_code.py`
- [ ] All tests passing (`python -m pytest tests.py`)
- [ ] Filled in all TODO sections in this notebook with your design choices
- [ ] Justified your experimental design decisions
- [ ] Answered all analysis questions
- [ ] Created clear matplotlib visualizations (no seaborn)
- [ ] Written executive summary and reflections
- [ ] Exported this notebook as PDF for peer review
- [ ] Double-checked that all cells run without errors

**Note:** Export this notebook as PDF via: File → Save and Export Notebook As → PDF