# Task 4.5 Solution: Training Diagnostics Lab

This notebook contains solutions to the exercises from Task 4.5.

---

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from typing import List, Dict, Tuple
import time
import os
import gzip
import urllib.request

np.random.seed(42)
plt.style.use('default')
%matplotlib inline

print("Setup complete!")

In [None]:
# Load MNIST for experiments
def load_mnist(path='../data'):
    os.makedirs(path, exist_ok=True)
    base_url = 'http://yann.lecun.com/exdb/mnist/'
    files = {
        'train_images': 'train-images-idx3-ubyte.gz',
        'train_labels': 'train-labels-idx1-ubyte.gz',
        'test_images': 't10k-images-idx3-ubyte.gz',
        'test_labels': 't10k-labels-idx1-ubyte.gz'
    }
    
    def download(filename):
        filepath = os.path.join(path, filename)
        if not os.path.exists(filepath):
            print(f"Downloading {filename}...")
            urllib.request.urlretrieve(base_url + filename, filepath)
        return filepath
    
    def load_images(fp):
        with gzip.open(fp, 'rb') as f:
            f.read(16)
            return np.frombuffer(f.read(), dtype=np.uint8).reshape(-1, 784).astype(np.float32) / 255.0
    
    def load_labels(fp):
        with gzip.open(fp, 'rb') as f:
            f.read(8)
            return np.frombuffer(f.read(), dtype=np.uint8)
    
    return (load_images(download(files['train_images'])),
            load_labels(download(files['train_labels'])),
            load_images(download(files['test_images'])),
            load_labels(download(files['test_labels'])))

X_train, y_train, X_test, y_test = load_mnist()
print(f"Loaded {len(X_train)} training samples")

---

## Exercise Solution: Diagnose These Training Curves

### Problem A: Learning Rate Too High (Loss Exploding)

**Symptoms:**
- Loss increases exponentially
- Training is completely unstable
- Model outputs become NaN

**Diagnosis:** The learning rate is so high that weight updates overshoot the optimum massively, causing the loss to explode.

**Fix:** Decrease learning rate by 10-100x

In [None]:
# Visualize Problem A
fig, ax = plt.subplots(figsize=(8, 4))

x = np.arange(20)
y_exploding = np.exp(x * 0.3) + np.random.randn(20) * 10

ax.plot(y_exploding, 'r-', linewidth=2, label='LR too high')
ax.set_xlabel('Epoch')
ax.set_ylabel('Loss')
ax.set_title('Problem A: Learning Rate Too High - Loss Exploding')
ax.grid(True, alpha=0.3)
ax.legend()

plt.tight_layout()
plt.show()

print("Diagnosis: Learning rate too high")
print("Fix: Decrease LR by 10-100x (e.g., from 10.0 to 0.1)")

---

### Problem B: Learning Rate Too Low (Loss Stuck)

**Symptoms:**
- Loss barely decreases
- Training takes extremely long
- Loss appears flat after many epochs

**Diagnosis:** The learning rate is so small that weight updates are negligible. The model is learning, but at a glacial pace.

**Fix:** Increase learning rate by 10-100x

In [None]:
# Visualize Problem B
fig, ax = plt.subplots(figsize=(8, 4))

x = np.arange(20)
y_stuck = 2.3 + np.random.randn(20) * 0.01

ax.plot(y_stuck, 'b-', linewidth=2, label='LR too low')
ax.set_xlabel('Epoch')
ax.set_ylabel('Loss')
ax.set_title('Problem B: Learning Rate Too Low - Loss Stuck')
ax.set_ylim(0, 5)
ax.grid(True, alpha=0.3)
ax.legend()

plt.tight_layout()
plt.show()

print("Diagnosis: Learning rate too low")
print("Fix: Increase LR by 10-100x (e.g., from 0.0001 to 0.01 or 0.1)")

---

### Problem C: Overfitting

**Symptoms:**
- Training loss decreases steadily
- Validation loss decreases initially, then starts increasing
- Gap between train and validation loss grows

**Diagnosis:** The model is memorizing the training data instead of learning generalizable patterns. It performs well on training data but poorly on unseen data.

**Fixes:**
1. Add regularization (L2, Dropout)
2. Get more training data
3. Use data augmentation
4. Reduce model complexity
5. Early stopping

In [None]:
# Visualize Problem C
fig, ax = plt.subplots(figsize=(8, 4))

x = np.arange(20)
y_train_loss = 2.3 - np.log(x + 1) * 0.8 + np.random.randn(20) * 0.05
y_val_loss = 2.3 - np.log(x + 1) * 0.3 + x * 0.05 + np.random.randn(20) * 0.05

ax.plot(y_train_loss, 'b-', linewidth=2, label='Train Loss')
ax.plot(y_val_loss, 'r-', linewidth=2, label='Validation Loss')
ax.set_xlabel('Epoch')
ax.set_ylabel('Loss')
ax.set_title('Problem C: Overfitting - Train/Val Divergence')
ax.grid(True, alpha=0.3)
ax.legend()

# Mark the divergence point
ax.axvline(x=8, color='orange', linestyle='--', alpha=0.7, label='Divergence point')
ax.annotate('Overfitting starts here', xy=(8, 1.5), xytext=(12, 1.8),
            arrowprops=dict(arrowstyle='->', color='orange'),
            fontsize=10, color='orange')

plt.tight_layout()
plt.show()

print("Diagnosis: Overfitting")
print("Fixes:")
print("  1. Add L2 regularization (weight decay)")
print("  2. Add Dropout layers")
print("  3. Get more training data")
print("  4. Use early stopping at the divergence point")

---

### Problem D: Learning Rate Too High (Oscillating)

**Symptoms:**
- Loss oscillates up and down
- No steady decrease
- Training is unstable but doesn't explode

**Diagnosis:** The learning rate is high enough to cause oscillation around the optimum, but not so high that it explodes. The model keeps overshooting.

**Fixes:**
1. Decrease learning rate by 2-10x
2. Use learning rate scheduling
3. Increase batch size

In [None]:
# Visualize Problem D
fig, ax = plt.subplots(figsize=(8, 4))

x = np.arange(20)
y_oscillating = 1.5 + np.sin(x * 0.5) * 0.5 + np.random.randn(20) * 0.3

ax.plot(y_oscillating, 'g-', linewidth=2, label='LR causing oscillation')
ax.set_xlabel('Epoch')
ax.set_ylabel('Loss')
ax.set_title('Problem D: Learning Rate Too High - Oscillating')
ax.grid(True, alpha=0.3)
ax.legend()

plt.tight_layout()
plt.show()

print("Diagnosis: Learning rate too high (oscillating)")
print("Fixes:")
print("  1. Decrease LR by 2-10x")
print("  2. Use learning rate scheduling (reduce on plateau)")
print("  3. Increase batch size for more stable gradients")

---

## Summary: Quick Reference Table

| Symptom | Diagnosis | Fix |
|---------|-----------|-----|
| Loss explodes to Inf/NaN | LR too high | Decrease LR by 10-100x |
| Loss barely moves | LR too low | Increase LR by 10-100x |
| Loss oscillates | LR too high | Decrease LR by 2-10x |
| Train loss down, val loss up | Overfitting | Add regularization, more data |
| Both losses high, not moving | Underfitting | Bigger model, train longer |
| Can't overfit one batch | Bug in code | Debug forward/backward |
| Early layer gradients = 0 | Vanishing gradients | Use ReLU, He init |

---

## Bonus: Implementing the Fixes

### Fix 1: Learning Rate Comparison

In [None]:
class SimpleMLP:
    """Simple MLP for demonstrating LR effects."""
    
    def __init__(self, layer_sizes: List[int]):
        self.layers = []
        for i in range(len(layer_sizes) - 1):
            W = np.random.randn(layer_sizes[i], layer_sizes[i + 1]).astype(np.float32) * np.sqrt(2.0 / layer_sizes[i])
            b = np.zeros(layer_sizes[i + 1], dtype=np.float32)
            self.layers.append({'W': W, 'b': b, 'cache': {}})
    
    def forward(self, X: np.ndarray) -> np.ndarray:
        out = X
        for i, layer in enumerate(self.layers[:-1]):
            layer['cache']['X'] = out
            out = out @ layer['W'] + layer['b']
            layer['cache']['Z'] = out
            out = np.maximum(0, out)  # ReLU
        
        self.layers[-1]['cache']['X'] = out
        out = out @ self.layers[-1]['W'] + self.layers[-1]['b']
        
        out_shifted = out - np.max(out, axis=1, keepdims=True)
        exp_out = np.exp(out_shifted)
        self.probs = exp_out / np.sum(exp_out, axis=1, keepdims=True)
        return self.probs
    
    def backward(self, targets: np.ndarray, lr: float) -> None:
        batch_size = len(targets)
        grad = self.probs.copy()
        grad[np.arange(batch_size), targets] -= 1
        
        for i in range(len(self.layers) - 1, -1, -1):
            layer = self.layers[i]
            X = layer['cache']['X']
            
            dW = X.T @ grad / batch_size
            db = np.mean(grad, axis=0)
            grad = grad @ layer['W'].T
            
            if i > 0:
                Z = self.layers[i - 1]['cache']['Z']
                grad = grad * (Z > 0)
            
            layer['W'] -= lr * dW
            layer['b'] -= lr * db
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        return np.argmax(self.forward(X), axis=1)

In [None]:
def train_with_lr(lr: float, epochs: int = 20) -> Dict[str, List[float]]:
    """Train model with given learning rate."""
    np.random.seed(42)
    model = SimpleMLP([784, 256, 128, 10])
    history = {'loss': [], 'acc': []}
    
    X_sub, y_sub = X_train[:5000], y_train[:5000]
    batch_size = 64
    
    for epoch in range(epochs):
        indices = np.random.permutation(len(X_sub))
        epoch_loss = 0
        n_batches = 0
        
        for start in range(0, len(X_sub), batch_size):
            batch_idx = indices[start:start + batch_size]
            X_batch = X_sub[batch_idx]
            y_batch = y_sub[batch_idx]
            
            probs = model.forward(X_batch)
            loss = -np.mean(np.log(probs[np.arange(len(y_batch)), y_batch] + 1e-10))
            
            if np.isnan(loss) or loss > 100:
                loss = 100
            
            model.backward(y_batch, lr)
            epoch_loss += loss
            n_batches += 1
        
        acc = np.mean(model.predict(X_test[:1000]) == y_test[:1000])
        history['loss'].append(epoch_loss / n_batches)
        history['acc'].append(acc)
    
    return history

# Compare different learning rates
print("Training with different learning rates...")
lr_high = train_with_lr(10.0)
lr_good = train_with_lr(0.1)
lr_low = train_with_lr(0.0001)
print("Done!")

In [None]:
# Visualize LR comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].plot(lr_high['loss'], 'r-', linewidth=2, alpha=0.7, label='LR=10.0 (too high)')
axes[0].plot(lr_good['loss'], 'g-', linewidth=2, label='LR=0.1 (good)')
axes[0].plot(lr_low['loss'], 'b-', linewidth=2, alpha=0.7, label='LR=0.0001 (too low)')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Loss Curves: Learning Rate Comparison')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
axes[0].set_ylim(0, 10)

axes[1].plot(lr_high['acc'], 'r-', linewidth=2, alpha=0.7, label='LR=10.0 (too high)')
axes[1].plot(lr_good['acc'], 'g-', linewidth=2, label='LR=0.1 (good)')
axes[1].plot(lr_low['acc'], 'b-', linewidth=2, alpha=0.7, label='LR=0.0001 (too low)')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Accuracy Curves: Learning Rate Comparison')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nFinal Results:")
print(f"  LR=10.0:   Loss={lr_high['loss'][-1]:.4f}, Acc={lr_high['acc'][-1]:.2%}")
print(f"  LR=0.1:    Loss={lr_good['loss'][-1]:.4f}, Acc={lr_good['acc'][-1]:.2%}")
print(f"  LR=0.0001: Loss={lr_low['loss'][-1]:.4f}, Acc={lr_low['acc'][-1]:.2%}")

---

## Key Takeaways

1. **Always start with the 'overfit one batch' test** - if your model can't memorize 32 samples, there's a bug in your code.

2. **Learning rate is usually the first thing to check** - try 10x higher or 10x lower to diagnose.

3. **Vanishing gradients are sneaky** - use ReLU activation and proper He initialization.

4. **Overfitting is obvious from train/val divergence** - add regularization (L2, Dropout) or get more data.

5. **Track gradient magnitudes** - if early layers have tiny gradients, you have vanishing gradient problem.

---

## Cleanup

In [None]:
import gc
gc.collect()

print("Cleanup complete!")