# Lab 1.5.3: Regularization Experiments

**Module:** 1.5 - Neural Network Fundamentals  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê (Intermediate)

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Understand overfitting and how to detect it
- [ ] Implement L2 (weight decay) regularization
- [ ] Implement Dropout regularization
- [ ] Visualize the underfitting ‚Üî overfitting spectrum
- [ ] Find optimal regularization strength experimentally

---

## üìö Prerequisites

- Completed: Notebooks 01-02
- Knowledge of: Neural network training basics

---

## üåç Real-World Context

**Why regularization matters:**

In 2012, the ImageNet competition was won by AlexNet, which used **Dropout** - a regularization technique that was considered radical at the time. Now it's standard practice!

Modern LLMs use various regularization techniques:
- **Weight decay** in AdamW optimizer
- **Dropout** in attention layers
- **Early stopping** during training

Understanding regularization is crucial for training models that generalize well to new data.

---

## üßí ELI5: What is Overfitting?

> **Imagine you're studying for a test by memorizing the practice problems.**
>
> If you just memorize the exact answers, you'll do great on practice problems, but terrible on the real test with new questions. That's **overfitting**!
>
> The goal is to **understand the concepts** so you can answer any question, not just memorize specific answers.
>
> **Regularization is like a teacher saying:**
> - "Don't just memorize - explain the concept!" (L2 regularization)
> - "Can you still solve this if I cover part of your notes?" (Dropout)
> - "Stop studying when you're not improving anymore" (Early stopping)
>
> These techniques force the model to learn general patterns, not specific examples.

---

## Setup

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from typing import Tuple, List, Dict
import time
import sys
import os
from pathlib import Path

# Add scripts directory to path (robust approach)
notebook_dir = Path().resolve()
if notebook_dir.name == 'notebooks':
    scripts_dir = notebook_dir.parent / 'scripts'
else:
    scripts_dir = notebook_dir / 'scripts'
    if not scripts_dir.exists():
        scripts_dir = notebook_dir.parent / 'scripts'

if scripts_dir.exists():
    sys.path.insert(0, str(scripts_dir))

np.random.seed(42)
plt.style.use('default')
%matplotlib inline

print("Setup complete!")

---

## Part 1: Creating an Overfitting-Prone Dataset

To study overfitting, we need a scenario where it's likely to happen:
- Small training set
- Large model capacity
- No regularization

In [None]:
# Load MNIST
import gzip
import urllib.request

def load_mnist(path='../data'):
    os.makedirs(path, exist_ok=True)
    base_url = 'http://yann.lecun.com/exdb/mnist/'
    files = {
        'train_images': 'train-images-idx3-ubyte.gz',
        'train_labels': 'train-labels-idx1-ubyte.gz',
        'test_images': 't10k-images-idx3-ubyte.gz',
        'test_labels': 't10k-labels-idx1-ubyte.gz'
    }
    
    def download(filename):
        filepath = os.path.join(path, filename)
        if not os.path.exists(filepath):
            print(f"Downloading {filename}...")
            urllib.request.urlretrieve(base_url + filename, filepath)
        return filepath
    
    def load_images(fp):
        with gzip.open(fp, 'rb') as f:
            f.read(16)
            return np.frombuffer(f.read(), dtype=np.uint8).reshape(-1, 784).astype(np.float32) / 255.0
    
    def load_labels(fp):
        with gzip.open(fp, 'rb') as f:
            f.read(8)
            return np.frombuffer(f.read(), dtype=np.uint8)
    
    X_train = load_images(download(files['train_images']))
    y_train = load_labels(download(files['train_labels']))
    X_test = load_images(download(files['test_images']))
    y_test = load_labels(download(files['test_labels']))
    
    return X_train, y_train, X_test, y_test

print("Loading MNIST...")
X_train_full, y_train_full, X_test, y_test = load_mnist()

# Create small training set (overfitting-prone)
TRAIN_SIZE = 1000  # Very small!
X_train = X_train_full[:TRAIN_SIZE]
y_train = y_train_full[:TRAIN_SIZE]

print(f"\nüìä Dataset sizes:")
print(f"   Training: {len(X_train):,} samples (small, overfitting-prone!)")
print(f"   Test: {len(X_test):,} samples")
print(f"   Ratio: 1:{len(X_test)//len(X_train)} (test is {len(X_test)//len(X_train)}x larger!)")

---

## Part 2: Building the Model with Regularization Options

In [None]:
class RegularizedMLP:
    """
    MLP with L2 regularization and Dropout support.
    """
    
    def __init__(
        self, 
        layer_sizes: List[int],
        l2_lambda: float = 0.0,
        dropout_rate: float = 0.0
    ):
        """
        Args:
            layer_sizes: e.g., [784, 512, 256, 10]
            l2_lambda: L2 regularization strength (0 = no regularization)
            dropout_rate: Probability of dropping neurons (0 = no dropout)
        """
        self.l2_lambda = l2_lambda
        self.dropout_rate = dropout_rate
        self.training = True
        
        self.layers = []
        for i in range(len(layer_sizes) - 1):
            # He initialization
            W = np.random.randn(layer_sizes[i], layer_sizes[i + 1]) * np.sqrt(2.0 / layer_sizes[i])
            b = np.zeros(layer_sizes[i + 1])
            self.layers.append({
                'W': W, 'b': b, 
                'cache': {}, 
                'dW': None, 'db': None,
                'dropout_mask': None
            })
    
    def forward(self, X: np.ndarray) -> np.ndarray:
        out = X
        
        for i, layer in enumerate(self.layers[:-1]):
            layer['cache']['X'] = out
            
            # Linear
            out = out @ layer['W'] + layer['b']
            
            # ReLU
            layer['cache']['Z'] = out
            out = np.maximum(0, out)
            
            # Dropout (only during training)
            if self.training and self.dropout_rate > 0:
                mask = (np.random.rand(*out.shape) > self.dropout_rate).astype(float)
                out = out * mask / (1 - self.dropout_rate)  # Inverted dropout
                layer['dropout_mask'] = mask
            else:
                layer['dropout_mask'] = None
        
        # Output layer
        self.layers[-1]['cache']['X'] = out
        out = out @ self.layers[-1]['W'] + self.layers[-1]['b']
        
        # Softmax
        out_shifted = out - np.max(out, axis=1, keepdims=True)
        exp_out = np.exp(out_shifted)
        self.probs = exp_out / np.sum(exp_out, axis=1, keepdims=True)
        
        return self.probs
    
    def compute_loss(self, targets: np.ndarray) -> float:
        """Compute cross-entropy loss + L2 regularization."""
        batch_size = len(targets)
        
        # Cross-entropy loss
        ce_loss = -np.mean(np.log(self.probs[np.arange(batch_size), targets] + 1e-10))
        
        # L2 regularization loss
        l2_loss = 0.0
        if self.l2_lambda > 0:
            for layer in self.layers:
                l2_loss += np.sum(layer['W'] ** 2)
            l2_loss = 0.5 * self.l2_lambda * l2_loss
        
        return ce_loss + l2_loss
    
    def backward(self, targets: np.ndarray, learning_rate: float):
        """Backward pass with L2 regularization."""
        batch_size = len(targets)
        
        # Gradient from softmax + cross-entropy
        grad = self.probs.copy()
        grad[np.arange(batch_size), targets] -= 1
        
        # Backward through layers
        for i in range(len(self.layers) - 1, -1, -1):
            layer = self.layers[i]
            X = layer['cache']['X']
            
            # Compute gradients
            layer['dW'] = X.T @ grad / batch_size
            layer['db'] = np.mean(grad, axis=0)
            
            # Add L2 regularization gradient
            if self.l2_lambda > 0:
                layer['dW'] += self.l2_lambda * layer['W']
            
            # Gradient for next layer
            grad = grad @ layer['W'].T
            
            # Apply ReLU and dropout gradients (except for last layer)
            if i > 0:
                Z = self.layers[i - 1]['cache']['Z']
                grad = grad * (Z > 0).astype(float)  # ReLU backward
                
                # Dropout backward
                if self.layers[i - 1]['dropout_mask'] is not None:
                    grad = grad * self.layers[i - 1]['dropout_mask'] / (1 - self.dropout_rate)
            
            # Update weights
            layer['W'] -= learning_rate * layer['dW']
            layer['b'] -= learning_rate * layer['db']
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        self.training = False
        probs = self.forward(X)
        self.training = True
        return np.argmax(probs, axis=1)
    
    def get_l2_norm(self) -> float:
        """Get total L2 norm of weights (useful for visualization)."""
        total = 0.0
        for layer in self.layers:
            total += np.sum(layer['W'] ** 2)
        return np.sqrt(total)

---

## Part 3: Training Without Regularization (Observe Overfitting)

In [None]:
def train_model(
    model: RegularizedMLP,
    X_train: np.ndarray,
    y_train: np.ndarray,
    X_test: np.ndarray,
    y_test: np.ndarray,
    epochs: int = 50,
    batch_size: int = 32,
    lr: float = 0.01,
    verbose: bool = True
) -> Dict:
    """
    Train model and return history.
    """
    history = {
        'train_loss': [], 'test_loss': [],
        'train_acc': [], 'test_acc': [],
        'weight_norm': []
    }
    
    for epoch in range(epochs):
        # Training
        model.training = True
        indices = np.random.permutation(len(X_train))
        epoch_loss = 0
        n_batches = 0
        
        for start in range(0, len(X_train), batch_size):
            batch_idx = indices[start:start + batch_size]
            X_batch = X_train[batch_idx]
            y_batch = y_train[batch_idx]
            
            model.forward(X_batch)
            loss = model.compute_loss(y_batch)
            model.backward(y_batch, lr)
            
            epoch_loss += loss
            n_batches += 1
        
        # Evaluate
        train_preds = model.predict(X_train)
        test_preds = model.predict(X_test[:2000])  # Subset for speed
        
        model.training = False
        model.forward(X_train)
        train_loss = model.compute_loss(y_train)
        model.forward(X_test[:2000])
        test_loss = model.compute_loss(y_test[:2000])
        model.training = True
        
        history['train_loss'].append(train_loss)
        history['test_loss'].append(test_loss)
        history['train_acc'].append(np.mean(train_preds == y_train))
        history['test_acc'].append(np.mean(test_preds == y_test[:2000]))
        history['weight_norm'].append(model.get_l2_norm())
        
        if verbose and (epoch + 1) % 10 == 0:
            print(f"Epoch {epoch + 1:3d} | "
                  f"Train Acc: {history['train_acc'][-1]:.2%} | "
                  f"Test Acc: {history['test_acc'][-1]:.2%} | "
                  f"Gap: {history['train_acc'][-1] - history['test_acc'][-1]:.2%}")
    
    return history

In [None]:
# Train without regularization
print("üèãÔ∏è Training WITHOUT regularization (expect overfitting!)")
print("=" * 60)

np.random.seed(42)
model_no_reg = RegularizedMLP([784, 512, 256, 10], l2_lambda=0.0, dropout_rate=0.0)
history_no_reg = train_model(model_no_reg, X_train, y_train, X_test, y_test, epochs=50, lr=0.1)

print("\nüìä Final Results:")
print(f"   Train Accuracy: {history_no_reg['train_acc'][-1]:.2%}")
print(f"   Test Accuracy:  {history_no_reg['test_acc'][-1]:.2%}")
print(f"   Generalization Gap: {history_no_reg['train_acc'][-1] - history_no_reg['test_acc'][-1]:.2%}")
print("\n‚ö†Ô∏è Notice the large gap between train and test accuracy - that's overfitting!")

In [None]:
# Visualize overfitting
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Loss curves
axes[0].plot(history_no_reg['train_loss'], 'b-', linewidth=2, label='Train Loss')
axes[0].plot(history_no_reg['test_loss'], 'r-', linewidth=2, label='Test Loss')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Loss Curves (No Regularization)')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Accuracy curves
axes[1].plot(history_no_reg['train_acc'], 'b-', linewidth=2, label='Train Acc')
axes[1].plot(history_no_reg['test_acc'], 'r-', linewidth=2, label='Test Acc')
axes[1].axhline(y=1.0, color='g', linestyle=':', alpha=0.5)
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Accuracy Curves (No Regularization)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# Weight norm growth
axes[2].plot(history_no_reg['weight_norm'], 'purple', linewidth=2)
axes[2].set_xlabel('Epoch')
axes[2].set_ylabel('Weight Norm (L2)')
axes[2].set_title('Weight Magnitude Growth')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüí° Key observations:")
print("   1. Train loss keeps decreasing ‚Üí model is memorizing")
print("   2. Test loss starts increasing ‚Üí model is overfitting")
print("   3. Weights keep growing ‚Üí no constraint on complexity")

---

## Part 4: L2 Regularization (Weight Decay)

### The Idea

L2 regularization adds a penalty for large weights:

$$\mathcal{L}_{total} = \mathcal{L}_{CE} + \frac{\lambda}{2} \sum_i w_i^2$$

This encourages smaller weights, which leads to simpler models that generalize better.

In [None]:
# Test different L2 regularization strengths
print("üî¨ Experimenting with L2 Regularization")
print("=" * 60)

l2_values = [0.0, 0.0001, 0.001, 0.01, 0.1]
l2_results = {}

for l2_lambda in l2_values:
    np.random.seed(42)
    model = RegularizedMLP([784, 512, 256, 10], l2_lambda=l2_lambda)
    history = train_model(model, X_train, y_train, X_test, y_test, epochs=50, lr=0.1, verbose=False)
    l2_results[l2_lambda] = history
    
    gap = history['train_acc'][-1] - history['test_acc'][-1]
    print(f"Œª = {l2_lambda:6.4f} | Train: {history['train_acc'][-1]:.2%} | "
          f"Test: {history['test_acc'][-1]:.2%} | Gap: {gap:.2%}")

print("=" * 60)

In [None]:
# Visualize L2 regularization effects
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

colors = plt.cm.viridis(np.linspace(0, 0.9, len(l2_values)))

for (l2, history), color in zip(l2_results.items(), colors):
    label = f'Œª={l2}'
    
    # Accuracy gap
    gap = [t - v for t, v in zip(history['train_acc'], history['test_acc'])]
    axes[0].plot(gap, color=color, linewidth=2, label=label)
    
    # Test accuracy
    axes[1].plot(history['test_acc'], color=color, linewidth=2, label=label)
    
    # Weight norm
    axes[2].plot(history['weight_norm'], color=color, linewidth=2, label=label)

axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Train - Test Accuracy')
axes[0].set_title('Generalization Gap')
axes[0].legend(fontsize=8)
axes[0].grid(True, alpha=0.3)

axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Test Accuracy')
axes[1].set_title('Test Accuracy')
axes[1].legend(fontsize=8)
axes[1].grid(True, alpha=0.3)

axes[2].set_xlabel('Epoch')
axes[2].set_ylabel('Weight Norm')
axes[2].set_title('Weight Magnitude')
axes[2].legend(fontsize=8)
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

## Part 5: Dropout Regularization

### The Idea

Dropout randomly "drops" neurons during training:

$$\tilde{h} = h \odot m$$

where $m \sim \text{Bernoulli}(1 - p)$ is a random mask.

**Why it works:**
1. Prevents neurons from co-adapting
2. Acts like training many different networks
3. At test time, all neurons contribute (with scaling)

In [None]:
# Test different dropout rates
print("üî¨ Experimenting with Dropout")
print("=" * 60)

dropout_values = [0.0, 0.1, 0.2, 0.3, 0.5]
dropout_results = {}

for dropout_rate in dropout_values:
    np.random.seed(42)
    model = RegularizedMLP([784, 512, 256, 10], dropout_rate=dropout_rate)
    history = train_model(model, X_train, y_train, X_test, y_test, epochs=50, lr=0.1, verbose=False)
    dropout_results[dropout_rate] = history
    
    gap = history['train_acc'][-1] - history['test_acc'][-1]
    print(f"Dropout = {dropout_rate:.1f} | Train: {history['train_acc'][-1]:.2%} | "
          f"Test: {history['test_acc'][-1]:.2%} | Gap: {gap:.2%}")

print("=" * 60)

In [None]:
# Visualize dropout effects
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

colors = plt.cm.plasma(np.linspace(0, 0.9, len(dropout_values)))

for (drop, history), color in zip(dropout_results.items(), colors):
    label = f'Dropout={drop}'
    
    # Accuracy gap
    gap = [t - v for t, v in zip(history['train_acc'], history['test_acc'])]
    axes[0].plot(gap, color=color, linewidth=2, label=label)
    
    # Test accuracy
    axes[1].plot(history['test_acc'], color=color, linewidth=2, label=label)

axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Train - Test Accuracy')
axes[0].set_title('Generalization Gap with Dropout')
axes[0].legend(fontsize=9)
axes[0].grid(True, alpha=0.3)

axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Test Accuracy')
axes[1].set_title('Test Accuracy with Dropout')
axes[1].legend(fontsize=9)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

## Part 6: The Underfitting ‚Üî Overfitting Spectrum

Let's visualize how regularization affects the bias-variance tradeoff.

In [None]:
# Create comprehensive comparison
print("üéØ Finding the Sweet Spot")
print("=" * 60)

# Combine L2 and Dropout
configs = [
    ('No Regularization', 0.0, 0.0),
    ('L2 only (Œª=0.001)', 0.001, 0.0),
    ('Dropout only (0.2)', 0.0, 0.2),
    ('L2 + Dropout', 0.001, 0.2),
    ('Strong Regularization', 0.01, 0.5),
]

comparison_results = {}

for name, l2, drop in configs:
    np.random.seed(42)
    model = RegularizedMLP([784, 512, 256, 10], l2_lambda=l2, dropout_rate=drop)
    history = train_model(model, X_train, y_train, X_test, y_test, epochs=50, lr=0.1, verbose=False)
    comparison_results[name] = history
    
    gap = history['train_acc'][-1] - history['test_acc'][-1]
    print(f"{name:25s} | Train: {history['train_acc'][-1]:.2%} | "
          f"Test: {history['test_acc'][-1]:.2%} | Gap: {gap:.2%}")

In [None]:
# Create the spectrum visualization
fig, ax = plt.subplots(figsize=(12, 6))

# Prepare data
names = list(comparison_results.keys())
train_accs = [comparison_results[n]['train_acc'][-1] for n in names]
test_accs = [comparison_results[n]['test_acc'][-1] for n in names]
gaps = [t - v for t, v in zip(train_accs, test_accs)]

x = np.arange(len(names))
width = 0.35

bars1 = ax.bar(x - width/2, train_accs, width, label='Train Accuracy', color='#2196F3', alpha=0.8)
bars2 = ax.bar(x + width/2, test_accs, width, label='Test Accuracy', color='#4CAF50', alpha=0.8)

# Add gap annotations
for i, (train, test, gap) in enumerate(zip(train_accs, test_accs, gaps)):
    mid = (train + test) / 2
    ax.annotate(f'Gap: {gap:.1%}', (i, mid), ha='center', fontsize=9, fontweight='bold')

ax.set_ylabel('Accuracy', fontsize=12)
ax.set_title('Underfitting ‚Üî Good Fit ‚Üî Overfitting Spectrum', fontsize=14)
ax.set_xticks(x)
ax.set_xticklabels(names, rotation=15, ha='right')
ax.legend()
ax.grid(True, alpha=0.3, axis='y')
ax.set_ylim(0.7, 1.05)

# Add regions
ax.axvspan(-0.5, 0.5, alpha=0.1, color='red', label='Overfitting')
ax.axvspan(2.5, 3.5, alpha=0.1, color='green', label='Sweet Spot')
ax.axvspan(3.5, 4.5, alpha=0.1, color='blue', label='Underfitting')

plt.tight_layout()
plt.show()

---

## Part 7: Summary and Recommendations

In [None]:
print("\n" + "=" * 80)
print("                    REGULARIZATION RECOMMENDATIONS")
print("=" * 80)

print("""
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ Technique       ‚îÇ When to Use                                                ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ L2 (Weight      ‚îÇ ‚Ä¢ Default choice, always a good starting point             ‚îÇ
‚îÇ Decay)          ‚îÇ ‚Ä¢ Built into AdamW optimizer (modern standard)             ‚îÇ
‚îÇ                 ‚îÇ ‚Ä¢ Typical values: 0.0001 to 0.01                           ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ Dropout         ‚îÇ ‚Ä¢ Large models with many parameters                        ‚îÇ
‚îÇ                 ‚îÇ ‚Ä¢ Fully-connected layers (less common in CNNs now)         ‚îÇ
‚îÇ                 ‚îÇ ‚Ä¢ Typical values: 0.1 to 0.5                               ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ Early Stopping  ‚îÇ ‚Ä¢ When you see test loss increasing                        ‚îÇ
‚îÇ                 ‚îÇ ‚Ä¢ Save model at best validation performance                ‚îÇ
‚îÇ                 ‚îÇ ‚Ä¢ Patience: typically 5-10 epochs                          ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ Data            ‚îÇ ‚Ä¢ Limited training data                                    ‚îÇ
‚îÇ Augmentation    ‚îÇ ‚Ä¢ Computer vision tasks                                    ‚îÇ
‚îÇ                 ‚îÇ ‚Ä¢ Creates "virtual" training examples                      ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

Key Signs:
‚Ä¢ Overfitting: Train acc >> Test acc, Test loss increasing
‚Ä¢ Underfitting: Both train and test acc are low
‚Ä¢ Good fit: Small gap between train and test, test acc high
""")

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Applying dropout during inference

```python
# ‚ùå Wrong - dropout still active during testing
predictions = model(test_data)  # Some values randomly zeroed!

# ‚úÖ Right - disable dropout during testing
model.training = False
predictions = model(test_data)
```

### Mistake 2: Too much regularization

```python
# ‚ùå Wrong - model can't learn anything
model = MLP(l2_lambda=1.0, dropout=0.9)

# ‚úÖ Right - start small and increase if needed
model = MLP(l2_lambda=0.001, dropout=0.2)
```

### Mistake 3: Not monitoring both train and test metrics

```python
# ‚ùå Wrong - only looking at training loss
print(f"Train loss: {train_loss}")

# ‚úÖ Right - always compare train vs test
print(f"Train loss: {train_loss}, Test loss: {test_loss}")
print(f"Gap: {train_acc - test_acc}")
```

---

## ‚úã Try It Yourself

### Exercise 1: Find the Optimal Regularization

Use grid search to find the best combination of L2 and Dropout for this dataset.

<details>
<summary>üí° Hint</summary>
Try L2 in [0.0001, 0.001, 0.01] and Dropout in [0.0, 0.1, 0.2, 0.3].
Track test accuracy for each combination.
</details>

In [None]:
# Your code here: Grid search for optimal regularization

### Exercise 2: Implement Early Stopping

Add early stopping to the training loop:
- Monitor validation loss
- Stop if it doesn't improve for N epochs
- Return the best model

In [None]:
# Your code here: Implement early stopping

---

## üéâ Checkpoint

You've learned:

- ‚úÖ How to detect overfitting (train >> test)
- ‚úÖ L2 regularization constrains weight magnitudes
- ‚úÖ Dropout prevents co-adaptation of neurons
- ‚úÖ The underfitting ‚Üî overfitting spectrum
- ‚úÖ How to find optimal regularization experimentally

---

## üìñ Further Reading

- [Dropout Paper (Srivastava et al.)](https://jmlr.org/papers/v15/srivastava14a.html)
- [Regularization for Deep Learning (Goodfellow)](https://www.deeplearningbook.org/contents/regularization.html)
- [L2 Regularization and Batch Norm](https://arxiv.org/abs/1706.05350)

---

## üßπ Cleanup

In [None]:
import gc
gc.collect()

print("‚úÖ Cleanup complete!")
print("\nüéØ Next: Proceed to notebook 04-normalization-comparison.ipynb")