# Day 12: Dropout - A Simple Way to Prevent Neural Networks from Overfitting üé≤

Welcome to Day 12 of 30 Papers in 30 Days!

Today we're exploring **Dropout** - the elegantly simple regularization technique that revolutionized neural network training. It's like training an ensemble of networks for the price of one!

## What You'll Learn

1. **Why Overfitting Happens**: The curse of too many parameters
2. **The Dropout Solution**: Random neuron silencing during training
3. **Inverted Dropout**: The practical implementation trick
4. **Dropout Variants**: Spatial, DropConnect, AlphaDropout
5. **MC Dropout**: Uncertainty estimation for free!
6. **Implementation**: Build dropout from scratch

## The Big Idea (in 30 seconds)

**Problem**: Neural networks overfit. They memorize training data instead of learning patterns.

**Solution**: During training, randomly "drop" (zero out) neurons with probability p!

**Why it works**:
- No neuron can rely on any other specific neuron
- Forces redundant representations
- Like training an exponential ensemble of networks!

**Result**: Better generalization, less overfitting!

Let's dive in! üöÄ

In [None]:
# Setup and imports
import numpy as np
import matplotlib.pyplot as plt
import sys
import os

# Add current directory to path
sys.path.append('.')

# Set random seed for reproducibility
np.random.seed(42)

# Import our implementations
from implementation import Dropout, Dropout2D, NaiveDropout, DropoutNetwork
from train_minimal import load_mnist, DropoutMLP, SGD, train_epoch, evaluate
from visualization import (
    plot_dropout_masks, 
    plot_training_curves_comparison,
    plot_ensemble_interpretation
)

print("‚úÖ All imports successful!")
print("üé≤ Ready to explore dropout!")

## Part 1: Understanding the Problem - Overfitting

Before we solve a problem, let's understand it. Neural networks have many parameters - sometimes millions or billions. With that many knobs to turn, it's easy to perfectly fit the training data while failing on new data.

Let's visualize this problem.

In [None]:
# Demonstrate overfitting
def demonstrate_overfitting():
    """Show how a network can overfit to training data."""
    
    print("üìä Demonstrating Overfitting...")
    
    # Load MNIST data
    X_train, y_train, X_test, y_test = load_mnist()
    
    # Use small subset to make overfitting more obvious
    X_train_small = X_train[:1000]
    y_train_small = y_train[:1000]
    
    print(f"Training samples: {len(X_train_small)}")
    print(f"Test samples: {len(X_test)}")
    
    # Train WITHOUT dropout (prone to overfitting)
    print("\nTraining WITHOUT dropout...")
    
    model_no_dropout = DropoutMLP(
        input_size=784,
        hidden_sizes=[512, 256],
        output_size=10,
        dropout_p=1.0,  # No dropout (keep everything)
        input_dropout_p=1.0
    )
    
    optimizer = SGD(model_no_dropout.get_params(), lr=0.01, momentum=0.9)
    
    train_accs_no_drop = []
    test_accs_no_drop = []
    
    for epoch in range(30):
        train_epoch(model_no_dropout, X_train_small, y_train_small, optimizer, batch_size=32)
        
        train_acc, _ = evaluate(model_no_dropout, X_train_small, y_train_small)
        test_acc, _ = evaluate(model_no_dropout, X_test, y_test)
        
        train_accs_no_drop.append(train_acc)
        test_accs_no_drop.append(test_acc)
        
        if (epoch + 1) % 10 == 0:
            print(f"  Epoch {epoch+1}: Train {train_acc:.3f} | Test {test_acc:.3f} | Gap {train_acc - test_acc:.3f}")
    
    # Train WITH dropout
    print("\nTraining WITH dropout (p=0.5)...")
    
    np.random.seed(42)  # Reset for fair comparison
    
    model_dropout = DropoutMLP(
        input_size=784,
        hidden_sizes=[512, 256],
        output_size=10,
        dropout_p=0.5,  # Keep 50% of neurons
        input_dropout_p=0.9
    )
    
    optimizer = SGD(model_dropout.get_params(), lr=0.01, momentum=0.9)
    
    train_accs_drop = []
    test_accs_drop = []
    
    for epoch in range(30):
        train_epoch(model_dropout, X_train_small, y_train_small, optimizer, batch_size=32)
        
        train_acc, _ = evaluate(model_dropout, X_train_small, y_train_small)
        test_acc, _ = evaluate(model_dropout, X_test, y_test)
        
        train_accs_drop.append(train_acc)
        test_accs_drop.append(test_acc)
        
        if (epoch + 1) % 10 == 0:
            print(f"  Epoch {epoch+1}: Train {train_acc:.3f} | Test {test_acc:.3f} | Gap {train_acc - test_acc:.3f}")
    
    # Plot comparison
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # No dropout
    ax = axes[0]
    ax.plot(train_accs_no_drop, 'b-', label='Train', linewidth=2)
    ax.plot(test_accs_no_drop, 'r-', label='Test', linewidth=2)
    ax.fill_between(range(30), test_accs_no_drop, train_accs_no_drop, alpha=0.2, color='red')
    ax.set_xlabel('Epoch', fontsize=12)
    ax.set_ylabel('Accuracy', fontsize=12)
    ax.set_title('WITHOUT Dropout: Overfitting! ‚ùå', fontsize=14, weight='bold', color='red')
    ax.legend(fontsize=11)
    ax.grid(True, alpha=0.3)
    ax.text(15, 0.75, f'Gap: {train_accs_no_drop[-1] - test_accs_no_drop[-1]:.2f}', 
           fontsize=12, ha='center', color='red', weight='bold')
    
    # With dropout
    ax = axes[1]
    ax.plot(train_accs_drop, 'b-', label='Train', linewidth=2)
    ax.plot(test_accs_drop, 'g-', label='Test', linewidth=2)
    ax.fill_between(range(30), test_accs_drop, train_accs_drop, alpha=0.2, color='green')
    ax.set_xlabel('Epoch', fontsize=12)
    ax.set_ylabel('Accuracy', fontsize=12)
    ax.set_title('WITH Dropout: Much Better! ‚úÖ', fontsize=14, weight='bold', color='green')
    ax.legend(fontsize=11)
    ax.grid(True, alpha=0.3)
    ax.text(15, 0.75, f'Gap: {train_accs_drop[-1] - test_accs_drop[-1]:.2f}', 
           fontsize=12, ha='center', color='green', weight='bold')
    
    plt.tight_layout()
    plt.show()
    
    print("\nüí° Key Observation:")
    print(f"  Without dropout: Train-Test gap = {train_accs_no_drop[-1] - test_accs_no_drop[-1]:.3f}")
    print(f"  With dropout:    Train-Test gap = {train_accs_drop[-1] - test_accs_drop[-1]:.3f}")
    print(f"  Dropout reduces overfitting by ~{100*(1 - (train_accs_drop[-1] - test_accs_drop[-1])/(train_accs_no_drop[-1] - test_accs_no_drop[-1])):.0f}%!")

demonstrate_overfitting()

## Part 2: How Dropout Works

Dropout is beautifully simple:

1. **During Training**: For each forward pass, randomly zero out neurons with probability (1-p)
2. **During Inference**: Use all neurons (no dropout)

Let's visualize what a dropout mask looks like.

In [None]:
# Visualize dropout masks
def visualize_dropout_mechanism():
    """Show how dropout masks work."""
    
    print("üé≤ Visualizing Dropout Masks...")
    
    # Different keep probabilities
    keep_probs = [0.9, 0.7, 0.5, 0.3]
    
    fig, axes = plt.subplots(2, 4, figsize=(16, 8))
    
    for i, p in enumerate(keep_probs):
        # Generate mask for 16x16 layer
        mask = (np.random.rand(16, 16) < p).astype(float)
        
        # Show mask
        ax = axes[0, i]
        ax.imshow(mask, cmap='RdYlGn', vmin=0, vmax=1)
        ax.set_title(f'keep_prob = {p}\n{int(mask.sum())}/256 active', fontsize=11, weight='bold')
        ax.axis('off')
        
        # Show histogram of multiple samples
        ax = axes[1, i]
        active_counts = [(np.random.rand(256) < p).sum() for _ in range(1000)]
        ax.hist(active_counts, bins=20, color='steelblue', alpha=0.7, edgecolor='black')
        ax.axvline(256*p, color='red', linestyle='--', linewidth=2, label=f'Expected: {256*p:.0f}')
        ax.set_xlabel('Active Neurons', fontsize=10)
        ax.set_ylabel('Count', fontsize=10)
        ax.legend(fontsize=9)
        ax.grid(True, alpha=0.3)
    
    axes[0, 0].set_ylabel('Dropout Mask\n(Green=Active)', fontsize=11)
    axes[1, 0].set_ylabel('Distribution\n(1000 samples)', fontsize=11)
    
    plt.suptitle('Dropout Masks at Different Keep Probabilities', fontsize=14, weight='bold', y=1.02)
    plt.tight_layout()
    plt.show()
    
    print("\nüí° Key Insights:")
    print("  ‚Ä¢ Each training step uses a DIFFERENT random mask")
    print("  ‚Ä¢ p=0.5 is the standard for hidden layers")
    print("  ‚Ä¢ p=0.8-0.9 for input layers (don't drop too much input!)")
    print("  ‚Ä¢ p=1.0 for output layers (need all outputs!)")

visualize_dropout_mechanism()

## Part 3: Inverted Dropout - The Practical Trick

There are two ways to implement dropout:

**Original (Naive) Dropout:**
- Training: Apply mask, no scaling
- Inference: Scale by p

**Inverted Dropout (Standard):**
- Training: Apply mask AND scale by 1/p
- Inference: No change needed!

Inverted dropout is better because inference is simpler (no scaling).

In [None]:
# Compare inverted vs naive dropout
def compare_dropout_implementations():
    """Compare inverted vs naive dropout."""
    
    print("üîÑ Comparing Inverted vs Naive Dropout...")
    
    # Create sample input
    x = np.ones((1, 10))
    p = 0.5  # Keep probability
    
    print(f"\nInput: {x[0]}")
    print(f"Keep probability: {p}")
    
    # Set same random seed for comparison
    np.random.seed(42)
    mask = (np.random.rand(*x.shape) < p).astype(float)
    print(f"\nMask: {mask[0].astype(int)}")
    
    # Naive Dropout
    print("\n--- NAIVE DROPOUT ---")
    naive_train = x * mask  # No scaling during training
    naive_test = x * p       # Scale during inference
    
    print(f"Training output:  {naive_train[0]}")
    print(f"Inference output: {naive_test[0]}")
    print(f"Expected value (train): {naive_train.mean():.3f}")
    print(f"Expected value (test):  {naive_test.mean():.3f}")
    
    # Inverted Dropout
    print("\n--- INVERTED DROPOUT ---")
    inverted_train = (x * mask) / p  # Scale during training
    inverted_test = x                 # No change during inference
    
    print(f"Training output:  {inverted_train[0]}")
    print(f"Inference output: {inverted_test[0]}")
    print(f"Expected value (train): {inverted_train.mean():.3f}")
    print(f"Expected value (test):  {inverted_test.mean():.3f}")
    
    # Verify expected values match
    print("\nüí° Key Insight:")
    print("  With inverted dropout, the expected value during training")
    print("  equals the value during inference (1.0 in this case).")
    print("  This makes inference simpler - no scaling needed!")
    
    # Statistical verification
    print("\nüìä Statistical Verification (1000 trials):")
    
    naive_means = []
    inverted_means = []
    
    for _ in range(1000):
        mask = (np.random.rand(*x.shape) < p).astype(float)
        naive_means.append((x * mask).mean())
        inverted_means.append(((x * mask) / p).mean())
    
    print(f"  Naive training mean:    {np.mean(naive_means):.4f} (should be ~0.5)")
    print(f"  Inverted training mean: {np.mean(inverted_means):.4f} (should be ~1.0)")

compare_dropout_implementations()

## Part 4: Why Dropout Works - The Ensemble Interpretation

Dropout can be understood as training an **exponential ensemble** of networks!

With n neurons and dropout, there are 2^n possible sub-networks. Each training step trains a different sub-network. At inference, we use the average of all these sub-networks.

This is like getting an ensemble for free!

In [None]:
# Visualize the ensemble interpretation
def visualize_ensemble_nature():
    """Show dropout as implicit ensemble training."""
    
    print("üîÄ Dropout as Ensemble Learning...")
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    
    # Sample sub-networks
    n_neurons = 8
    p = 0.5
    
    # Show 3 different "sub-networks" (dropout masks)
    for i in range(3):
        ax = axes[i]
        
        # Generate random mask
        mask = (np.random.rand(1, n_neurons) < p)[0]
        
        # Visualize as network diagram
        # Input layer
        for j in range(4):
            ax.scatter(0, j, s=300, c='steelblue', zorder=3)
        
        # Hidden layer (with dropout)
        for j in range(n_neurons):
            y = j - (n_neurons - 4) / 2
            color = 'green' if mask[j] else 'lightgray'
            alpha = 1.0 if mask[j] else 0.3
            ax.scatter(1, y, s=300, c=color, alpha=alpha, zorder=3, edgecolor='black')
            
            # Draw connections
            for k in range(4):
                if mask[j]:
                    ax.plot([0, 1], [k, y], 'g-', alpha=0.3, linewidth=1)
                else:
                    ax.plot([0, 1], [k, y], 'gray', alpha=0.1, linewidth=0.5)
        
        # Output layer
        for j in range(2):
            y = j + 0.5
            ax.scatter(2, y, s=300, c='orange', zorder=3)
            
            for k in range(n_neurons):
                ky = k - (n_neurons - 4) / 2
                if mask[k]:
                    ax.plot([1, 2], [ky, y], 'g-', alpha=0.3, linewidth=1)
        
        ax.set_xlim(-0.5, 2.5)
        ax.set_ylim(-3, 6)
        ax.set_title(f'Sub-network {i+1}\n({int(mask.sum())}/{n_neurons} neurons active)', 
                    fontsize=12, weight='bold')
        ax.axis('off')
    
    plt.suptitle('Each Training Step = Different Sub-Network!', fontsize=14, weight='bold', y=1.02)
    plt.tight_layout()
    plt.show()
    
    # Statistics
    print(f"\nüìä Ensemble Statistics:")
    print(f"  Hidden layer neurons: {n_neurons}")
    print(f"  Possible sub-networks: 2^{n_neurons} = {2**n_neurons}")
    print(f"\n  For a real network with 512 hidden neurons:")
    print(f"  Possible sub-networks: 2^512 ‚âà 10^154 (more than atoms in universe!)")
    print(f"\nüí° Dropout = Implicit ensemble of exponentially many networks!")

visualize_ensemble_nature()

## Part 5: Finding the Optimal Dropout Rate

The dropout rate is a hyperparameter. Too little dropout = overfitting. Too much dropout = underfitting.

Let's find the sweet spot!

In [None]:
# Find optimal dropout rate
def find_optimal_dropout_rate():
    """Sweep dropout rates to find optimal."""
    
    print("üîç Finding Optimal Dropout Rate...")
    
    # Load data
    X_train, y_train, X_test, y_test = load_mnist()
    X_train, y_train = X_train[:3000], y_train[:3000]  # Subset for speed
    
    # Dropout rates to test (keep probability)
    dropout_rates = [1.0, 0.9, 0.7, 0.5, 0.3, 0.1]
    
    results = {}
    epochs = 20
    
    for keep_prob in dropout_rates:
        print(f"\nTesting keep_prob = {keep_prob}...")
        
        np.random.seed(42)
        
        model = DropoutMLP(
            input_size=784,
            hidden_sizes=[256, 128],
            output_size=10,
            dropout_p=keep_prob,
            input_dropout_p=1.0 if keep_prob == 1.0 else 0.9
        )
        
        optimizer = SGD(model.get_params(), lr=0.01, momentum=0.9)
        
        for epoch in range(epochs):
            train_epoch(model, X_train, y_train, optimizer, batch_size=64)
        
        train_acc, _ = evaluate(model, X_train, y_train)
        test_acc, _ = evaluate(model, X_test, y_test)
        
        results[keep_prob] = {
            'train_acc': train_acc,
            'test_acc': test_acc,
            'gap': train_acc - test_acc
        }
        
        print(f"  Train: {train_acc:.3f} | Test: {test_acc:.3f} | Gap: {train_acc - test_acc:.3f}")
    
    # Plot results
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    probs = list(results.keys())
    train_accs = [results[p]['train_acc'] for p in probs]
    test_accs = [results[p]['test_acc'] for p in probs]
    gaps = [results[p]['gap'] for p in probs]
    
    # Accuracy plot
    ax = axes[0]
    ax.plot(probs, train_accs, 'b-o', label='Train', linewidth=2, markersize=10)
    ax.plot(probs, test_accs, 'g-s', label='Test', linewidth=2, markersize=10)
    
    best_p = probs[np.argmax(test_accs)]
    ax.axvline(best_p, color='green', linestyle='--', alpha=0.7, label=f'Best: {best_p}')
    
    ax.set_xlabel('Keep Probability (p)', fontsize=12)
    ax.set_ylabel('Accuracy', fontsize=12)
    ax.set_title('Accuracy vs Dropout Rate', fontsize=14, weight='bold')
    ax.legend(fontsize=11)
    ax.grid(True, alpha=0.3)
    ax.set_xlim(0, 1.1)
    
    # Gap plot
    ax = axes[1]
    colors = ['red' if g > 0.1 else 'orange' if g > 0.05 else 'green' for g in gaps]
    ax.bar(range(len(probs)), gaps, color=colors, alpha=0.7, edgecolor='black')
    ax.set_xticks(range(len(probs)))
    ax.set_xticklabels([f'{p:.1f}' for p in probs])
    ax.set_xlabel('Keep Probability (p)', fontsize=12)
    ax.set_ylabel('Train-Test Gap', fontsize=12)
    ax.set_title('Overfitting vs Dropout Rate', fontsize=14, weight='bold')
    ax.grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.show()
    
    print(f"\nüí° Best dropout rate: keep_prob = {best_p}")
    print(f"  Test accuracy: {results[best_p]['test_acc']:.3f}")
    print(f"  Train-Test gap: {results[best_p]['gap']:.3f}")
    
    return results

results = find_optimal_dropout_rate()

## Part 6: Spatial Dropout (Dropout2D)

For CNNs, standard dropout isn't ideal because nearby pixels are correlated. If you drop one pixel, its neighbors can "fill in" the missing information.

**Spatial Dropout (Dropout2D)** drops entire feature channels instead!

In [None]:
# Demonstrate spatial dropout
def demonstrate_spatial_dropout():
    """Compare standard vs spatial dropout."""
    
    print("üñºÔ∏è Demonstrating Spatial Dropout...")
    
    # Create sample feature map (batch=1, channels=4, height=8, width=8)
    np.random.seed(42)
    x = np.random.randn(1, 4, 8, 8)
    
    print(f"Input shape: {x.shape}")
    print(f"  Batch: {x.shape[0]}")
    print(f"  Channels: {x.shape[1]}")
    print(f"  Height x Width: {x.shape[2]}x{x.shape[3]}")
    
    # Standard dropout (flattened)
    standard_dropout = Dropout(p=0.5)
    x_flat = x.reshape(1, -1)
    y_standard = standard_dropout.forward(x_flat).reshape(x.shape)
    
    # Spatial dropout
    spatial_dropout = Dropout2D(p=0.5)
    y_spatial = spatial_dropout.forward(x)
    
    # Visualize
    fig, axes = plt.subplots(3, 4, figsize=(14, 10))
    
    # Original channels
    for c in range(4):
        axes[0, c].imshow(x[0, c], cmap='viridis')
        axes[0, c].set_title(f'Original Ch {c}', fontsize=10)
        axes[0, c].axis('off')
    axes[0, 0].set_ylabel('Original', fontsize=12, rotation=0, ha='right', labelpad=40)
    
    # Standard dropout
    for c in range(4):
        axes[1, c].imshow(y_standard[0, c], cmap='viridis')
        zeros_pct = 100 * np.sum(y_standard[0, c] == 0) / y_standard[0, c].size
        axes[1, c].set_title(f'Ch {c}: {zeros_pct:.0f}% zeros', fontsize=10)
        axes[1, c].axis('off')
    axes[1, 0].set_ylabel('Standard\nDropout', fontsize=12, rotation=0, ha='right', labelpad=40)
    
    # Spatial dropout
    for c in range(4):
        axes[2, c].imshow(y_spatial[0, c], cmap='viridis')
        is_dropped = np.all(y_spatial[0, c] == 0)
        status = 'DROPPED' if is_dropped else 'KEPT (2x)'
        color = 'red' if is_dropped else 'green'
        axes[2, c].set_title(f'Ch {c}: {status}', fontsize=10, color=color, weight='bold')
        axes[2, c].axis('off')
    axes[2, 0].set_ylabel('Spatial\nDropout', fontsize=12, rotation=0, ha='right', labelpad=40)
    
    plt.suptitle('Standard Dropout vs Spatial Dropout (Dropout2D)', fontsize=14, weight='bold')
    plt.tight_layout()
    plt.show()
    
    print("\nüí° Key Difference:")
    print("  Standard: Scattered zeros across all channels")
    print("  Spatial:  Entire channels are ON or OFF")
    print("\n  For CNNs, spatial dropout is better because:")
    print("  ‚Ä¢ Nearby pixels are correlated")
    print("  ‚Ä¢ Dropping one pixel: neighbors can compensate")
    print("  ‚Ä¢ Dropping whole channel: must use different features")

demonstrate_spatial_dropout()

## Part 7: MC Dropout - Uncertainty for Free!

One beautiful application of dropout is **Monte Carlo Dropout** (MC Dropout):

1. Keep dropout ON during inference
2. Run multiple forward passes
3. Mean = prediction, Variance = uncertainty!

This gives us uncertainty estimates without changing the model!

In [None]:
# Demonstrate MC Dropout
def demonstrate_mc_dropout():
    """Show how MC Dropout provides uncertainty estimates."""
    
    print("üéØ Demonstrating MC Dropout for Uncertainty...")
    
    # Load data and train model
    X_train, y_train, X_test, y_test = load_mnist()
    X_train, y_train = X_train[:3000], y_train[:3000]
    
    np.random.seed(42)
    
    model = DropoutMLP(
        input_size=784,
        hidden_sizes=[256, 128],
        output_size=10,
        dropout_p=0.5
    )
    
    optimizer = SGD(model.get_params(), lr=0.01, momentum=0.9)
    
    print("Training model...")
    for epoch in range(15):
        train_epoch(model, X_train, y_train, optimizer, batch_size=64)
        if (epoch + 1) % 5 == 0:
            train_acc, _ = evaluate(model, X_train, y_train)
            print(f"  Epoch {epoch+1}: Train acc = {train_acc:.3f}")
    
    # Standard evaluation
    print("\n--- Standard Evaluation ---")
    model.eval()
    test_acc, _ = evaluate(model, X_test, y_test)
    print(f"Test accuracy: {test_acc:.3f}")
    
    # MC Dropout evaluation
    print("\n--- MC Dropout Evaluation ---")
    
    n_samples = 100
    n_test = 200
    
    X_subset = X_test[:n_test]
    y_subset = y_test[:n_test]
    
    # Run multiple forward passes with dropout ON
    model.train()  # Keep dropout active!
    
    all_predictions = []
    for i in range(n_samples):
        preds = model.forward(X_subset)
        all_predictions.append(preds)
    
    all_predictions = np.stack(all_predictions, axis=0)  # (n_samples, n_test, 10)
    
    # Compute statistics
    mean_pred = all_predictions.mean(axis=0)
    std_pred = all_predictions.std(axis=0)
    
    # Get predictions and uncertainty
    predicted_classes = mean_pred.argmax(axis=1)
    uncertainty = std_pred.sum(axis=1)  # Total variance
    
    # Check correctness
    correct = (predicted_classes == y_subset)
    
    print(f"MC Dropout accuracy: {correct.mean():.3f}")
    print(f"\nUncertainty Analysis:")
    print(f"  Correct predictions:   mean uncertainty = {uncertainty[correct].mean():.4f}")
    print(f"  Incorrect predictions: mean uncertainty = {uncertainty[~correct].mean():.4f}")
    
    # Plot
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Uncertainty distribution
    ax = axes[0]
    ax.hist(uncertainty[correct], bins=20, alpha=0.7, label='Correct', color='green', density=True)
    ax.hist(uncertainty[~correct], bins=20, alpha=0.7, label='Incorrect', color='red', density=True)
    ax.set_xlabel('Uncertainty (Total Variance)', fontsize=12)
    ax.set_ylabel('Density', fontsize=12)
    ax.set_title('Uncertainty: Correct vs Incorrect', fontsize=14, weight='bold')
    ax.legend(fontsize=11)
    ax.grid(True, alpha=0.3)
    
    # Rejection curve
    ax = axes[1]
    
    # Sort by uncertainty
    order = np.argsort(uncertainty)
    
    reject_rates = np.linspace(0, 0.5, 20)
    accuracies = []
    
    for reject_rate in reject_rates:
        n_keep = int(n_test * (1 - reject_rate))
        keep_indices = order[:n_keep]  # Keep least uncertain
        acc = correct[keep_indices].mean()
        accuracies.append(acc)
    
    ax.plot(100 * reject_rates, accuracies, 'b-o', linewidth=2, markersize=6)
    ax.axhline(test_acc, color='r', linestyle='--', label='No rejection')
    ax.set_xlabel('Rejection Rate (%)', fontsize=12)
    ax.set_ylabel('Accuracy on Remaining', fontsize=12)
    ax.set_title('Accuracy vs Rejection Rate', fontsize=14, weight='bold')
    ax.legend(fontsize=11)
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("\nüí° Key Insights:")
    ratio = uncertainty[~correct].mean() / uncertainty[correct].mean()
    print(f"  ‚Ä¢ Wrong predictions are {ratio:.1f}x more uncertain")
    print(f"  ‚Ä¢ Rejecting uncertain samples improves accuracy")
    print(f"  ‚Ä¢ MC Dropout = Free uncertainty estimation!")

demonstrate_mc_dropout()

## Part 8: Key Takeaways

Let's summarize what we've learned about dropout!

In [None]:
# Summary
print("="*60)
print("KEY TAKEAWAYS: DROPOUT")
print("="*60)

print("""
üéØ WHAT IS DROPOUT?
   Randomly zero out neurons during training to prevent overfitting.

üîß IMPLEMENTATION:
   ‚Ä¢ Training: mask = (rand < p); output = (input * mask) / p
   ‚Ä¢ Inference: output = input (no change!)

‚öôÔ∏è RECOMMENDED RATES:
   ‚Ä¢ Input layer:  p = 0.8-0.9 (keep most input)
   ‚Ä¢ Hidden layers: p = 0.5 (standard)
   ‚Ä¢ Output layer: p = 1.0 (never dropout!)

üß† WHY IT WORKS:
   ‚Ä¢ Prevents co-adaptation of neurons
   ‚Ä¢ Forces redundant representations
   ‚Ä¢ Implicit ensemble of 2^n networks
   ‚Ä¢ Adds beneficial noise during training

üì¶ VARIANTS:
   ‚Ä¢ Dropout:   Standard for dense layers
   ‚Ä¢ Dropout2D: Spatial dropout for CNNs
   ‚Ä¢ AlphaDropout: For SELU activations
   ‚Ä¢ DropConnect: Drop weights, not neurons
   ‚Ä¢ DropBlock: Drop contiguous regions

üé≤ MC DROPOUT:
   ‚Ä¢ Keep dropout ON at inference
   ‚Ä¢ Run multiple forward passes
   ‚Ä¢ Mean = prediction, Variance = uncertainty

‚ö†Ô∏è COMMON MISTAKES:
   ‚Ä¢ Forgetting model.eval() during inference
   ‚Ä¢ Dropout on output layer
   ‚Ä¢ Too aggressive dropout (underfitting)
   ‚Ä¢ Using dropout with BatchNorm (tricky!)

üöÄ MODERN USAGE:
   ‚Ä¢ Still widely used in NLP (transformers)
   ‚Ä¢ Less common in vision (BatchNorm preferred)
   ‚Ä¢ Essential for uncertainty estimation
""")

print("="*60)
print("üéâ Congratulations! You've mastered Dropout!")
print("="*60)

## Exercises

Ready to test your understanding? Head to the `exercises/` directory for 5 progressive challenges:

1. **Build Dropout from Scratch** - Implement forward/backward passes
2. **Dropout Rate Sweep** - Find the optimal rate empirically
3. **Spatial Dropout** - Implement Dropout2D for CNNs
4. **MC Dropout** - Build uncertainty estimation
5. **Regularization Comparison** - Compare dropout with L2, early stopping

Good luck! üöÄ