# Module 07: Regularization Techniques (Dropout, Batch Normalization)

**Difficulty**: ⭐⭐⭐ (Advanced)

**Estimated Time**: 60-75 minutes

**Prerequisites**: 
- [Module 05: Feed-Forward Neural Networks with Keras](05_feedforward_neural_networks_keras.ipynb)
- [Module 06: Optimizers (SGD, Adam, RMSprop)](06_optimizers_sgd_adam_rmsprop.ipynb)
- [Module 04: Introduction to TensorFlow and Keras](04_introduction_to_tensorflow_keras.ipynb)

## Learning Objectives

By the end of this notebook, you will be able to:
1. Understand overfitting in deep learning and diagnose it effectively
2. Apply L1 and L2 regularization to control model complexity
3. Implement Dropout for reducing co-adaptation of neurons
4. Use Batch Normalization to stabilize and accelerate training
5. Apply Early Stopping to prevent overtraining
6. Combine multiple regularization techniques effectively
7. Understand when and how to use each regularization method

## 1. Setup and Imports

In [None]:
# Core libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# TensorFlow and Keras
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models, regularizers
from tensorflow.keras.datasets import fashion_mnist, cifar10
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

# Sklearn for data splitting
from sklearn.model_selection import train_test_split

# Reproducibility
np.random.seed(42)
tf.random.set_seed(42)

# Visualization settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

print(f"TensorFlow version: {tf.__version__}")
print(f"Keras version: {keras.__version__}")

## 2. Understanding Overfitting

### What is Overfitting?

**Overfitting** occurs when a model learns the training data **too well**, including noise and random fluctuations, at the expense of generalization to new data.

### Visual Signature of Overfitting:

```
Loss/Accuracy Curves:

Training Loss:     \          (keeps decreasing)
                    \___

Validation Loss:    \  /      (starts increasing)
                     \/
                      ^
                      |
                Overfitting begins here
```

### Causes of Overfitting:

1. **Model too complex**: Too many parameters relative to data
2. **Insufficient data**: Not enough samples to learn generalizable patterns
3. **Training too long**: Model memorizes training examples
4. **No regularization**: Nothing prevents memorization
5. **Noisy labels**: Model learns incorrect patterns

### Solutions (Regularization Techniques):

1. **L1/L2 Regularization**: Penalize large weights
2. **Dropout**: Randomly deactivate neurons during training
3. **Batch Normalization**: Normalize activations
4. **Early Stopping**: Stop training when validation performance degrades
5. **Data Augmentation**: Artificially expand training data
6. **Architecture simplification**: Use fewer layers/neurons

In [None]:
# Load Fashion-MNIST for demonstration
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist.load_data()

# Normalize and flatten
X_train_full = X_train_full.astype('float32') / 255.0
X_test = X_test.astype('float32') / 255.0

X_train_full_flat = X_train_full.reshape(-1, 784)
X_test_flat = X_test.reshape(-1, 784)

# Create validation split
X_train = X_train_full_flat[:50000]
X_valid = X_train_full_flat[50000:]
y_train = y_train_full[:50000]
y_valid = y_train_full[50000:]

print(f"Training set: {X_train.shape}")
print(f"Validation set: {X_valid.shape}")
print(f"Test set: {X_test_flat.shape}")

In [None]:
# Create a deliberately overfitting model (too complex)
def create_overfitting_model():
    """
    Create an overly complex model to demonstrate overfitting.
    Many parameters, deep architecture.
    """
    model = models.Sequential([
        layers.Input(shape=(784,)),
        layers.Dense(512, activation='relu'),
        layers.Dense(512, activation='relu'),
        layers.Dense(512, activation='relu'),
        layers.Dense(512, activation='relu'),
        layers.Dense(10, activation='softmax')
    ], name='overfitting_model')
    return model

# Train overfitting model
print("Training overly complex model (will overfit)...\n")
model_overfit = create_overfitting_model()
model_overfit.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# Train with small dataset to induce overfitting
# Use only 5000 samples
X_train_small = X_train[:5000]
y_train_small = y_train[:5000]

history_overfit = model_overfit.fit(
    X_train_small, y_train_small,
    epochs=50,
    batch_size=32,
    validation_data=(X_valid, y_valid),
    verbose=0
)

print("Training completed!")

In [None]:
# Visualize overfitting
def plot_overfitting_analysis(history, title="Overfitting Analysis"):
    """
    Plot training history with overfitting annotations.
    """
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
    
    # Loss plot
    epochs = range(1, len(history.history['loss']) + 1)
    ax1.plot(epochs, history.history['loss'], 'b-', linewidth=2, label='Training Loss')
    ax1.plot(epochs, history.history['val_loss'], 'r-', linewidth=2, label='Validation Loss')
    
    # Find overfitting point (where val_loss starts increasing)
    val_loss = np.array(history.history['val_loss'])
    # Simple heuristic: find first local minimum
    best_epoch = np.argmin(val_loss)
    
    ax1.axvline(x=best_epoch+1, color='green', linestyle='--', linewidth=2, 
                label=f'Best Epoch ({best_epoch+1})')
    ax1.fill_between(epochs[best_epoch:], 0, max(history.history['val_loss']), 
                      alpha=0.2, color='red', label='Overfitting Region')
    
    ax1.set_xlabel('Epoch', fontsize=12)
    ax1.set_ylabel('Loss', fontsize=12)
    ax1.set_title('Loss Over Time', fontsize=12, fontweight='bold')
    ax1.legend(fontsize=10)
    ax1.grid(True, alpha=0.3)
    
    # Accuracy plot
    ax2.plot(epochs, history.history['accuracy'], 'b-', linewidth=2, label='Training Accuracy')
    ax2.plot(epochs, history.history['val_accuracy'], 'r-', linewidth=2, label='Validation Accuracy')
    ax2.axvline(x=best_epoch+1, color='green', linestyle='--', linewidth=2, 
                label=f'Best Epoch ({best_epoch+1})')
    ax2.fill_between(epochs[best_epoch:], 0, 1, 
                      alpha=0.2, color='red', label='Overfitting Region')
    
    ax2.set_xlabel('Epoch', fontsize=12)
    ax2.set_ylabel('Accuracy', fontsize=12)
    ax2.set_title('Accuracy Over Time', fontsize=12, fontweight='bold')
    ax2.legend(fontsize=10)
    ax2.grid(True, alpha=0.3)
    
    plt.suptitle(title, fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    # Print statistics
    print(f"\nOverfitting Statistics:")
    print(f"  Best Epoch: {best_epoch + 1}")
    print(f"  Best Validation Loss: {val_loss[best_epoch]:.4f}")
    print(f"  Final Training Accuracy: {history.history['accuracy'][-1]:.4f}")
    print(f"  Final Validation Accuracy: {history.history['val_accuracy'][-1]:.4f}")
    print(f"  Accuracy Gap: {abs(history.history['accuracy'][-1] - history.history['val_accuracy'][-1]):.4f}")

plot_overfitting_analysis(history_overfit, "Demonstration of Overfitting")

## 3. L1 and L2 Regularization

### Weight Regularization Concept:

Add a penalty term to the loss function to discourage large weights.

### L2 Regularization (Ridge, Weight Decay):

$$
L_{\text{total}} = L_{\text{original}} + \lambda \sum_{i} w_i^2
$$

- **Effect**: Penalizes large weights, prefers many small weights
- **Gradient**: $\frac{\partial L}{\partial w_i} = \frac{\partial L_{\text{original}}}{\partial w_i} + 2\lambda w_i$
- **Use case**: Default choice, works well in most scenarios

### L1 Regularization (Lasso):

$$
L_{\text{total}} = L_{\text{original}} + \lambda \sum_{i} |w_i|
$$

- **Effect**: Drives weights to exactly zero, creates sparse models
- **Gradient**: $\frac{\partial L}{\partial w_i} = \frac{\partial L_{\text{original}}}{\partial w_i} + \lambda \cdot \text{sign}(w_i)$
- **Use case**: Feature selection, when you want sparse networks

### Elastic Net (L1 + L2):

$$
L_{\text{total}} = L_{\text{original}} + \lambda_1 \sum_{i} |w_i| + \lambda_2 \sum_{i} w_i^2
$$

Combines benefits of both.

In [None]:
# Model with L2 regularization
def create_l2_model(l2_lambda=0.01):
    """
    Create model with L2 regularization.
    
    Parameters:
    -----------
    l2_lambda : float
        L2 regularization strength (typically 0.001 to 0.1)
    """
    model = models.Sequential([
        layers.Input(shape=(784,)),
        layers.Dense(512, activation='relu', 
                    kernel_regularizer=regularizers.l2(l2_lambda)),
        layers.Dense(512, activation='relu',
                    kernel_regularizer=regularizers.l2(l2_lambda)),
        layers.Dense(512, activation='relu',
                    kernel_regularizer=regularizers.l2(l2_lambda)),
        layers.Dense(512, activation='relu',
                    kernel_regularizer=regularizers.l2(l2_lambda)),
        layers.Dense(10, activation='softmax')
    ], name='l2_regularized')
    return model

# Model with L1 regularization
def create_l1_model(l1_lambda=0.01):
    """
    Create model with L1 regularization.
    """
    model = models.Sequential([
        layers.Input(shape=(784,)),
        layers.Dense(512, activation='relu',
                    kernel_regularizer=regularizers.l1(l1_lambda)),
        layers.Dense(512, activation='relu',
                    kernel_regularizer=regularizers.l1(l1_lambda)),
        layers.Dense(512, activation='relu',
                    kernel_regularizer=regularizers.l1(l1_lambda)),
        layers.Dense(512, activation='relu',
                    kernel_regularizer=regularizers.l1(l1_lambda)),
        layers.Dense(10, activation='softmax')
    ], name='l1_regularized')
    return model

# Train models with regularization
print("Training models with L1/L2 regularization...\n")

# L2 model
model_l2 = create_l2_model(l2_lambda=0.001)
model_l2.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
history_l2 = model_l2.fit(X_train_small, y_train_small, epochs=50, batch_size=32,
                          validation_data=(X_valid, y_valid), verbose=0)
print("L2 model trained!")

# L1 model
model_l1 = create_l1_model(l1_lambda=0.0001)
model_l1.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
history_l1 = model_l1.fit(X_train_small, y_train_small, epochs=50, batch_size=32,
                          validation_data=(X_valid, y_valid), verbose=0)
print("L1 model trained!")

In [None]:
# Compare regularization methods
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Validation loss
ax1.plot(history_overfit.history['val_loss'], linewidth=2, label='No Regularization', marker='o')
ax1.plot(history_l2.history['val_loss'], linewidth=2, label='L2 Regularization', marker='s')
ax1.plot(history_l1.history['val_loss'], linewidth=2, label='L1 Regularization', marker='^')
ax1.set_xlabel('Epoch', fontsize=12)
ax1.set_ylabel('Validation Loss', fontsize=12)
ax1.set_title('Impact of L1/L2 Regularization on Validation Loss', fontsize=12, fontweight='bold')
ax1.legend(fontsize=11)
ax1.grid(True, alpha=0.3)

# Validation accuracy
ax2.plot(history_overfit.history['val_accuracy'], linewidth=2, label='No Regularization', marker='o')
ax2.plot(history_l2.history['val_accuracy'], linewidth=2, label='L2 Regularization', marker='s')
ax2.plot(history_l1.history['val_accuracy'], linewidth=2, label='L1 Regularization', marker='^')
ax2.set_xlabel('Epoch', fontsize=12)
ax2.set_ylabel('Validation Accuracy', fontsize=12)
ax2.set_title('Impact of L1/L2 Regularization on Validation Accuracy', fontsize=12, fontweight='bold')
ax2.legend(fontsize=11)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nBest Validation Accuracy:")
print(f"  No Regularization: {max(history_overfit.history['val_accuracy']):.4f}")
print(f"  L2 Regularization: {max(history_l2.history['val_accuracy']):.4f}")
print(f"  L1 Regularization: {max(history_l1.history['val_accuracy']):.4f}")

## 4. Dropout: Preventing Co-Adaptation

### Dropout Intuition:

**Key Idea**: During training, randomly "drop" (set to zero) a fraction of neurons.

### How Dropout Works:

1. **Training**: Each neuron has probability $p$ of being dropped
   - Remaining neurons scaled by $1/(1-p)$ to maintain expected sum
   - Forces network to learn redundant representations

2. **Inference**: Use all neurons (no dropout)
   - Weights automatically scaled due to training scaling

### Mathematical Formulation:

During training:
$$
\begin{align}
r_i &\sim \text{Bernoulli}(1-p) \quad \text{(dropout mask)} \\
h_i &= \frac{r_i}{1-p} \cdot f(\mathbf{W}\mathbf{x} + \mathbf{b}) \quad \text{(scaled activation)}
\end{align}
$$

### Why Dropout Works:

1. **Ensemble Effect**: Training many "thinned" networks, averaging at test time
2. **Breaks Co-Adaptation**: Neurons can't rely on specific other neurons
3. **Implicit Regularization**: Similar to L2 for linear models

### Typical Dropout Rates:
- **Hidden layers**: 0.2 to 0.5 (20-50% dropped)
- **Input layer**: 0.1 to 0.2 (10-20% dropped, if used at all)
- **Output layer**: Never use dropout

In [None]:
# Model with dropout
def create_dropout_model(dropout_rate=0.5):
    """
    Create model with dropout regularization.
    
    Parameters:
    -----------
    dropout_rate : float
        Fraction of neurons to drop (typically 0.2 to 0.5)
    """
    model = models.Sequential([
        layers.Input(shape=(784,)),
        
        layers.Dense(512, activation='relu'),
        layers.Dropout(dropout_rate),  # Drop 50% of neurons
        
        layers.Dense(512, activation='relu'),
        layers.Dropout(dropout_rate),
        
        layers.Dense(512, activation='relu'),
        layers.Dropout(dropout_rate),
        
        layers.Dense(512, activation='relu'),
        layers.Dropout(dropout_rate),
        
        layers.Dense(10, activation='softmax')
    ], name='dropout_model')
    return model

# Train with different dropout rates
dropout_rates = [0.2, 0.3, 0.5]
histories_dropout = {}

print("Training models with different dropout rates...\n")

for rate in dropout_rates:
    print(f"Training with dropout rate = {rate}...")
    model = create_dropout_model(dropout_rate=rate)
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    history = model.fit(X_train_small, y_train_small, epochs=50, batch_size=32,
                       validation_data=(X_valid, y_valid), verbose=0)
    histories_dropout[rate] = history
    print(f"  Best validation accuracy: {max(history.history['val_accuracy']):.4f}")

print("\nAll dropout models trained!")

In [None]:
# Compare dropout rates
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Plot no-dropout baseline
ax1.plot(history_overfit.history['val_loss'], linewidth=2, 
         label='No Dropout', linestyle='--', alpha=0.7)
ax2.plot(history_overfit.history['val_accuracy'], linewidth=2, 
         label='No Dropout', linestyle='--', alpha=0.7)

# Plot different dropout rates
colors = ['green', 'blue', 'red']
for (rate, history), color in zip(histories_dropout.items(), colors):
    ax1.plot(history.history['val_loss'], linewidth=2, 
             label=f'Dropout {rate}', marker='o', color=color)
    ax2.plot(history.history['val_accuracy'], linewidth=2, 
             label=f'Dropout {rate}', marker='o', color=color)

ax1.set_xlabel('Epoch', fontsize=12)
ax1.set_ylabel('Validation Loss', fontsize=12)
ax1.set_title('Dropout Impact on Validation Loss', fontsize=12, fontweight='bold')
ax1.legend(fontsize=11)
ax1.grid(True, alpha=0.3)

ax2.set_xlabel('Epoch', fontsize=12)
ax2.set_ylabel('Validation Accuracy', fontsize=12)
ax2.set_title('Dropout Impact on Validation Accuracy', fontsize=12, fontweight='bold')
ax2.legend(fontsize=11)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 5. Batch Normalization: Stabilizing Training

### The Internal Covariate Shift Problem:

During training, the distribution of layer inputs changes as parameters update.
This slows down training and requires careful initialization and low learning rates.

### Batch Normalization Solution:

**Normalize** each mini-batch to have mean 0 and variance 1:

$$
\begin{align}
\mu_B &= \frac{1}{m} \sum_{i=1}^m x_i \quad \text{(batch mean)} \\
\sigma_B^2 &= \frac{1}{m} \sum_{i=1}^m (x_i - \mu_B)^2 \quad \text{(batch variance)} \\
\hat{x}_i &= \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} \quad \text{(normalize)} \\
y_i &= \gamma \hat{x}_i + \beta \quad \text{(scale and shift)}
\end{align}
$$

Where:
- $\gamma$ and $\beta$ are **learnable** parameters (scale and shift)
- $\epsilon$ is a small constant for numerical stability ($10^{-5}$)

### Benefits of Batch Normalization:

1. **Faster Training**: Can use higher learning rates
2. **Better Generalization**: Acts as regularization
3. **Less Sensitive to Initialization**: Normalizes activations
4. **Reduces Need for Dropout**: Provides regularization effect

### Where to Place BatchNorm:

**Option 1** (Original paper): After activation
```python
Dense → Activation → BatchNorm → Dropout
```

**Option 2** (More common): Before activation
```python
Dense → BatchNorm → Activation → Dropout
```

Both work well in practice.

In [None]:
# Model with Batch Normalization
def create_batchnorm_model():
    """
    Create model with Batch Normalization.
    Using BatchNorm before activation (more common approach).
    """
    model = models.Sequential([
        layers.Input(shape=(784,)),
        
        layers.Dense(512),
        layers.BatchNormalization(),
        layers.Activation('relu'),
        
        layers.Dense(512),
        layers.BatchNormalization(),
        layers.Activation('relu'),
        
        layers.Dense(512),
        layers.BatchNormalization(),
        layers.Activation('relu'),
        
        layers.Dense(512),
        layers.BatchNormalization(),
        layers.Activation('relu'),
        
        layers.Dense(10, activation='softmax')
    ], name='batchnorm_model')
    return model

# Model with both Batch Normalization and Dropout
def create_batchnorm_dropout_model(dropout_rate=0.3):
    """
    Create model combining BatchNorm and Dropout.
    Order: Dense → BatchNorm → Activation → Dropout
    """
    model = models.Sequential([
        layers.Input(shape=(784,)),
        
        layers.Dense(512),
        layers.BatchNormalization(),
        layers.Activation('relu'),
        layers.Dropout(dropout_rate),
        
        layers.Dense(512),
        layers.BatchNormalization(),
        layers.Activation('relu'),
        layers.Dropout(dropout_rate),
        
        layers.Dense(512),
        layers.BatchNormalization(),
        layers.Activation('relu'),
        layers.Dropout(dropout_rate),
        
        layers.Dense(512),
        layers.BatchNormalization(),
        layers.Activation('relu'),
        layers.Dropout(dropout_rate),
        
        layers.Dense(10, activation='softmax')
    ], name='batchnorm_dropout_model')
    return model

# Train models
print("Training BatchNorm models...\n")

# BatchNorm only
model_bn = create_batchnorm_model()
model_bn.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
history_bn = model_bn.fit(X_train_small, y_train_small, epochs=50, batch_size=32,
                          validation_data=(X_valid, y_valid), verbose=0)
print("BatchNorm model trained!")
print(f"  Best validation accuracy: {max(history_bn.history['val_accuracy']):.4f}")

# BatchNorm + Dropout
model_bn_drop = create_batchnorm_dropout_model(dropout_rate=0.3)
model_bn_drop.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
history_bn_drop = model_bn_drop.fit(X_train_small, y_train_small, epochs=50, batch_size=32,
                                     validation_data=(X_valid, y_valid), verbose=0)
print("\nBatchNorm + Dropout model trained!")
print(f"  Best validation accuracy: {max(history_bn_drop.history['val_accuracy']):.4f}")

In [None]:
# Compare all regularization techniques
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Validation loss comparison
ax1.plot(history_overfit.history['val_loss'], linewidth=2, label='No Regularization', alpha=0.7)
ax1.plot(history_l2.history['val_loss'], linewidth=2, label='L2 Only')
ax1.plot(histories_dropout[0.3].history['val_loss'], linewidth=2, label='Dropout Only')
ax1.plot(history_bn.history['val_loss'], linewidth=2, label='BatchNorm Only')
ax1.plot(history_bn_drop.history['val_loss'], linewidth=2, label='BatchNorm + Dropout', linewidth=3)
ax1.set_xlabel('Epoch', fontsize=12)
ax1.set_ylabel('Validation Loss', fontsize=12)
ax1.set_title('Regularization Techniques Comparison: Loss', fontsize=12, fontweight='bold')
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3)

# Validation accuracy comparison
ax2.plot(history_overfit.history['val_accuracy'], linewidth=2, label='No Regularization', alpha=0.7)
ax2.plot(history_l2.history['val_accuracy'], linewidth=2, label='L2 Only')
ax2.plot(histories_dropout[0.3].history['val_accuracy'], linewidth=2, label='Dropout Only')
ax2.plot(history_bn.history['val_accuracy'], linewidth=2, label='BatchNorm Only')
ax2.plot(history_bn_drop.history['val_accuracy'], linewidth=2, label='BatchNorm + Dropout', linewidth=3)
ax2.set_xlabel('Epoch', fontsize=12)
ax2.set_ylabel('Validation Accuracy', fontsize=12)
ax2.set_title('Regularization Techniques Comparison: Accuracy', fontsize=12, fontweight='bold')
ax2.legend(fontsize=10)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Summary table
print("\n" + "="*80)
print("REGULARIZATION TECHNIQUES COMPARISON")
print("="*80)
print(f"{'Technique':<30} {'Best Val Acc':<20} {'Final Val Acc':<20}")
print("-"*80)

results = [
    ('No Regularization', history_overfit),
    ('L2 Only', history_l2),
    ('Dropout Only (0.3)', histories_dropout[0.3]),
    ('BatchNorm Only', history_bn),
    ('BatchNorm + Dropout', history_bn_drop)
]

for name, history in results:
    best_acc = max(history.history['val_accuracy'])
    final_acc = history.history['val_accuracy'][-1]
    print(f"{name:<30} {best_acc:<20.4f} {final_acc:<20.4f}")

print("="*80)

## 6. Early Stopping: Stop Before Overfitting

### Concept:

Monitor validation performance and **stop training** when it stops improving.

### How It Works:

1. Train the model while monitoring validation metric
2. If validation metric doesn't improve for `patience` epochs, stop
3. Optionally restore weights from best epoch

### Parameters:

- **monitor**: Metric to watch (`'val_loss'`, `'val_accuracy'`)
- **patience**: How many epochs to wait for improvement
- **mode**: `'min'` for loss, `'max'` for accuracy
- **restore_best_weights**: Restore model to best epoch (recommended)

### Benefits:

- Automatically finds optimal training duration
- Prevents overtraining
- Saves computation time

In [None]:
# Train with Early Stopping
print("Training with Early Stopping...\n")

# Create model
model_early_stop = create_batchnorm_dropout_model(dropout_rate=0.3)
model_early_stop.compile(optimizer='adam', 
                         loss='sparse_categorical_crossentropy', 
                         metrics=['accuracy'])

# Define Early Stopping callback
early_stopping = EarlyStopping(
    monitor='val_loss',          # Monitor validation loss
    patience=10,                  # Wait 10 epochs for improvement
    mode='min',                   # Lower is better for loss
    restore_best_weights=True,    # Restore weights from best epoch
    verbose=1
)

# Train with callback
history_early_stop = model_early_stop.fit(
    X_train_small, y_train_small,
    epochs=100,  # Set high, early stopping will terminate
    batch_size=32,
    validation_data=(X_valid, y_valid),
    callbacks=[early_stopping],
    verbose=0
)

print(f"\nTraining stopped at epoch {len(history_early_stop.history['loss'])}")
print(f"Best validation accuracy: {max(history_early_stop.history['val_accuracy']):.4f}")

In [None]:
# Visualize Early Stopping
plot_overfitting_analysis(history_early_stop, "Early Stopping Example")

## 7. Combining Regularization Techniques

### Best Practices for Combining Regularization:

#### Common Combinations:

1. **BatchNorm + Dropout + Early Stopping** (Most Common)
   ```python
   Dense → BatchNorm → ReLU → Dropout
   ```
   - BatchNorm: Faster training, some regularization
   - Dropout: Additional regularization
   - Early Stopping: Prevent overtraining

2. **L2 + Dropout + Early Stopping**
   - L2: Smooth weight regularization
   - Dropout: Ensemble effect
   - Early Stopping: Automatic duration

3. **BatchNorm + L2 + Early Stopping**
   - Good for very deep networks
   - BatchNorm may reduce need for dropout

### Guidelines:

| Scenario | Recommended Regularization |
|----------|---------------------------|
| **Small dataset** | L2 + Dropout (high rate) + Early Stopping |
| **Large dataset** | BatchNorm + Light Dropout + Early Stopping |
| **Deep network** | BatchNorm + Dropout + Early Stopping |
| **Need speed** | BatchNorm + Early Stopping |
| **Need sparsity** | L1 + Dropout + Early Stopping |

### Hyperparameter Tuning Order:

1. Start with BatchNorm (almost always helps)
2. Add moderate Dropout (0.2-0.3)
3. Add Early Stopping (patience ~10-20 epochs)
4. If still overfitting, increase Dropout or add L2
5. If underfitting, reduce regularization

In [None]:
# Create production-ready model with all best practices
def create_production_model():
    """
    Production-ready model with modern best practices:
    - Batch Normalization for stable training
    - Dropout for regularization
    - Will use Early Stopping during training
    """
    model = models.Sequential([
        layers.Input(shape=(784,)),
        
        # Layer 1
        layers.Dense(256),
        layers.BatchNormalization(),
        layers.Activation('relu'),
        layers.Dropout(0.3),
        
        # Layer 2
        layers.Dense(128),
        layers.BatchNormalization(),
        layers.Activation('relu'),
        layers.Dropout(0.3),
        
        # Layer 3
        layers.Dense(64),
        layers.BatchNormalization(),
        layers.Activation('relu'),
        layers.Dropout(0.2),
        
        # Output layer (no dropout, no batchnorm)
        layers.Dense(10, activation='softmax')
    ], name='production_model')
    return model

# Train on FULL dataset with all regularization techniques
print("Training production model on full dataset...\n")

model_prod = create_production_model()
model_prod.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# Callbacks
callbacks = [
    EarlyStopping(monitor='val_loss', patience=15, restore_best_weights=True, verbose=1),
    ModelCheckpoint('best_model.h5', monitor='val_loss', save_best_only=True, verbose=0)
]

# Train on FULL training set
history_prod = model_prod.fit(
    X_train, y_train,  # Full training set
    epochs=100,
    batch_size=128,
    validation_data=(X_valid, y_valid),
    callbacks=callbacks,
    verbose=0
)

print(f"\nTraining completed at epoch {len(history_prod.history['loss'])}")
print(f"Best validation accuracy: {max(history_prod.history['val_accuracy']):.4f}")

# Evaluate on test set
test_loss, test_accuracy = model_prod.evaluate(X_test_flat, y_test, verbose=0)
print(f"\nTest set performance:")
print(f"  Test Loss: {test_loss:.4f}")
print(f"  Test Accuracy: {test_accuracy:.4f} ({test_accuracy*100:.2f}%)")

In [None]:
# Plot production model training
plot_overfitting_analysis(history_prod, "Production Model with All Regularization Techniques")

## 8. Summary

### Key Concepts:

1. **Overfitting**:
   - Model learns training data too well
   - Validation loss increases while training loss decreases
   - Large gap between training and validation accuracy

2. **L1/L2 Regularization**:
   - L2: Penalizes large weights, most common
   - L1: Creates sparse weights, feature selection
   - Add regularization term to loss function

3. **Dropout**:
   - Randomly drop neurons during training
   - Prevents co-adaptation
   - Ensemble effect
   - Typical rates: 0.2-0.5 for hidden layers

4. **Batch Normalization**:
   - Normalizes layer inputs
   - Faster training, higher learning rates
   - Regularization effect
   - Place before or after activation

5. **Early Stopping**:
   - Monitor validation performance
   - Stop when no improvement
   - Restore best weights
   - Patience typically 10-20 epochs

6. **Combining Techniques**:
   - BatchNorm + Dropout + Early Stopping (most common)
   - Start with BatchNorm, add Dropout if needed
   - Always use Early Stopping

### Decision Guide:

```
Is your model overfitting?
    |
    |-- YES → Add regularization
    |       |
    |       |-- Try BatchNorm first (usually helps)
    |       |-- Add Dropout (0.2-0.5)
    |       |-- Add L2 if still overfitting
    |       |-- Always use Early Stopping
    |
    |-- NO → Check if underfitting
            |
            |-- Reduce regularization
            |-- Increase model capacity
            |-- Train longer
```

### What's Next?

- **Module 08**: Loss Functions and Metrics
- **Module 09**: Hyperparameter Tuning for Deep Learning

### Additional Resources:

- [Dropout: A Simple Way to Prevent Overfitting](https://jmlr.org/papers/v15/srivastava14a.html)
- [Batch Normalization Paper](https://arxiv.org/abs/1502.03167)
- [Keras Regularizers Documentation](https://keras.io/api/layers/regularizers/)
- [CS231n: Regularization](https://cs231n.github.io/neural-networks-2/)

## 9. Exercises

### Exercise 1: Dropout Rate Sensitivity

**Task**: Systematically test how dropout rate affects model performance.

**Requirements**:
- Test dropout rates: [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7]
- Use same architecture and training configuration
- Plot validation accuracy vs dropout rate
- Identify optimal dropout rate

**Questions**:
1. What happens with very high dropout (0.6-0.7)?
2. Is there a "sweet spot" dropout rate?
3. How does dropout affect training time?

```python
# Your code here
```

In [None]:
# Exercise 1 Solution
# Uncomment to reveal

# dropout_rates = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7]
# results = {}
# 
# for rate in dropout_rates:
#     model = create_dropout_model(dropout_rate=rate)
#     model.compile(optimizer='adam',
#                   loss='sparse_categorical_crossentropy',
#                   metrics=['accuracy'])
#     
#     history = model.fit(X_train, y_train, epochs=20,
#                        validation_data=(X_valid, y_valid),
#                        verbose=0)
#     
#     results[rate] = {
#         'best_val_acc': max(history.history['val_accuracy']),
#         'final_val_acc': history.history['val_accuracy'][-1]
#     }
# 
# # Plot results
# plt.figure(figsize=(10, 6))
# rates = list(results.keys())
# best_accs = [results[r]['best_val_acc'] for r in rates]
# plt.plot(rates, best_accs, marker='o', linewidth=2)
# plt.xlabel('Dropout Rate')
# plt.ylabel('Best Validation Accuracy')
# plt.title('Dropout Rate vs Model Performance')
# plt.grid(True)
# plt.show()

### Exercise 2: BatchNorm Placement Comparison

**Task**: Compare different BatchNorm placement strategies.

**Requirements**:
- Model 1: Dense → BatchNorm → Activation
- Model 2: Dense → Activation → BatchNorm
- Model 3: No BatchNorm (baseline)
- Train all with same configuration
- Compare convergence speed and final accuracy

**Questions**:
1. Which placement converges faster?
2. Which achieves better final accuracy?
3. Is there a significant difference?

```python
# Your code here
```

In [None]:
# Exercise 2 Solution
# Uncomment to reveal

# def create_bn_before_activation():
#     model = models.Sequential([
#         layers.Input(shape=(784,)),
#         layers.Dense(256),
#         layers.BatchNormalization(),
#         layers.Activation('relu'),
#         layers.Dense(128),
#         layers.BatchNormalization(),
#         layers.Activation('relu'),
#         layers.Dense(10, activation='softmax')
#     ])
#     return model
# 
# def create_bn_after_activation():
#     model = models.Sequential([
#         layers.Input(shape=(784,)),
#         layers.Dense(256, activation='relu'),
#         layers.BatchNormalization(),
#         layers.Dense(128, activation='relu'),
#         layers.BatchNormalization(),
#         layers.Dense(10, activation='softmax')
#     ])
#     return model
# 
# # Train and compare
# # (Add training code here)

### Exercise 3: Regularization for CIFAR-10

**Task**: Apply regularization techniques to improve CIFAR-10 classification.

**Requirements**:
- Load CIFAR-10 dataset
- Create model with BatchNorm + Dropout + L2
- Use Early Stopping
- Achieve >55% validation accuracy
- Compare with unregularized baseline

**Questions**:
1. Which regularization technique helped most?
2. How much did regularization improve generalization?
3. Did training take longer with regularization?

```python
# Your code here
```

In [None]:
# Exercise 3 Solution
# Uncomment to reveal

# # Load CIFAR-10
# (X_cifar_train, y_cifar_train), (X_cifar_test, y_cifar_test) = cifar10.load_data()
# 
# # Preprocess
# X_cifar_train = X_cifar_train.astype('float32') / 255.0
# X_cifar_test = X_cifar_test.astype('float32') / 255.0
# X_cifar_train_flat = X_cifar_train.reshape(-1, 3072)
# X_cifar_test_flat = X_cifar_test.reshape(-1, 3072)
# y_cifar_train = y_cifar_train.flatten()
# 
# # Split validation
# X_c_train, X_c_valid, y_c_train, y_c_valid = train_test_split(
#     X_cifar_train_flat, y_cifar_train, test_size=0.2, random_state=42
# )
# 
# # Create regularized model
# model_cifar = models.Sequential([
#     layers.Input(shape=(3072,)),
#     layers.Dense(512, kernel_regularizer=regularizers.l2(0.001)),
#     layers.BatchNormalization(),
#     layers.Activation('relu'),
#     layers.Dropout(0.4),
#     # ... add more layers
# ])
# 
# # Compile and train
# # (Add training code here)

---

**Congratulations!** You've completed Module 07. You now understand:
- How to diagnose and prevent overfitting
- L1/L2 regularization for weight control
- Dropout for ensemble-like regularization
- Batch Normalization for stable training
- Early Stopping to prevent overtraining
- How to combine regularization techniques effectively

Continue to **Module 08: Loss Functions and Metrics** to learn about different training objectives!