# Lab 2: Deep Learning Fundamentals

In this lab, we'll explore the techniques that make training deep neural networks possible and effective. You'll learn about normalization, regularization, optimization, and architectural patterns that are essential for modern deep learning.

## Learning Objectives

By the end of this lab, you will:
- Understand and apply batch normalization
- Implement dropout for regularization
- Use advanced optimizers (Adam, RMSprop, AdamW)
- Apply learning rate scheduling
- Handle vanishing/exploding gradients
- Build deep networks with residual connections
- Master weight initialization strategies
- Train robust deep learning models

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, regularizers
from tensorflow.keras.callbacks import LearningRateScheduler, EarlyStopping, ModelCheckpoint
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
np.random.seed(42)
tf.random.set_seed(42)
torch.manual_seed(42)

## Part 1: Batch Normalization

**Batch Normalization** normalizes layer inputs during training, providing several benefits:
- Faster training
- Higher learning rates possible
- Less sensitive to initialization
- Acts as regularization

### Algorithm:
For a mini-batch $\mathcal{B} = \{x_1, ..., x_m\}$:

1. Compute batch mean: $\mu_{\mathcal{B}} = \frac{1}{m} \sum_{i=1}^{m} x_i$
2. Compute batch variance: $\sigma^2_{\mathcal{B}} = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_{\mathcal{B}})^2$
3. Normalize: $\hat{x}_i = \frac{x_i - \mu_{\mathcal{B}}}{\sqrt{\sigma^2_{\mathcal{B}} + \epsilon}}$
4. Scale and shift: $y_i = \gamma \hat{x}_i + \beta$

In [None]:
# Load Fashion-MNIST
(X_train, y_train), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()

# Normalize
X_train = X_train.reshape(-1, 784) / 255.0
X_test = X_test.reshape(-1, 784) / 255.0

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")

class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

In [None]:
# Model WITHOUT batch normalization
model_no_bn = keras.Sequential([
    layers.Dense(256, activation='relu', input_shape=(784,)),
    layers.Dense(128, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(10, activation='softmax')
])

# Model WITH batch normalization
model_with_bn = keras.Sequential([
    layers.Dense(256, input_shape=(784,)),
    layers.BatchNormalization(),
    layers.Activation('relu'),
    layers.Dense(128),
    layers.BatchNormalization(),
    layers.Activation('relu'),
    layers.Dense(64),
    layers.BatchNormalization(),
    layers.Activation('relu'),
    layers.Dense(10, activation='softmax')
])

# Compile both
for model in [model_no_bn, model_with_bn]:
    model.compile(
        optimizer='sgd',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )

In [None]:
# Train both models
history_no_bn = model_no_bn.fit(
    X_train, y_train,
    batch_size=128,
    epochs=15,
    validation_split=0.1,
    verbose=0
)

history_with_bn = model_with_bn.fit(
    X_train, y_train,
    batch_size=128,
    epochs=15,
    validation_split=0.1,
    verbose=0
)

# Compare
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].plot(history_no_bn.history['loss'], label='Without BN')
axes[0].plot(history_with_bn.history['loss'], label='With BN')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Training Loss')
axes[0].set_title('Training Loss Comparison')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

axes[1].plot(history_no_bn.history['val_accuracy'], label='Without BN')
axes[1].plot(history_with_bn.history['val_accuracy'], label='With BN')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Validation Accuracy')
axes[1].set_title('Validation Accuracy Comparison')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Batch Normalization benefits:")
print("- Faster convergence")
print("- More stable training")
print("- Better final accuracy")

## Part 2: Dropout Regularization

**Dropout** randomly drops neurons during training to prevent overfitting.

- During training: Randomly set activations to 0 with probability $p$
- During inference: Use all neurons (scaled appropriately)

### Benefits:
- Prevents co-adaptation of neurons
- Ensemble effect (training many sub-networks)
- Strong regularization

In [None]:
# Model with dropout
model_dropout = keras.Sequential([
    layers.Dense(256, activation='relu', input_shape=(784,)),
    layers.Dropout(0.3),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.3),
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.2),
    layers.Dense(10, activation='softmax')
])

model_dropout.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# Model without dropout (for comparison)
model_no_dropout = keras.Sequential([
    layers.Dense(256, activation='relu', input_shape=(784,)),
    layers.Dense(128, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(10, activation='softmax')
])

model_no_dropout.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

In [None]:
# Train both
history_dropout = model_dropout.fit(
    X_train, y_train,
    batch_size=128,
    epochs=20,
    validation_split=0.1,
    verbose=0
)

history_no_dropout = model_no_dropout.fit(
    X_train, y_train,
    batch_size=128,
    epochs=20,
    validation_split=0.1,
    verbose=0
)

# Compare overfitting
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Without dropout - training vs validation
axes[0].plot(history_no_dropout.history['accuracy'], label='Training')
axes[0].plot(history_no_dropout.history['val_accuracy'], label='Validation')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Without Dropout (Overfitting)')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# With dropout - training vs validation
axes[1].plot(history_dropout.history['accuracy'], label='Training')
axes[1].plot(history_dropout.history['val_accuracy'], label='Validation')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('With Dropout (Better Generalization)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Notice how dropout reduces the gap between training and validation accuracy!")

## Part 3: Advanced Optimizers

Modern optimizers adapt learning rates during training:

### SGD with Momentum
$$v_t = \beta v_{t-1} + (1-\beta) \nabla L$$
$$\theta_t = \theta_{t-1} - \alpha v_t$$

### RMSprop
$$s_t = \beta s_{t-1} + (1-\beta) (\nabla L)^2$$
$$\theta_t = \theta_{t-1} - \frac{\alpha}{\sqrt{s_t + \epsilon}} \nabla L$$

### Adam (Adaptive Moment Estimation)
Combines momentum and RMSprop:
$$m_t = \beta_1 m_{t-1} + (1-\beta_1) \nabla L$$
$$v_t = \beta_2 v_{t-1} + (1-\beta_2) (\nabla L)^2$$
$$\hat{m}_t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t}$$
$$\theta_t = \theta_{t-1} - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$$

In [None]:
# Compare optimizers
optimizers_to_test = {
    'SGD': keras.optimizers.SGD(learning_rate=0.01),
    'SGD + Momentum': keras.optimizers.SGD(learning_rate=0.01, momentum=0.9),
    'RMSprop': keras.optimizers.RMSprop(learning_rate=0.001),
    'Adam': keras.optimizers.Adam(learning_rate=0.001),
    'AdamW': keras.optimizers.AdamW(learning_rate=0.001)
}

histories = {}

for name, optimizer in optimizers_to_test.items():
    print(f"Training with {name}...")
    
    model = keras.Sequential([
        layers.Dense(128, activation='relu', input_shape=(784,)),
        layers.Dense(64, activation='relu'),
        layers.Dense(10, activation='softmax')
    ])
    
    model.compile(
        optimizer=optimizer,
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    
    history = model.fit(
        X_train[:10000], y_train[:10000],  # Subset for speed
        batch_size=128,
        epochs=20,
        validation_split=0.2,
        verbose=0
    )
    
    histories[name] = history

In [None]:
# Plot comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for name, history in histories.items():
    axes[0].plot(history.history['loss'], label=name)
    axes[1].plot(history.history['val_accuracy'], label=name)

axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Training Loss')
axes[0].set_title('Optimizer Comparison - Loss')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Validation Accuracy')
axes[1].set_title('Optimizer Comparison - Accuracy')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nTypical observations:")
print("- Adam converges fastest (good default choice)")
print("- SGD with momentum is more stable than plain SGD")
print("- AdamW adds weight decay regularization")

## Part 4: Learning Rate Scheduling

Learning rate schedules adjust the learning rate during training:

- **Step Decay**: Reduce LR by factor every N epochs
- **Exponential Decay**: $\alpha_t = \alpha_0 e^{-kt}$
- **Cosine Annealing**: $\alpha_t = \alpha_{min} + \frac{1}{2}(\alpha_{max} - \alpha_{min})(1 + \cos(\frac{t\pi}{T}))$
- **Warmup**: Start with small LR, gradually increase

In [None]:
# Learning rate schedules
def step_decay(epoch, lr):
    drop_rate = 0.5
    epochs_drop = 5
    return lr * (drop_rate ** (epoch // epochs_drop))

def exponential_decay(epoch, lr):
    k = 0.1
    return lr * np.exp(-k * epoch)

def cosine_annealing(epoch, lr, total_epochs=20):
    min_lr = 1e-5
    max_lr = 0.001
    return min_lr + 0.5 * (max_lr - min_lr) * (1 + np.cos(np.pi * epoch / total_epochs))

# Visualize schedules
epochs = np.arange(0, 20)
initial_lr = 0.001

plt.figure(figsize=(12, 6))
plt.plot(epochs, [step_decay(e, initial_lr) for e in epochs], label='Step Decay', linewidth=2)
plt.plot(epochs, [exponential_decay(e, initial_lr) for e in epochs], label='Exponential Decay', linewidth=2)
plt.plot(epochs, [cosine_annealing(e, initial_lr) for e in epochs], label='Cosine Annealing', linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Learning Rate')
plt.title('Learning Rate Schedules')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Train with learning rate schedule
model_lr_schedule = keras.Sequential([
    layers.Dense(128, activation='relu', input_shape=(784,)),
    layers.BatchNormalization(),
    layers.Dropout(0.3),
    layers.Dense(64, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(0.3),
    layers.Dense(10, activation='softmax')
])

model_lr_schedule.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# Add callbacks
lr_scheduler = LearningRateScheduler(lambda epoch, lr: cosine_annealing(epoch, lr))
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

history_scheduled = model_lr_schedule.fit(
    X_train, y_train,
    batch_size=128,
    epochs=20,
    validation_split=0.1,
    callbacks=[lr_scheduler, early_stopping],
    verbose=0
)

print(f"Final validation accuracy: {max(history_scheduled.history['val_accuracy']):.4f}")

## Part 5: Weight Initialization

Proper initialization is crucial for training deep networks:

### Xavier/Glorot Initialization (for Sigmoid/Tanh)
$$W \sim \mathcal{N}(0, \sqrt{\frac{2}{n_{in} + n_{out}}})$$

### He Initialization (for ReLU)
$$W \sim \mathcal{N}(0, \sqrt{\frac{2}{n_{in}}})$$

### Why it matters:
- Poor initialization → vanishing/exploding gradients
- Good initialization → stable training

In [None]:
# Compare initializations
initializers = {
    'Zeros': keras.initializers.Zeros(),
    'Random Normal': keras.initializers.RandomNormal(stddev=0.01),
    'Xavier': keras.initializers.GlorotNormal(),
    'He': keras.initializers.HeNormal()
}

init_histories = {}

for name, initializer in initializers.items():
    if name == 'Zeros':  # Skip zeros (won't learn)
        continue
        
    print(f"Training with {name} initialization...")
    
    model = keras.Sequential([
        layers.Dense(128, activation='relu', kernel_initializer=initializer, input_shape=(784,)),
        layers.Dense(64, activation='relu', kernel_initializer=initializer),
        layers.Dense(10, activation='softmax')
    ])
    
    model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    
    history = model.fit(
        X_train[:5000], y_train[:5000],
        batch_size=128,
        epochs=10,
        validation_split=0.2,
        verbose=0
    )
    
    init_histories[name] = history

# Plot
plt.figure(figsize=(12, 6))
for name, history in init_histories.items():
    plt.plot(history.history['val_accuracy'], label=name, linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Validation Accuracy')
plt.title('Effect of Weight Initialization')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("\nHe initialization typically works best for ReLU networks!")

## Part 6: Residual Connections (ResNet-style)

**Residual connections** (skip connections) help train very deep networks:

$$y = F(x) + x$$

Where $F(x)$ is the residual mapping to be learned.

### Benefits:
- Easier gradient flow
- Enables training of very deep networks (100+ layers)
- Identity mapping as fallback

In [None]:
# Residual block in Keras (Functional API)
def residual_block(x, units):
    """
    Create a residual block.
    """
    # Main path
    out = layers.Dense(units)(x)
    out = layers.BatchNormalization()(out)
    out = layers.Activation('relu')(out)
    out = layers.Dense(units)(out)
    out = layers.BatchNormalization()(out)
    
    # Shortcut connection
    if x.shape[-1] != units:
        x = layers.Dense(units)(x)  # Project to match dimensions
    
    # Add and activate
    out = layers.Add()([out, x])
    out = layers.Activation('relu')(out)
    
    return out

# Build ResNet-style model
inputs = keras.Input(shape=(784,))
x = layers.Dense(128)(inputs)
x = layers.BatchNormalization()(x)
x = layers.Activation('relu')(x)

# Add residual blocks
x = residual_block(x, 128)
x = residual_block(x, 128)
x = residual_block(x, 64)

# Output
x = layers.Dropout(0.3)(x)
outputs = layers.Dense(10, activation='softmax')(x)

model_resnet = keras.Model(inputs=inputs, outputs=outputs)

model_resnet.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

model_resnet.summary()

In [None]:
# Train ResNet model
history_resnet = model_resnet.fit(
    X_train, y_train,
    batch_size=128,
    epochs=15,
    validation_split=0.1,
    verbose=1
)

# Evaluate
test_loss, test_acc = model_resnet.evaluate(X_test, y_test, verbose=0)
print(f"\nTest Accuracy with Residual Connections: {test_acc:.4f}")

## Part 7: Complete Modern Architecture

Let's build a complete model with all best practices:

In [None]:
# Modern best-practice model
def create_modern_model():
    inputs = keras.Input(shape=(784,))
    
    # Initial layer
    x = layers.Dense(256, kernel_initializer='he_normal')(inputs)
    x = layers.BatchNormalization()(x)
    x = layers.Activation('relu')(x)
    x = layers.Dropout(0.3)(x)
    
    # Residual block 1
    residual = x
    x = layers.Dense(256, kernel_initializer='he_normal')(x)
    x = layers.BatchNormalization()(x)
    x = layers.Activation('relu')(x)
    x = layers.Dense(256, kernel_initializer='he_normal')(x)
    x = layers.BatchNormalization()(x)
    x = layers.Add()([x, residual])
    x = layers.Activation('relu')(x)
    x = layers.Dropout(0.3)(x)
    
    # Transition
    x = layers.Dense(128, kernel_initializer='he_normal')(x)
    x = layers.BatchNormalization()(x)
    x = layers.Activation('relu')(x)
    x = layers.Dropout(0.2)(x)
    
    # Residual block 2
    residual = x
    x = layers.Dense(128, kernel_initializer='he_normal')(x)
    x = layers.BatchNormalization()(x)
    x = layers.Activation('relu')(x)
    x = layers.Dense(128, kernel_initializer='he_normal')(x)
    x = layers.BatchNormalization()(x)
    x = layers.Add()([x, residual])
    x = layers.Activation('relu')(x)
    x = layers.Dropout(0.2)(x)
    
    # Output
    outputs = layers.Dense(10, activation='softmax')(x)
    
    model = keras.Model(inputs=inputs, outputs=outputs)
    return model

# Create and compile
model_modern = create_modern_model()
model_modern.compile(
    optimizer=keras.optimizers.AdamW(learning_rate=0.001),
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# Callbacks
callbacks = [
    LearningRateScheduler(lambda epoch, lr: cosine_annealing(epoch, lr, total_epochs=20)),
    EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True),
    ModelCheckpoint('best_model.h5', monitor='val_accuracy', save_best_only=True)
]

# Train
history_modern = model_modern.fit(
    X_train, y_train,
    batch_size=128,
    epochs=20,
    validation_split=0.1,
    callbacks=callbacks,
    verbose=1
)

# Final evaluation
test_loss, test_acc = model_modern.evaluate(X_test, y_test, verbose=0)
print(f"\nFinal Test Accuracy: {test_acc:.4f}")

In [None]:
# Visualize final results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].plot(history_modern.history['loss'], label='Training')
axes[0].plot(history_modern.history['val_loss'], label='Validation')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Modern Architecture - Loss')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

axes[1].plot(history_modern.history['accuracy'], label='Training')
axes[1].plot(history_modern.history['val_accuracy'], label='Validation')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Modern Architecture - Accuracy')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Key Takeaways

1. **Batch Normalization** accelerates training and improves stability
2. **Dropout** prevents overfitting through regularization
3. **Adam optimizer** is a great default choice
4. **Learning rate scheduling** improves convergence
5. **He initialization** works well for ReLU networks
6. **Residual connections** enable very deep networks
7. **Combine techniques** for best results
8. **Early stopping** prevents overtraining
9. **Model checkpointing** saves best weights
10. **Monitor validation metrics** to detect overfitting

## Modern Deep Learning Recipe

1. Use **He initialization** for weights
2. Add **Batch Normalization** after dense layers (before activation)
3. Use **ReLU** or variants (LeakyReLU, ELU) for hidden layers
4. Add **Dropout** for regularization (0.2-0.5)
5. Use **Adam** or **AdamW** optimizer
6. Apply **learning rate scheduling** (cosine annealing)
7. Use **residual connections** for deep networks
8. Monitor with **early stopping**
9. Save best model with **checkpointing**
10. Use **data augmentation** when possible

## Exercises

1. **Ablation Study**: Remove techniques one-by-one and measure impact
2. **Hyperparameter Search**: Find optimal dropout rate and layer sizes
3. **Deep Network**: Build a 10-layer network with residual connections
4. **Custom Scheduler**: Implement warmup + cosine decay
5. **L2 Regularization**: Add kernel regularization and compare with dropout
6. **Gradient Clipping**: Implement gradient clipping for stability
7. **Mixed Precision**: Use mixed precision training for speed

## Next Steps

In Lab 3, we'll explore:
- Convolutional Neural Networks (CNNs)
- Computer vision applications
- Transfer learning
- Famous CNN architectures

Excellent work! You now understand the key techniques for training deep neural networks.