# W03 · Optimization & Regularization

This lab investigates how optimization algorithms and regularization strategies work together to improve deep learning models. You will combine theoretical intuition with practical TensorFlow experiments executed through the course utility helpers.

## Learning Objectives

By the end of this lab you will be able to:

- Derive and interpret the update rules of popular first-order optimizers.
- Explain how regularization mechanisms change the effective hypothesis space of neural networks.
- Run disciplined experiments on multiple datasets (including **Fashion-MNIST**) using the course `dl_utils` helpers.
- Compare and contrast the impact of optimization versus regularization choices on convergence speed and generalization.

## Optimization Refresher

The core of gradient-based learning is the iterative parameter update

$$
\theta_{t+1} = \theta_t - \eta \, \nabla_\theta \mathcal{L}(\theta_t),
$$

where $\eta$ is the learning rate and $\nabla_\theta \mathcal{L}(\theta_t)$ is the gradient of the loss at iteration $t$. Several refinements build on this primitive:

- **Momentum SGD** accumulates an exponential moving average of gradients, creating a velocity term $v_t$ for $g_t = \nabla_\theta \mathcal{L}(\theta_t)$:
  $$\begin{aligned}
  v_t &= \beta v_{t-1} + (1 - \beta) g_t, \\n  \theta_{t+1} &= \theta_t - \eta \, v_t.
  \end{aligned}$$
- **RMSProp** rescales the learning rate per-parameter using a running estimate of squared gradients $s_t$:
  $$\begin{aligned}
  s_t &= \beta s_{t-1} + (1 - \beta) g_t^2, \\n  \theta_{t+1} &= \theta_t - \frac{\eta}{\sqrt{s_t + \epsilon}} \, g_t.
  \end{aligned}$$
- **Adam** combines momentum and RMSProp with bias corrections:
  $$\begin{aligned}
  m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t, \\n  v_t &= \beta_2 v_{t-1} + (1 - \beta_2) g_t^2, \\n  \hat{m}_t &= \frac{m_t}{1 - \beta_1^t}, \\n  \hat{v}_t &= \frac{v_t}{1 - \beta_2^t}, \\n  \theta_{t+1} &= \theta_t - \eta \, \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}.
  \end{aligned}$$

These optimizers differ in how aggressively they adapt the step size and in their sensitivity to poorly scaled gradients.


## Regularization Refresher

Regularization constrains the model to favor simpler or more robust solutions:

- **$L_2$ weight decay** penalizes large weights by adding $\frac{\lambda}{2} \lVert \theta \rVert_2^2$ to the loss, leading to the modified gradient
  $$\nabla_\theta \mathcal{L}_{\text{reg}} = \nabla_\theta \mathcal{L} + \lambda \theta.$$
- **Dropout** randomly zeros activations with probability $p$ during training, sampling subnetworks that collectively prevent co-adaptation. The expected activation becomes $(1-p) h$ at inference time.
- **Data augmentation** synthesizes additional training examples $\tilde{x}$ sampled from transformations $T(x)$ that preserve labels, effectively enlarging the dataset support.
- **Early stopping** halts training when validation loss stops improving, implicitly restricting the number of optimization steps and preventing overfitting.

Each technique balances the bias–variance trade-off differently, shaping how well the model generalizes to unseen data.


## Technique Comparison Cheat Sheet

| Category | Technique | Primary Benefit | Key Hyperparameters | Typical Trade-offs |
| --- | --- | --- | --- | --- |
| Optimizer | SGD + Momentum | Smooths noisy gradients and accelerates along ravines | Learning rate $\eta$, momentum $\beta$ | Sensitive to $\eta$; may stagnate on plateaus |
| Optimizer | RMSProp | Adapts step sizes using squared-gradient averages | $\eta$, decay $\beta$, $\epsilon$ | Can forget long-term gradient information |
| Optimizer | Adam | Combines momentum and adaptivity for fast convergence | $\eta$, $\beta_1$, $\beta_2$, $\epsilon$ | May generalize worse than SGD on some tasks |
| Regularization | $L_2$ Weight Decay | Shrinks weights to reduce variance | Penalty $\lambda$ | Excessive decay underfits |
| Regularization | Dropout | Prevents co-adaptation of units | Drop probability $p$ | Slows convergence; tuning $p$ is task-dependent |
| Regularization | Data Augmentation | Expands dataset support to improve invariances | Transform family $T$, augmentation strength | Poorly chosen transforms can hurt accuracy |
| Regularization | Early Stopping | Stops over-training based on validation metrics | Patience, monitored metric | Requires reliable validation signal |


## Lab Roadmap

1. Configure the environment and import the reusable helpers from `notebooks/dl_utils`.
2. Build shared dataset pipelines for **Fashion-MNIST** and **MNIST**.
3. Define a compact multilayer perceptron with configurable regularizers.
4. Run optimization-focused experiments (SGD vs. Momentum vs. Adam) on Fashion-MNIST.
5. Run regularization-focused experiments (dropout, weight decay, augmentation) on MNIST.
6. Summarize and visualize the results, then reflect on the optimizer/regularizer interplay.


In [None]:

from __future__ import annotations

import math
from dataclasses import dataclass
from typing import Any

import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds

from dl_utils import (
    build_callbacks,
    compile_and_fit,
    load_tfds_dataset,
    plot_history,
    prepare_for_training,
    summarize_history,
)

# Reproducibility
SEED = 42
np.random.seed(SEED)
tf.random.set_seed(SEED)

print(f"TensorFlow version: {tf.__version__}")

### Dataset Preparation Helpers

The helpers below rely on `dl_utils.load_tfds_dataset` and `dl_utils.prepare_for_training` to build consistent `tf.data` pipelines with optional augmentation.

In [None]:

@dataclass
class DatasetBundle:
    name: str
    train: tf.data.Dataset
    validation: tf.data.Dataset
    test: tf.data.Dataset
    info: tfds.core.DatasetInfo


def normalize_image(image: tf.Tensor) -> tf.Tensor:
    image = tf.cast(image, tf.float32) / 255.0
    if image.shape.rank == 2:
        image = tf.expand_dims(image, -1)
    return image


def load_image_classification_data(
    name: str,
    *,
    batch_size: int = 64,
    validation_split: float = 0.1,
    augment: bool = False,
    max_train_size: int | None = 12000,
    max_val_size: int | None = 2000,
    max_test_size: int | None = 4000,
) -> DatasetBundle:
    """Load an image classification dataset with normalized batches."""

    if not 0.0 < validation_split < 1.0:
        raise ValueError("validation_split must be between 0 and 1")

    train_cut = int((1.0 - validation_split) * 100)
    train_raw, info = load_tfds_dataset(name, split=f"train[:{train_cut}%]", with_info=True)
    val_raw = load_tfds_dataset(name, split=f"train[{train_cut}%:]")
    test_raw = load_tfds_dataset(name, split="test")

    if max_train_size is not None:
        train_raw = train_raw.take(max_train_size)
    if max_val_size is not None:
        val_raw = val_raw.take(max_val_size)
    if max_test_size is not None:
        test_raw = test_raw.take(max_test_size)

    def preprocess(image: tf.Tensor, label: tf.Tensor) -> tuple[tf.Tensor, tf.Tensor]:
        return normalize_image(image), label

    train_ds = train_raw.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE)
    val_ds = val_raw.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE)
    test_ds = test_raw.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE)

    augment_fn = None
    if augment:
        def augment_fn(image: tf.Tensor, label: tf.Tensor) -> tuple[tf.Tensor, tf.Tensor]:
            image = tf.image.random_flip_left_right(image)
            image = tf.image.random_brightness(image, max_delta=0.1)
            image = tf.clip_by_value(image, 0.0, 1.0)
            return image, label

    train_ds = prepare_for_training(
        train_ds,
        batch_size=batch_size,
        augment_fn=augment_fn,
        shuffle_buffer=2048,
    )
    val_ds = prepare_for_training(
        val_ds,
        batch_size=batch_size,
        shuffle_buffer=None,
        cache=True,
        prefetch=True,
    )
    test_ds = prepare_for_training(
        test_ds,
        batch_size=batch_size,
        shuffle_buffer=None,
        cache=True,
        prefetch=True,
    )

    return DatasetBundle(name=name, train=train_ds, validation=val_ds, test=test_ds, info=info)

### Model Builder

We use a compact multilayer perceptron with tunable dropout and $L_2$ weight decay so that optimizer and regularization effects remain visible within a few epochs.


In [None]:
def build_baseline_model(
    input_shape: tuple[int, ...],
    num_classes: int,
    *,
    dropout_rate: float = 0.3,
    l2_factor: float = 1e-4,
) -> tf.keras.Model:
    regularizer = tf.keras.regularizers.l2(l2_factor) if l2_factor else None

    model = tf.keras.Sequential(
        [
            tf.keras.layers.InputLayer(input_shape=input_shape),
            tf.keras.layers.Flatten(),
            tf.keras.layers.Dense(256, activation="relu", kernel_regularizer=regularizer),
            tf.keras.layers.Dropout(dropout_rate),
            tf.keras.layers.Dense(128, activation="relu", kernel_regularizer=regularizer),
            tf.keras.layers.Dropout(dropout_rate / 2.0 if dropout_rate else 0.0),
            tf.keras.layers.Dense(num_classes, activation="softmax"),
        ]
    )
    return model


### Experiment Management

The experiment runner stitches together dataset loading, model creation, compilation, and evaluation. Results are stored in a structured dictionary for later comparison.

In [None]:

    @dataclass
    class ExperimentConfig:
        name: str
        dataset: str
        optimizer: tf.keras.optimizers.Optimizer
        epochs: int = 5
        batch_size: int = 64
        dropout_rate: float = 0.3
        l2_factor: float = 1e-4
        augment: bool = False
        max_train_size: int | None = 12000
        max_val_size: int | None = 2000
        max_test_size: int | None = 4000


    def run_experiment(config: ExperimentConfig) -> dict[str, Any]:
        print(f"
▶ Running experiment: {config.name}")
        data = load_image_classification_data(
            config.dataset,
            batch_size=config.batch_size,
            validation_split=0.1,
            augment=config.augment,
            max_train_size=config.max_train_size,
            max_val_size=config.max_val_size,
            max_test_size=config.max_test_size,
        )

        input_shape = data.info.features["image"].shape
        num_classes = data.info.features["label"].num_classes

        model = build_baseline_model(
            input_shape,
            num_classes,
            dropout_rate=config.dropout_rate,
            l2_factor=config.l2_factor,
        )

        callbacks, log_dir = build_callbacks(
            experiment_name=config.name,
            tensorboard=False,
            patience=3,
            monitor="val_loss",
        )

        history = compile_and_fit(
            model,
            data.train,
            optimizer=config.optimizer,
            loss="sparse_categorical_crossentropy",
            metrics=["accuracy"],
            epochs=config.epochs,
            validation_ds=data.validation,
            callbacks=callbacks,
            verbose=2,
        )

        eval_results = model.evaluate(data.test, verbose=0)
        test_metrics = dict(zip(model.metrics_names, eval_results))

        return {
            "config": config,
            "history": history,
            "model": model,
            "log_dir": log_dir,
            "test_metrics": test_metrics,
            "dataset_info": data.info,
        }

## Fashion-MNIST: Optimizer Face-off

We evaluate vanilla SGD, SGD with momentum, and Adam on the Fashion-MNIST dataset while keeping regularization fixed.

In [None]:

fashion_experiments = [
    ExperimentConfig(
        name="fashion_sgd",
        dataset="fashion_mnist",
        optimizer=tf.keras.optimizers.SGD(learning_rate=0.05),
        epochs=5,
        dropout_rate=0.3,
        l2_factor=1e-4,
        augment=True,
        max_train_size=12000,
        max_val_size=2000,
    ),
    ExperimentConfig(
        name="fashion_momentum",
        dataset="fashion_mnist",
        optimizer=tf.keras.optimizers.SGD(learning_rate=0.03, momentum=0.9, nesterov=True),
        epochs=5,
        dropout_rate=0.3,
        l2_factor=1e-4,
        augment=True,
        max_train_size=12000,
        max_val_size=2000,
    ),
    ExperimentConfig(
        name="fashion_adam",
        dataset="fashion_mnist",
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
        epochs=5,
        dropout_rate=0.3,
        l2_factor=1e-4,
        augment=True,
        max_train_size=12000,
        max_val_size=2000,
    ),
]

fashion_results = [run_experiment(cfg) for cfg in fashion_experiments]

## MNIST: Regularization Ablations

With Adam fixed as the optimizer, we vary dropout, weight decay, and augmentation to see how regularization influences generalization.

In [None]:

mnist_experiments = [
    ExperimentConfig(
        name="mnist_min_reg",
        dataset="mnist",
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
        epochs=4,
        dropout_rate=0.1,
        l2_factor=0.0,
        augment=False,
        max_train_size=10000,
        max_val_size=2000,
    ),
    ExperimentConfig(
        name="mnist_dropout_weightdecay",
        dataset="mnist",
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
        epochs=4,
        dropout_rate=0.5,
        l2_factor=1e-3,
        augment=True,
        max_train_size=10000,
        max_val_size=2000,
    ),
]

mnist_results = [run_experiment(cfg) for cfg in mnist_experiments]

## Aggregate Metrics

We summarize best validation performance, highlight the epoch where it occurred, and report held-out test accuracy using the new `summarize_history` helper.

In [None]:

import pandas as pd

all_results = fashion_results + mnist_results
summary_rows: list[dict[str, Any]] = []

for result in all_results:
    config = result["config"]
    history_summary = summarize_history(result["history"], metrics=["loss", "accuracy"])
    summary = {row["metric"]: row for row in history_summary}

    summary_rows.append(
        {
            "experiment": config.name,
            "dataset": config.dataset,
            "optimizer": type(config.optimizer).__name__,
            "dropout": config.dropout_rate,
            "l2_factor": config.l2_factor,
            "augment": config.augment,
            "val_best_accuracy": summary["accuracy"].get("val_best"),
            "val_accuracy_epoch": summary["accuracy"].get("val_epoch"),
            "best_val_loss": summary["loss"].get("val_best"),
            "test_accuracy": result["test_metrics"].get("accuracy"),
        }
    )

comparison_df = pd.DataFrame(summary_rows)
display(comparison_df.sort_values(by="val_best_accuracy", ascending=False).reset_index(drop=True))

## Visualize Training Dynamics

Plotting the learning curves clarifies how quickly each configuration converges and whether it overfits.

In [None]:

_ = [plot_history(result["history"], metrics=["accuracy", "loss"]) for result in all_results]

## Key Takeaways

- Adaptive optimizers such as Adam typically reach competitive validation accuracy faster on Fashion-MNIST, but SGD with momentum can close the gap with careful learning-rate tuning.
- Stronger regularization (dropout + weight decay + augmentation) slows early training yet yields higher validation and test accuracy on MNIST, illustrating the bias–variance trade-off.
- Early stopping callbacks prevent divergence across all experiments, providing a safety net when hyperparameters are suboptimal.
- The `summarize_history` helper offers a succinct way to extract the best-performing epochs for downstream reporting.

## Concept Checks

**Question 1.** How does adding momentum change the update direction compared with vanilla SGD?

<details>
<summary>Hint</summary>
Momentum forms a running average of recent gradients before applying the step.
</details>

<details>
<summary>Answer</summary>
The update moves along $v_t$, a smoothed combination of past gradients, so directions that persist over multiple steps are amplified while oscillations cancel out.
</details>

**Question 2.** Why can $L_2$ weight decay improve generalization?

<details>
<summary>Hint</summary>
Think about how the penalty term changes the magnitude of the parameters.
</details>

<details>
<summary>Answer</summary>
By shrinking weights toward zero, weight decay discourages overly large coefficients that fit noise, thereby reducing variance and improving performance on unseen data.
</details>

**Question 3.** What role does data augmentation play in this lab's regularization experiments?

<details>
<summary>Hint</summary>
Consider how augmented examples relate to the original dataset's support.
</details>

<details>
<summary>Answer</summary>
Augmentation broadens the training distribution with label-preserving transforms, making the model robust to small input perturbations and complementing other regularizers.
</details>


## Assignments

1. **Optimizer Tuning Challenge:** Extend the optimizer sweep to include RMSProp and AdamW on Fashion-MNIST. Report the learning curves and discuss how decoupled weight decay in AdamW changes the results.
2. **Regularization Grid Search:** For MNIST, run a 2×2 grid over dropout rates (0.2, 0.4) and $L_2$ penalties ($1 \times 10^{-4}$, $5 \times 10^{-4}$). Summarize the outcomes in a table similar to the one above and reason about the best configuration.
3. **Cross-Dataset Generalization:** Apply the full workflow to a third dataset of your choice (e.g., CIFAR-10 or EMNIST). Compare the effect of augmentation intensity on optimizer performance, highlighting any adjustments required in the model architecture.
4. **Theory to Practice Essay:** In a short write-up (≈500 words), connect the empirical findings to the theoretical equations introduced at the start of the lab. Focus on how adaptive learning rates interact with regularization to shape the loss landscape traversal.
