[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Harvard-CS1090/2026_CS1090B_public/blob/main/sec03/cs1090b_sec03_student_alternative.ipynb)

# <img style="float: left; padding-right: 10px; width: 45px" src="https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/iacs.png">

# CS1090B Section 3 (alternative): Regularization and Data Augmentation

**Harvard University**<br/>
**Spring 2026**<br/>
**Instructors**: Pavlos Protopapas, Kevin Rader, and Chris Gumb<br/>

## Overview

In this section, we investigate a central question in deep learning:

> **Why do neural networks overfit, and how can we systematically improve generalization?**

We approach this question in two stages.

---

### Part 1: Understanding Regularization (Regression Setting)

We begin with a controlled synthetic regression problem where overfitting is easy to observe and diagnose. In this setting, we:

- Visualize how overfitting emerges in training vs. validation loss curves  
- Examine how model capacity affects generalization  
- Compare four regularization strategies:
  - Early stopping  
  - L1 and L2 weight penalties  
  - Dropout  
  - Gaussian noise augmentation  

The goal is not just to apply these techniques, but to understand **how each one constrains the model** and why that improves generalization.

---

### Part 2: Regularization in Practice (Image Classification)

We then move to a more realistic setting: image classification on Fashion-MNIST.

Here, we:

- Train a neural network under limited data conditions  
- Observe overfitting in a high-dimensional input space  
- Apply image-based data augmentation (flips, rotations, noise)  

This allows us to see how augmentation acts as an *implicit regularizer* by modifying the effective training distribution.


## Learning Objectives

By the end of this section, you should be able to:

### Conceptual Understanding
1. Define overfitting and explain how it appears in training vs. validation loss curves.
2. Explain why regularization improves generalization.
3. Compare different regularization techniques and describe how they constrain model capacity.
4. Understand data augmentation as implicit regularization.

### Practical Skills (PyTorch + Modeling)
5. Implement and train a neural network in PyTorch for regression and classification tasks.
6. Diagnose overfitting using learning curves.
7. Apply and compare the following regularization methods:
   - L1 / L2 weight penalties
   - Early stopping
   - Dropout
   - Gaussian noise augmentation
8. Modify optimizers to include weight decay.
9. Use dropout correctly during training and inference.
10. Implement data augmentation for image classification using torchvision transforms.
11. Using PyTorch DataLoader to implement SGD

## Setup: Download Data

In [None]:
#@title Colab Setup
# Environment detection and setup
import os
import subprocess
import sys
import shutil

# Define the zip file URL and expected directories
assets_zip_url = "https://github.com/Harvard-CS1090/2026_CS1090B_public/raw/main/sec03/notebook_assets.zip"

assets_zip_name = "notebook_assets.zip"
expected_dirs = ["data", "fig"]

# Check if required directories already exist
all_dirs_exist = all(os.path.isdir(d) for d in expected_dirs)

if all_dirs_exist:
    print("Required directories already exist. Skipping download.")
else:
    print(f"Downloading {assets_zip_name} from GitHub...")

    # Use wget in Colab, or urllib for local
    try:
        if 'google.colab' in sys.modules:
            subprocess.run(['wget', '-q', assets_zip_url], check=True)
        else:
            import urllib.request
            urllib.request.urlretrieve(assets_zip_url, assets_zip_name)
        print(f"Downloaded {assets_zip_name}.")

        # Unzip the file
        import zipfile
        with zipfile.ZipFile(assets_zip_name, 'r') as zip_ref:
            zip_ref.extractall('.')
        print(f"Extracted {assets_zip_name}.")

        # Clean up the zip file
        os.remove(assets_zip_name)
        print(f"Removed {assets_zip_name}.")

        # Remove __MACOSX folder if it exists
        if os.path.isdir('__MACOSX'):
            shutil.rmtree('__MACOSX')
            print("Removed __MACOSX folder.")

    except Exception as e:
        print(f"Error during setup: {e}", file=sys.stderr)

print("Setup complete!")

## Part 1: Regularization for Regression

When a model fits the training data too closely - capturing noise rather than the underlying pattern - it **overfits**, performing well on training examples but poorly on unseen data. Regularization techniques address this by constraining the model in various ways.

In this section, we'll intentionally overfit a neural network on a regression task, then apply different regularization techniques to combat it: early stopping, L1/L2 weight penalties, dropout, and data augmentation via Gaussian noise. We'll compare how each method affects the training and validation loss curves.

In [None]:
#@title Imports, Device Setup, and Random Seeds
import copy
import os
import random as rn

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from torchvision import datasets, transforms

sns.set_style('whitegrid')

# Device setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)

# Reproducibility
os.environ['PYTHONHASHSEED'] = '0'
np.random.seed(109)
rn.seed(109)
torch.manual_seed(109);

**Dataset:** We are using a synthetic dataset based on the function  
$$y = x \sin(x)$$

with added Gaussian noise.  

The exact functional form is not the focus here. What matters is that the relationship between input and output is **nonlinear** and **noisy**, which makes it an ideal setting for observing overfitting and testing regularization strategies.

Because the signal is structured but imperfect, the model can either:
- Learn the underlying pattern, or  
- Start memorizing noise  

In [None]:
#@title Generating a (Noisy) Toy Dataset: $f(x)= x\text{sin}(x)$
# We'll model noisy data from the function f(x) = x * sin(x)
def f(x):
    return x * np.sin(x)

N = 25
x = np.linspace(0, 5, N) # small, equally spaced data points for simplicity
y = f(x) + np.random.normal(0, 0.5, len(x)) # noisy data
df = pd.DataFrame({'x': x, 'y': y})

# Split to train and val (no test for our examples)
x_train, x_val, y_train, y_val = train_test_split(
    df['x'],
    df['y'],
    train_size=0.7,
    random_state=109
)

# We'll use standardized data for input to the NN
scaler = StandardScaler()
x_train_std = scaler.fit_transform(x_train.to_frame()).ravel()
x_val_std = scaler.transform(x_val.to_frame()).ravel()
print(f"Generated {N} data points!")

# Torch models will want torch tensor objects
x_train_t = torch.tensor(x_train_std, dtype=torch.float32).unsqueeze(1).to(device)
y_train_t = torch.tensor(y_train.values, dtype=torch.float32).unsqueeze(1).to(device)
x_val_t = torch.tensor(x_val_std, dtype=torch.float32).unsqueeze(1).to(device)
y_val_t = torch.tensor(y_val.values, dtype=torch.float32).unsqueeze(1).to(device)

In [None]:
#@title Visualizing the Data
x_lin = np.linspace(0, 5, 1000) # large linspace for plotting

def plot_data_with_true_fn(df, idx, ax=None):
    if ax is None:
        ax = plt.gca()
    train_mask = np.ones(df.shape[0], dtype=bool)
    train_mask[idx] = False
    ax.scatter(df.x[train_mask], df.y[train_mask], c='b', label='train data')
    ax.scatter(df.x[~train_mask], df.y[~train_mask], c='orange', marker='^', label='validation data')
    ax.plot(x_lin, f(x_lin), label="true function", alpha=0.5, c='k')
    ax.set_xlabel(r'$x$')
    ax.set_ylabel(r'$\hat{y}$')
    ax.legend()

plot_data_with_true_fn(df, idx=x_val.index)

We will start with a small neural network and intentionally overfit. Below we define a baseline model using `nn.Sequential` and reusable plotting utilities that we'll use throughout Part 1 to compare different regularization strategies.

In [None]:
torch.manual_seed(109)
model1 = nn.Sequential(
            nn.Linear(1, 100),
            nn.ReLU(),
            nn.Linear(100, 100),
            nn.ReLU(),
            nn.Linear(100, 100),
            nn.ReLU(),
            nn.Linear(100, 1)
        ).to(device)

optimizer = optim.Adam(model1.parameters(), lr=0.0003)
criterion = nn.MSELoss()

epochs = 3000
history1 = {'loss': [], 'val_loss': []}
for epoch in range(epochs):
    model1.train()
    optimizer.zero_grad()
    preds_train = model1(x_train_t)
    loss = criterion(preds_train, y_train_t)
    loss.backward()
    optimizer.step()
    train_loss = loss.item()

    model1.eval()
    with torch.no_grad():
        preds_val = model1(x_val_t)
        loss = criterion(preds_val, y_val_t)
        val_loss = loss.item()

    history1['loss'].append(train_loss)
    history1['val_loss'].append(val_loss)
    if epoch % 100 == 0:
        print(f"Epoch {epoch:4d}: train loss = {train_loss:.4f}, \
              val loss = {val_loss:.4f}")

> **‚ùì Question 1: Understanding the Training Loop**
>
> Examine the training code above and explain what each of the following does:
>
> 1. `optimizer = optim.Adam(model1.parameters(), lr=0.0003)`
> 2. `criterion = nn.MSELoss()`
> 3. `loss.backward()` and `optimizer.step()`


In [None]:
#@title Initial Training Results

# Some helper functions we'll use throughout for plot training results
def plot_history(history, title=None, ax=None, best_val=False):
    if ax is None:
        ax = plt.gca()

    ax.plot(history['loss'], label='train')
    ax.plot(history['val_loss'], label='validation')

    ax.set_yscale('log')  # <-- log scale for MSE

    ax.set_xlabel('epoch')
    ax.set_ylabel('MSE (log scale)')

    best_loss = np.nanmin(history['val_loss'])
    best_epoch = np.nanargmin(history['val_loss'])

    if best_val:
        ax.axvline(best_epoch, c='k', ls='--',
                label=f'best val loss = {best_loss:.2e}')

    ax.legend()

    if title:
        ax.set_title(title)
    else:
        ax.set_title("Training History")

def plot_predictions(model, ax=None):
    if ax is None:
        ax = plt.gca()
    model.eval()
    x_lin = np.linspace(df.x.min(), df.x.max(), 500).reshape(-1, 1)
    x_lin_std = scaler.transform(pd.DataFrame(x_lin, columns=['x']))
    with torch.no_grad():
        x_tensor = torch.tensor(x_lin_std, dtype=torch.float32).to(device)
        preds = model(x_tensor).detach().cpu().numpy()
        preds_val = model(x_val_t)
    ax.plot(x_lin, preds, c='blue', alpha=0.8, label='prediction')
    ax.legend()
    # Here we reference criterion which is out of function scope.
    # This is bad practice but it makes the function calls simpler
    # and we never change the definition of criterion throughout
    # part 1.
    final_val_loss = criterion(preds_val, y_val_t).item()
    ax.set_title(f"Final Model\n(val loss = {final_val_loss:.4f})")

def plot_data(df, idx=None, ax=None):

    if ax is None:
        ax = plt.gca()

    # If validation indices are provided, separate train and val
    if idx is not None:
        train_df = df.drop(idx)
        val_df = df.loc[idx]

        ax.scatter(train_df['x'], train_df['y'],
                   color='gray', alpha=0.6, label='train data')
        ax.scatter(val_df['x'], val_df['y'],
                   color='red', alpha=0.8, label='validation data')
    else:
        ax.scatter(df['x'], df['y'],
                   color='gray', alpha=0.6, label='data')

    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.legend()

def plot_results(model, history, title="Model Performance", best_val=False):
    fig, axs = plt.subplots(1, 2, figsize=(10, 4))
    plot_history(history, ax=axs[0], best_val=best_val)
    plot_data(df, idx=x_val.index, ax=axs[1])
    plot_predictions(model, ax=axs[1])
    plt.tight_layout()
    plt.suptitle(title, fontsize=16, y=1.05)
    plt.show()

plot_results(model1, history1, "Baseline (no regularization)")

> **‚ùì Question 2: Diagnosing Overfitting**
>
> 1. What do you notice about the gap between training and validation loss as epochs increase?
> 2. Why does the validation loss start to increase even as training loss continues to decrease?


### Weight Decay (L2) and L1 Regularization

Weight penalties add a cost to having large weights, discouraging the model from fitting noise. **L2 regularization** (weight decay) adds $\lambda \sum w^2$ to the loss, which shrinks all weights toward zero but rarely makes them exactly zero. **L1 regularization** adds $\lambda \sum |w|$ to the loss, which encourages *sparsity* - pushing many weights to exactly zero while keeping a few large ones. In practice, L2 produces smoother models while L1 can act as a form of feature selection.

**Note:** PyTorch also provides `torch.optim.AdamW`, which applies weight decay *directly* to the weights rather than through the gradient. In standard `Adam`, the penalty gets scaled down by the adaptive learning rate, weakening it for frequently-updated parameters. `AdamW` avoids this and is preferred in modern practice (e.g., transformer fine-tuning). For this small example the difference is negligible.

In [None]:
torch.manual_seed(109)
model2 = nn.Sequential(
            nn.Linear(1, 100),
            nn.ReLU(),
            nn.Linear(100, 100),
            nn.ReLU(),
            nn.Linear(100, 100),
            nn.ReLU(),
            nn.Linear(100, 1)
        ).to(device)

# NOTE: For our optimizer we use AdamW and a new argument, `weight_decay`
PENALTY = 0.5
optimizer = optim.AdamW(model2.parameters(), lr=0.0003, weight_decay=PENALTY)
criterion = nn.MSELoss()

epochs = 3000
history2 = {'loss': [], 'val_loss': []}
for epoch in range(epochs):
    model2.train()
    optimizer.zero_grad()
    preds_train = model2(x_train_t)
    loss = criterion(preds_train, y_train_t)
    loss.backward()
    optimizer.step()
    train_loss = loss.item()

    model2.eval()
    with torch.no_grad():
        preds_val = model2(x_val_t)
        loss = criterion(preds_val, y_val_t)
        val_loss = loss.item()

    history2['loss'].append(train_loss)
    history2['val_loss'].append(val_loss)

    if epoch % 100 == 0:
        print(f"Epoch {epoch:4d}: train loss = {train_loss:.4f}, \
              val loss = {val_loss:.4f}")

plot_results(model2, history2, "L2 Weight Decay")

> **‚ùì Question 3: Regularization Strength**
> 1. Is this enough regularization? Try different values.
> 2. What do you expect to happen if `weight_decay` becomes very large? Think about: what happens to model capacity, what kind of functions the network can represent, and what happens to training and validation loss.
> 3. If too little regularization leads to overfitting and too much leads to underfitting, how do we choose the right `weight_decay`?


---

üìù **Note on L1 Regularization**

Unlike L2 (which can be added directly using `weight_decay` in optimizers like `AdamW`),  
there is **no built-in simple argument for L1 regularization** in PyTorch optimizers.

Instead, we implement L1 manually by adding an extra penalty term to the loss:

$$
\text{loss} = \text{original loss} + \lambda \sum |w|
$$

In practice, this means:

- Compute the standard loss (e.g., MSE).
- Compute the L1 penalty by summing the absolute values of the weights.
- Add the penalty to the loss before calling `backward()`.

Here is a minimal reference implementation:

```python

l1_lambda = 0.005
# Inside the training loop, after computing the standard loss:
loss = criterion(preds, targets)
# Add L1 penalty on weights only (not biases ‚Äî standard convention)
l1_penalty = sum(p.abs().sum() for name, p in model.named_parameters() if 'weight' in name)
loss = loss + l1_lambda * l1_penalty

loss.backward()
optimizer.step()
```

---

### Early Stopping

Early stopping monitors validation loss during training and halts the process when performance stops improving. The idea is simple: as training continues, the model begins to memorize noise in the training data, causing validation loss to rise even as training loss keeps falling. By saving the best model and stopping once validation loss hasn't improved for a set number of epochs (*patience*), we get a model that generalizes better without needing to tune a regularization hyperparameter directly.

In [None]:
# Model with Early Stopping
torch.manual_seed(109)
model3 = nn.Sequential(
            nn.Linear(1, 100),
            nn.ReLU(),
            nn.Linear(100, 100),
            nn.ReLU(),
            nn.Linear(100, 100),
            nn.ReLU(),
            nn.Linear(100, 1)
        ).to(device)

optimizer = optim.Adam(model3.parameters(), lr=0.0003)
criterion = nn.MSELoss()

epochs = 3000 # Max epochs to run if not stopped early
history3 = {'loss': [], 'val_loss': []}

# Early Stopping parameters
patience = 200
best_val_loss = np.inf
patience_counter = 0

print("Training model with Early Stopping (direct implementation)...")
for epoch in range(epochs):
    model3.train()
    optimizer.zero_grad()
    preds_train = model3(x_train_t)
    loss = criterion(preds_train, y_train_t)
    loss.backward()
    optimizer.step()
    train_loss = loss.item()

    model3.eval()
    with torch.no_grad():
        preds_val = model3(x_val_t)
        loss = criterion(preds_val, y_val_t)
        val_loss = loss.item()

    history3['loss'].append(train_loss)
    history3['val_loss'].append(val_loss)

    if epoch % 100 == 0:
        print(f"Epoch {epoch:4d}: train loss = {train_loss:.4f}, val loss = {val_loss:.4f}")

    # Early stopping logic
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        patience_counter = 0
    else:
        patience_counter += 1

    if patience_counter >= patience:
        print(f"Early stopping triggered at epoch {epoch}. Validation loss did not improve for {patience} epochs.")
        break

# Plotting results for the early stopped model
plot_results(model3, history3, "Early Stopping /w Patience")

> **‚ùì Question 4: Understanding Early Stopping**
>
> Read the training loop carefully and think about the following:
>
> 1. What is the role of `best_val_loss`? Why is it initialized to `np.inf`? What does it track during training?
> 2. What does `patience = 200` mean? What happens if patience is very small? What happens if patience is very large?
> 3. Why do we reset `patience_counter` when validation loss improves?
> 4. Why do we monitor **validation loss** and not training loss? What would happen if we stopped based on training loss instead?


> **‚ùì Question 5: Best Model Restoration**
>
> When early stopping is triggered, we stop training ‚Äî but are we using the model from the **last epoch before stopping**, or the model that achieved the **lowest validation loss**? Are these always the same? What is best practice?


### Restoring Our Best Model

In [None]:
# Model with Early Stopping
torch.manual_seed(109)
model4 = nn.Sequential(
            nn.Linear(1, 100),
            nn.ReLU(),
            nn.Linear(100, 100),
            nn.ReLU(),
            nn.Linear(100, 100),
            nn.ReLU(),
            nn.Linear(100, 1)
        ).to(device)

optimizer = optim.Adam(model4.parameters(), lr=0.0003)
criterion = nn.MSELoss()

epochs = 3000 # Max epochs to run if not stopped early
history4 = {'loss': [], 'val_loss': []}

# Early Stopping parameters
patience = 100
best_val_loss = np.inf
patience_counter = 0
best_model_state = None

print("Training model with Early Stopping (direct implementation)...")
for epoch in range(epochs):
    model4.train()
    optimizer.zero_grad()
    preds_train = model4(x_train_t)
    loss = criterion(preds_train, y_train_t)
    loss.backward()
    optimizer.step()
    train_loss = loss.item()

    model4.eval()
    with torch.no_grad():
        preds_val = model4(x_val_t)
        loss = criterion(preds_val, y_val_t)
        val_loss = loss.item()

    history4['loss'].append(train_loss)
    history4['val_loss'].append(val_loss)

    if epoch % 100 == 0:
        print(f"Epoch {epoch:4d}: train loss = {train_loss:.4f}, val loss = {val_loss:.4f}")

    # Early stopping logic
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        patience_counter = 0
        best_model_state = copy.deepcopy(model4.state_dict()) # Save the best model state
    else:
        patience_counter += 1

    if patience_counter >= patience:
        print(f"Early stopping triggered at epoch {epoch}. Validation loss did not improve for {patience} epochs.")
        break

# Load the best model weights if early stopping occurred and a state was saved
if best_model_state is not None:
    model4.load_state_dict(best_model_state)
    print("Restored model to best state.")
else:
    print("No best model state to restore (perhaps val_loss never improved).")

# Plotting results for the early stopped model
plot_results(model4, history4, "Early Stopping /w Patience & Best Weight Restoring", best_val=True)

üí¨ **Discuss:**

‚ÅâÔ∏è **Q:** How is the weight restoration implemented? What do `state_dict()` and `load_state_dict()` do?        

‚ÅâÔ∏è **Q:** Why do we need `copy.deepcopy` here instead of a simple assignment?

‚ÅâÔ∏è **Q:** How can we improve our implementation of early stopping? (Hint: think about minimum improvement thresholds)

### Dropout

Dropout randomly zeroes a fraction of neuron activations during each training step, forcing the network to not rely on any single neuron. This acts as an implicit ensemble ‚Äî on each forward pass, a different "thinned" subnetwork is active, and the final model approximates an average over all these subnetworks. At evaluation time, dropout is turned off and all neurons are active (with outputs scaled accordingly). The `dropout` parameter controls what fraction of activations are dropped (e.g., 0.3 means 30% are zeroed).

**From Lecture 4:** Dropout randomly "drops" neurons during training, forcing the network to learn redundant representations and reducing co-adaptation between neurons.

<img src=./fig/lec04_dropout_mechanism.png width="800">

In [None]:
dropout_p = 0.1
torch.manual_seed(109)
model5 = nn.Sequential(
            nn.Linear(1, 100),
            nn.ReLU(),
            nn.Dropout(dropout_p),
            nn.Linear(100, 100),
            nn.ReLU(),
            nn.Dropout(dropout_p),
            nn.Linear(100, 100),
            nn.ReLU(),
            nn.Dropout(dropout_p),
            nn.Linear(100, 1)
        ).to(device)

optimizer = optim.Adam(model5.parameters(), lr=0.0003)
criterion = nn.MSELoss()

epochs = 3000
history5 = {'loss': [], 'val_loss': []}
for epoch in range(epochs):
    model5.train()
    optimizer.zero_grad()
    preds_train = model5(x_train_t)
    loss = criterion(preds_train, y_train_t)
    loss.backward()
    optimizer.step()
    train_loss = loss.item()

    model5.eval()
    with torch.no_grad():
        preds_val = model5(x_val_t)
        loss = criterion(preds_val, y_val_t)
        val_loss = loss.item()

    history5['loss'].append(train_loss)
    history5['val_loss'].append(val_loss)
    if epoch % 100 == 0:
        print(f"Epoch {epoch:4d}: train loss = {train_loss:.4f}, \
              val loss = {val_loss:.4f}")

plot_results(model5, history5, "Dropout")

üí¨ **Discuss:**                                                                                           

‚ÅâÔ∏è  **Q:** Do you notice anything in the training history compared to before? (Look at the noise in the training loss curve) 

‚ÅâÔ∏è  **Q:** What happens if we remove `model.eval()` before computing validation loss?

‚ÅâÔ∏è  **Q:** Run a forward pass on the same input multiple times ‚Äî once with `model.train()` and once with `model.eval()`. What do you observe?

In [None]:
##  Uncomment to try - answers above question
# model5.train()
# print(model5(x_val_t[0:1]))
# print(model5(x_val_t[0:1]))

# model5.eval()
# print(model5(x_val_t[0:1]))
# print(model5(x_val_t[0:1]))

### Data Augmentation with Gaussian Noise

Adding random noise to inputs during training is a simple form of data augmentation that acts as a regularizer. Each time the model sees a training example, it sees a slightly perturbed version, which prevents it from memorizing exact input-output mappings. The noise is only applied during training - at evaluation time, the original clean inputs are used. This is especially useful when the dataset is small and collecting more data isn't feasible.

In [None]:
# This class will be utilized later when we add Gaussian noise as a model layer.
# 209 students take note!
class GaussianNoise(nn.Module):
    def __init__(self, stddev):
        super().__init__()
        self.stddev = stddev

    def forward(self, x):
        if self.training and self.stddev > 0:
            noise = torch.randn_like(x) * self.stddev
            return x + noise
        return x

In [None]:
torch.manual_seed(109)
model6 = nn.Sequential(
            GaussianNoise(0.1),
            nn.Linear(1, 100),
            nn.ReLU(),
            nn.Linear(100, 100),
            nn.ReLU(),
            nn.Linear(100, 100),
            nn.ReLU(),
            nn.Linear(100, 1)
        ).to(device)

optimizer = optim.Adam(model6.parameters(), lr=0.01)
criterion = nn.MSELoss()

epochs = 2000
history6 = {'loss': [], 'val_loss': []}
for epoch in range(epochs):
    model6.train()
    optimizer.zero_grad()
    preds_train = model6(x_train_t)
    loss = criterion(preds_train, y_train_t)
    loss.backward()
    optimizer.step()
    train_loss = loss.item()

    model6.eval()
    with torch.no_grad():
        preds_val = model6(x_val_t)
        loss = criterion(preds_val, y_val_t)
        val_loss = loss.item()

    history6['loss'].append(train_loss)
    history6['val_loss'].append(val_loss)
    if epoch % 100 == 0:
        print(f"Epoch {epoch:4d}: train loss = {train_loss:.4f}, \
              val loss = {val_loss:.4f}")

plot_results(model6, history6, "Data Augmentation /w Gaussian Noise")

üí¨ **Discuss:**           

‚ÅâÔ∏è  **Q:** What does `GaussianNoise(0.1)` do as the first layer of the model? How does it act differently during training vs. evaluation?

‚ÅâÔ∏è  **Q:** What happens if the noise standard deviation is increased or decreased? Try different values.    

## Part 2: Classification with Fashion-MNIST

Having seen regularization in the regression setting, we now apply these ideas to a classification task. We'll build an MLP classifier on Fashion-MNIST, observe overfitting with limited data, and then use data augmentation to improve generalization.

---

Fashion-MNIST is a dataset of Zalando's article images‚Äîconsisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. Fashion-MNIST is intended to serve as a drop-in replacement for MNIST.

<img src="https://4.bp.blogspot.com/-OQZGt_5WqDo/Wa_Dfa4U15I/AAAAAAAAAUI/veRmAmUUKFA19dVw6XCOV2YLO6n-y_omwCLcBGAs/s400/out.jpg" width="400px" alt="Grid of Fashion-MNIST sample images showing 10 clothing categories: T-shirt/top, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, and Ankle boot" />

### DataLoaders (PyTorch)

We use `torchvision.datasets.FashionMNIST` with `transforms.Compose` and PyTorch DataLoaders. We'll take a small train/validation split for faster experimentation.

In [None]:
# Load Fashion-MNIST with normalization
mean, std = (0.2860,), (0.3530,)
base_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean, std)
])

fashion_train = datasets.FashionMNIST(root='data', train=True, download=True, transform=base_transform)
fashion_test = datasets.FashionMNIST(root='data', train=False, download=True, transform=base_transform)

print(fashion_train, fashion_test)

üí¨ **Discuss:**    
          
‚ÅâÔ∏è  **Q:** What does `transforms.Normalize(mean, std)` do? How does it transform the pixel values?    

In [None]:
labels = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
          'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']


In [None]:
# Display a few examples (unnormalized for visualization)
fig, axs = plt.subplots(3, 5, figsize=(9, 7))
for i, ax in enumerate(axs.ravel()):
    image, label = fashion_train[i]
    # Unnormalize: the transform applied Normalize(mean, std), which computes (x - mean) / std
    # To display the original pixel values, we reverse this: x_original = x_normalized * std + mean
    image = image * std[0] + mean[0]
    ax.imshow(image.squeeze(), cmap='gray')
    ax.set_title(labels[label], fontsize=8)
    ax.axis('off')
plt.tight_layout()

We will use PyTorch `Subset` objects to create train/validation/test splits without writing images to disk.

In [None]:
from torch.utils.data import Subset

# Reduce dataset sizes for faster experimentation
generator = torch.Generator().manual_seed(109)
train_size = 1200
val_size = 240
test_size = 1000

train_val_indices = torch.randperm(len(fashion_train), generator=generator)[:train_size + val_size]
train_indices = train_val_indices[:train_size].tolist()
val_indices = train_val_indices[train_size:].tolist()
test_indices = torch.randperm(len(fashion_test), generator=generator)[:test_size].tolist()

train_subset = Subset(fashion_train, train_indices)
val_subset = Subset(fashion_train, val_indices)
test_subset = Subset(fashion_test, test_indices)

In [None]:
# Create DataLoaders
batch_size = 32

train_loader = torch.utils.data.DataLoader(train_subset, batch_size=batch_size, shuffle=True)
val_loader = torch.utils.data.DataLoader(val_subset, batch_size=batch_size, shuffle=False)
test_loader = torch.utils.data.DataLoader(test_subset, batch_size=batch_size, shuffle=False)

data_batch, labels_batch = next(iter(train_loader))
print('data batch shape:', data_batch.shape)
print('labels batch shape:', labels_batch.shape)

üí¨ **Discuss:**    

‚ÅâÔ∏è  **Q:** What is a DataLoader? Why do we use it instead of passing the full dataset to the model at once?   

> **‚ùì Question 6: Subset Splits**
>
> 1. Why do we create separate train/validation/test subsets instead of reusing the full dataset?
> 2. How could class imbalance in a small subset affect validation accuracy?


### Build a Classifier

In [None]:
class FashionMLP(nn.Module):
    def __init__(self, input_dim=28*28, num_classes=10):
        super().__init__()
        self.net = nn.Sequential(
            nn.Flatten(),
            nn.Linear(input_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, num_classes)
        )

    def forward(self, x):
        return self.net(x)

In [None]:
def train_classifier(model, train_loader, val_loader, optimizer, criterion, epochs=30, patience=None):
    history = {'loss': [], 'val_loss': [], 'accuracy': [], 'val_accuracy': []}

    # Early stopping state
    best_val_loss = np.inf
    patience_counter = 0
    best_model_state = None

    for epoch in range(epochs):
        model.train()
        train_loss = 0.0
        correct = 0
        total = 0
        for xb, yb in train_loader:
            xb = xb.to(device)
            yb = yb.to(device)
            optimizer.zero_grad()
            outputs = model(xb)
            loss = criterion(outputs, yb)
            loss.backward()
            optimizer.step()
            train_loss += loss.item() * xb.size(0)
            preds = outputs.argmax(dim=1)
            correct += (preds == yb).sum().item()
            total += yb.size(0)
        train_loss /= len(train_loader.dataset)
        train_acc = correct / total

        model.eval()
        val_loss = 0.0
        val_correct = 0
        val_total = 0
        with torch.no_grad():
            for xb, yb in val_loader:
                xb = xb.to(device)
                yb = yb.to(device)
                outputs = model(xb)
                loss = criterion(outputs, yb)
                val_loss += loss.item() * xb.size(0)
                preds = outputs.argmax(dim=1)
                val_correct += (preds == yb).sum().item()
                val_total += yb.size(0)
        val_loss /= len(val_loader.dataset)
        val_acc = val_correct / val_total

        history['loss'].append(train_loss)
        history['val_loss'].append(val_loss)
        history['accuracy'].append(train_acc)
        history['val_accuracy'].append(val_acc)

        # Early stopping logic
        if patience is not None:
            if val_loss < best_val_loss:
                best_val_loss = val_loss
                patience_counter = 0
                best_model_state = copy.deepcopy(model.state_dict())
            else:
                patience_counter += 1

            if patience_counter >= patience:
                break

    # Restore best model weights
    if patience is not None and best_model_state is not None:
        model.load_state_dict(best_model_state)

    return history


def evaluate_classifier(model, loader, criterion):
    model.eval()
    loss_total = 0.0
    correct = 0
    total = 0
    with torch.no_grad():
        for xb, yb in loader:
            xb = xb.to(device)
            yb = yb.to(device)
            outputs = model(xb)
            loss = criterion(outputs, yb)
            loss_total += loss.item() * xb.size(0)
            preds = outputs.argmax(dim=1)
            correct += (preds == yb).sum().item()
            total += yb.size(0)
    return loss_total / len(loader.dataset), correct / total

In [None]:
torch.manual_seed(109)
model_cls = FashionMLP().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model_cls.parameters(), lr=0.001)

history_cls = train_classifier(
    model_cls,
    train_loader,
    val_loader,
    optimizer,
    criterion,
    epochs=30,
    patience=5
)

In [None]:
train_loss, train_acc = evaluate_classifier(model_cls, train_loader, criterion)
val_loss, val_acc = evaluate_classifier(model_cls, val_loader, criterion)
test_loss, test_acc = evaluate_classifier(model_cls, test_loader, criterion)

print(f"Train ‚Äî Loss: {train_loss:.4f}, Acc: {train_acc:.2f}")
print(f"Val   ‚Äî Loss: {val_loss:.4f}, Acc: {val_acc:.2f}")
print(f"Test  ‚Äî Loss: {test_loss:.4f}, Acc: {test_acc:.2f}")

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(10, 4))
print(f"Val Acc at last training epoch (before best-model restore): {history_cls['val_accuracy'][-1]:.2f}")
axs[0].plot(history_cls['accuracy'])
axs[0].plot(history_cls['val_accuracy'])
axs[0].set_title('Baseline Classifier ‚Äî Accuracy')
axs[0].set_ylabel('accuracy')
axs[0].set_xlabel('epoch')
axs[0].grid(True, alpha=0.3)
axs[0].legend(['train', 'validation'], loc='upper left')

axs[1].plot(history_cls['loss'])
axs[1].plot(history_cls['val_loss'])
axs[1].set_title('Baseline Classifier ‚Äî Loss')
axs[1].set_ylabel('loss')
axs[1].set_xlabel('epoch')
axs[1].grid(True, alpha=0.3)
axs[1].legend(['train', 'validation'], loc='upper left')
plt.tight_layout();

### Data Augmentation

In Part 1, we added Gaussian noise directly inside the model as a regularizer. For image data, we can go further ‚Äî applying transformations that exploit known invariances. A T-shirt is still a T-shirt whether flipped horizontally or slightly rotated. By applying random transforms on-the-fly, the model sees a different variation of each image every epoch, even though the dataset itself remains 1,200 samples. Below we combine random flips, small rotations, and Gaussian noise as a `torchvision` transform pipeline.

In [None]:
class AddGaussianNoise:
    def __init__(self, std=0.05):
        self.std = std

    def __call__(self, tensor):
        return torch.clamp(tensor + torch.randn_like(tensor) * self.std, 0.0, 1.0)

augment_transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),  #Default probability is 0.5
    transforms.RandomRotation(10),
    transforms.ToTensor(),
    AddGaussianNoise(0.05),
    transforms.Normalize(mean, std)
])

fashion_train_aug = datasets.FashionMNIST(root='data', train=True, download=False, transform=augment_transform)
train_subset_aug = Subset(fashion_train_aug, train_indices)
train_loader_aug = torch.utils.data.DataLoader(train_subset_aug, batch_size=batch_size, shuffle=True)

> **‚ùì Question 7: Choosing Augmentations**
>
> Not all augmentations are appropriate for every dataset. Consider the transforms we used: `RandomHorizontalFlip`, `RandomRotation(10)`, and `AddGaussianNoise`.
>
> 1. Why is `RandomHorizontalFlip` a reasonable augmentation for Fashion-MNIST? For which classes (if any) might it *not* preserve the label?
> 2. Why do we limit rotation to just 10 degrees? What could go wrong with larger rotations (e.g., 90¬∞)?
> 3. We did **not** include `RandomVerticalFlip`. Why would vertical flips be a bad augmentation for clothing images?
> 4. If you were working with a medical imaging dataset (e.g., chest X-rays), which of these augmentations would still be appropriate and which would not?


In [None]:
torch.manual_seed(109)
model_cls_aug = FashionMLP().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model_cls_aug.parameters(), lr=0.001)

history_aug = train_classifier(
    model_cls_aug,
    train_loader_aug,
    val_loader,
    optimizer,
    criterion,
    epochs=60,
    patience=10
)

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(10, 4))
print(f"Final epoch val Acc: {history_aug['val_accuracy'][-1]:.2f}")
print(f"Best epoch val Acc: {max(history_aug['val_accuracy']):.2f}")

axs[0].plot(history_aug['accuracy'])
axs[0].plot(history_aug['val_accuracy'])
axs[0].set_title('Augmented Classifier ‚Äî Accuracy')
axs[0].set_ylabel('accuracy')
axs[0].set_xlabel('epoch')
axs[0].legend(['train', 'validation'], loc='upper left')

axs[1].plot(history_aug['loss'])
axs[1].plot(history_aug['val_loss'])
axs[1].set_title('Augmented Classifier ‚Äî Loss')
axs[1].set_ylabel('loss')
axs[1].set_xlabel('epoch')
axs[1].legend(['train', 'validation'], loc='upper left')
plt.tight_layout();

In [None]:
train_loss, train_acc = evaluate_classifier(model_cls_aug, train_loader, criterion)
val_loss, val_acc = evaluate_classifier(model_cls_aug, val_loader, criterion)
test_loss, test_acc = evaluate_classifier(model_cls_aug, test_loader, criterion)

print(f"Train ‚Äî Loss: {train_loss:.4f}, Acc: {train_acc:.2f}")
print(f"Val   ‚Äî Loss: {val_loss:.4f}, Acc: {val_acc:.2f}")
print(f"Test  ‚Äî Loss: {test_loss:.4f}, Acc: {test_acc:.2f}")

**Results comparison:** Augmentation improved validation accuracy over the baseline on our small 1,200-sample training set, while test accuracy remained similar. With only 1,000 test samples, a few points of difference can be noise, and the validation split comes from the same `fashion_train` pool, so gains may not transfer to the separate test set. Always rely on a truly held-out test set and avoid over-interpreting small differences.

Note: exact metrics can vary across runs and hardware (CPU vs GPU) due to nondeterminism.

> **‚ùì Question 8: Model Layer vs. Data Transform for Noise**
>
> In Part 1 we added Gaussian noise as a **model layer** (`GaussianNoise(nn.Module)`), while in Part 2 we added it as a **data transform** (`AddGaussianNoise` in `transforms.Compose`). Both inject random noise during training to regularize the model.
>
> 1. At what point in the pipeline does noise get applied in each approach? How does this affect what the noise "means" (pixel space vs. normalized space)?
> 2. What are the trade-offs of each approach? When might you prefer one over the other?
> 3. The `AddGaussianNoise` transform clamps values to [0, 1]. Why is this necessary here but not in the `GaussianNoise` module?


## A Note on GPU Device Management

This notebook is configured to use a GPU when available (`device = 'cuda'`), falling back to CPU otherwise. Working with GPUs in PyTorch requires explicitly moving data and models to the same device. Here's the pattern you'll see throughout:

**1. Model to device** ‚Äî `model.to(device)` moves all model parameters (weights, biases, BatchNorm statistics) to the GPU. This only needs to be done once after creating the model.

**2. Batches to device** ‚Äî Inside the training loop, each mini-batch must also be moved: `xb = xb.to(device)` and `yb = yb.to(device)`. DataLoaders return CPU tensors by default, so this transfer happens every iteration. If the model is on GPU but the data is on CPU (or vice versa), PyTorch will raise a runtime error.

**3. Getting results back for evaluation** ‚Äî There are two separate concerns when extracting values from PyTorch tensors:

- **Computation graph**: Tensors produced during a forward pass track operations for automatic differentiation (regardless of whether they're on CPU or GPU). To disconnect from this graph, either wrap code in `torch.no_grad()` (prevents tracking entirely ‚Äî used in eval loops) or call `.detach()` on individual tensors.

- **Device transfer**: GPU tensors can't be converted to NumPy arrays directly ‚Äî they must first move to CPU via `.cpu()`.

In practice you'll see two patterns in this notebook:
- **Scalars** (losses, accuracy counts): use `.item()` to extract a Python number ‚Äî this handles both detaching and device transfer in one call. Example from the training loop: `loss.item()`, `(preds == yb).sum().item()`.
- **Full tensors** (predictions for plotting): use `.detach().cpu().numpy()` to get a NumPy array. Example from `plot_predictions`: `preds = model(x_tensor).detach().cpu().numpy()`. Inside a `torch.no_grad()` block the `.detach()` is technically redundant, but it's common practice to include it for safety.

In short: **model and data must live on the same device**, and **results must come back to CPU** before you can use them with non-PyTorch libraries like NumPy or matplotlib.