# BME I9400 – **HW3: Multilayer Neural Nets in PyTorch (Fashion-MNIST)**
### Student Starter Notebook

**What you will learn**
- Build, train, and evaluate multilayer perceptrons (MLPs) in PyTorch
- Use `Dataset`/`DataLoader`, reproducibility, device (CPU/GPU), and checkpoints
- Diagnose under/overfitting; apply regularization (dropout, weight decay)
- Do a small hyperparameter search

**Dataset**: **Fashion-MNIST** (10 classes, 28×28 grayscale). Downloaded via `torchvision`.

**References (course materials)**
- Lecture 09: *Multilayer Perceptrons*: `lecture_09_multilayer_perceptrons.ipynb`
- Lecture 10: *Neural Nets in PyTorch*: `lecture_10_neural_nets_in_pytorch.ipynb`

**What to submit**
- This notebook with **all TODOs completed** and all cells runnable top-to-bottom.
- **Save the notebook to your my-work folder**.

**Grading (100 pts)**
- (10) Reproducible setup + clean data pipeline
- (20) Baseline MLP implemented + trained with curves
- (35) Capacity & regularization comparison (3 configs + analysis)
- (35) Small hyperparameter search (design + results + chosen best)

---
**Checklist before you submit**
- [ ] Every section’s `TODO` is completed
- [ ] Plots are readable and titled
- [ ] You wrote the short answers where requested
- [ ] Notebook runs from top to bottom without errors


## 1) Setup, reproducibility, and device
**Instructions**
- Run this cell first. Do not change the seeding unless you document why.
- This cell sets deterministic flags to help reproducibility.

In [None]:
import os, random, math, time
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, random_split
from torchvision import datasets, transforms
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt

def set_seed(seed: int = 9400):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

set_seed(9400)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)

## 2) Data pipeline (Fashion-MNIST)
**Goal:** Build train/val/test dataloaders with appropriate normalization.

**Instructions**
- Use `FashionMNIST` via `torchvision.datasets`.
- Normalize with **mean≈0.286** and **std≈0.353** (already provided below).
- Split the training set into **50k train / 10k val** deterministically.
- Use `shuffle=True` for the train loader.
- Print one batch shape to confirm `[B,1,28,28]`.

In [None]:
# TODO: Run this cell
DATA_DIR = './data'
BATCH_SIZE = 128  # you may adjust if you have limited compute

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.286,), (0.353,)),
])

full_train = datasets.FashionMNIST(DATA_DIR, train=True, download=True, transform=transform)
test_ds   = datasets.FashionMNIST(DATA_DIR, train=False, download=True, transform=transform)

train_size = 50000
val_size = len(full_train) - train_size
train_ds, val_ds = random_split(full_train, [train_size, val_size], generator=torch.Generator().manual_seed(9400))

train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True, num_workers=2, pin_memory=True)
val_loader   = DataLoader(val_ds, batch_size=BATCH_SIZE, shuffle=False, num_workers=2, pin_memory=True)
test_loader  = DataLoader(test_ds, batch_size=BATCH_SIZE, shuffle=False, num_workers=2, pin_memory=True)

xb, yb = next(iter(train_loader))
print('Batch shapes:', xb.shape, yb.shape)

**Short answer (1–2 sentences):** Why do we normalize images before training?

> TODO: Your answer here.

## 3) Baseline MLP + training loop
**Goal:** Implement a clean baseline MLP and a minimal training/eval loop, and save the best checkpoint.

**Specs**
- Model: `Flatten → Linear(784→256) → ReLU → Linear(256→10)`.
- Loss: `CrossEntropyLoss`; Optimizer: `Adam(lr=1e-3)`; Epochs: **10** (you may increase if you have time).
- Save the **best-val** checkpoint.

In [None]:
# TODO: Implement the forward() method below
class MLP(nn.Module):
    def __init__(self, input_dim=28*28, hidden_dim=256, num_classes=10, dropout=0.0):
        super().__init__()
        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.dropout = nn.Dropout(dropout)
        self.fc2 = nn.Linear(hidden_dim, num_classes)

    def forward(self, x):
        # TODO: Implement forward pass
        raise NotImplementedError('TODO: Implement forward pass')
       

In [None]:
# TODO: Implement accuracy_from_logits() method, which returns scalar float accuracy in [0,1]
# the inputs are the outputs of the model (called logits) and the targets
# Hint: use torch.argmax() to get the predicted class

def accuracy_from_logits(logits, targets):
    """Return scalar float accuracy in [0,1]."""
    # TODO: implement
    raise NotImplementedError

In [None]:
# TODO: Implement the evaluate_epoch() method below
# Hint: this is very similar to the train_epoch() method below (code provided)
# The only difference is that we don't need to compute the gradients
# You will need to set the model to evaluation mode
@torch.no_grad()
def evaluate_epoch(model, loader, criterion, device):
    """Iterate loader in eval mode; return (avg_loss, avg_acc)."""
    # TODO: implement
    raise NotImplementedError

def train_epoch(model, loader, criterion, optimizer, device):
    model.train()
    total_loss, total_acc, n = 0.0, 0.0, 0
    for xb, yb in loader:
        xb, yb = xb.to(device), yb.to(device)
        optimizer.zero_grad()
        logits = model(xb)
        loss = criterion(logits, yb)
        loss.backward()
        optimizer.step()
        acc = accuracy_from_logits(logits, yb)
        bs = xb.size(0)
        total_loss += loss.item() * bs
        total_acc += acc * bs
        n += bs
    return total_loss / n, total_acc / n


In [None]:
# TODO: Run the cell to train your baseline model and save best checkpoint
EPOCHS = 10
model = MLP(hidden_dim=256, dropout=0.0).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

best_val_acc = 0.0
history = {'train_loss': [], 'train_acc': [], 'val_loss': [], 'val_acc': []}
ckpt_path = 'best_mlp_baseline.pt'

for epoch in range(1, EPOCHS+1):
    tr_loss, tr_acc = train_epoch(model, train_loader, criterion, optimizer, device)
    val_loss, val_acc = evaluate_epoch(model, val_loader, criterion, device)
    history['train_loss'].append(tr_loss); history['train_acc'].append(tr_acc)
    history['val_loss'].append(val_loss); history['val_acc'].append(val_acc)
    print(f'Epoch {epoch:02d}: train_loss={tr_loss:.4f} acc={tr_acc:.3f} | val_loss={val_loss:.4f} acc={val_acc:.3f}')
    if val_acc > best_val_acc:
        best_val_acc = val_acc
        torch.save(model.state_dict(), ckpt_path)
print('Saved best baseline to', ckpt_path)

plt.figure(); plt.plot(history['train_loss'], label='train'); plt.plot(history['val_loss'], label='val');
plt.xlabel('Epoch'); plt.ylabel('Loss'); plt.title('Baseline MLP – Loss'); plt.legend(); plt.show()

plt.figure(); plt.plot(history['train_acc'], label='train'); plt.plot(history['val_acc'], label='val');
plt.xlabel('Epoch'); plt.ylabel('Accuracy'); plt.title('Baseline MLP – Accuracy'); plt.legend(); plt.show()


**Short answer (2–3 sentences):** Based on the curves, does the baseline underfit, overfit, or generalize well? Explain briefly.

> TODO: Your answer here.

## 4) Capacity & regularization study
**Goal:** Compare three configurations for ~10 epochs each and discuss bias/variance trade-offs.

**Configs**
1. `1×128` (no dropout)
2. `2×256` (no dropout)
3. `2×256` + `dropout=0.3` + `weight_decay=1e-4`

**Deliverables**
- Val accuracy curves for all three
- A short paragraph: which overfits/underfits and why

In [None]:
# TODO: Run this cell
# Helper model for deeper networks
class DeepMLP(nn.Module):
    def __init__(self, hidden_sizes=(128,), dropout=0.0, num_classes=10):
        super().__init__()
        layers = [nn.Flatten()]
        in_dim = 28*28
        for h in hidden_sizes:
            layers += [nn.Linear(in_dim, h), nn.ReLU()]
            if dropout > 0:
                layers.append(nn.Dropout(dropout))
            in_dim = h
        layers.append(nn.Linear(in_dim, num_classes))
        self.net = nn.Sequential(*layers)
    def forward(self, x): return self.net(x)

def fit_model(model, train_loader, val_loader, epochs, optimizer, criterion, device):
    hist = {'train_loss': [], 'train_acc': [], 'val_loss': [], 'val_acc': []}
    for _ in range(epochs):
        tl, ta = train_epoch(model, train_loader, criterion, optimizer, device)
        vl, va = evaluate_epoch(model, val_loader, criterion, device)
        hist['train_loss'].append(tl); hist['train_acc'].append(ta)
        hist['val_loss'].append(vl); hist['val_acc'].append(va)
    return hist


In [None]:
# Use the three configs defined here to evaluate 3 different models
configs = {
    'small_1x128': dict(hidden_sizes=(128,), dropout=0.0, weight_decay=0.0),
    'medium_2x256': dict(hidden_sizes=(256,256), dropout=0.0, weight_decay=0.0),
    'medium_reg':   dict(hidden_sizes=(256,256), dropout=0.3, weight_decay=1e-4),
}

# TODO: fill in the required lines in the for loop below
results = {}
for name, cfg in configs.items():
    print('\nRunning', name, cfg)
    
    # TODO: create DeepMLP model  
    
    # TODO: create Adam optimizer
    
    # TODO: fit the model
    
    # TODO: store the results

# TODO: Plot the validation accuracy versus epoch for all three configs


**Short answer (3–4 sentences):** Which config overfits the most? Which underfits? Explain using the curves and final metrics.

> TODO: Your answer here.

## 5) Small hyperparameter search (use 2×256)
**Goal:** Tune a few hyperparameters quickly and pick a winner by validation accuracy.

**Grid**
- `lr ∈ {1e-3, 3e-4}`
- `batch_size ∈ {64, 128}`
- `dropout ∈ {0.0, 0.3}`
- `weight_decay ∈ {0.0, 1e-4}`

**Instructions**
- Train each candidate for ~8 epochs
- Report the **best config** and its max val accuracy

In [None]:
# Use the provided code excerpts to implement a small hyperparameter search over dropout, learning rate, batch size, and weight decay
from itertools import product

def rebuild_loaders(bs):
    return (DataLoader(train_ds, batch_size=bs, shuffle=True, num_workers=2, pin_memory=True),
            DataLoader(val_ds,   batch_size=bs, shuffle=False, num_workers=2, pin_memory=True))

# This initializes the best configuration and the best validation accuracy
best_cfg, best_acc = None, -1
# Note: best_cfg is a dictionary that will store the best configuration found during the search {lr: ___, bs: ___, dropout: ___, weight_decay: ___}

# TODO: fill in the missing lines in the for loop below
for lr, bs, dp, wd in product([1e-3, 3e-4], [64, 128], [0.0, 0.3], [0.0, 1e-4]):
    print(f'\nTrial: lr={lr} bs={bs} dropout={dp} wd={wd}')
    tr_loader, v_loader = rebuild_loaders(bs)

    # TODO: create a DeepMLP model with the correct value for dropout

    # TODO: create an Adam optimizer with the correct value for learning rate and weight decay

    # TODO: fit the model

    # TODO: find the maximum validation accuracy

    # TODO: update the best configuration if the current one has a higher validation accuracy

print('\nBest config:', best_cfg, 'val_acc=', best_acc)


**Short answer (2–4 sentences):** Why might this best config outperform nearby ones in your grid?

> TODO: Your answer here.