In [None]:
# DS776 Environment Setup & Package Update
# Configures storage paths for proper cleanup/sync, then updates introdl if needed
# If this cell fails, see Lessons/Course_Tools/AUTO_UPDATE_SYSTEM.md for help
%run ../../Lessons/Course_Tools/auto_update_introdl.py

# Homework 03 Assignment
**Name:** [Student Name Here]  
**Total Points:** 40

## Submission Checklist
- [ ] All code cells executed with output saved
- [ ] All questions answered
- [ ] Notebook converted to HTML (use the Homework_03_Utilities notebook)
- [ ] Canvas notebook filename includes `_GRADE_THIS_ONE`
- [ ] Files uploaded to Canvas

---

# Better Training Techniques

In this assignment you will build a deeper CNN model to improve the classification performance on the FashionMNIST dataset. Deeper models can be more difficult to train so you'll employ some of the techniques from Lesson 3 to improve the training. You'll also use data augmentation to improve the performance of the model while reducing overfitting. Along the way you'll see how to downsample a dataset to make for more efficient experimentation.

In [None]:
# YOUR IMPORTS HERE
# Add any additional imports you need below this line


from introdl.utils import config_paths_keys

# Configure paths
paths = config_paths_keys()
DATA_PATH = paths['DATA_PATH']
MODELS_PATH = paths['MODELS_PATH']

## Storage Guidance

**Always use the path variables** (`MODELS_PATH`, `DATA_PATH`, `CACHE_PATH`) instead of hardcoded paths. The actual locations depend on your environment:

| Variable | CoCalc Home Server | Compute Server |
|----------|-------------------|----------------|
| `MODELS_PATH` | `Homework_03_Models/` | `Homework_03_Models/` *(synced)* |
| `DATA_PATH` | `~/home_workspace/data/` | `~/cs_workspace/data/` *(local)* |
| `CACHE_PATH` | `~/home_workspace/downloads/` | `~/cs_workspace/downloads/` *(local)* |

**Why this matters:**
- On **Compute Servers**: Only `MODELS_PATH` syncs back to CoCalc (~10GB limit). Data and cache stay local (~50GB).
- On **CoCalc Home**: Everything syncs and counts against the ~10GB limit.
- **Storage_Cleanup.ipynb** (in this folder) helps free synced space when needed.

**Tip:** Always write `MODELS_PATH / 'model.pt'` ‚Äî never hardcode paths like `'Homework_03_Models/model.pt'`.

## Part 1 - Model and Data Setup (10 pts)

### 1.1 [5 pts] Build the Model

Implement a PyTorch model of class `nn.Module` to reproduce a model with the structure shown in the assignment. The model should have:
- Three blocks of convolutional layers
- Each block contains 3 Conv2d layers with ReLU activations
- MaxPool2d after blocks 1 and 2
- A single Linear layer as the classifier
- Total parameters should be around 542,922

In [None]:
# YOUR CODE HERE
# TODO: Define the FashionMNISTModel class
# - Block 1: 1->32 channels, 3 conv layers, MaxPool2d
# - Block 2: 32->64 channels, 3 conv layers, MaxPool2d
# - Block 3: 64->128 channels, 3 conv layers, no pooling
# - Classifier: Flatten + Linear(6272, 10)

class FashionMNISTModel(nn.Module):
    def __init__(self):
        super().__init__()
        # Your implementation
        pass
    
    def forward(self, x):
        # Your implementation
        pass

# Create model and show summary
# model = FashionMNISTModel()
# summary(model, input_size=(64, 1, 28, 28))

### 1.2 [5 pts] Setup the Data

Load the FashionMNIST dataset. Normalize with mean 0.2860 and standard deviation 0.3530. Downsample the train dataset to 10% of its original size to make experimentation quick.

Use the FashionMNIST test dataset for your `valid_dataset`. For the DataLoaders try batch size 64 to start.

In [None]:
# YOUR CODE HERE
# TODO: Load FashionMNIST with normalization
# TODO: Downsample training set to 10% using provided code
# TODO: Create DataLoaders

# Use this code for downsampling:
# from torch.utils.data import Subset
# np.random.seed(42)  # use this seed for reproducibility
# subset_indices = np.random.choice(len(train_dataset), size=int(0.1 * len(train_dataset)), replace=False)
# train_dataset = Subset(train_dataset, subset_indices)

## Part 2 - Optimizer Comparison (10 pts)

### 2.1 [5 pts] Training with SGD

Train your model with Stochastic Gradient Descent. Track the accuracy metric. You'll likely need to increase both the learning rate and the number of epochs to see the validation accuracy plateau.

Make sure to instantiate a fresh model to see complete training results.

In [None]:
# YOUR CODE HERE
# TODO: Create fresh model instance
# TODO: Setup SGD optimizer
# TODO: Train with appropriate learning rate and epochs

Load the checkpoint file and make graphs showing the training and validation losses and accuracies.

In [None]:
# YOUR CODE HERE
# TODO: Load checkpoint and plot metrics

### 2.2 [5 pts] Training with AdamW

Now repeat the previous training using AdamW. You should be able to use the default learning rate of 0.001 and fewer epochs.

In [None]:
# YOUR CODE HERE
# TODO: Create fresh model instance
# TODO: Setup AdamW optimizer
# TODO: Train with default learning rate

Load the checkpoint file and make graphs showing the training and validation losses and accuracies.

In [None]:
# YOUR CODE HERE
# TODO: Load checkpoint and plot metrics

### 2.3 Compare SGD and AdamW Training Performance

Make plots of validation loss and accuracy for both SGD and AdamW.

In [None]:
# YOUR CODE HERE
# TODO: Create comparison plots

## Part 3 - Advanced Training Techniques (13 pts)

### 3.1 [5 pts] Data Augmentation

Now use data augmentation. Build a transform_train pipeline that includes:
* Random horizontal flips
* Random crops of size 28, padding = 4
* Random rotations up to 10 degrees

Use the same seed to downsample the train_dataset to 10% of its size.

In the next cell, set up the data and augmentation transforms (don't augment the validation data). Build the DataLoaders.

In [None]:
# YOUR CODE HERE
# TODO: Create augmentation transforms
# TODO: Setup datasets with augmentation
# TODO: Create DataLoaders

Train a new instance of your model with the new DataLoaders and AdamW. Training will take more epochs so you may have to experiment a little.

In [None]:
# YOUR CODE HERE
# TODO: Train model with augmentation

Load the checkpoint file and make graphs showing the training and validation losses and accuracies.

In [None]:
# YOUR CODE HERE
# TODO: Load checkpoint and plot metrics

Compare validation loss and accuracy for the three different approaches so far: SGD, AdamW, and AdamW with augmentation. Make appropriate graphs and comment on the three training strategies in terms of their performance on metrics and overfitting.

üìù **YOUR ANALYSIS HERE:**

In [None]:
# YOUR CODE HERE
# TODO: Create comparison plots for all three approaches

### 3.2 [3 pts] Early Stopping

Early stopping isn't really necessary unless the metrics on the validation or test set start to degrade. Try it anyway just to reinforce how it works. In this section implement early stopping based on the validation loss. Use AdamW and data augmentation. Add a comparison plot of the two methods. Comment on the performance with and without early stopping. Do you get comparable performance?

In [None]:
# YOUR CODE HERE
# TODO: Implement training with early stopping
# TODO: Compare with regular training

üìù **YOUR ANALYSIS HERE:**

### 3.3 [5 pts] OneCycleLR Scheduler

Create a new instance of the model. Implement a OneCycleLR learning rate scheduler and add it to your AdamW approach with data augmentation. You should be able to use a larger max learning rate of 0.003 or so. Experiment a little to see if you can get similar results to the above with fewer epochs (you may not be able to).

In [None]:
# YOUR CODE HERE
# TODO: Implement OneCycleLR training

Load the checkpoint file and make graphs showing the training and validation losses and accuracies.

In [None]:
# YOUR CODE HERE
# TODO: Load checkpoint and plot metrics

Make a plot comparing the validation losses and accuracies for all of the training approaches above (there should be 4).

In [None]:
# YOUR CODE HERE
# TODO: Create comprehensive comparison plots

Which approach works best? Why?

üìù **YOUR ANSWER HERE:**

## Part 4 - Full Dataset Training (5 pts)

Take your best approach and apply it to the full dataset (don't downsample).

This will take a little more than a minute per epoch so run your experiments with the smaller dataset above, then run this once. You can use `resume_from_checkpoint = True` if you want to extend the training.

How does this compare to the performance you achieved in HW 2? Import your best run from HW 2 and make a plot comparing the performance of your best approach from this assignment to the approach from the second assignment.

In [None]:
# YOUR CODE HERE
# TODO: Train best model on full dataset
# TODO: Compare with HW2 results

üìù **YOUR ANALYSIS HERE:**

## Part 5 - Reflection (2 pts)

1. What, if anything, did you find difficult to understand for this lesson? Why?

üìù **YOUR ANSWER HERE:**

2. What resources did you find supported your learning most and least for this lesson? (Be honest - I use your input to shape the course.)

üìù **YOUR ANSWER HERE:**

### Export Notebook to HTML for Canvas Upload

Uncomment the two lines below and run the cell to export the current notebook to HTML.

In [None]:
# from introdl import export_this_to_html
# export_this_to_html()