# Deep Learning Applications: Laboratory #1

In this first laboratory we will work relatively simple architectures to get a feel for working with Deep Models. This notebook is designed to work with PyTorch, but as I said in the introductory lecture: please feel free to use and experiment with whatever tools you like.



## Exercise 1: Warming Up
In this series of exercises I want you to try to duplicate (on a small scale) the results of the ResNet paper:

> [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385), Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, CVPR 2016.

We will do this in steps using a Multilayer Perceptron on MNIST.

Recall that the main message of the ResNet paper is that **deeper** networks do not **guarantee** more reduction in training loss (or in validation accuracy). Below you will incrementally build a sequence of experiments to verify this for an MLP. A few guidelines:

+ I have provided some **starter** code at the beginning. **NONE** of this code should survive in your solutions. Not only is it **very** badly written, it is also written in my functional style that also obfuscates what it's doing (in part to **discourage** your reuse!). It's just to get you *started*.
+ These exercises ask you to compare **multiple** training runs, so it is **really** important that you factor this into your **pipeline**. Using [Tensorboard](https://pytorch.org/tutorials/recipes/recipes/tensorboard_with_pytorch.html) is a **very** good idea -- or, even better [Weights and Biases](https://wandb.ai/site).
+ You may work and submit your solutions in **groups of at most two**. Share your ideas with everyone, but the solutions you submit *must be your own*.

First some boilerplate to get you started, then on to the actual exercises!

### Preface: Some code to get you started

What follows is some **very simple** code for training an MLP on MNIST. The point of this code is to get you up and running (and to verify that your Python environment has all needed dependencies).

**Note**: As you read through my code and execute it, this would be a good time to think about *abstracting* **your** model definition, and training and evaluation pipelines in order to make it easier to compare performance of different models.

In [1]:
# Start with some standard imports.
import numpy as np
import matplotlib.pyplot as plt
from functools import reduce
import torch
from torchvision.datasets import MNIST
from torch.utils.data import Subset
import torch.nn as nn
import torch.nn.functional as F
import torchvision.transforms as transforms
import wandb
from tqdm import tqdm


#### Data preparation

Here is some basic dataset loading, validation splitting code to get you started working with MNIST.

In [2]:
# Standard MNIST transform.
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

# Load MNIST train and test.
ds_train = MNIST(root='./datasets', train=True, download=True, transform=transform)
ds_test = MNIST(root='./datasets', train=False, download=True, transform=transform)

# Split train into train and validation.
val_size = 5000
I = np.random.permutation(len(ds_train))
ds_val = Subset(ds_train, I[:val_size])
ds_train = Subset(ds_train, I[val_size:])

#### Boilerplate training and evaluation code

This is some **very** rough training, evaluation, and plotting code. Again, just to get you started. I will be *very* disappointed if any of this code makes it into your final submission.

In [3]:
from sklearn.metrics import accuracy_score, classification_report

# Function to train a model for a single epoch over the data loader.
def train_epoch(model, dl, opt, epoch='Unknown', device='cpu'):
    model.train()
    losses = []
    for (xs, ys) in tqdm(dl, desc=f'Training epoch {epoch}', leave=True):
        xs = xs.to(device)
        ys = ys.to(device)
        opt.zero_grad()
        logits = model(xs)
        loss = F.cross_entropy(logits, ys)
        loss.backward()
        opt.step()
        losses.append(loss.item())
    return np.mean(losses)

# Function to evaluate model over all samples in the data loader.
def evaluate_model(model, dl, device='cpu'):
    model.eval()
    predictions = []
    gts = []
    for (xs, ys) in tqdm(dl, desc='Evaluating', leave=False):
        xs = xs.to(device)
        preds = torch.argmax(model(xs), dim=1)
        gts.append(ys)
        predictions.append(preds.detach().cpu().numpy())
        
    # Return accuracy score and classification report.
    return (accuracy_score(np.hstack(gts), np.hstack(predictions)),
            classification_report(np.hstack(gts), np.hstack(predictions), zero_division=0, digits=3))

# Simple function to plot the loss curve and validation accuracy.
def plot_validation_curves(losses_and_accs):
    losses = [x for (x, _) in losses_and_accs]
    accs = [x for (_, x) in losses_and_accs]
    plt.figure(figsize=(16, 8))
    plt.subplot(1, 2, 1)
    plt.plot(losses)
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.title('Average Training Loss per Epoch')
    plt.subplot(1, 2, 2)
    plt.plot(accs)
    plt.xlabel('Epoch')
    plt.ylabel('Validation Accuracy')
    plt.title(f'Best Accuracy = {np.max(accs)} @ epoch {np.argmax(accs)}')

#### A basic, parameterized MLP

This is a very basic implementation of a Multilayer Perceptron. Don't waste too much time trying to figure out how it works -- the important detail is that it allows you to pass in a list of input, hidden layer, and output *widths*. **Your** implementation should also support this for the exercises to come.

In [4]:
class MLP(nn.Module):
    def __init__(self, layer_sizes):
        super().__init__()
        self.layers = nn.ModuleList([nn.Linear(nin, nout) for (nin, nout) in zip(layer_sizes[:-1], layer_sizes[1:])])
    
    def forward(self, x):
        return reduce(lambda f, g: lambda x: g(F.relu(f(x))), self.layers, lambda x: x.flatten(1))(x)


#### A *very* minimal training pipeline.

Here is some basic training and evaluation code to get you started.

**Important**: I cannot stress enough that this is a **terrible** example of how to implement a training pipeline. You can do better!

In [None]:
# Training hyperparameters.
device = 'cuda' if torch.cuda.is_available() else 'cpu'
epochs = 100
lr = 0.0001
batch_size = 128

# Architecture hyperparameters.
input_size = 28*28
width = 16
depth = 2

# Dataloaders.
dl_train = torch.utils.data.DataLoader(ds_train, batch_size, shuffle=True, num_workers=4)
dl_val   = torch.utils.data.DataLoader(ds_val, batch_size, num_workers=4)
dl_test  = torch.utils.data.DataLoader(ds_test, batch_size, shuffle=True, num_workers=4)

# Instantiate model and optimizer.
model_mlp = MLP([input_size] + [width]*depth + [10]).to(device)
opt = torch.optim.Adam(params=model_mlp.parameters(), lr=lr)
model_mlp = MyMLP([16, 16]).to(device)
wandb.init(
    # set the wandb project where this run will be logged
    project="dla_lab",
    
    # track hyperparameters and run metadata
    config={
    "learning_rate": 0.0001,
    "architecture": "MLP",
    "dataset": "MNIST",
    "epochs": 100,
    }
)

# Training loop.
losses_and_accs = []
for epoch in range(epochs):
    loss = train_epoch(model_mlp, dl_train, opt, epoch, device=device)
    (val_acc, _) = evaluate_model(model_mlp, dl_val, device=device)
    losses_and_accs.append((loss, val_acc))
    wandb.log({"accuracy": val_acc, "loss": loss})

wandb.finish()
# And finally plot the curves.
plot_validation_curves(losses_and_accs)
print(f'Accuracy report on TEST:\n {evaluate_model(model_mlp, dl_test, device=device)[1]}')

### Exercise 1.1: A baseline MLP

Implement a *simple* Multilayer Perceptron to classify the 10 digits of MNIST (e.g. two *narrow* layers). Use my code above as inspiration, but implement your own training pipeline -- you will need it later. Train this model to convergence, monitoring (at least) the loss and accuracy on the training and validation sets for every epoch. Below I include a basic implementation to get you started -- remember that you should write your *own* pipeline!

**Note**: This would be a good time to think about *abstracting* your model definition, and training and evaluation pipelines in order to make it easier to compare performance of different models.

**Important**: Given the *many* runs you will need to do, and the need to *compare* performance between them, this would **also** be a great point to study how **Tensorboard** or **Weights and Biases** can be used for performance monitoring.# Your code here.

#### Model

In [3]:
class MyMLP(nn.Module):

    def __init__(self, widths: list[int], nclasses: int=10) -> None:
        super(MyMLP, self).__init__()
        self.flatten = nn.Flatten()
        self.net = nn.Sequential()
        for width in widths:
            self.net.append(nn.LazyLinear(width))
            self.net.append(nn.ReLU())
        self.output = nn.LazyLinear(nclasses)

    def forward(self, X):
        f = self.flatten(X) # flattened input
        h = self.net(f)
        o = self.output(h)
        return o

### Evaluation

In [2]:
def validate(model, loss_fn, validation_dl):
    avg_acc = 0
    avg_loss = 0
    for (X, y) in tqdm(validation_dl, desc="Validation", leave=False):
        X, y = X.to(device), y.to(device)
        prediction = model(X)
        avg_acc += (prediction.argmax(1) == y).sum().item()
        avg_loss += loss_fn(prediction, y)
        X.detach()
        y.detach()
    return avg_loss / len(validation_dl.dataset), avg_acc / len(validation_dl.dataset)

### Training

In [3]:
def training(model, loss_fn, optimizer, training_dl, validation_dl, epochs, log=True):
    for epoch in range(epochs):
        for (X, y) in tqdm(training_dl, desc=f"Training #{epoch + 1}", leave=True):
            X, y = X.to(device), y.to(device)
            prediction = model(X)
            loss = loss_fn(prediction, y)


            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        with torch.no_grad():
            avg_loss, avg_acc = validate(model, loss_fn, validation_dl)
            total_norm = 0
            par = [p.to("cpu") for p in model.parameters()]
            for p in par:
                total_norm += p.norm().item()
            if log:
                wandb.log({"loss": avg_loss, "acc": avg_acc, "grad": total_norm})
        

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
m = MyMLP([16, 16]).to(device)
wandb.init(
    # set the wandb project where this run will be logged
    project="dla_lab",
    
    # track hyperparameters and run metadata
    config={
    "learning_rate": 0.0001,
    "architecture": "MLP",
    "dataset": "MNIST",
    "epochs": 100,
    }
)
training(m, nn.CrossEntropyLoss(), torch.optim.Adam(params=m.parameters(), lr=0.0001), dl_train, dl_val, 100)
wandb.finish()

### Exercise 1.2: Rinse and Repeat

Repeat the verification you did above, but with **Convolutional** Neural Networks. If you were careful about abstracting your model and training code, this should be a simple exercise. Show that **deeper** CNNs *without* residual connections do not always work better and **even deeper** ones *with* residual connections.

**Hint**: You probably should do this exercise using CIFAR10, since MNIST is *very* easy (at least up to about 99% accuracy).

**Spoiler**: If you plan to do optional exercise 3.3, you should think *very* carefully about the architectures of your CNNs here (so you can reuse them!).### Exercise 1.1: A baseline MLP

### Model

In [6]:
from typing import Union
# Your code here.
class CNN(nn.Module):
    def __init__(self, output_channels: list[int] = [64, 128, 256], kernel_sizes: list[Union[tuple[int], int]] = [3, 3, 3], 
                 strides: list[Union[tuple[int], int]] = [1, 1, 1], paddings: list[Union[tuple[int], int]] = [1, 1, 1],
                 classifier_head_widths: list[int] = [64, 128, 256], nclasses: int = 10) -> None:
        super(CNN, self).__init__()
        assert len(kernel_sizes) == len(paddings) == len(strides)
        self.net = nn.Sequential()
        for i in range(len(kernel_sizes)):
            self.net.append(nn.LazyConv2d(output_channels[i], kernel_size=kernel_sizes[i], padding=paddings[i], stride=strides[i]))
            self.net.append(nn.ReLU())
            self.net.append(nn.MaxPool2d(2, stride=2))
        self.flatten = nn.Flatten()
        self.head = nn.Sequential()
        for width in classifier_head_widths:
            self.head.append(nn.LazyLinear(width))
            self.head.append(nn.ReLU())
        self.head.append(nn.LazyLinear(nclasses))

    def forward(self, X):
        Z = self.flatten(self.net(X)) # encode image into vector
        O = self.head(Z) # classify
        return O


### Data

In [4]:
from torchvision.datasets import CIFAR10
from torchvision.transforms import ToTensor

from torch.utils.data import DataLoader

# Load MNIST train and test.
ds_train = CIFAR10(root='./datasets', train=True, download=True, transform=ToTensor())
ds_test = CIFAR10(root='./datasets', train=False, download=True, transform=ToTensor())

# Split train into train and validation.
val_size = 10000
I = np.random.permutation(len(ds_train))
ds_val = Subset(ds_train, I[:val_size])
ds_train = Subset(ds_train, I[val_size:])

bs=32

dl_train = DataLoader(ds_train, batch_size=bs, shuffle=True)
dl_val = DataLoader(ds_val, batch_size=bs, shuffle=True)

Files already downloaded and verified
Files already downloaded and verified


In [10]:
device = "cuda" if torch.cuda.is_available() else "cpu"
model = CNN().to(device)
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
wandb.init("DLA_LAB_01")
training(model, loss_fn, optimizer, dl_train, dl_val, 100)
wandb.finish()



VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.01666932176666099, max=1.0)…

### Deeper Model

In [8]:
device = "cuda" if torch.cuda.is_available() else "cpu"
out_channels = [64, 128, 256, 256, 512]
kernel_sizes = [3] * 5
paddings = [1] * 5
strides = [1] * 5
model = CNN(kernel_sizes=kernel_sizes, output_channels=out_channels, strides=strides, paddings=paddings).to(device)
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
wandb.init("DLA_LAB_01")
training(model, loss_fn, optimizer, dl_train, dl_val, 100)
wandb.finish()



VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.016668143416670014, max=1.0…

Training #1: 100%|██████████| 1250/1250 [00:10<00:00, 124.18it/s]
Training #2: 100%|██████████| 1250/1250 [00:09<00:00, 126.77it/s]
Training #3: 100%|██████████| 1250/1250 [00:09<00:00, 126.91it/s]
Training #4: 100%|██████████| 1250/1250 [00:09<00:00, 126.32it/s]
Training #5: 100%|██████████| 1250/1250 [00:09<00:00, 126.66it/s]
Training #6: 100%|██████████| 1250/1250 [00:09<00:00, 126.23it/s]
Training #7: 100%|██████████| 1250/1250 [00:09<00:00, 126.31it/s]
Training #8: 100%|██████████| 1250/1250 [00:09<00:00, 126.54it/s]
Training #9: 100%|██████████| 1250/1250 [00:09<00:00, 126.01it/s]
Training #10: 100%|██████████| 1250/1250 [00:09<00:00, 126.35it/s]
Training #11: 100%|██████████| 1250/1250 [00:09<00:00, 126.22it/s]
Training #12: 100%|██████████| 1250/1250 [00:09<00:00, 125.85it/s]
Training #13: 100%|██████████| 1250/1250 [00:09<00:00, 125.95it/s]
Training #14: 100%|██████████| 1250/1250 [00:09<00:00, 126.28it/s]
Training #15: 100%|██████████| 1250/1250 [00:09<00:00, 126.92it/s]
Trai

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

[34m[1mwandb[0m: [32m[41mERROR[0m Control-C detected -- Run data was not synced


-----
## Exercise 2: Choose at Least One

Below are **three** exercises that ask you to deepen your understanding of Deep Networks for visual recognition. You must choose **at least one** of the below for your final submission -- feel free to do **more**, but at least **ONE** you must submit.

### Exercise 2.1: Explain why Residual Connections are so effective
Use your two models (with and without residual connections) you developed above to study and **quantify** why the residual versions of the networks learn more effectively.

**Hint**: A good starting point might be looking at the gradient magnitudes passing through the networks during backpropagation.

In [None]:
# Your code here.

### Exercise 2.2: Fully-convolutionalize a network.
Take one of your trained classifiers and **fully-convolutionalize** it. That is, turn it into a network that can predict classification outputs at *all* pixels in an input image. Can you turn this into a **detector** of handwritten digits? Give it a try.

**Hint 1**: Sometimes the process of fully-convolutionalization is called "network surgery".

**Hint 2**: To test your fully-convolutionalized networks you might want to write some functions to take random MNIST samples and embed them into a larger image (i.e. in a regular grid or at random positions).

In [None]:
# Your code here.

### Exercise 2.3: *Explain* the predictions of a CNN

Use the CNN model you trained in Exercise 1.2 and implement [*Class Activation Maps*](http://cnnlocalization.csail.mit.edu/#:~:text=A%20class%20activation%20map%20for,decision%20made%20by%20the%20CNN.):

> B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning Deep Features for Discriminative Localization. CVPR'16 (arXiv:1512.04150, 2015).

Use your implementation to demonstrate how your trained CNN *attends* to specific image features to recognize *specific* classes.

In [None]:
# Your code here.