# Deep Learning Applications: Laboratory #1

In this first laboratory we will work relatively simple architectures to get a feel for working with Deep Models. This notebook is designed to work with PyTorch, but as I said in the introductory lecture: please feel free to use and experiment with whatever tools you like.

**Important Notes**:
1. Be sure to **document** all of your decisions, as well as your intermediate and final results. Make sure your conclusions and analyses are clearly presented. Don't make us dig into your code or walls of printed results to try to draw conclusions from your code.
2. If you use code from someone else (e.g. Github, Stack Overflow, ChatGPT, etc) you **must be transparent about it**. Document your sources and explain how you adapted any partial solutions to creat **your** solution.



## Exercise 1: Warming Up
In this series of exercises I want you to try to duplicate (on a small scale) the results of the ResNet paper:

> [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385), Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, CVPR 2016.

We will do this in steps using a Multilayer Perceptron on MNIST.

Recall that the main message of the ResNet paper is that **deeper** networks do not **guarantee** more reduction in training loss (or in validation accuracy). Below you will incrementally build a sequence of experiments to verify this for an MLP. A few guidelines:

+ I have provided some **starter** code at the beginning. **NONE** of this code should survive in your solutions. Not only is it **very** badly written, it is also written in my functional style that also obfuscates what it's doing (in part to **discourage** your reuse!). It's just to get you *started*.
+ These exercises ask you to compare **multiple** training runs, so it is **really** important that you factor this into your **pipeline**. Using [Tensorboard](https://pytorch.org/tutorials/recipes/recipes/tensorboard_with_pytorch.html) is a **very** good idea -- or, even better [Weights and Biases](https://wandb.ai/site).
+ You may work and submit your solutions in **groups of at most two**. Share your ideas with everyone, but the solutions you submit *must be your own*.

First some boilerplate to get you started, then on to the actual exercises!

### Preface: Some code to get you started

What follows is some **very simple** code for training an MLP on MNIST. The point of this code is to get you up and running (and to verify that your Python environment has all needed dependencies).

**Note**: As you read through my code and execute it, this would be a good time to think about *abstracting* **your** model definition, and training and evaluation pipelines in order to make it easier to compare performance of different models.

In [8]:
# Standard libraries
#import copy
#from functools import reduce

# Numerical and plotting
import numpy as np
#import matplotlib.pyplot as plt
from tqdm import tqdm

# PyTorch core
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, Subset

# Torchvision
import torchvision
import torchvision.transforms as transforms
import torchvision.transforms as T
from torchvision.datasets import MNIST
import torchvision.models as models

# Scikit-learn
from sklearn.metrics import classification_report, accuracy_score
from sklearn.svm import LinearSVC

# Experiment tracking
import wandb


#### Data preparation

Here is some basic dataset loading, validation splitting code to get you started working with MNIST.

In [4]:
# Standard MNIST transform.
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

# Load MNIST train and test.
ds_train = MNIST(root='./data', train=True, download=True, transform=transform)
ds_test = MNIST(root='./data', train=False, download=True, transform=transform)

# Split train into train and validation.
val_size = 5000
I = np.random.permutation(len(ds_train))
ds_val = Subset(ds_train, I[:val_size])
ds_train = Subset(ds_train, I[val_size:])

### Exercise 1.1: A baseline MLP

Implement a *simple* Multilayer Perceptron to classify the 10 digits of MNIST (e.g. two *narrow* layers). Use my code above as inspiration, but implement your own training pipeline -- you will need it later. Train this model to convergence, monitoring (at least) the loss and accuracy on the training and validation sets for every epoch. Below I include a basic implementation to get you started -- remember that you should write your *own* pipeline!

**Note**: This would be a good time to think about *abstracting* your model definition, and training and evaluation pipelines in order to make it easier to compare performance of different models.

**Important**: Given the *many* runs you will need to do, and the need to *compare* performance between them, this would **also** be a great point to study how **Tensorboard** or **Weights and Biases** can be used for performance monitoring.

In [6]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

batch_size = 128
dl_train = DataLoader(ds_train, batch_size=batch_size, shuffle=True, num_workers=4)
dl_val   = DataLoader(ds_val,   batch_size=batch_size, shuffle=False, num_workers=4)
dl_test  = DataLoader(ds_test,  batch_size=batch_size, shuffle=False, num_workers=4)

In [7]:
class SimpleMLP(nn.Module):
    def __init__(self, input_size=28*28, hidden1=128, hidden2=64, num_classes=10):
        super().__init__()
        self.net = nn.Sequential(
            nn.Flatten(),
            nn.Linear(input_size, hidden1),
            nn.ReLU(inplace=True),
            nn.Linear(hidden1, hidden2),
            nn.ReLU(inplace=True),
            nn.Linear(hidden2, num_classes)
        )
    def forward(self, x):
        return self.net(x)

In [2]:
def train_one_epoch(model, dl, optimizer, device):
    model.train()
    running_loss = 0.0
    correct = 0
    total = 0
    for x, y in tqdm(dl, desc="Batches", leave=False):
        x, y = x.to(device), y.to(device)
        optimizer.zero_grad()
        logits = model(x)
        loss = F.cross_entropy(logits, y)
        loss.backward()
        optimizer.step()
        running_loss += loss.item() * x.size(0)
        preds = logits.argmax(dim=1)
        correct += (preds == y).sum().item()
        total += x.size(0)
    return running_loss / total, correct / total

In [3]:
def evaluate(model, dl, device):
    model.eval()
    running_loss = 0.0
    preds_all = []
    gts_all = []
    with torch.no_grad():
        for x, y in tqdm(dl, desc="Batches", leave=False):
            x, y = x.to(device), y.to(device)
            logits = model(x)
            loss = F.cross_entropy(logits, y, reduction='sum')
            running_loss += loss.item()
            preds_all.append(logits.argmax(dim=1).cpu().numpy())
            gts_all.append(y.cpu().numpy())
    preds_all = np.hstack(preds_all)
    gts_all = np.hstack(gts_all)
    return running_loss / len(gts_all), accuracy_score(gts_all, preds_all), classification_report(gts_all, preds_all, zero_division=0, digits=3)

In [8]:
model = SimpleMLP(hidden1=128, hidden2=64).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
epochs = 20

train_losses, train_accs = [], []
val_losses, val_accs = [], []
best_val_acc = 0.0
best_state = None

wandb.init(project="Lab-1", config={
"epochs": epochs,
"lr": 1e-3,
"batch_size": batch_size,
"model": "SimpleMLP"
})
config = wandb.config

for ep in range(1, epochs+1):
    train_loss, train_acc = train_one_epoch(model, dl_train, optimizer, device)
    val_loss, val_acc, _ = evaluate(model, dl_val, device)
    train_losses.append(train_loss)
    train_accs.append(train_acc)
    val_losses.append(val_loss)
    val_accs.append(val_acc)
    wandb.log({
        "train_loss": train_loss,
        "val_loss": val_loss,
        "train_accuracy": train_acc,
        "val_accuracy": val_acc
    }, step=ep)
    if val_acc > best_val_acc:
        best_val_acc = val_acc
        torch.save(model.state_dict(), "best_model.pth")
    print(f"Epoch {ep:02d}: train_loss={train_loss:.4f} train_acc={train_acc:.4f} | val_loss={val_loss:.4f} val_acc={val_acc:.4f}")

artifact = wandb.Artifact("simple-mlp", type="model")
artifact.add_file("best_model.pth")
wandb.log_artifact(artifact)
best_state = model.state_dict()

[34m[1mwandb[0m: Currently logged in as: [33mmatteo-piras[0m ([33mmatteo-piras-universit-di-firenze[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch 01: train_loss=0.3480 train_acc=0.9004 | val_loss=0.2000 val_acc=0.9396
Epoch 02: train_loss=0.1462 train_acc=0.9564 | val_loss=0.1458 val_acc=0.9568
Epoch 03: train_loss=0.1013 train_acc=0.9692 | val_loss=0.1141 val_acc=0.9650
Epoch 04: train_loss=0.0752 train_acc=0.9769 | val_loss=0.1165 val_acc=0.9658
Epoch 05: train_loss=0.0584 train_acc=0.9820 | val_loss=0.0939 val_acc=0.9734
Epoch 06: train_loss=0.0489 train_acc=0.9845 | val_loss=0.0925 val_acc=0.9742
Epoch 07: train_loss=0.0378 train_acc=0.9877 | val_loss=0.1183 val_acc=0.9670
Epoch 08: train_loss=0.0323 train_acc=0.9892 | val_loss=0.1053 val_acc=0.9698
Epoch 09: train_loss=0.0264 train_acc=0.9911 | val_loss=0.1076 val_acc=0.9726
Epoch 10: train_loss=0.0225 train_acc=0.9926 | val_loss=0.1141 val_acc=0.9722
Epoch 11: train_loss=0.0197 train_acc=0.9935 | val_loss=0.1123 val_acc=0.9742
Epoch 12: train_loss=0.0176 train_acc=0.9942 | val_loss=0.1163 val_acc=0.9746
Epoch 13: train_loss=0.0169 train_acc=0.9944 | val_loss=0.1235 v

In [9]:
if best_state is not None:
    model.load_state_dict(best_state)


# Final test evaluation
test_loss, test_acc, test_report = evaluate(model, dl_test, device)
wandb.log({"test_loss": test_loss, "test_acc": test_acc})
wandb.log({"classification_report": str(test_report)})
wandb.finish()
print(f"Test loss: {test_loss:.4f}  Test acc: {test_acc:.4f}")
print("Classification report on TEST:\n", test_report)

0,1
test_acc,▁
test_loss,▁
train_accuracy,▁▅▆▇▇▇▇▇████████████
train_loss,█▄▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁
val_accuracy,▁▄▆▆██▆▇▇▇████▇▇████
val_loss,█▄▂▃▁▁▃▂▂▂▂▃▃▄▄▄▄▄▅▅

0,1
classification_report,precis...
test_acc,0.978
test_loss,0.1148
train_accuracy,0.99678
train_loss,0.00903
val_accuracy,0.9734
val_loss,0.15369


Test loss: 0.1148  Test acc: 0.9780
Classification report on TEST:
               precision    recall  f1-score   support

           0      0.984     0.988     0.986       980
           1      0.989     0.994     0.992      1135
           2      0.962     0.988     0.975      1032
           3      0.967     0.983     0.975      1010
           4      0.970     0.986     0.978       982
           5      0.978     0.964     0.971       892
           6      0.980     0.974     0.977       958
           7      0.982     0.975     0.979      1028
           8      0.985     0.959     0.972       974
           9      0.982     0.965     0.974      1009

    accuracy                          0.978     10000
   macro avg      0.978     0.978     0.978     10000
weighted avg      0.978     0.978     0.978     10000



### Exercise 1.2: Adding Residual Connections

Implement a variant of your parameterized MLP network to support **residual** connections. Your network should be defined as a composition of **residual MLP** blocks that have one or more linear layers and add a skip connection from the block input to the output of the final linear layer.

**Compare** the performance (in training/validation loss and test accuracy) of your MLP and ResidualMLP for a range of depths. Verify that deeper networks **with** residual connections are easier to train than a network of the same depth **without** residual connections.

**For extra style points**: See if you can explain by analyzing the gradient magnitudes on a single training batch *why* this is the case. 

#### General Model Overview

The `GeneralModel` class implements a flexible neural network architecture composed of:

- **A sequence of user-defined blocks** (e.g., MLP layers, convolutional layers, residual blocks).  
- **A final classifier**, typically a linear layer.

The model includes two flattening mechanisms:

- **Optional input flattening** for fully connected architectures.  
- **Automatic output flattening** when the last block returns multi-dimensional tensors.

This design allows building MLPs, CNNs, or hybrid architectures without manual shape management.
  

In [5]:
class GeneralModel(nn.Module):
    def __init__(self, blocks, classifier, flatten_input=False):
        super().__init__()

        # Optional flattening before blocks (for MLPs)
        self.input_flatten = nn.Flatten() if flatten_input else nn.Identity()

        self.blocks = nn.Sequential(*blocks)
        self.classifier = classifier

    def forward(self, x):
        # Optional flatten before blocks
        x = self.input_flatten(x)

        # Apply blocks
        x = self.blocks(x)

        # Automatically flatten before classifier if needed
        if x.ndim > 2:
            x = x.view(x.size(0), -1)

        return self.classifier(x)

#### Residual MLP Block Overview

This block is a **flexible MLP version of a ResNet block**:

- Supports **one or multiple linear layers**.
- Handles **dimension changes** via a **projection** in the skip connection.
- During the construction of the `Sequential` block, it **skips the ReLU after the last layer**:
  - Ensures the **final ReLU is applied after adding the residual connection**, consistent with standard ResNet design.
- Allows **disabling the skip connection**, producing an **identical architecture without residuals** for clean comparisons.



In [9]:
#Code genrated by AI
class MLPBlock(nn.Module):
    def __init__(self, dim_in, dim_out, hidden_layers=2, use_skip=True):
        super().__init__()
        self.use_skip = use_skip

        # Skip connection projection (if needed)
        if use_skip:
            self.proj = nn.Linear(dim_in, dim_out) if dim_in != dim_out else nn.Identity()

        # Build MLP layers
        layers = [nn.Linear(dim_in, dim_out)]
        for _ in range(hidden_layers - 2):
            layers.append(nn.Linear(dim_out, dim_out))
        if hidden_layers > 1:
            layers.append(nn.Linear(dim_out, dim_out))

        self.net = nn.Sequential(*[
            nn.Sequential(l, nn.ReLU(inplace=True)) if i < hidden_layers - 1 else l
            for i, l in enumerate(layers)
        ])
        self.relu = nn.ReLU(inplace=True)

    def forward(self, x):
        out = self.net(x)
        if self.use_skip:
            skip = self.proj(x)
            return self.relu(out + skip)
        else:
            return self.relu(out)

As a personal reminder, this code:  
```python
self.net = nn.Sequential(*[
    nn.Sequential(l, nn.ReLU(inplace=True)) if i < hidden_layers - 1 else l
    for i, l in enumerate(layers)
])
```
is the compact version of this code:
```python
layers_with_activation = []

# Loop through all layers
for i, layer in enumerate(layers):

    # If this is NOT the last layer → add ReLU after it
    if i < hidden_layers - 1:
        block = nn.Sequential(
            layer,
            nn.ReLU(inplace=True)
        )
    else:
        # Last layer: no activation
        block = layer

    layers_with_activation.append(block)

# Build nn.Sequential using unpacking
self.net = nn.Sequential(*layers_with_activation)
```

#### Comparing three different configurations 2, 4 and 8 residual vs non-residual blocks

In [10]:

config = {
    "batch_size": 128,
    "lr": 0.01,
    "epochs": 10,
    "depths": [[784, 128, 128, 10],[784, 128, 128, 128, 128, 10],[784, 128, 128, 128 , 128, 128, 128, 128 , 128, 10] ],  # example depths
    "block_hidden_layers": 2,
    "use_skip_options": [False, True]
}

# Dataloaders.
train_loader = torch.utils.data.DataLoader(ds_train, config["batch_size"], shuffle=True, num_workers=4)
val_loader   = torch.utils.data.DataLoader(ds_val, config["batch_size"], num_workers=4)
dl_test  = DataLoader(ds_test,  batch_size=config["batch_size"], shuffle=False, num_workers=4)


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
for depth in config["depths"]:
    for use_skip in config["use_skip_options"]:
        run_name = f"depth_{len(depth)-2}_skip_{use_skip}"
        print(f"\n=== Starting experiment: {run_name} ===")

        # Initialize a new wandb run, assign to a group
        wandb.init(
            project="Lab-1",
            name=run_name,
            group="mlp_depth_residual_comparison",  # all runs belong to this group
            config={
                **config,
                "layer_sizes": depth,
                "use_skip": use_skip
            },
            reinit=True   # allows multiple runs in the same notebook
        )

        # ---------------------------
        # Build model using GeneralModel
        # ---------------------------
        blocks = []
        for nin, nout in zip(depth[:-2], depth[1:-1]):
            blocks.append(
                MLPBlock(
                    nin,
                    nout,
                    hidden_layers=config["block_hidden_layers"],
                    use_skip=use_skip
                )
            )

        classifier = nn.Linear(depth[-2], depth[-1])

        model = GeneralModel(
            blocks=blocks,
            classifier=classifier,
            flatten_input=True  # important for MLPs
        ).to(device)
        

        optimizer = torch.optim.SGD(model.parameters(), lr=config["lr"], momentum=0.9)

        # Training loop
        for epoch in range(config["epochs"]):
            train_loss, train_acc = train_one_epoch(model, train_loader, optimizer, device)
            val_loss, val_acc, val_report = evaluate(model, val_loader, device)

            # -------------------------------
            # Gradient norms analysis
            # -------------------------------
            # Take a single batch for gradient check
            x_batch, y_batch = next(iter(train_loader))
            x_batch, y_batch = x_batch.to(device), y_batch.to(device)
            model.zero_grad()
            logits = model(x_batch)
            loss = F.cross_entropy(logits, y_batch)
            loss.backward()
            grad_norms = [p.grad.norm().item() for p in model.parameters() if p.grad is not None]

            # First layer gradient norm (usually the first Linear layer)
            first_layer_grad_norm = None
            for p in model.parameters():
                if p.grad is not None:
                    first_layer_grad_norm = p.grad.norm().item()
                    break  # take only the first parameter's grad

            wandb.log({
                "train_loss": train_loss,
                "val_loss": val_loss,
                "train_accuracy": train_acc,
                "val_accuracy": val_acc,
                "grad_norm_mean": np.mean(grad_norms),
                "grad_norm_max": np.max(grad_norms),
                "grad_norm_min": np.min(grad_norms),
                "grad_first_layer": first_layer_grad_norm
                }, step=epoch)

            print(f"[{run_name}] Epoch {epoch}")
                  #f"train_loss={train_loss:.4f}, train_acc={train_acc:.4f}, "
                  #f"val_loss={val_loss:.4f}, val_acc={val_acc:.4f}")
        
        # Final test evaluation
        test_loss, test_acc, test_report = evaluate(model, dl_test, device)
        wandb.log({"test_loss": test_loss, "test_acc": test_acc})
        wandb.log({"classification_report": str(test_report)})
        # Finish this run
        wandb.finish()



=== Starting experiment: depth_2_skip_False ===


[34m[1mwandb[0m: Currently logged in as: [33mmatteo-piras[0m ([33mmatteo-piras-universit-di-firenze[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


[depth_2_skip_False] Epoch 0: 
[depth_2_skip_False] Epoch 1: 
[depth_2_skip_False] Epoch 2: 
[depth_2_skip_False] Epoch 3: 
[depth_2_skip_False] Epoch 4: 
[depth_2_skip_False] Epoch 5: 
[depth_2_skip_False] Epoch 6: 
[depth_2_skip_False] Epoch 7: 
[depth_2_skip_False] Epoch 8: 
[depth_2_skip_False] Epoch 9: 


0,1
grad_first_layer,█▄▅▂▄▇▁▅▆▁
grad_norm_max,█▄▅▂▄▇▁▅▆▁
grad_norm_mean,█▄▃▁▃▅▁▄▄▁
grad_norm_min,█▃▂▁▂▄▁▂▃▁
test_acc,▁
test_loss,▁
train_accuracy,▁▇▇▇██████
train_loss,█▂▂▂▁▁▁▁▁▁
val_accuracy,▁▅▆▇██████
val_loss,█▄▃▂▂▁▁▁▁▁

0,1
classification_report,precis...
grad_first_layer,0.28019
grad_norm_max,0.28019
grad_norm_mean,0.06919
grad_norm_min,0.00342
test_acc,0.9741
test_loss,0.09587
train_accuracy,0.9932
train_loss,0.02297
val_accuracy,0.9774



=== Starting experiment: depth_2_skip_True ===


[depth_2_skip_True] Epoch 0: 
[depth_2_skip_True] Epoch 1: 
[depth_2_skip_True] Epoch 2: 
[depth_2_skip_True] Epoch 3: 
[depth_2_skip_True] Epoch 4: 
[depth_2_skip_True] Epoch 5: 
[depth_2_skip_True] Epoch 6: 
[depth_2_skip_True] Epoch 7: 
[depth_2_skip_True] Epoch 8: 
[depth_2_skip_True] Epoch 9: 


0,1
grad_first_layer,▄▄▅█▁▅▄▃▃▅
grad_norm_max,▄▄▅█▁▅▄▃▃▅
grad_norm_mean,▄▄▅█▁▅▄▃▃▄
grad_norm_min,▄▄▅█▁▆▅▃▃▄
test_acc,▁
test_loss,▁
train_accuracy,▁▅▆▇▇▇████
train_loss,█▄▃▂▂▂▁▁▁▁
val_accuracy,▁▄▆▇▇▇████
val_loss,█▄▃▂▂▂▁▁▁▁

0,1
classification_report,precis...
grad_first_layer,0.64166
grad_norm_max,0.64166
grad_norm_mean,0.17522
grad_norm_min,0.01075
test_acc,0.979
test_loss,0.07208
train_accuracy,0.99522
train_loss,0.01707
val_accuracy,0.9816



=== Starting experiment: depth_4_skip_False ===


[depth_4_skip_False] Epoch 0: 
[depth_4_skip_False] Epoch 1: 
[depth_4_skip_False] Epoch 2: 
[depth_4_skip_False] Epoch 3: 
[depth_4_skip_False] Epoch 4: 
[depth_4_skip_False] Epoch 5: 
[depth_4_skip_False] Epoch 6: 
[depth_4_skip_False] Epoch 7: 
[depth_4_skip_False] Epoch 8: 
[depth_4_skip_False] Epoch 9: 


0,1
grad_first_layer,▁▂▆█▃▅▆▄▃▇
grad_norm_max,▁▃▅█▃▅▆▄▃▇
grad_norm_mean,▁▃▄█▃▅▅▄▂▇
grad_norm_min,▁▆▄█▄▆▅▄▂▅
test_acc,▁
test_loss,▁
train_accuracy,▁▁▆███████
train_loss,██▄▂▁▁▁▁▁▁
val_accuracy,▁▂████████
val_loss,█▇▂▁▁▁▁▁▁▁

0,1
classification_report,precis...
grad_first_layer,1.71414
grad_norm_max,1.71414
grad_norm_mean,0.38258
grad_norm_min,0.01744
test_acc,0.9578
test_loss,0.15878
train_accuracy,0.98376
train_loss,0.05679
val_accuracy,0.971



=== Starting experiment: depth_4_skip_True ===


[depth_4_skip_True] Epoch 0: 
[depth_4_skip_True] Epoch 1: 
[depth_4_skip_True] Epoch 2: 
[depth_4_skip_True] Epoch 3: 
[depth_4_skip_True] Epoch 4: 
[depth_4_skip_True] Epoch 5: 
[depth_4_skip_True] Epoch 6: 
[depth_4_skip_True] Epoch 7: 
[depth_4_skip_True] Epoch 8: 
[depth_4_skip_True] Epoch 9: 


0,1
grad_first_layer,▅▄▄▆█▂▂▃▂▁
grad_norm_max,▅▄▄▆█▂▂▃▂▁
grad_norm_mean,▄▅▄▆█▂▂▃▃▁
grad_norm_min,▃▄▃▆█▁▂▂▂▁
test_acc,▁
test_loss,▁
train_accuracy,▁▅▆▇▇▇████
train_loss,█▃▃▂▂▂▁▁▁▁
val_accuracy,▁▄▆▆█▇██▇█
val_loss,█▅▃▃▂▂▁▁▁▂

0,1
classification_report,precis...
grad_first_layer,0.3685
grad_norm_max,0.3685
grad_norm_mean,0.0787
grad_norm_min,0.00617
test_acc,0.9777
test_loss,0.07709
train_accuracy,0.99556
train_loss,0.01441
val_accuracy,0.9802



=== Starting experiment: depth_8_skip_False ===


[depth_8_skip_False] Epoch 0: 
[depth_8_skip_False] Epoch 1: 
[depth_8_skip_False] Epoch 2: 
[depth_8_skip_False] Epoch 3: 
[depth_8_skip_False] Epoch 4: 
[depth_8_skip_False] Epoch 5: 
[depth_8_skip_False] Epoch 6: 
[depth_8_skip_False] Epoch 7: 
[depth_8_skip_False] Epoch 8: 
[depth_8_skip_False] Epoch 9: 


0,1
grad_first_layer,▃▄▃▁▄▅█▇▃▁
grad_norm_max,▅▄▁▂▃▆▅█▃▁
grad_norm_mean,▄▅▁▂▃▆▅█▃▁
grad_norm_min,▄▃▂▁▄▆▇█▃▁
test_acc,▁
test_loss,▁
train_accuracy,▁█████████
train_loss,█▃▁▂▃▁▃▂▃▂
val_accuracy,▁▁▁▁▁▁▁▁▁▁
val_loss,▄▄▅▄▁█▃▆▂▃

0,1
classification_report,precis...
grad_first_layer,0.0
grad_norm_max,0.05391
grad_norm_mean,0.0032
grad_norm_min,0.0
test_acc,0.1135
test_loss,2.30113
train_accuracy,0.11193
train_loss,2.3015
val_accuracy,0.1172



=== Starting experiment: depth_8_skip_True ===


[depth_8_skip_True] Epoch 0: 
[depth_8_skip_True] Epoch 1: 
[depth_8_skip_True] Epoch 2: 
[depth_8_skip_True] Epoch 3: 
[depth_8_skip_True] Epoch 4: 
[depth_8_skip_True] Epoch 5: 
[depth_8_skip_True] Epoch 6: 
[depth_8_skip_True] Epoch 7: 
[depth_8_skip_True] Epoch 8: 
[depth_8_skip_True] Epoch 9: 


0,1
grad_first_layer,▄██▇▁▆▅▂▅▁
grad_norm_max,▄██▇▁▆▅▂▅▁
grad_norm_mean,▄██▆▁▅▅▂▅▁
grad_norm_min,▄█▆▅▁▅▄▂▄▁
test_acc,▁
test_loss,▁
train_accuracy,▁▆▆▇▇▇████
train_loss,█▃▃▂▂▁▁▁▁▁
val_accuracy,▁▁▅▇▇▆██▅█
val_loss,█▇▃▂▂▃▁▁▄▂

0,1
classification_report,precis...
grad_first_layer,0.04275
grad_norm_max,0.04275
grad_norm_mean,0.00655
grad_norm_min,0.00046
test_acc,0.9789
test_loss,0.08645
train_accuracy,0.99573
train_loss,0.01391
val_accuracy,0.9798


#### Results
For MNIST, a relatively simple dataset, shallow architectures (2 and 4 MLP blocks) achieve similarly good performance. In these cases, vanishing gradients are not noticeable, as confirmed by the gradient norms logged in WandB. However, architectures without skip connections still show slower convergence, suggesting that either mild gradient attenuation occurs or that skip connections provide additional architectural flexibility by allowing the network to partially bypass layers that are not contributing useful features.

The impact of skip connections becomes dramatic in the deeper 8-block architecture. Without skip connections, the network struggles to learn, likely due to severe vanishing gradients. Introducing skip connections enables effective gradient flow and allows the model to achieve good performance, demonstrating the importance of residual connections in facilitating training in deeper networks.

### Exercise 1.3: Rinse and Repeat (but with a CNN)

Repeat the verification you did above, but with **Convolutional** Neural Networks. If you were careful about abstracting your model and training code, this should be a simple exercise. Show that **deeper** CNNs *without* residual connections do not always work better and **even deeper** ones *with* residual connections.

**Hint**: You probably should do this exercise using CIFAR-10, since MNIST is *very* easy (at least up to about 99% accuracy).

**Tip**: Feel free to reuse the ResNet building blocks defined in `torchvision.models.resnet` (e.g. [BasicBlock](https://github.com/pytorch/vision/blob/main/torchvision/models/resnet.py#L59) which handles the cascade of 3x3 convolutions, skip connections, and optional downsampling). This is an excellent exercise in code diving. 

**Spoiler**: Depending on the optional exercises you plan to do below, you should think *very* carefully about the architectures of your CNNs here (so you can reuse them!).

#### Non-Residual BasicBlock
Since I'm going to use BasicBlock from trochvison.model.resnet for this experiment I define a Carbon copy of the BasicBlock class that dosen't have the residual connection (since the residual connection is not present the optional downsampling layer is not needed as well)

In [6]:
from typing import Optional, Callable

def conv3x3(in_planes, out_planes, stride=1):
    """3x3 convolution with padding"""
    return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride, padding=1, bias=False)

class NonResidualBasicBlock(nn.Module):
    expansion: int = 1

    def __init__(
        self,
        inplanes: int,
        planes: int,
        stride: int = 1,
        groups: int = 1,
        base_width: int = 64,
        dilation: int = 1,
        norm_layer: Optional[Callable[..., nn.Module]] = None,
    ) -> None:
        super().__init__()
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        if groups != 1 or base_width != 64:
            raise ValueError("BasicBlock only supports groups=1 and base_width=64")
        if dilation > 1:
            raise NotImplementedError("Dilation > 1 not supported in BasicBlock")
        
        # Two convolutions (same as BasicBlock)
        self.conv1 = conv3x3(inplanes, planes, stride)
        self.bn1 = norm_layer(planes)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = conv3x3(planes, planes)
        self.bn2 = norm_layer(planes)
        self.stride = stride

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)  # just a final ReLU, no skip connection

        return out

#### Data preparation
I'm going to use CIFAR10 dataset for this experiment, and I will devide it into train and validation dataset with a 90/10 split ratio

In [None]:

config = {
    "batch_size": 128,
    "lr": 0.1,
    "epochs": 10,
}

# Transforms
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,0.5,0.5), (0.5,0.5,0.5))
])

# Download CIFAR-10 dataset
full_train_dataset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                                  download=True, transform=transform)
test_dataset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                            download=True, transform=transform)

# Split train into train + validation
val_size = int(0.1 *len(full_train_dataset))
train_size = len(full_train_dataset) - val_size
train_dataset, val_dataset = torch.utils.data.random_split(full_train_dataset, [train_size, val_size])

# DataLoaders
batch_size = 128
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=config["batch_size"], shuffle=True, num_workers=4)
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=config["batch_size"], shuffle=False, num_workers=4)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=config["batch_size"], shuffle=False, num_workers=4)

print(f"Train size: {len(train_dataset)}, Val size: {len(val_dataset)}, Test size: {len(test_dataset)}")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")



Train size: 45000, Val size: 5000, Test size: 10000


### Models definition
For this experiment I'm going to define and use 8 different model architectures four of them will use residual connections the other four will be a copy of the first four but will NOT use residual connections

#### Residual Models
I'm going to define four residual models with increasing depth

In [None]:
from torchvision.models.resnet import BasicBlock

# --- 4-block residual ---
res_4_blocks = GeneralModel(
    blocks=[
        nn.Sequential(nn.Conv2d(3, 64, 1, 1, bias=False), nn.BatchNorm2d(64), nn.ReLU()),
        BasicBlock(64, 64),
        BasicBlock(64, 64),
        BasicBlock(64, 128, stride=2, downsample=nn.Sequential(nn.Conv2d(64,128,1,2,bias=False), nn.BatchNorm2d(128))),
        BasicBlock(128, 128),
    ],
    classifier=nn.Linear(128*16*16, 10),
    flatten_input=False
)

# --- 8-block residual ---
res_8_blocks = GeneralModel(
    blocks=[
        nn.Sequential(nn.Conv2d(3, 64, 1, 1, bias=False), nn.BatchNorm2d(64), nn.ReLU()),
        BasicBlock(64, 64), BasicBlock(64, 64), BasicBlock(64, 64), BasicBlock(64, 64),
        BasicBlock(64, 128, stride=2, downsample=nn.Sequential(nn.Conv2d(64,128,1,2,bias=False), nn.BatchNorm2d(128))),
        BasicBlock(128, 128), BasicBlock(128, 128), BasicBlock(128, 128),
    ],
    classifier=nn.Linear(128*16*16, 10),
    flatten_input=False
)

res_16_blocks = GeneralModel(
    blocks=[
        nn.Sequential(nn.Conv2d(3,64,1,1,bias=False), nn.BatchNorm2d(64), nn.ReLU()),

        # First 8 blocks, 64 channels
        BasicBlock(64,64), BasicBlock(64,64), BasicBlock(64,64), BasicBlock(64,64),
        BasicBlock(64,64), BasicBlock(64,64), BasicBlock(64,64), BasicBlock(64,64),

        # Next 8 blocks, 128 channels, downsample at the first of these
        BasicBlock(64,128,stride=2, downsample=nn.Sequential(nn.Conv2d(64,128,1,2,bias=False), nn.BatchNorm2d(128))),
        BasicBlock(128,128), BasicBlock(128,128), BasicBlock(128,128),
        BasicBlock(128,128), BasicBlock(128,128), BasicBlock(128,128), BasicBlock(128,128),
    ],
    classifier=nn.Linear(128*16*16,10),
    flatten_input=False
)

res_32_blocks_64_128_256 = GeneralModel(
    blocks=[
        nn.Sequential(nn.Conv2d(3,64,1,1,bias=False), nn.BatchNorm2d(64), nn.ReLU()),

        # 16 blocks, 64 channels
        *[BasicBlock(64,64) for _ in range(16)],

        # 8 blocks, 128 channels, downsample at first
        BasicBlock(64,128,stride=2, downsample=nn.Sequential(nn.Conv2d(64,128,1,2,bias=False), nn.BatchNorm2d(128))),
        *[BasicBlock(128,128) for _ in range(7)],

        # 8 blocks, 256 channels, downsample at first
        BasicBlock(128,256,stride=2, downsample=nn.Sequential(nn.Conv2d(128,256,1,2,bias=False), nn.BatchNorm2d(256))),
        *[BasicBlock(256,256) for _ in range(7)],
    ],
    classifier=nn.Linear(256*8*8,10),  # final spatial size 8x8
    flatten_input=False
)



#### Non-Residual Models
These models are analouge to the previus but without residual connections

In [20]:
# --- 2-block non-residual ---
nores_2_blocks = GeneralModel(
    blocks=[
        nn.Sequential(nn.Conv2d(3,64,1,1,bias=False), nn.BatchNorm2d(64), nn.ReLU()),
        NonResidualBasicBlock(64, 64),
        NonResidualBasicBlock(64, 128, stride=2),
    ],
    classifier=nn.Linear(128*16*16, 10),
    flatten_input=False
)

# --- 4-block non-residual ---
nores_4_blocks = GeneralModel(
    blocks=[
        nn.Sequential(nn.Conv2d(3,64,1,1,bias=False), nn.BatchNorm2d(64), nn.ReLU()),
        NonResidualBasicBlock(64, 64),
        NonResidualBasicBlock(64, 64),
        NonResidualBasicBlock(64, 128, stride=2),
        NonResidualBasicBlock(128, 128),
    ],
    classifier=nn.Linear(128*16*16, 10),
    flatten_input=False
)

# --- 8-block non-residual ---
nores_8_blocks = GeneralModel(
    blocks=[
        nn.Sequential(nn.Conv2d(3,64,1,1,bias=False), nn.BatchNorm2d(64), nn.ReLU()),
        NonResidualBasicBlock(64,64), NonResidualBasicBlock(64,64), NonResidualBasicBlock(64,64), NonResidualBasicBlock(64,64),
        NonResidualBasicBlock(64,128,stride=2), NonResidualBasicBlock(128,128), NonResidualBasicBlock(128,128), NonResidualBasicBlock(128,128)
    ],
    classifier=nn.Linear(128*16*16, 10),
    flatten_input=False
)

nores_16_blocks = GeneralModel(
    blocks=[
        nn.Sequential(nn.Conv2d(3,64,1,1,bias=False), nn.BatchNorm2d(64), nn.ReLU()),

        # First 8 blocks, 64 channels
        NonResidualBasicBlock(64,64), NonResidualBasicBlock(64,64), NonResidualBasicBlock(64,64), NonResidualBasicBlock(64,64),
        NonResidualBasicBlock(64,64), NonResidualBasicBlock(64,64), NonResidualBasicBlock(64,64), NonResidualBasicBlock(64,64),

        # Next 8 blocks, 128 channels, downsample at first of these
        NonResidualBasicBlock(64,128,stride=2), NonResidualBasicBlock(128,128), NonResidualBasicBlock(128,128), NonResidualBasicBlock(128,128),
        NonResidualBasicBlock(128,128), NonResidualBasicBlock(128,128), NonResidualBasicBlock(128,128), NonResidualBasicBlock(128,128),
    ],
    classifier=nn.Linear(128*16*16,10),
    flatten_input=False
)

nores_32_blocks_64_128_256 = GeneralModel(
    blocks=[
        nn.Sequential(nn.Conv2d(3,64,1,1,bias=False), nn.BatchNorm2d(64), nn.ReLU()),

        # 16 blocks, 64 channels
        *[NonResidualBasicBlock(64,64) for _ in range(16)],

        # 8 blocks, 128 channels, downsample at first
        NonResidualBasicBlock(64,128,stride=2),
        *[NonResidualBasicBlock(128,128) for _ in range(7)],

        # 8 blocks, 256 channels, downsample at first
        NonResidualBasicBlock(128,256,stride=2),
        *[NonResidualBasicBlock(256,256) for _ in range(7)],
    ],
    classifier=nn.Linear(256*8*8,10),  # final spatial size 8x8
    flatten_input=False
)


In [None]:
models = [
    ("Res_4", res_4_blocks),
    ("Res_8", res_8_blocks),
    ("Res_16", res_16_blocks),
    ("Res_32_64_128_256", res_32_blocks_64_128_256),
    ("NoRes_4", nores_4_blocks),
    ("NoRes_8", nores_8_blocks),
    ("NoRes_16", nores_16_blocks),
    ("NoRes_32_64_128_256", nores_32_blocks_64_128_256),
]

#### Training, Testing and Logging of the eight models 

In [None]:
results = {}

for name, model in models:
    print(f"\n=== Training model: {name} ===")

    wandb.init(
            project="Lab-1",
            name=f"cnn_{name}",
            group="cnn_residual_comparison_exp2",  # all runs belong to this group
            config={
                **config,
                "BasicBlock_depth": model.blocks.__len__()-1,
                "optimizer": "Adam",
                "lr": 0.001,
            },
            reinit=True   # allows multiple runs in the same notebook
        )

    model = model.to(device)
    #optimizer = torch.optim.SGD(model.parameters(), lr=config["lr"], momentum=0.9, weight_decay=5e-4)
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)

    for epoch in range(config["epochs"]):
        train_loss, train_acc = train_one_epoch(model, train_loader, optimizer, device)
        val_loss, val_acc, _ = evaluate(model, val_loader, device)

        # -------------------------------
        # Gradient norms analysis
        # -------------------------------
        # Take a single batch for gradient check
        x_batch, y_batch = next(iter(train_loader))
        x_batch, y_batch = x_batch.to(device), y_batch.to(device)
        model.zero_grad()
        logits = model(x_batch)
        loss = F.cross_entropy(logits, y_batch)
        loss.backward()
        grad_norms = [p.grad.norm().item() for p in model.parameters() if p.grad is not None]

        # First layer gradient norm (usually the first Linear layer)
        first_layer_grad_norm = None
        for p in model.parameters():
            if p.grad is not None:
                first_layer_grad_norm = p.grad.norm().item()
                break  # take only the first parameter's grad

        wandb.log({
            "train_loss": train_loss,
            "val_loss": val_loss,
            "train_accuracy": train_acc,
            "val_accuracy": val_acc,
            "grad_norm_mean": np.mean(grad_norms),
            "grad_norm_max": np.max(grad_norms),
            "grad_norm_min": np.min(grad_norms),
            "grad_first_layer": first_layer_grad_norm
            }, step=epoch)
    
    # Final test evaluation
    test_loss, test_acc, test_report = evaluate(model, test_loader, device)
    wandb.log({"test_loss": test_loss, "test_acc": test_acc})
    wandb.log({"classification_report": str(test_report)})
    # Finish this run
    wandb.finish()

# -------------------------------
# Final summary
# -------------------------------
print("\n=== Summary ===")
for name, res in results.items():
    print(f"{name}: test_loss={res['test_loss']:.4f}, test_acc={res['test_acc']:.4f}")
            


=== Training model: Res_16 ===


                                                          

[Epoch 0]


                                                          

[Epoch 1]


                                                          

[Epoch 2]


                                                          

[Epoch 3]


                                                          

[Epoch 4]


                                                          

[Epoch 5]


                                                          

[Epoch 6]


                                                          

[Epoch 7]


                                                          

[Epoch 8]


                                                          

[Epoch 9]


                                                        

0,1
grad_first_layer,▃▃█▃▂▂▂▁▁▂
grad_norm_max,█▄▇▂▁▂▁▁▁▂
grad_norm_mean,█▇█▄▂▂▂▂▁▃
grad_norm_min,▁▆█▄▂▃▃▃▃▇
test_acc,▁
test_loss,▁
train_accuracy,▁▃▄▅▆▇▇▇██
train_loss,█▄▃▃▂▂▂▁▁▁
val_accuracy,▁▄▆▇██████
val_loss,█▅▄▁▁▁▁▁▂▃

0,1
classification_report,precis...
grad_first_layer,0.47147
grad_norm_max,1.91021
grad_norm_mean,0.21116
grad_norm_min,0.01076
test_acc,0.7538
test_loss,0.94612
train_accuracy,0.94329
train_loss,0.15984
val_accuracy,0.7536



=== Training model: Res_32_64_128_256 ===


                                                          

[Epoch 0]


                                                          

[Epoch 1]


                                                          

[Epoch 2]


                                                          

[Epoch 3]


                                                          

[Epoch 4]


                                                          

[Epoch 5]


                                                          

[Epoch 6]


                                                          

[Epoch 7]


                                                          

[Epoch 8]


                                                          

[Epoch 9]


                                                        

0,1
grad_first_layer,▁█▁▂▁▁▁▁▁▁
grad_norm_max,▁█▁▅▁▁▁▁▁▁
grad_norm_mean,▁█▁▃▁▁▁▁▁▁
grad_norm_min,▃█▁▄▁▂▁▁▂▁
test_acc,▁
test_loss,▁
train_accuracy,▁▄▆▆▇▇█▇▇█
train_loss,█▅▃▃▂▂▁▂▂▁
val_accuracy,▇▃█▁▇▇█▇▇█
val_loss,▁█▁▃▁▁▁▁▁▁

0,1
classification_report,precis...
grad_first_layer,0.31276
grad_norm_max,0.74001
grad_norm_mean,0.05331
grad_norm_min,0.00129
test_acc,0.7837
test_loss,1.01633
train_accuracy,0.97871
train_loss,0.06198
val_accuracy,0.8052



=== Training model: NoRes_16 ===


                                                          

[Epoch 0]


                                                          

[Epoch 1]


                                                          

[Epoch 2]


                                                          

[Epoch 3]


                                                          

[Epoch 4]


                                                          

[Epoch 5]


                                                          

[Epoch 6]


                                                          

[Epoch 7]


                                                          

[Epoch 8]


                                                          

[Epoch 9]


                                                        

0,1
grad_first_layer,▂▁▂▃█▃▂▆▂▇
grad_norm_max,▄▃▃▃█▃▂▅▁█
grad_norm_mean,▅▁▃▂█▂▁▅▁▇
grad_norm_min,▂▁▄▂█▄▁█▃█
test_acc,▁
test_loss,▁
train_accuracy,▁▃▄▅▆▆▇███
train_loss,█▅▄▄▃▂▂▁▁▁
val_accuracy,▁▃▄▅▅▆▇▇█▇
val_loss,█▆▅▄▄▃▂▃▁▄

0,1
classification_report,precis...
grad_first_layer,1.43009
grad_norm_max,3.79934
grad_norm_mean,0.80462
grad_norm_min,0.11739
test_acc,0.5865
test_loss,1.32164
train_accuracy,0.64511
train_loss,0.99864
val_accuracy,0.593



=== Training model: NoRes_32_64_128_256 ===


                                                          

[Epoch 0]


                                                          

[Epoch 1]


                                                          

[Epoch 2]


                                                          

[Epoch 3]


                                                          

[Epoch 4]


                                                          

[Epoch 5]


                                                          

[Epoch 6]


                                                          

[Epoch 7]


                                                          

[Epoch 8]


                                                          

[Epoch 9]


                                                        

0,1
grad_first_layer,▂▁▂█▂▂▁▁▂▂
grad_norm_max,▂▁▃█▂▂▁▂▂▂
grad_norm_mean,▂▁▂█▃▃▂▁▁▃
grad_norm_min,▃▁▃█▄▄▂▁▂▄
test_acc,▁
test_loss,▁
train_accuracy,▁▂▃▃▄▅▆▆▇█
train_loss,█▆▆▅▄▄▃▂▂▁
val_accuracy,▁▅▄▁▅▄▆██▇
val_loss,▅▃▃█▂▄▂▁▁▁

0,1
classification_report,precis...
grad_first_layer,0.72478
grad_norm_max,0.88451
grad_norm_mean,0.31876
grad_norm_min,0.08772
test_acc,0.4205
test_loss,1.56177
train_accuracy,0.44904
train_loss,1.47316
val_accuracy,0.4272



=== Summary ===


#### Results
Deeper residual models yeld higher performances in train set, this still apply but to a lesser extent in validation and test, this can be attributed to overfitting in fact we can see a performance discrepancy in train and test performances especially for deeper models where we see around 15% accuracy loss, usually this problem is mitigated or outright solved by data augmentation that notably I did not use here beeing the focus of the experiment more on seeing the effect of not using residual connection more than obtaining the best performances in testing in fact we can see that non residual models inversly show worse performances the deeper the model is, looking at the gradient norms, especially at the mean and the min I do notice that the norms of the models with no residual connections tend to be higher than the norms of the models with the residual connection suggesting that exploding gradient is occurring.

-----
## Exercise 2: Choose at Least One

Below are **three** exercises that ask you to deepen your understanding of Deep Networks for visual recognition. You must choose **at least one** of the below for your final submission -- feel free to do **more**, but at least **ONE** you must submit. Each exercise is designed to require you to dig your hands **deep** into the guts of your models in order to do new and interesting things.

**Note**: These exercises are designed to use your small, custom CNNs and small datasets. This is to keep training times reasonable. If you have a decent GPU, feel free to use pretrained ResNets and larger datasets (e.g. the [Imagenette](https://pytorch.org/vision/0.20/generated/torchvision.datasets.Imagenette.html#torchvision.datasets.Imagenette) dataset at 160px).

### Exercise 2.1: *Fine-tune* a pre-trained model
Train one of your residual CNN models from Exercise 1.3 on CIFAR-10. Then:
1. Use the pre-trained model as a **feature extractor** (i.e. to extract the feature activations of the layer input into the classifier) on CIFAR-100. Use a **classical** approach (e.g. Linear SVM, K-Nearest Neighbor, or Bayesian Generative Classifier) from scikit-learn to establish a **stable baseline** performance on CIFAR-100 using the features extracted using your CNN.
2. Fine-tune your CNN on the CIFAR-100 training set and compare with your stable baseline. Experiment with different strategies:
    - Unfreeze some of the earlier layers for fine-tuning.
    - Test different optimizers (Adam, SGD, etc.).

Each of these steps will require you to modify your model definition in some way. For 1, you will need to return the activations of the last fully-connected layer (or the global average pooling layer). For 2, you will need to replace the original, 10-class classifier with a new, randomly-initialized 100-class classifier.

In [4]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using:", device)

Using: cuda


In [5]:
train_transform = T.Compose([
    #T.RandomCrop(32, padding=4),
    #T.RandomHorizontalFlip(),
    T.Resize(224),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225]),
])

test_transform = T.Compose([
    T.Resize(224),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225]),
])

train_set = torchvision.datasets.CIFAR100(root="./data", train=True,  download=True, transform=train_transform)
test_set  = torchvision.datasets.CIFAR100(root="./data", train=False, download=True, transform=test_transform)

train_loader = DataLoader(train_set, batch_size=128, shuffle=True, num_workers=2)
val_loader   = DataLoader(test_set,  batch_size=128, shuffle=False, num_workers=2)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device


device(type='cuda')

#### Stable Baseline

In [5]:
resnet = models.resnet18(weights="IMAGENET1K_V1")
resnet = resnet.to(device)
resnet.eval()

# REMOVE final fully connected layer → keep everything except FC
feature_extractor = nn.Sequential(*list(resnet.children())[:-1]).to(device)

# Freeze parameters
for p in feature_extractor.parameters():
    p.requires_grad = False

In [6]:
def extract_features(loader, model, device):
    feats, labels = [], []
    with torch.no_grad():
        for imgs, y in loader:
            imgs = imgs.to(device)
            f = model(imgs).squeeze()  # (batch, 512)
            feats.append(f.cpu())
            labels.append(y)
    return torch.cat(feats), torch.cat(labels)

In [7]:
train_feats, train_labels = extract_features(train_loader, feature_extractor, device)
val_feats, val_labels     = extract_features(val_loader,   feature_extractor, device)

# Convert to numpy for scikit-learn
train_feats = train_feats.numpy()
train_labels = train_labels.numpy()
val_feats = val_feats.numpy()
val_labels = val_labels.numpy()

print("Feature shape:", train_feats.shape)   

Feature shape: (50000, 512)


In [8]:
clf = LinearSVC()
clf.fit(train_feats, train_labels)

acc = clf.score(val_feats, val_labels)
print("CIFAR100 Feature Baseline Accuracy:", acc)

CIFAR100 Feature Baseline Accuracy: 0.5989


#### Fine Tuning

In [6]:
def get_finetuning_model():
    model = models.resnet18(weights="IMAGENET1K_V1")
    
    num_feats = model.fc.in_features
    model.fc = nn.Linear(num_feats, 100)  # CIFAR-100 has 100 classes

    return model.to(device)

##### Train Only Classifier Head

In [16]:
model = get_finetuning_model()

# Freeze entire backbone
for param in model.parameters():
    param.requires_grad = False

# Unfreeze only the classifier
for param in model.fc.parameters():
    param.requires_grad = True

config = {
    "batch_size": 128,
    "Adam_lr": 1e-3,
    "SGD_lr": 0.01,
    "epochs": 10,
    "optimizer": "Adam",
}

if config["optimizer"] == "Adam":
    optimizer = optim.Adam(model.fc.parameters(), lr=config["Adam_lr"])
elif config["optimizer"] == "SGD":
    optimizer = optim.SGD(model.fc.parameters(), lr=config["SGD_lr"], momentum=0.9)


wandb.init(
            project="Lab-1",
            name="no_augment_resnet18_finetune_head_only",
            group="fine_tuning_experiments",  # all runs belong to this group
            config={
                **config,
            },
            reinit=True   # allows multiple runs in the same notebook
        )

for epoch in range(config["epochs"]):
    train_loss, train_acc = train_one_epoch(model, train_loader, optimizer, device)
    val_loss, val_acc, _ = evaluate(model, val_loader, device)
    wandb.log({
        "train_loss": train_loss,
        "val_loss": val_loss,
        "train_accuracy": train_acc,
        "val_accuracy": val_acc
    }, step=epoch)
    print(f"[HEAD ONLY] Epoch {epoch+1}/{config["epochs"]}  Train Acc={train_acc:.4f}  Val Acc={val_acc:.4f}")
wandb.finish()


                                                          

[HEAD ONLY] Epoch 1/10  Train Acc=0.4105  Val Acc=0.5283


                                                          

[HEAD ONLY] Epoch 2/10  Train Acc=0.5607  Val Acc=0.5574


                                                          

[HEAD ONLY] Epoch 3/10  Train Acc=0.5952  Val Acc=0.5728


                                                          

[HEAD ONLY] Epoch 4/10  Train Acc=0.6117  Val Acc=0.5755


                                                          

[HEAD ONLY] Epoch 5/10  Train Acc=0.6257  Val Acc=0.5917


                                                          

[HEAD ONLY] Epoch 6/10  Train Acc=0.6353  Val Acc=0.5869


                                                          

[HEAD ONLY] Epoch 7/10  Train Acc=0.6441  Val Acc=0.5884


                                                          

[HEAD ONLY] Epoch 8/10  Train Acc=0.6501  Val Acc=0.5951


                                                          

[HEAD ONLY] Epoch 9/10  Train Acc=0.6541  Val Acc=0.5939


                                                          

[HEAD ONLY] Epoch 10/10  Train Acc=0.6595  Val Acc=0.5903


0,1
train_accuracy,▁▅▆▇▇▇████
train_loss,█▃▂▂▂▂▁▁▁▁
val_accuracy,▁▄▆▆█▇▇██▇
val_loss,█▄▃▂▁▂▁▁▁▁

0,1
train_accuracy,0.65952
train_loss,1.20296
val_accuracy,0.5903
val_loss,1.49114


#### Train Classifaier head and last layer

In [17]:
model = get_finetuning_model()

# Freeze all layers first
for param in model.parameters():
    param.requires_grad = False

# Unfreeze last convolutional block
for param in model.layer4.parameters():
    param.requires_grad = True
# Unfreeze classifier
for param in model.fc.parameters():
    param.requires_grad = True

config = {
    "batch_size": 128,
    "Adam_lr_layer4": 1e-4,
    "Adam_lr_fc": 1e-3,
    "SGD_lr_layer4": 0.001,
    "SGD_lr_fc": 0.01,
    "epochs": 10,
    "optimizer": "Adam",
}

if config["optimizer"] == "Adam":
    optimizer = optim.Adam([
        {"params": model.layer4.parameters(), "lr": config["Adam_lr_layer4"]},
        {"params": model.fc.parameters(),      "lr": config["Adam_lr_fc"]},
    ])
elif config["optimizer"] == "SGD":
    optimizer = optim.SGD([
        {"params": model.layer4.parameters(), "lr": config["SGD_lr_layer4"], "momentum": 0.9},
        {"params": model.fc.parameters(),      "lr": config["SGD_lr_fc"],      "momentum": 0.9},
    ])

wandb.init(
            project="Lab-1",
            name="no_augment_resnet18_finetune_layer4_and_head",
            group="fine_tuning_experiments",  # all runs belong to this group
            config={
                **config,
            },
            reinit=True   # allows multiple runs in the same notebook
        )

for epoch in range(config["epochs"]):
    train_loss, train_acc = train_one_epoch(model, train_loader, optimizer, device)
    val_loss, val_acc, _ = evaluate(model, val_loader, device)
    wandb.log({
        "train_loss": train_loss,
        "val_loss": val_loss,
        "train_accuracy": train_acc,
        "val_accuracy": val_acc
    }, step=epoch)
    print(f"[UNFREEZE layer4] Epoch {epoch+1}/{config["epochs"]}  Train Acc={train_acc:.4f}  Val Acc={val_acc:.4f}")
wandb.finish()


                                                          

[UNFREEZE layer4] Epoch 1/10  Train Acc=0.5504  Val Acc=0.6774


                                                          

[UNFREEZE layer4] Epoch 2/10  Train Acc=0.7748  Val Acc=0.7056


                                                          

[UNFREEZE layer4] Epoch 3/10  Train Acc=0.8948  Val Acc=0.7144


                                                          

[UNFREEZE layer4] Epoch 4/10  Train Acc=0.9708  Val Acc=0.7211


                                                          

[UNFREEZE layer4] Epoch 5/10  Train Acc=0.9953  Val Acc=0.7262


                                                          

[UNFREEZE layer4] Epoch 6/10  Train Acc=0.9989  Val Acc=0.7306


                                                          

[UNFREEZE layer4] Epoch 7/10  Train Acc=0.9994  Val Acc=0.7324


                                                          

[UNFREEZE layer4] Epoch 8/10  Train Acc=0.9996  Val Acc=0.7336


                                                          

[UNFREEZE layer4] Epoch 9/10  Train Acc=0.9986  Val Acc=0.7195


                                                          

[UNFREEZE layer4] Epoch 10/10  Train Acc=0.9940  Val Acc=0.6974




0,1
train_accuracy,▁▄▆███████
train_loss,█▄▃▂▁▁▁▁▁▁
val_accuracy,▁▅▆▆▇███▆▃
val_loss,▃▁▁▂▂▂▃▃▅█

0,1
train_accuracy,0.99396
train_loss,0.02662
val_accuracy,0.6974
val_loss,1.40907


#### Full Fine-tuning

In [9]:
model = get_finetuning_model()

for param in model.parameters():
    param.requires_grad = True

config = {
    "batch_size": 128,
    "Adam_lr": 1e-4,
    "SGD_lr": 0.001,
    "epochs": 10,
    "optimizer": "Adam",
}

if config["optimizer"] == "Adam":
    optimizer = optim.Adam(model.parameters(), lr=config["Adam_lr"])
elif config["optimizer"] == "SGD":
    optimizer = optim.SGD(model.parameters(), lr=config["SGD_lr"], momentum=0.9)

wandb.init(
            project="Lab-1",
            name="no_augment_resnet18_finetune_full_model",
            group="fine_tuning_experiments",  # all runs belong to this group
            config={
                **config,
            },
            reinit=True   # allows multiple runs in the same notebook
        )

for epoch in range(config["epochs"]):
    train_loss, train_acc = train_one_epoch(model, train_loader, optimizer, device)
    val_loss, val_acc, _ = evaluate(model, val_loader, device)
    wandb.log({
        "train_loss": train_loss,
        "val_loss": val_loss,
        "train_accuracy": train_acc,
        "val_accuracy": val_acc
    }, step=epoch)
    print(f"[FULL FT] Epoch {epoch+1}/{config["epochs"]}  Train Acc={train_acc:.4f}  Val Acc={val_acc:.4f}")
wandb.finish()

[34m[1mwandb[0m: Currently logged in as: [33mmatteo-piras[0m ([33mmatteo-piras-universit-di-firenze[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


                                                          

[FULL FT] Epoch 1/10  Train Acc=0.5599  Val Acc=0.7289


                                                          

[FULL FT] Epoch 2/10  Train Acc=0.8027  Val Acc=0.7717


                                                          

[FULL FT] Epoch 3/10  Train Acc=0.9007  Val Acc=0.7861


                                                          

[FULL FT] Epoch 4/10  Train Acc=0.9633  Val Acc=0.7914


                                                          

[FULL FT] Epoch 5/10  Train Acc=0.9904  Val Acc=0.7952


                                                          

[FULL FT] Epoch 6/10  Train Acc=0.9970  Val Acc=0.8004


                                                          

[FULL FT] Epoch 7/10  Train Acc=0.9989  Val Acc=0.8058


                                                          

[FULL FT] Epoch 8/10  Train Acc=0.9987  Val Acc=0.7929


                                                          

[FULL FT] Epoch 9/10  Train Acc=0.9900  Val Acc=0.7615


                                                          

[FULL FT] Epoch 10/10  Train Acc=0.9850  Val Acc=0.7803


0,1
train_accuracy,▁▅▆▇██████
train_loss,█▄▂▂▁▁▁▁▁▁
val_accuracy,▁▅▆▇▇██▇▄▆
val_loss,█▃▁▁▁▁▁▂▅▄

0,1
train_accuracy,0.985
train_loss,0.06859
val_accuracy,0.7803
val_loss,0.85825


### Exercise 2.2: *Distill* the knowledge from a large model into a smaller one
In this exercise you will see if you can derive a *small* model that performs comparably to a larger one on CIFAR-10. To do this, you will use [Knowledge Distillation](https://arxiv.org/abs/1503.02531):

> Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the Knowledge in a Neural Network, NeurIPS 2015.

To do this:
1. Train one of your best-performing CNNs on CIFAR-10 from Exercise 1.3 above. This will be your **teacher** model.
2. Define a *smaller* variant with about half the number of parameters (change the width and/or depth of the network). Train it on CIFAR-10 and verify that it performs *worse* than your **teacher**. This small network will be your **student** model.
3. Train the **student** using a combination of **hard labels** from the CIFAR-10 training set (cross entropy loss) and **soft labels** from predictions of the **teacher** (Kulback-Leibler loss between teacher and student).

Try to optimize training parameters in order to maximize the performance of the student. It should at least outperform the student trained only on hard labels in Setp 2.

**Tip**: You can save the predictions of the trained teacher network on the training set and adapt your dataloader to provide them together with hard labels. This will **greatly** speed up training compared to performing a forward pass through the teacher for each batch of training.

In [None]:
# Your code here.

### Exercise 2.3: *Explain* the predictions of a CNN

Use the CNN model you trained in Exercise 1.3 and implement [*Class Activation Maps*](http://cnnlocalization.csail.mit.edu/#:~:text=A%20class%20activation%20map%20for,decision%20made%20by%20the%20CNN.):

> B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning Deep Features for Discriminative Localization. CVPR'16 (arXiv:1512.04150, 2015).

Use your CNN implementation to demonstrate how your trained CNN *attends* to specific image features to recognize *specific* classes. Try your implementation out using a pre-trained ResNet-18 model and some images from the [Imagenette](https://pytorch.org/vision/0.20/generated/torchvision.datasets.Imagenette.html#torchvision.datasets.Imagenette) dataset -- I suggest you start with the low resolution version of images at 160px.

In [None]:
# Your code here.