# Homework 3: optimization of a CNN model
The task of this homework is to optimize a CNN model for the CIFAR-100. You are free to define the architecture of the model, and the training procedure. The only contraints are:
- It must be a `torch.nn.Module` object
- The number of trained parameters must be less than 1 million
- The test dataset must not be used for any step of training.
- The final training notebook should run on Google Colab within a maximum 1 hour approximately.
- Do not modify the random seed, as they are needed for reproducibility purpose.

For the grading, you must use the `evaluate` function defined below. It takes a model as input, and returns the test accuracy as output.

As a guideline, you are expected to **discuss** and motivate your choices regarding:
- Model architecture
- Hyperparameters (learning rate, batch size, etc)
- Regularization methods
- Optimizer
- Validation scheme

A code without any explanation of the choices will not be accepted. Test accuracy is not the only measure of success for this homework.

Remember that most of the train process is randomized, store your model's weights after training and load it before the evaluation!

## Example

### Loading packages and libraries

In [14]:
import random
import numpy as np
import torch
import torchvision


# Fix all random seeds
seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)

# For full determinism
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

# Import the best device available
device = torch.device('cuda' if torch.cuda.is_available() else 'mps' if torch.mps.is_available() else 'cpu')
print('Using device:', device)

# load the data
train_dataset = torchvision.datasets.CIFAR100(root='./data', train=True, download=True, transform=torchvision.transforms.ToTensor())

Using device: mps


In [15]:
test_dataset = torchvision.datasets.CIFAR100(root='./data', train=False, download=True, transform=torchvision.transforms.ToTensor())

def evaluate(model):
    params_count = sum(p.numel() for p in model.parameters())
    print('The model has {} parameters'.format(params_count))

    if params_count > int(1e6):
        print('The model has too many parameters! Not allowed to evaluate.')
        return

    model = model.to(device)
    model.eval()
    correct = 0
    total = 0

    test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=32, shuffle=False)
    for images, labels in test_loader:
        images, labels = images.to(device), labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()


    # print in bold red in a notebook
    print('\033[1m\033[91mAccuracy on the test set: {}%\033[0m'.format(100 * correct / total))


### Example of a simple CNN model

In [16]:
class TinyNet(torch.nn.Module):
    def __init__(self):
        super(TinyNet, self).__init__()
        self.conv1 = torch.nn.Conv2d(3, 32, kernel_size=3, stride=1, padding=1)
        self.conv2 = torch.nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1)
        self.fc1 = torch.nn.Linear(8*8*64, 128)
        self.fc2 = torch.nn.Linear(128, 100)

    def forward(self, x):
        x = torch.nn.functional.relu(self.conv1(x))
        x = torch.nn.functional.max_pool2d(x, 2)
        x = torch.nn.functional.relu(self.conv2(x))
        x = torch.nn.functional.max_pool2d(x, 2)
        x = x.view(-1, 8*8*64)
        x = torch.nn.functional.relu(self.fc1(x))
        x = self.fc2(x)
        return x

print("Model parameters: ", sum(p.numel() for p in TinyNet().parameters()))

Model parameters:  556708


### Example of basic training

In [17]:

model = TinyNet()
model.to(device)
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters())

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=32, shuffle=True)
for epoch in range(10):
    for i, (images, labels) in enumerate(train_loader):
        images, labels = images.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

    print('Epoch [{}/{}], Loss: {:.4f}'.format(epoch+1, 10, loss.item()))


Epoch [1/10], Loss: 4.6139
Epoch [2/10], Loss: 4.5809
Epoch [2/10], Loss: 4.5809
Epoch [3/10], Loss: 4.5756
Epoch [3/10], Loss: 4.5756
Epoch [4/10], Loss: 4.5896
Epoch [4/10], Loss: 4.5896
Epoch [5/10], Loss: 4.5961
Epoch [5/10], Loss: 4.5961
Epoch [6/10], Loss: 4.5979
Epoch [6/10], Loss: 4.5979
Epoch [7/10], Loss: 4.6145
Epoch [7/10], Loss: 4.6145
Epoch [8/10], Loss: 4.5321
Epoch [8/10], Loss: 4.5321
Epoch [9/10], Loss: 4.5010
Epoch [9/10], Loss: 4.5010
Epoch [10/10], Loss: 4.3955
Epoch [10/10], Loss: 4.3955


In [18]:
# save the model on a file
torch.save(model.state_dict(), 'tiny_net.pt')

loaded_model = TinyNet()
loaded_model.load_state_dict(torch.load('tiny_net.pt', weights_only=True))
evaluate(loaded_model)

The model has 556708 parameters
[1m[91mAccuracy on the test set: 3.37%[0m
[1m[91mAccuracy on the test set: 3.37%[0m


## Improved Approach

### Proposed Improved Training Approach

- **Model Architecture:** Compact CNN under 1M params using depthwise-separable convolutions, BatchNorm, Dropout, and global average pooling. Optionally add lightweight residual connections to stabilize deeper stacks.
- **Data Pipeline:** Normalize with CIFAR-100 stats (mean `[0.5071, 0.4867, 0.4408]`, std `[0.2675, 0.2565, 0.2761]`). Apply `RandomCrop(32, padding=4)`, `RandomHorizontalFlip`, mild `ColorJitter`, and `RandomErasing`. Consider `RandAugment` or `AutoAugment` if training time permits.
- **Validation Scheme:** Stratified or random 90/10 split from training set, fixed by the provided seed for reproducibility. Track best validation accuracy and use early stopping (patience ~10).
- **Optimizer & LR Schedule:** Prefer `AdamW` (lr ~1e-3, weight_decay=5e-4) or `SGD` with Nesterov (lr warmup then cosine annealing). Cosine schedule generally improves convergence versus multi-step.
- **Regularization:** Use label smoothing (e.g., `0.1`), Dropout2d in conv blocks, gradient clipping (e.g., `1.0`). Maintain an EMA of model parameters for evaluation to reduce variance.
- **Performance Tricks:** Mixed precision (AMP) on CUDA, `pin_memory` for DataLoaders, adequate `num_workers` (2â€“4). Save checkpoints to `small_cifar_net_best.pt` on validation improvement.
- **Evaluation:** Ensure test loader uses the same normalization as training. Reload best weights and evaluate with the provided `evaluate(model)` function.

This setup balances efficiency and accuracy within the parameter/time constraints while keeping seeds unchanged for reproducibility.

In [30]:
import math
import torchvision.transforms as transforms
from torch.utils.data import DataLoader, SubsetRandomSampler
import torch
import numpy as np

# CIFAR-100 normalization
CIFAR100_MEAN = (0.5071, 0.4867, 0.4408)
CIFAR100_STD = (0.2675, 0.2565, 0.2761)

# Transforms: augmentations for train, normalization for val/test
train_transform = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.02),
    transforms.ToTensor(),
    transforms.Normalize(CIFAR100_MEAN, CIFAR100_STD),
    transforms.RandomErasing(p=0.25)
])

val_test_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(CIFAR100_MEAN, CIFAR100_STD)
])

# Recreate datasets with proper transforms
train_dataset = torchvision.datasets.CIFAR100(root='./data', train=True, download=True, transform=train_transform)
val_dataset = torchvision.datasets.CIFAR100(root='./data', train=True, download=True, transform=val_test_transform)
# Use normalized test dataset for evaluation only
test_dataset = torchvision.datasets.CIFAR100(root='./data', train=False, download=True, transform=val_test_transform)

# Validation split (seed already set earlier)
val_ratio = 0.1
num_train = len(train_dataset)
indices = np.arange(num_train)
np.random.shuffle(indices)
val_size = int(num_train * val_ratio)
val_indices = indices[:val_size]
train_indices = indices[val_size:]

train_sampler = SubsetRandomSampler(train_indices)
val_sampler = SubsetRandomSampler(val_indices)

# DataLoaders (no CUDA: disable pin_memory)
batch_size = 128
num_workers = 2
train_loader = DataLoader(train_dataset, batch_size=batch_size, sampler=train_sampler, num_workers=num_workers, pin_memory=False)
val_loader = DataLoader(val_dataset, batch_size=batch_size, sampler=val_sampler, num_workers=num_workers, pin_memory=False)


In [31]:
import torch
import torch.nn as nn

class DSConvBlock(nn.Module):
    def __init__(self, in_ch, out_ch, stride=1, p_drop=0.1):
        super().__init__()
        self.depthwise = nn.Conv2d(in_ch, in_ch, kernel_size=3, stride=stride, padding=1, groups=in_ch, bias=False)
        self.pointwise = nn.Conv2d(in_ch, out_ch, kernel_size=1, bias=False)
        self.bn = nn.BatchNorm2d(out_ch)
        self.dropout = nn.Dropout2d(p_drop)

    def forward(self, x):
        x = self.depthwise(x)
        x = self.pointwise(x)
        x = self.bn(x)
        x = torch.relu(x)
        x = self.dropout(x)
        return x

class SmallCIFARNet(nn.Module):
    def __init__(self, num_classes=100):
        super().__init__()
        self.features = nn.Sequential(
            DSConvBlock(3, 64, stride=1, p_drop=0.05),
            nn.MaxPool2d(2),  # 32 -> 16
            DSConvBlock(64, 128, stride=1, p_drop=0.10),
            nn.MaxPool2d(2),  # 16 -> 8
            DSConvBlock(128, 256, stride=1, p_drop=0.15),
            nn.MaxPool2d(2),  # 8 -> 4
            DSConvBlock(256, 512, stride=1, p_drop=0.20),
            # Extra capacity at 512 channels
            DSConvBlock(512, 512, stride=1, p_drop=0.20),
            DSConvBlock(512, 512, stride=1, p_drop=0.20),
        )
        self.gap = nn.AdaptiveAvgPool2d(1)
        self.classifier = nn.Linear(512, num_classes)

    def forward(self, x):
        x = self.features(x)
        x = self.gap(x)
        x = torch.flatten(x, 1)
        x = self.classifier(x)
        return x

model = SmallCIFARNet().to(device)
print('Model parameters:', sum(p.numel() for p in model.parameters()))


Model parameters: 765055


In [32]:
from copy import deepcopy
import torch

# EMA helpers

def init_ema(model):
    ema = deepcopy(model)
    for p in ema.parameters():
        p.requires_grad_(False)
    return ema


def update_ema(ema_model, model, decay):
    with torch.no_grad():
        msd = model.state_dict()
        for k, v in ema_model.state_dict().items():
            v.copy_(v * decay + msd[k] * (1.0 - decay))

ema_decay = 0.995
ema_model = init_ema(model)


In [33]:
import math
import torch
import torch.nn as nn

# Optimizer, scheduler, loss, and training hyperparameters
base_lr = 1e-3
weight_decay = 5e-4
optimizer = torch.optim.AdamW(model.parameters(), lr=base_lr, weight_decay=weight_decay)

epochs = 50
warmup_epochs = 5


def lr_lambda(e):
    if e < warmup_epochs:
        return (e + 1) / warmup_epochs
    t = (e - warmup_epochs) / max(1, epochs - warmup_epochs)
    return 0.5 * (1 + math.cos(math.pi * t))


scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lr_lambda)
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)

grad_clip = 1.0
patience = 10
best_val_acc = 0.0
no_improve = 0
best_path = 'small_cifar_net_best.pt'


In [34]:
import torch
import torch.nn as nn

# Training loop (no CUDA/AMP)
for epoch in range(epochs):
    model.train()
    running_loss = 0.0
    running_acc = 0.0
    total = 0

    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)
        optimizer.zero_grad(set_to_none=True)
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        if grad_clip is not None:
            nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
        optimizer.step()

        preds = outputs.argmax(dim=1)
        correct = (preds == labels).sum().item()
        bs = labels.size(0)
        total += bs
        running_loss += loss.item() * bs
        running_acc += correct

        update_ema(ema_model, model, ema_decay)

    scheduler.step()
    epoch_loss = running_loss / max(1, total)
    epoch_acc = running_acc / max(1, total)

    # Validation using EMA weights
    ema_model.eval()
    val_correct = 0
    val_total = 0
    with torch.no_grad():
        for images, labels in val_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = ema_model(images)
            preds = outputs.argmax(dim=1)
            val_correct += (preds == labels).sum().item()
            val_total += labels.size(0)
    val_acc = val_correct / max(1, val_total)

    current_lr = scheduler.get_last_lr()[0]
    print(f'Epoch {epoch+1}/{epochs} | loss: {epoch_loss:.4f} | train acc: {epoch_acc*100:.2f}% | val acc: {val_acc*100:.2f}% | lr: {current_lr:.5f}')

    if val_acc > best_val_acc:
        best_val_acc = val_acc
        no_improve = 0
        torch.save(ema_model.state_dict(), best_path)
    else:
        no_improve += 1
        if no_improve >= patience:
            print(f'Early stopping at epoch {epoch+1}. Best val acc: {best_val_acc*100:.2f}%')
            break


Epoch 1/50 | loss: 4.3586 | train acc: 4.57% | val acc: 0.94% | lr: 0.00040
Epoch 2/50 | loss: 4.0377 | train acc: 9.82% | val acc: 4.86% | lr: 0.00060
Epoch 2/50 | loss: 4.0377 | train acc: 9.82% | val acc: 4.86% | lr: 0.00060
Epoch 3/50 | loss: 3.8282 | train acc: 13.69% | val acc: 18.80% | lr: 0.00080
Epoch 3/50 | loss: 3.8282 | train acc: 13.69% | val acc: 18.80% | lr: 0.00080
Epoch 4/50 | loss: 3.6633 | train acc: 17.45% | val acc: 24.82% | lr: 0.00100
Epoch 4/50 | loss: 3.6633 | train acc: 17.45% | val acc: 24.82% | lr: 0.00100
Epoch 5/50 | loss: 3.5432 | train acc: 20.01% | val acc: 29.56% | lr: 0.00100
Epoch 5/50 | loss: 3.5432 | train acc: 20.01% | val acc: 29.56% | lr: 0.00100
Epoch 6/50 | loss: 3.4328 | train acc: 22.76% | val acc: 32.32% | lr: 0.00100
Epoch 6/50 | loss: 3.4328 | train acc: 22.76% | val acc: 32.32% | lr: 0.00100
Epoch 7/50 | loss: 3.3457 | train acc: 24.80% | val acc: 34.20% | lr: 0.00100
Epoch 7/50 | loss: 3.3457 | train acc: 24.80% | val acc: 34.20% | lr: 

In [36]:
# Load best and evaluate on test set

eval_model = SmallCIFARNet().to(device)
eval_model.load_state_dict(torch.load(best_path, weights_only=True, map_location=device))
loaded_model = eval_model

# Use provided evaluate(model) function
evaluate(loaded_model)


The model has 765055 parameters
[1m[91mAccuracy on the test set: 52.95%[0m
[1m[91mAccuracy on the test set: 52.95%[0m
