# Homework 3: optimization of a CNN model
The task of this homework is to optimize a CNN model for the CIFAR-100. You are free to define the architecture of the model, and the training procedure. The only contraints are:
- It must be a `torch.nn.Module` object
- The number of trained parameters must be less than 1 million
- The test dataset must not be used for any step of training. It is better if don't even import it.
- The final training notebook should run on Google Colab within a maximum 1 hour approximately.

For the grading, you must use the `evaluate` function defined below. It takes a model as input, and returns the test accuracy as output.

As a guideline, you are expected to **discuss** and motivate your choices regarding:
- Model architecture
- Hyperparameters (learning rate, batch size, etc)
- Regularization methods
- Optimizer
- Validation scheme

A code without any explanation of the choices will not be accepted. Test accuracy is not the only measure of success for this homework.

Remember that most of the train process is randomized, store your model's weights after training and load it before the evaluation!

## Example

### Loading packages and libraries

In [8]:
import torch
import torchvision
from evaluate import evaluate

# Import the best device available
device = torch.device('cuda' if torch.cuda.is_available() else 'mps' if torch.mps.is_available() else 'cpu')
print('Using device:', device)

# load the data
train_dataset = torchvision.datasets.CIFAR100(root='./data', train=True, download=True, transform=torchvision.transforms.ToTensor())

Using device: cuda
Files already downloaded and verified


### Example of a simple CNN model

In [3]:
class TinyNet(torch.nn.Module):
    def __init__(self):
        super(TinyNet, self).__init__()
        self.conv1 = torch.nn.Conv2d(3, 32, kernel_size=3, stride=1, padding=1)
        self.conv2 = torch.nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1)
        self.fc1 = torch.nn.Linear(8*8*64, 128)
        self.fc2 = torch.nn.Linear(128, 100)

    def forward(self, x):
        x = torch.nn.functional.relu(self.conv1(x))
        x = torch.nn.functional.max_pool2d(x, 2)
        x = torch.nn.functional.relu(self.conv2(x))
        x = torch.nn.functional.max_pool2d(x, 2)
        x = x.view(-1, 8*8*64)
        x = torch.nn.functional.relu(self.fc1(x))
        x = self.fc2(x)
        return x

print("Model parameters: ", sum(p.numel() for p in TinyNet().parameters()))

Model parameters:  556708


### Example of basic training

In [4]:

model = TinyNet()
model.to(device)
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters())

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=32, shuffle=True)
for epoch in range(10):
    for i, (images, labels) in enumerate(train_loader):
        images, labels = images.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

    print('Epoch [{}/{}], Loss: {:.4f}'.format(epoch+1, 10, loss.item()))


Epoch [1/10], Loss: 4.5914
Epoch [2/10], Loss: 4.5974
Epoch [3/10], Loss: 4.6317
Epoch [4/10], Loss: 4.6144
Epoch [5/10], Loss: 4.5923
Epoch [6/10], Loss: 4.5894
Epoch [7/10], Loss: 4.5210
Epoch [8/10], Loss: 4.4610
Epoch [9/10], Loss: 4.4117
Epoch [10/10], Loss: 4.3461


In [5]:
# save the model on a file
torch.save(model.state_dict(), 'tiny_net.pt')

loaded_model = TinyNet()
loaded_model.load_state_dict(torch.load('tiny_net.pt', weights_only=True))
evaluate(loaded_model)

The model has 556708 parameters
[1m[91mAccuracy on the test set: 6.02%[0m


- Res net
- bottleneck building block for deeper nets with fast training
- idenitity shortcut
- scheduler for learning rate. study if a plateau is present and in case reduce the lr at plateau
- weight initialization using kaming he initialization (works better than xavier)
- regularization using weight decay: no dropout becauase when using BN it can be avoided
- optimizer: sdg or adam

In [43]:
from typing import Optional
from torch import nn
from torch.utils.data import DataLoader
import torchvision.transforms as transforms
import torch.nn.functional as F
from training_utils import *

In [15]:
DEVICE = None
if torch.cuda.is_available():
    # Requires NVIDIA GPU with CUDA installed
    DEVICE = torch.device("cuda")
elif torch.mps.is_available():
    # Requires Apple computer with M1 or later chip
    DEVICE = torch.device("mps")
else:
    # Not recommended, because it's slow. Move to Google Colab!
    DEVICE = torch.device("cpu")

print(DEVICE)

cuda


In [78]:
transform = torchvision.transforms.ToTensor()

BATCH_SIZE = 128

# load the train dataset
train_dataset = torchvision.datasets.CIFAR100(
    root='./data/',
    train=True,
    download=True,
    transform=transform)

# Split the dataset into 40k-10k samples for training-validation.
from torch.utils.data import random_split
train_dataset,  valid_dataset = random_split(
    train_dataset,
    lengths=[40000, 10000],
    generator=torch.Generator().manual_seed(42)
)

train_dataloader = DataLoader(
    dataset=train_dataset,
    batch_size=BATCH_SIZE,
    shuffle=True,
    num_workers=2)

valid_dataloader = DataLoader(
    dataset=valid_dataset,
    batch_size=BATCH_SIZE,
    shuffle=False,
    num_workers=2)

Files already downloaded and verified


I define a fit function as the one used in the tp

In [79]:
def fit(
    model: nn.Module,
    train_dataloader: DataLoader,
    optimizer: torch.optim.Optimizer,
    epochs: int,
    device: torch.device,
    scheduler_lr: Optional[torch.optim.lr_scheduler._LRScheduler] = None,
    val_dataloader: Optional[DataLoader] = None
):
    """
    the fit method simply calls the train_epoch() method for a
    specified number of epochs.
    """

    # keep track of the losses in order to visualize them later
    train_losses = []
    val_losses = []
    val_accuracies = []

    for epoch in range(epochs):
        # Train
        train_loss = train_epoch(
            model=model,
            train_dataloader=train_dataloader,
            optimizer=optimizer,
            device=device,
        )
        train_losses.append(train_loss)
        # Validate
        if val_dataloader is not None:
            val_loss, val_accuracy = predict(
                model=model, test_dataloader=val_dataloader, device=device, verbose=False
            )
            val_losses.append(val_loss)
            val_accuracies.append(val_accuracy)
            print(
                f"Epoch {epoch}: Train Loss={train_loss:.4f}, Val Loss={val_loss:.4f}, Val Accuracy={val_accuracy:.0f}%"
            )
        else:
            print(f"Epoch {epoch}: Train Loss={train_loss:.4f}")
        # LR scheduler
        if scheduler_lr is not None:
            scheduler_lr.step(metrics=val_loss)

    return train_losses, val_losses, val_accuracies

The architecture I choose is a ResNet. ResNets as we have seen in class are very good network to perform image classification tasks.
The one I choose is a residual block ResNEt with a skip connection.
 Skip connection is important because it allows to have deep networks, which offer better performance, without the problem of the vanishing gradient.

I define the residual block of the ResNet.
My block is a 3-layer block with a bottleneck. I choose this structure because it allows to hava e deep network but still with manageble training times.
Each layer is made up of a convolution, a batch normalization and a ReLu used as activation function, in this order.
The three covolutions used are the following:
- 1x1 convolution layer to reduce dimensions
- 3x3 (bottleneck) convolution layer on the reduced dimension
- 1x1 convolution to restore the dimension

In [85]:
class ResidualBlock(nn.Module):

    def __init__(self, in_planes, planes, stride=1):
        super().__init__()

        # First layer: 1x1 reduce dimension
        self.conv1 = nn.Conv2d(
            in_channels = in_planes,
            out_channels = planes//2,
            kernel_size=1,
            stride=stride,
            padding=1,
            bias=False)
        self.bn1 = nn.BatchNorm2d(planes//2)

        # Second layer: 3x3 bottleneck
        self.conv2 = nn.Conv2d(
            planes//2,
            planes//2,
            kernel_size=3,
            stride=1,
            padding=1,
            bias=False)
        self.bn2 = nn.BatchNorm2d(planes//2)

        # Third layer: 1x1 restore dimension
        self.conv3 = nn.Conv2d(
          in_channels = planes//2,
          out_channels = planes,
          kernel_size=1,
          stride=stride,
          padding=1,
          bias=False)
        self.bn3 = nn.BatchNorm2d(planes)

        # skip when dimensions match
        self.skip = nn.Sequential()
        # skip when dimensions don't match
        if in_planes != planes or stride > 1:
            self.skip = nn.Sequential(
                nn.Conv2d(in_planes, planes, kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(planes)
            )

    def forward(self, x):
        x = self.conv1(x)
        x = self.bn1(x)
        x = F.relu(x)

        x = self.conv2(x)
        x = self.bn2(x)
        x = F.relu(x)

        x = self.conv3(x)
        x = self.bn3(x)
        x += self.skip(x)
        x = F.relu(x)

        return x

Now I write the network.

In [86]:
class ResNet(nn.Module):
    def __init__(self, block, num_blocks, num_classes=100):
        super().__init__()
        self.in_planes = 32

        self.conv1 = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(64)
        self.layer1 = self._make_layer(block, 32, num_blocks[0], stride=1)
        self.layer2 = self._make_layer(block, 64, num_blocks[1], stride=2)
        self.layer3 = self._make_layer(block, 128, num_blocks[2], stride=2)
        self.layer4 = self._make_layer(block, 256, num_blocks[3], stride=2)
        self.linear = nn.Linear(1024, num_classes)

        self._initialize_weights()

    def _make_layer(self, block, planes, num_blocks, stride):
        strides = [stride] + [1]*(num_blocks-1)
        layers = []
        for stride in strides:
            layers.append(block(self.in_planes, planes, stride))
            self.in_planes = planes
        return nn.Sequential(*layers)

    def _initialize_weights(self):
        # Iterate through all model parameters
        for m in self.modules():
            if isinstance(m, nn.Conv2d):  # Initialize Conv2d weights
                torch.nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
            elif isinstance(m, nn.BatchNorm2d):  # Initialize BatchNorm weights
                torch.nn.init.constant_(m.weight, 1)
                torch.nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):  # Initialize Linear weights
                torch.nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
                torch.nn.init.constant_(m.bias, 0)

    def forward(self, x):
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.layer1(out)
        out = self.layer2(out)
        out = self.layer3(out)
        out = self.layer4(out)
        out = F.avg_pool2d(out, 4)
        out = out.view(out.size(0), -1)
        out = self.linear(out)
        return out

print("Model parameters: ", sum(p.numel() for p in ResNet(block=ResidualBlock, num_blocks=[3,4,3,3]).parameters()))



Model parameters:  994724


In [87]:

cnn = ResNet(block=ResidualBlock, num_blocks=[3,4,3,3]).to(DEVICE)

optimizer = torch.optim.SGD(model.parameters(), lr=0.001, weight_decay=0.001, momentum=0.9)
scheduler_lr = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.1, patience=5)

train_losses, valid_losses, valid_accs =   fit(
        model,
        train_dataloader = train_dataloader,
        optimizer = optimizer,
        epochs = 100,
        device = DEVICE,
        val_dataloader = valid_dataloader,
        scheduler_lr = scheduler_lr
    )
plot_loss( train_losses )


Epoch 0: Train Loss=2.3668, Val Loss=2.9326, Val Accuracy=29%
Epoch 1: Train Loss=2.3689, Val Loss=2.9377, Val Accuracy=29%
Epoch 2: Train Loss=2.3684, Val Loss=2.9532, Val Accuracy=29%
Epoch 3: Train Loss=2.3672, Val Loss=2.9299, Val Accuracy=29%
Epoch 4: Train Loss=2.3673, Val Loss=2.9310, Val Accuracy=29%
Epoch 5: Train Loss=2.3644, Val Loss=2.9291, Val Accuracy=29%
Epoch 6: Train Loss=2.3639, Val Loss=2.9268, Val Accuracy=29%
Epoch 7: Train Loss=2.3644, Val Loss=2.9307, Val Accuracy=29%
Epoch 8: Train Loss=2.3641, Val Loss=2.9375, Val Accuracy=29%
Epoch 9: Train Loss=2.3615, Val Loss=2.9285, Val Accuracy=29%
Epoch 10: Train Loss=2.3604, Val Loss=2.9301, Val Accuracy=29%
Epoch 11: Train Loss=2.3613, Val Loss=2.9193, Val Accuracy=29%
Epoch 12: Train Loss=2.3579, Val Loss=2.9338, Val Accuracy=29%
Epoch 13: Train Loss=2.3620, Val Loss=2.9282, Val Accuracy=29%
Epoch 14: Train Loss=2.3562, Val Loss=2.9256, Val Accuracy=29%
Epoch 15: Train Loss=2.3587, Val Loss=2.9255, Val Accuracy=29%
Ep

KeyboardInterrupt: 