# Deep Learning for Computer Vision

---

**Goethe University Frankfurt am Main**

Winter Semester 2022/23

<br>

## *Assignment 4 (Training)*

---

**Points:** 110<br>
**Due:** 23.11.2022, 10 am<br>
**Contact:** Matthias Fulde ([fulde@cs.uni-frankfurt.de](mailto:fulde@cs.uni-frankfurt.de))<br>

---

**Your Name:** Tilo-Lars Flasche

<br>

<br>

## Table of Contents

---

- [1 Nesterov Momentum](#1-Nesterov-Momentum-(10-Points))
- [2 Batch Normalization](#2-Batch-Normalization-(20-Points))
  - [2.1 Derivatives](#2.1-Derivatives-(15-Points))
  - [2.2-Redundant Bias](#2.2-Redundant-Bias-(5-Points))
- [3 Residual Networks](#3-Residual-Networks-(30-Points))
  - [3.1 Block](#3.1-Block-(10-Points))
  - [3.2 Network](#3.2-Network-(15-Points))
  - [3.3 Capacity](#3.3-Capacity-(5-Points))
- [4 Network Training](#4-Network-Training-(50-Points)
  - [4.1 Solver](#4.1-Solver-(10-Points))
  - [4.2 Training](#4.2-Training-(35-Points))
  - [4.3 Analysis](#4.3-Analysis-(5-Points))


<br>

## Setup

---

In this problem set we work with PyTorch, NumPy, and Matplotlib. See [this page](https://pytorch.org/get-started/locally/) for information on how to install PyTorch on your system.

We set the Matplotlib backend to inline in order to display images directly in the notebook. In addition, we enable autoreloading so that saved changes in imported modules are applied without the need to manually reaload the modules every time.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

# Enable autoreloading of imported modules.
%load_ext autoreload
%autoreload 2

In order to show the progress during training, we'll import the [tqdm](https://tqdm.github.io/) library.

In [None]:
from tqdm.notebook import tnrange

As you have observed in the last problem set, training a neural net without GPU support is terribly slow, even if we use the vector registers of the CPU. To accelerate training, we load the PyTorch library that allows us to perform the computations on the graphics card. You can check the GPU support of your environment with the statements below.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.transforms as T

# Check GPU support on your machine.
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

print(device)

To avoid cluttering the notebook, we'll write our class and function definitions in external modules.

In [None]:
from networks import ResNet
from solver import Solver
from utils import show_training

<br>

## Definitions

---

For comparability, we want to be able to compute the `capacity` of a model, which is the total number of learnable parameters. This also includes biases and linear and affine parameters of normalization layers, but *not* hyperparameters like the learning rate.

In [None]:
def capacity(module):
    """
    Computes the number of learnable parameters.

    Parameters:
        - module (nn.Module): Module with parameters.

    Returns:
        - num_param (int): Number of parameters.

    """
    num_param = sum(p.numel() for p in module.parameters() if p.requires_grad)

    return num_param

<br>

## Dataset

---

We'll use again the CIFAR-10 dataset for these assignment, but this time we'll load and preprocess the dataset using the Torchvision library.

We can use the [Compose](https://pytorch.org/vision/stable/transforms.html#torchvision.transforms.Compose) function to chain multiple transformations that we wish to apply to our data. We first convert the images in the loaded training and test sets to PyTorch tensors using the [ToTensor](https://pytorch.org/vision/stable/transforms.html#torchvision.transforms.ToTensor) transform. The Torchvision datasets are PIL images in the range $[0,1]$ but we want the data to be centered, so we use the [Normalize](https://pytorch.org/vision/stable/transforms.html#torchvision.transforms.Normalize) transform to normalize the data to the $[-1,1]$ range.

To tune the hyperparameters of our models, we want to split the training set into a smaller set for training and a set for validation. The [random_split](https://pytorch.org/docs/stable/data.html) function can be used to accomplish this, which takes a dataset and a list of sizes for the random subsets.

Note that in the calls to `CIFAR10` below, you can safely keep the `download` parameter set to `True`. In case the dataset has been downloaded before to the given directory, it will be loaded from there and not downloaded from the internet again.

In [None]:
# Convert images to tensors and standardize data.
transform = T.Compose([
    T.ToTensor(),
    T.Normalize((.5, .5, .5), (.5, .5, .5))
])

# Load and preprocess training set.
data_train = torchvision.datasets.CIFAR10(
    root='./datasets',
    train=True,
    download=True,
    transform=transform,
)

# Split training set into sets for training and validation.
data_train, data_val = torch.utils.data.random_split(data_train, [49000, 1000])

# Load and preprocess test set.
data_test = torchvision.datasets.CIFAR10(
    root='./datasets',
    train=False,
    download=True,
    transform=transform
)

<br>

## Example

Since we have not worked with PyTorch so far, let's fsee an example on how to create a model and train it on the dataset.

For simple feed forward networks we can use the [Sequential](https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html?highlight=sequential#torch.nn.Sequential) class of the PyTorch library to stack the different layers. For this example, we'll create the same model that we implemented in the previous problem set, so we'll use [Conv2d](https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html#torch.nn.Conv2d) for the convolution layers, [MaxPool2d](https://pytorch.org/docs/stable/generated/torch.nn.MaxPool2d.html#torch.nn.MaxPool2d) for the pooling layers, [Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html#torch.nn.Linear) for the fully-connected layers and [Dropout](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html#torch.nn.Dropout) for the dropout layer.

We'll use again the ReLU activation function, which we can plug in using the [ReLU](https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html#torch.nn.ReLU) class. With [Flatten](https://pytorch.org/docs/stable/generated/torch.nn.Flatten.html#torch.nn.Flatten) we can transform the tensor output of the last convolutional layer into a vector representation that is the input to the first linear layer.

In order to be able to utilize the GPU for training, we have to transfer the parameters and buffers of our model to the respective device.

In [None]:
# Define the same model as in the last assignment.
model = nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=6, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2),
    nn.Conv2d(in_channels=6, out_channels=8, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2),
    nn.Flatten(),
    nn.Linear(in_features=512, out_features=32),
    nn.ReLU(),
    nn.Linear(in_features=32, out_features=10)
)

# Transfer model to selected device.
model = model.to(device)

In order to train the model, we need a loss function and an optimizer. We'll use again [CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss) and stochastic gradient descent with momentum. For the latter we call [SGD](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html#torch.optim.SGD) with the parameters we want to optimize and provide values for learning rate and momentum.

In [None]:
# Create loss function and optimizer.
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

To access the images in our datasets, we use the [DataLoader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) class from the data utils of the PyTorch library. Using this class we don't have to sample the minibatches manually. The result is an iterable that provides us with chunks of data where each entry is a tuple of images and corresponding labels.

In [None]:
# Set number of samples per batch.
batch_size = 128

# Create loader for training set.
loader_train = torch.utils.data.DataLoader(
    data_train,
    batch_size=batch_size,
    shuffle=True,
    num_workers=2
)

# Create loader for test set.
loader_test = torch.utils.data.DataLoader(
    data_test,
    batch_size=batch_size,
    shuffle=False,
    num_workers=2
)

Now let's train the model for a couple of epochs.

After each iteration, we reset the gradients to zero, since this is not done automatically. We compute the forward pass through our network, the loss and the backward pass. To perform the update, we call the `step` method of our optimizer.

In [None]:
# Enable training mode.
model.train()

# Set number of training epochs.
num_epochs = 5

for epoch in (pbar := tnrange(num_epochs)):
    for inputs, labels in loader_train:

        # Transfer data to selected device.
        inputs = inputs.to(device)
        labels = labels.to(device)

        # Compute the forward pass.
        outputs = model(inputs)

        # Compute the loss and backward pass.
        loss = criterion(outputs, labels)
        loss.backward()

        # Update model parameters and reset gradients.
        optimizer.step()
        optimizer.zero_grad()

    # Show most recent loss every epoch.
    pbar.set_description(f'Epoch: {epoch + 1}  Loss: {loss.item():.5f}')

Let's check how good our model works on the test set.

When doing this, we can tell PyTorch that we don't need gradients with the `no_grad` method to allow faster computation. Besides that, we must remember to transfer the data to the GPU again. You should expect to see $> 50 \%$ accuracy on the test set with the given model and hyperparameter settings.

In [None]:
# Correct predictions and samples.
correct = 0
total = 0

# Enable evaluation mode.
model.eval()

with torch.no_grad():
    for inputs, labels in loader_test:

        # Transfer data to selected device.
        inputs = inputs.to(device)
        labels = labels.to(device)

        # Compute the forward pass.
        outputs = model(inputs)

        # Get predicted labels.
        prediction = torch.argmax(outputs.data, 1)

        # Store results for the current batch.
        total += labels.size(0)
        correct += torch.sum(prediction == labels).item()

# Show the result.
print(f'Accuracy: {100*correct / total:5.2f}%')

An alternative to using the `Sequential` class for creating a model, which provides more flexibility, is to subclass [Module](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module). Instances of this class are callable but can also maintain state like model parameters. Furthermore, [functional](https://pytorch.org/docs/stable/nn.functional.html) provides functions instead of classes that we can call directly in our model. Using this approach, we could define the same model we created above as follows:

<br>

```Python
class Model(nn.Module):

    def __init__(self):
        super().__init__()

        # Define conv layers.
        self.conv1 = nn.Conv2d(3, 6, 3, padding=1)
        self.conv2 = nn.Conv2d(6, 8, 3, padding=1)

        # Define fully-connected layers.
        self.fc1 = nn.Linear(512, 32)
        self.fc2 = nn.Linear(32, 10)

        # Define pooling layers.
        self.pool = nn.MaxPool2d(2)

    def forward(self, x):

        # Pass input through conv net.
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))

        # Flatten all but batch dimension.
        x = torch.flatten(x, 1)
        
        # Pass input through fully-connected net.
        x = F.relu(self.fc1(x))
        x = self.fc2(x)

        return x
```

<br>

When training your network, you might also consider to use a [learning rate scheduler](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate), since for training deep neural networks, it's almost always beneficial to decrease the learning rate as training goes on.

The reason for why this makes sense is quite simple:<br>

- If you choose a *too large* learning rate, you can make some fast initial progress, but you won't be able to reach a good local minimum, because you jump over the bottom point and bounce back and forth.
- If you choose a *too small* learning rate on the other hand, you'll probably reach a minimum, but training will take very long, because you only take tiny steps in the right direction.

So it's better to start with a rather large learning rate and decrease it on the way down towards the minimum.

Regarding regularization, the PyTorch optimizers take a `weight_decay` argument, allowing to use L2 regularization for the parameters.

To get more thorough explanations on how all this works, we highly recommend to explore the [tutorials](https://pytorch.org/tutorials/) section on the PyTorch website and have a look at the documentation pages for the respective components.

<br>

## Exercises

---

### 1 Nesterov Momentum (10 Points)

---

The classicial stochastic gradient descent with momentum update rule is given by

<br>

$$
    \boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - V^{(t+1)}
    \hspace{2em}\text{with}\hspace{2em}
    V^{(t+1)} = \mu V^{(t)} + \eta\nabla\mathcal{L}(\boldsymbol{\theta}^{(t)}),
$$

<br>

where $\boldsymbol{\theta}$ is a parameter, $\mathcal{L}$ is the loss function, $\eta$ the learning rate, $V$ the velocity, and $\mu$ the momentum.

In Nesterov's accelerated gradient method, this update rule becomes

<br>

$$
    \boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - V^{(t+1)}
    \hspace{2em}\text{with}\hspace{2em}
    V^{(t+1)} = \mu V^{(t)} + \eta\nabla\mathcal{L}(\boldsymbol{\theta}^{(t)} - \mu V^{(t)}),
$$

<br>

that is, we make a partial update in the direction of the accumulated gradient, before we evaluate the loss function and compute the gradient for the final update of the parameter.

Give an intuitive example for a situation in which Nesterov's method will perform a better update step than the classical momentum method. Be as precise as possible.

##### Answer

*Write your answer here.*

<br>

### 2 Batch Normalization (20 Points)

---

The batch normalizing transform for a vector $\mathbf{x}^{(n)} \in \mathbb{R}^D$ of a minibatch $\mathcal{B} \subset \mathbf{X}$ of $N$ samples is defined as

<br>

$$
    \mathbf{y}^{(n)} = \gamma\hat{\mathbf{x}}^{(n)} + \beta
$$

with

$$
    \hat{\mathbf{x}}^{(n)}
    =
    \frac{\mathbf{x}^{(n)}-\mu_\mathcal{B}}{\sqrt{\sigma_\mathcal{B}^2+\epsilon}}
$$

<br>

where the mean and variance are computed componentwise as

<br>

$$
    \mu_\mathcal{B}
    =
    \frac{1}{N} \sum_{n=1}^N \mathbf{x}^{(n)}
    \hspace{2em}\text{and}\hspace{2em}
    \sigma_\mathcal{B}^2
    =
    \frac{1}{N} \sum_{n=1}^N (\mathbf{x}^{(n)} - \mu_\mathcal{B})^2.
$$

<br>

### 2.1 Derivatives (15 Points)

---

We assume that we know the partial derivatives of the loss $\mathcal{L}$ with respect to the outputs $\mathbf{y}^{(n)}$ of a batch norm layer. Compute the partial derivatives with respect to the variance $\sigma_\mathcal{B}^2$, the mean $\mu_\mathcal{B}$, and the layer inputs $\mathbf{x}^{(n)}$, using the chain rule.

You can directly use

$$
    \frac{\partial\mathcal{L}}{\partial\hat{\mathbf{x}}^{(n)}}
    =
    \frac{\partial\mathcal{L}}{\partial\mathbf{y}^{(n)}}\gamma.
$$

<br>

##### Solution

*Write your solution here.*

<br>

### 2.2 Redundant Bias (5 Points)

---

Ignoring the constant added for numerical stability, we can write the normalization transform as

<br>

$$
    y = \gamma\left(\frac{x - \mu}{\sigma}\right) + \beta,
$$

<br>

where $x$ is the input, $y$ is the output, $\mu$ and $\sigma$ are mean and standard deviation, respectively, $\gamma$ are the linear parameters, and $\beta$ the affine parameters of the transformation. The difference between batch, layer, and instance normalizations is how the statistcs $\mu$ and $\sigma$ are computed.

During inference, we use estimates of the mean and standard deviation based on moving averages obtained during training. Hence when testing the model, these terms are constant.

Show that for $x = x^\prime + b$, the bias $b$ is redundant, that is, we can learn $\beta$ such that

<br>

$$
    y = \gamma\left(\frac{x^\prime - \mu}{\sigma}\right) + b.
$$

<br>

##### Proof

*Write your proof here.*

<div style="text-align:right">$\square$</div>

<br>

### 3 Residual Networks (30 Points)

---

In this exercise we want to implement a residual convolutional network, analogue to the ResNet architecture introduced in the lecture.

<br>

### 3.1 Block (10 Points)

---

In the `networks` module, complete the definition of the `ResidualBlock` class.

The main branch of the residual block should consist of two conv layers, each followed by a spatial batch norm layer. For the first convolutional block, the normalization layer is followed by a ReLU activation. The number of filters in each of the two conv layers is equal to the parameter `out_channels` of the constructor. Both conv layers should use a kernel size of 3 and a padding of 1. The stride of the first conv layer is equal to the `stride` parameter of the constructor. The second conv layer uses a stride of 1. Neither of the two conv layers uses a bias.

The skip connection should be the identity mapping if the shape of the output is equal to the shape of the input. If this is not the case, use a conv layer with kernel size 1 to adjust the shape. The conv layer should again be followed by a batch norm layer and have no bias.

In the `forward` method of the class, pass the input through the two branches, add the result, and apply a final ReLU activation before storing the result in the `out` variable that is returned from the method.

<br>

### 3.2 Network (15 Points)

---

Now complete the definion of the `ResNet` class in the same file.

The residual network should start with a single conv layer with 3 input channels, a kernel size of 3, and stride and padding being equal to 1. The first conv layer is followed by a batch norm layer and a ReLU activation.

After this first convolutional block, the network should consist of a number of `ResidualBlock` instances, which are finally followed by an average pooling layer, whose result is flattened and fed into a `Linear` layer that maps to feature vectors of length 10. The last layer has no activation.

The number of residual blocks, the number of channels, and the size of the respective feature maps, are up to you. However, the network that you define must not have more than $50.000$ parameters. This is a hard requirement!

You can use the function `capacity` to verify that you're below this threshold. In order to get a good result when training the network later, you should try to come close to the allowed number of parameters, while respecting the given specifications.

After training your model, you may come back to this exercise and change your architecture to see if you can get better results with a different setup, but make sure that you still meet the requirements.

In [None]:
# Create network instance.
model = ResNet()

# Check number of parameters.
print(capacity(model))

<br>

### 3.3 Capacity (5 Points)

---

When implementing your model to solve the previous exercise, you had to keep your model capacity below a certain threshold.

Calculate the number of learnable parameters in your model from hand. For each layer or block in your network give a formula to compute the number of parameters. In the end, the sum should match the number computed by the `capacity` function.

##### Answer

*Write your answer here.*

**2D Convolutional Layers**

The number of parameters of a 2D convolutional layers can be calculated with the following formula:
$$ \text{kernel_size}^2 \cdot (\text{in_channels}+\text{out_channels}) $$

**2D Batch Normalization Layers**

The number of parameters of a 2D batch normalization layer can be calculated with the formula:
$$ 2 \cdot \text{num_features} $$
where *num_features* is the number of inpput channels of the batch normalization layer. The batch normalization layer learns two vectors $\beta$ and $\gamma$, where each of these vectors has *num_features* components.

**Activation functions**

Activation functions do not have any learnable parameters. That holds for all activation functions like *ReLU*, *TanH* or *Sigmoid*

<br>

### 4 Network Training (50 Points)

---

Our goal in this exercise is to finally train a *good* classifier for the CIFAR-10 dataset.

<br>

### 4.1 Solver (10 Points)

---

Before we actually start with training, however, we want to write a reusable class that encapsulates the training logic. Such a class allows us to focus on the important aspects of the training and saves us work if we face this task again.

In the `solver.py` module you find the definition of the `Solver` class.

Read the code to see how it works. It creates a loss function, an optimizer, and optionally, a learning rate scheduler from the given names and arguments. Additional arguments like the `batch_size` are stored directly as attributes of the object. The class provides methods to `save` and `load` the training state, including the model parameters and the state of the optimizer and scheduler.

<br>

#### 4.1.1 Test Method (3 Points)

The first task is to complete the definition of the `test` method.

The method takes a dataset and has an optional `num_samples` parameter. If its value is `None` the model is evaluated on the whole given dataset. If a value for this parameter is given, it should be used to randomly subsample the given dataset before testing.

Create a `DataLoader` for the dataset using the stored batch size and `with torch.no_grad` evaluate the model. Compute the accuracy in percent and store it in the `accuracy` variable that is returned from the method.

<br>

#### 4.1.2 Train Method (7 Points)

The second task is to complete the definition of the `train` method.

In this method first create a `DataLoader` for the training set, using the stored batch size, then implement the train loop for the number of epochs given as an argument. In each epoch, increment the `self.epoch` counter. Store the *average* loss for the epoch in the `self.loss_history` list.

Also once per epoch, call the `test` method with the training and validation datasets and the stored parameters for the number of samples to use for validation. Store the obtained accuracies in the `self.train_acc` and `self.val_acc` lists. If a scheduler is given, you should call the `step` method of the scheduler in every epoch.

In addition, you should keep track of the best validation set accuracy. In every epoch, check if the accuracy is better than the best accuracy seen so far. If this is the case, store a copy of the model parameters. After the training has finished, load the parameters that produced the best accuracy back into the model.

Use the `tnrange` function of tqdm to show a progress bar for the epochs and the current validation set accuracy, as in the example above.

<br>

### 4.2 Training (35 Points)

---

Now use the `Solver` class and your implementation of `ResNet` to train a classifier for CIFAR-10.

Experiment with different residual network architectures within the constraints formulated in the previous exercise, try different optimizers and, if you want, learning rate schedulers. Try to find good hyperparameter values for your model. The goal is to achieve at least $70\%$ accuracy on the test set.

The ten bonus points in this assignment are for training the model to $80\%$ accuracy on the test set.

After you finished training, save the best model in the `models` folder.

<br>

You can use the `show_training` function from the `utils.py` module to show the losses and accuracies recorded during training.

<br>

#### 4.2.1 Solution

Write your code in cells below.

In [None]:
# Instantiate ResNet model
model = ResNet()
print("Number of residual blocks", model.n_blocks)

# Transfer model to selected device.
model = model.to(device)

# Instantiate Solver class
solver = Solver(
    model=model,
    data={'train': data_train, 'val': data_val},
)

solver.train(num_epochs=50)

###### <br>

#### 4.2.2 Evaluation

Compute and show the accuracy on the test set.

In [None]:
accuracy = solver.test(data_test)

print(f'Test accuracy: {accuracy:5.2f}%')

<br>

### 4.3 Analysis (5 Points)

---

Describe your approach and what you observed during training.

What did you try and what were the results? Which methods or settings that you used worked and which did not?

##### Answer

*Write your answer here.*

First I tried to train a model with 16 residual blocks over 25 epochs. This led to an accuracy of X%.

Next I increased the number of residual blocks to 32 and trained it again over 25 epochs. This time I received an accuracy of around 37%. I increased teh number of training epochs to 50 and evaluated the model again.

Finally I increased the number of residual blocks to 48. This time I received an accuracy of X%.

You can see that the accuracy increases with the number ofresidual blocks. The tradeoff is that the number of parameters also increases and the model needs more time to be trained.