# Lab 2: Your first Convolutional Neural Network

In today's lab session we will 

1. [Build a Shallow Convolutional Neural Network (CNN)](#Building-a-Shallow-Convolutional-Neural-Network)
2. [Train that network on CIFAR-10](#Training-on-CIFAR-10)

--- 
## Interactive notebooks

### Running on Google Colaboratory

Head over to https://colab.research.google.com and sign in with a Google account.

You should see something similar to the image below:

<img src="./media/colaboratory.png" width="900">

Go to *File > New Python 3 Notebook*, it should prompt you to sign in with your google account.

---
## Running on the lab machine
Alternatively, you can run these notebooks on the lab machines locally. First, you will need to load anaconda by entering the following into a terminal: `module load anaconda`

Now, install tensorboard by entring the following into the terminal: `pip install tensorboard`

With this complete, run the jupyter notebook server by entering the following into the terminal: `jupyter notebook` and navigate to `http://localhost:8888` in your browser.

Go to *New > Notebook* to create a new notebook.

In the first cell type the following

```python
import torch
torch.__version__
```

and click the play button to the left of the cell to run the code (Alternatively, pressing `<Ctrl>-<Enter>` will also run the code).

<img src="./media/colaboratory-notebook.png" width="700">

You should get *at least* version `1.1.0`, more likely you will get `2.4.0`.

You can add new cells to the notebook by clicking the *+ Code* button in the toolbar.

---
# Building a Shallow Convolutional Neural Network

We'll be building a shallow Convolutional Neural Network (CNN) of two layers.

We'll be making heavy use of pytorch's layers today, defined in the [`torch.nn`](https://pytorch.org/docs/1.2.0/nn.html) module. 

**Task:** Open the documentation for the fully connected layer ([`nn.Linear`](https://pytorch.org/docs/1.2.0/nn.html#linear) and 2D convolutional layer ([`nn.Conv2d`](https://pytorch.org/docs/1.2.0/nn.html#conv2d)), you'll need these later.

Optimizers are defined in [`torch.optim`](https://pytorch.org/docs/1.2.0/optim.html). We'll use the [`SGD`](https://pytorch.org/docs/1.2.0/optim.html#torch.optim.SGD) optimizer like we used in the first lab.

[Loss functions](https://pytorch.org/docs/1.2.0/nn.html#loss-functions) are also part of [`torch.nn`](https://pytorch.org/docs/1.2.0/nn.html#loss-functions) module. Also find the documentation for `nn.CrossEntropyLoss` (we used this last week) -- you'll need this later.

We will implement the following architecture, as described in your practical intro slides.

<img alt="CNN Architecture diagram" src="./media/cnn-ex-8.png" style="max-height: 300px;">

This diagram is drawn in the style put forward in the [AlexNet paper](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf) where inputs/outputs are visualised as 3D volumes and layers are implicit between the inputs/outputs with only the receptive fields drawn (highlighted in orange, pink, and blue in the figure above).

The output shapes of each layer are listed in the table below

| Layer  | Output shape ($C \times H \times W$) |
|--------|------------------------------|
| Input  | $3 \times 32 \times 32$      |
| Conv1  | $32 \times 32 \times 32$     |
| Pool1  | $32 \times 16 \times 16$     |
| Conv2  | $64\times 16 \times 16$     |
| Pool2  | $64 \times 8 \times 8$       |
| FC1    | $1024$                       |
| FC2    | $10$                         |

Our network is designed to operate on images from CIFAR-10, a dataset containing 60,000 RGB images, each 32 $\times$ 32 in resolution, split into 50,000 images for training and 10,000 images for testing. 

There are 10 classes with 6,000 examples per class. Some examples of each class can be seen in the diagram below

<img alt="CIFAR-10 examples" src="./media/cifar10.png" style="max-height: 500px;" />


We've provided you with a boilerplate script `train_cifar.py` to get you started. Download the code to your laptop by cloning the git repository  to your laptop:

```console
$ git clone https://github.com/COMSM0045-Applied-Deep-Learning/labsheets.git
```

If you don't have git, then download a zip copy via the green button in the top right of https://github.com/COMSM0045-Applied-Deep-Learning/labsheets.


The code provided is in `lab-2-cnns/lab2-code/`. There are two files:
- `train_cifar.py`: This contains the code that you will need to edit in this lab

We'll draw the architecture of the CNN bit by bit, accompanying it by code to show you how to implement the first few layers. We'll leave you to implement the rest. 

First up is the input to the network, this is a single input image drawn as a 3D volume:

<img alt="Input shape" src="./media/cnn-ex-1.png" style="max-height: 200px;">

In PyTorch, our network definition is split into two parts: The first part allocates the memory for the parameterized layers and takes place in the constructor, the second part defines the forward pass defining how the input data flows through the layers.

In this snippet of code from `train_cifar.py`, we define the bare skeleton of the CNN object. It has a few constructor arguments that define the shape of the input which we store in an `ImageShape` data structure.

```python
class ImageShape(NamedTuple):
    height: int
    width: int
    channels: int

        
class CNN(nn.Module):
    def __init__(self,
                 height: int = 32,
                 width: int = 32,
                 channels: int = 3,
                 class_count: int = 10):
        self.input_shape = ImageShape(height, width, channels)
        self.class_count = class_count
            ...
```

Our first layer sits in between the input tensor and the output tensor. One of the convolutional filter's receptive field is visualised by the orange cube in the input. Once the filter has been convolved with the input at one position, it produces a single value in the output tensor, indicated by the tip of the orange pyramid. The depth (horizontal) of the output tensor indicates the number of convolutional filters of the layer, in this case that is 32, i.e. there are 32 different orange cubes (different filter weights) convolved with the input.

<img alt="First conv layer" src="./media/cnn-ex-2.png" style="max-height: 200px;">

We define the convolutional layer as an attribute in the constructor, and pass the `images` through it during the `forward` pass.

Here is a webpage that visualises how the convolution layers, filters, and activation functions: [https://poloclub.github.io/cnn-explainer/#:~:text=Understanding%20Hyperparameters](https://poloclub.github.io/cnn-explainer/#:~:text=Understanding%20Hyperparameters)

```python
from torch.nn import functional as F


class CNN(nn.Module):
    def __init__(self,
                 height: int = 32,
                 width: int = 32,
                 channels: int = 3,
                 class_count: int = 10):
        ...
        self.conv1 = nn.Conv2d(
            in_channels=self.input_shape.channels,
            out_channels=32,
            kernel_size=(5, 5),
            padding=(2, 2),
        )
        self.initialise_layer(self.conv1)
        ...
        
    def forward(images: torch.Tensor) -> torch.Tensor:
        ...
        x = F.relu(self.conv1(images))
        ...
        
        
    @staticmethod
    def initialise_layer(layer):
        if hasattr(layer, "bias"):
            nn.init.zeros_(layer.bias)
        if hasattr(layer, "weight"):
            nn.init.kaiming_normal_(layer.weight)
```

Note that we apply the ReLU function in the forward pass, this is defined in the `torch.nn.functional` module (traditionally imported with the alias `F`) .

We have defined a method `initialise_layer` that initialises the layer's parameters. The `bias` weight is initialised with zeros, and for the `weight` attribute we use [`kaiming_normal_`](https://pytorch.org/docs/1.2.0/nn.init.html#torch.nn.init.kaiming_uniform_) to initialise the weight matrix with values from $\mathcal{N}(0, \sigma^2)$ where $\sigma$ is dependent upon the number of inputs to the layer.

Next we halve the spatial dimensions of the output of the convolutional layer by 'pooling' its output by placing a $2 \times 2$ grid at each position in the input and taking the max value within this grid. The purple grid in the figure slides across the full tensor. Note that the channel dimension stays the same--that's because pooling is applied for *each* channel dimension.

<img alt="First pooling layer" src="./media/cnn-ex-3.png" style="max-height: 200px;">

```python
from torch.nn import functional as F


class CNN(nn.Module):
    def __init__(self,
                 height: int = 32,
                 width: int = 32,
                 channels: int = 3,
                 class_count: int = 10):
        ...
        self.pool1 = nn.MaxPool2d(kernel_size=(2, 2), stride=(2, 2))
        ...

    def forward(images: torch.Tensor) -> torch.Tensor:
        ...
        x = F.relu(self.conv1(images))
        x = self.pool1(x)
        ...
```

If you are using Google Colaboratory, you need to copy-paste the content of the file in your colaboratory notebook. 
For lab machines, you already have a GPU so you can test your code as you implement the subsequent layers.

Use the following code to test the first two layers of the network:

```bash
$ cd lab2-code
$ python train_cifar.py
```

**BEWARE**: the code won't yet run without crashing as you need to implement the rest of the network and training process; what we've provided is just skeleton code!

It's your turn now to implement the remaining CNN layers. We're going to help you approach this in a step-by-step manner so you can verify your progress as you add each layer. To accomplish this, we'll print out the output shape of the network as we add each layer. This is useful because it shows you the output shape of the last layer you implemented, and consequently the input shape of the next one you need to implement.

**Task 1:** Within the training loop in the `Trainer.train` method, compute the model's forward pass using `self.model` on the `batch`. `batch` contains a $N \times 3 \times 32 \times 32$ batch of CIFAR-10 images. Assign the result of the forward pass to a local variable named `output`. Finally, print the output shape, and quit the program.

```python
## TASK 1: Compute the forward pass of the model, print the output shape
##         and quit the program
output = self.model.forward(batch)
print(output.shape)
import sys; sys.exit(1)
```            

Run your code with the command `python train_cifar.py`. You're program should print the output shape of the first conv layer:

```python
torch.Size([128, 32, 16, 16])
```

The data layout is in `NCHW` format, where 

- `N` is the batch size
- `C` is the channel depth
- `H` is the height
- `W` is the width

The batch size is 128 in this case, the spatial dimensions are $16 \times 16$, and the channel depth is 32.

**Task 2:** Now add the second convolutional layer, it has a $5 \times 5$ kernel, 
Don't forget to pass the output through `relu` in the `forward` function. Run the code and verify that the output shape against the diagram.

<img alt="Second conv layer" src="./media/cnn-ex-4.png" style="max-height: 200px;">

You'll need to add code below the following comments in your script
```python
## TASK 2-1: Define the second convolutional layer and initialise its parameters.
## Hint: copy the code for conv1, changing the name as well as the arguments for in_channels and out_channels. Also remember to initialise this layer!
``` 
and
```python
## TASK 2-2: Pass x through the second convolutional layer
## Hint: Don't forget to pass it through a relu after the convolution. 
```

**Task 3:** Next add the second pooling layer. Again you need to define this layer at `## Task 3-1` before calling this layer at `## Task 3-2`.  Run the code and verify the output shape.

<img alt="Second pooling layer" src="./media/cnn-ex-5.png" style="max-height: 200px;">

**Task 4:** Flatten the tensor produced by the second pooling layer from $8 \times 8 \times 64$ to $4096$, use [`torch.flatten`](https://pytorch.org/docs/1.2.0/torch.html) for this; take special note of the `start_dim` kwarg. You need to set this to 1 (default is 0) to avoid flattening the whole batch. All the functions in Pytorch expect a tensor of shape $N \times \ldots$ where $N$ is the batch size. By setting `start_dim` to 1, we flatten each image, not the whole batch. Run the code and check your output is a 2D tensor, first dimension is the batch size, and the second should be 4096.

<img alt="Flattened convolutional features" src="./media/cnn-ex-6.png" style="max-height: 300px;">

**Task 5:** Now take the flattened features and pass them through a fully connected layer (a.k.a a [`Linear`](https://pytorch.org/docs/1.2.0/nn.html#torch.nn.Linear) layer in PyTorch) mapping the $4096$ feature to $1024$. Copy the code

```python
## TASK 5-1: Define the first FC layer and initialise its parameters
self.fc1 = nn.Linear(4096, 1024)
self.initialise_layer(self.fc1)
```
Now add code to use this layer in the forward pass under (don't forget the ReLU activation function after the fully connected layer)
```python
## TASK 5-2: Pass x through the first fully connected layer
```
Run the code and check the output shape is `(128, 1024)`.

<img alt="First FC layer" src="./media/cnn-ex-7.png" style="max-height: 300px;">

**Task 6:** Add the final fully connected layer that maps from the $1024$ feature to the number of classes. This layer produces our [*logits*](https://developers.google.com/machine-learning/glossary/#logits), the unbounded class scores (**do NOT** use ReLU after this layer). Run the code and check the output shape is `(128, 10)`.

<img alt="Final FC layer" src="./media/cnn-ex-8.png" style="max-height: 300px;">

---
# Training on CIFAR-10

Now that you've defined your network we can go ahead and train it. To do so we need a loss function and an optimizer. For our loss function we'll use softmax cross entropy, the standard loss function for single-label classification tasks. For our optimizer we'll use Stochastic Gradient Descent (SGD). 

**Task 7:** Rename the `output` variable in he training loop of the `Trainer.train` method where you store you model output to `logits` as the model now produces a logit vector of class scores; the subsequent code in `Trainer.train` depends on this variable. 

Also remove the code you previously wrote to print the output and exit
```python
print(output.shape)
import sys; sys.exit(1)
```

**Task 8:** In the `main(args)` function, replace
```python
## TASK 8: Redefine the criterion to be softmax cross entropy
criterion = lambda logits, labels: torch.tensor(0)
```
with
```python
## TASK 8: Redefine the criterion to be softmax cross entropy
criterion = nn.CrossEntropyLoss()
```
This defines the softmax cross entropy loss [`nn.CrossEntropyLoss`](https://pytorch.org/docs/1.2.0/nn.html#torch.nn.CrossEntropyLoss). You will need to remove the dummy loss code `lambda logits, labels: torch.tensor(0)`, which was only there so you could run your code in previous steps to check your progress so far.

**Task 9:** Back in the training loop, replace the dummy loss function
```python
## TASK 9: Compute the loss using self.criterion and
##         store it in a variable called `loss`
loss = torch.tensor(0)
```
with your loss computed using `self.criterion`. This takes two arguments: the `logits` and the local variable `labels` (a 1D vector of length $N$ containing the labels corresponding to each example in the batch).

Run the code and check the loss printed out matches chance performance. Accuracy chance is 10% for randomly selecting one out of 10 classes. You don't need to run the code until completion, you can kill it after you've checked performance for a few batches by pressing `<Ctrl-C>`.

We have **not** yet trained our model. This was simply the forward pass.
 
**Task 10:** Compute the backward pass (this populates the gradient buffers of your network parameters) by calling `backward` on your `loss` variable (i.e. `loss.backward()`). 

**Task 11:** Initialise the variable named `optimizer` in the `main` function to a [`torch.optim.SGD`](https://pytorch.org/docs/1.2.0/optim.html#torch.optim.SGD) object using the [parameters of your model](https://pytorch.org/docs/1.2.0/nn.html#torch.nn.Module.parameters) and the learning rate stored in `args.learning_rate`.

**Task 12:** Update your network's parameters by calling [`self.optimizer.step()`](https://pytorch.org/docs/1.2.0/optim.html#torch.optim.Optimizer.step) and zero-out your gradient buffers using [`self.optimizer.zero_grad()`](https://pytorch.org/docs/1.2.0/optim.html#torch.optim.Optimizer.zero_grad) in `Trainer.train`. Run your code and check that your model's accuracy during training and testing increases and the loss decreases.

As you watch the model trained, you should see the accuracy increasing from chance (10%) to around **65%** by the end of epoch 20. Note the difference between your `batch accuracy` and your overall `accuracy` on the full test set.

```
epoch: [19], step: [390/391], batch loss: 0.76462, batch accuracy: 72.50, data load time: 0.00019, step time: 0.00363
validation loss: 0.95978, accuracy: 66.66
```

For example, at the end of my training, the batch loss for the last batch was 0.76 and the batch accuracy 72.5%.
However when testing the model on the full test set, the accuracy was 66.6% with the loss 0.96.

Q. Why is the batch accuracy different from the overall accuracy?

A. The batch accuracy is computed on a small subset of the data (128 images) whereas the overall accuracy is computed on the full test set (10,000 images).

Q. If you re-run the training do you get the same training accuracy? If not, why? What about the test accuracy?

A. Because in each iteration, the network predicts the classes for a different batch of examples (a set of 128 randomly selected images). There is shuffling in the data loader, so the network sees the data in a different order each time. The test accuracy should be more stable as it is computed on the full test set, however, it may still vary slightly due to the random initialisation of the network weights and the randomness of SGD.

Q. Let's say you want to use the model deterministically in some real-world application. How do you think that is possible?

A. We could _save_ the model parameters and then load them later to use the _same_ parameters without running training again. This is called _checkpointing_.

---

In addition to the metrics being printed to the console, the code we've provided also logs values to a tensorboard log directory. Similarly to lab-1, we're going to launch a tensorboard server and visualise the training and validation curves.

Each time you run your code, a new subdirectory containing that run's logs will be saved within the `logs` directory. You will see a bunch of subdirectorys already within `logs` when you were testing your code earlier as you built your network.
Open a new terminal and navigate to the directory containing the `logs` directory.

Now, run the following commands 
```bash
module load anaconda
python3 -m tensorboard.main --logdir=/path/to/logs
```
on this terminal.


**Note** If you get an issue with `tensorboard: command not found`, 
```console
pip install tensorboard
```
to load it.

Open up http://localhost:6006. Have a look at the loss and accuracy curves plotted for both training and test.

The x-axis is the number of steps we've trained for.

By default tensorboard smoothes your data by computing a running average. You can adjust this smoothing using the slider in the left side bar. We'd recommend turning this off to begin with as the smoothing can be deceptive and hide issues with training.

Make sure your tensorboard settings match those in the screenshot below:

![Tensorboard settings](./media/tensorboard-settings.png)

Your network should produce a similar result to the graphs shown below:

If you see multiple curves, it's because you've trained the model multiple times already. Each run will produce a new log folder within `logs`, clear all but the most recent one and rerun tensorboard.

![Expected accuracy](./media/expected-accuracy.png) ![Expected loss](./media/expected-loss.png)

**Task 13:** Run `python train_cifar.py --help` and investigate the hyperparameters that you can tweak. Pick one hyperparameter to change, run the code again and look at how the loss and accuracy curves differ.