# Lab 3: Techniques for training DNNs


## Contents

1. [Tuning Learning Rate](#Tuning-Learning-Rate)
2. [Tuning Batch Size](#Tuning-Batch-Size)
3. [Batch Normalisation](#Batch-Normalisation)
4. [SGD with Momentum](#SGD-with-Momentum)
5. [Addendum 1: A note on Batch Normalisation](#Addendum-1:-A-note-on-Batch-Normalisation)
6. [Addendum 2: Learning Rate Decay](#Addendum-2:-Learning-Rate-Decay)
7. [Addendum 3: CHALLENGE! Implementing BatchNorm2d](#Addendum-3:-CHALLENGE!-Implementing-BatchNorm2d)
---

**You will be continuing on the code from last lab. You may want to make a separate copy of the code**

Last lab session you built your first CNN and trained it on CIFAR-10. Let's revisit
the loss and accuracy curves we got last time.

![Last lab-session's loss curve](../lab-2-cnns/media/expected-loss.png)

*Legend:*
- Train: Orange
- Test: Red

![Last lab-session's accuracy curve](../lab-2-cnns/media/expected-accuracy.png)

*Legend:*
- Train: Grey
- Test: Blue

What can we learn by looking at these graphs? 

1. The **training accuracy** (~70%) could be higher; this dataset has 10 classes, so chance performance would be 10%. *Make sure you distinguish between the training curve and the test/validation curve for both loss and accuracy*. This indicates that our model is underfitting the training data. Our training performance, in normal circumstances, is typically an upper bound of the accuracy we can expect on the test set. In fact, as a rule of thumb, it is a good idea to overfit your training data, then regularise the model to reduce the train-validation gap in order to generalise to future data.

2. The **training loss** decreases quite slowly, perhaps our learning rate is too low.

3. The **test loss** is close to the training loss, this is good, it means that our model generalises well from our training data. Towards the end of training it does seem to overfit a bit, but the **test accuracy** is still increasing, which is our primary goal.


Based on the above, we will first attempt to adjust the learning rate.

---
## Tuning Learning Rate

We've been training our networks with stochastic gradient descent, this updates the parameters $\mathbf{W}$ according to the update rule

$$\mathbf{W}_{t+1} = \mathbf{W}_t - \eta\nabla_{W_t} \mathcal{L}(\mathbf{x}, \mathbf{W}_t)$$

Where
- $\mathbf{x}$ is the mini-batch of inputs to the network.
- $\mathbf{W}_t$ are the network's weights at time $t$.
- $\mathcal{L}$ is the loss function.
- $\nabla_{W_t}\mathcal{L}$ is the matrix containing the partial derivatives of the loss function with respect to the parameters $\mathbf{W}_t$.
- $\eta$ is the learning rate.

When picking a learning rate, it's best to search exponentially varying values, e.g. 0.001, 0.005, 0.01, 0.05, 0.1.

**Task**: The default learning rate in the code is 1e-2, try training with 5e-2 and 1e-1 by running `train_cifar.py` with `--learning-rate <lr>` (replace `<lr>` with `5e-2` or `1e-1`). Which learning rate gives you best results?

We found setting the learning rate to 1e-1 to be the most effective, boosting the test accuracy by ~7%. Let's examine the learning curves this time to see what we can infer from them.

![LR=1e-1 loss curve](./media/loss-curve-bs=128-lr=1e-1.png) ![LR=1e-1 accuracy curve](./media/accuracy-curve-bs=128-lr=1e-1.png)

These graphs paint a very different picture to the ones you see at the top. The training accuracy saturates at 100%, the network performs incredibly well on the training set, however the test accuracy plateaus at ~70% and the test loss is increasing. This means our network has *overfitted* to the training data, its performance on the training data isn't reflective of its performance on unseen data.   
We can use regularisation techniques to reduce this effect to try to close the gap between the train/test performance.

Before moving on, for the sake of completeness, we present the results of scanning over a range of learning rate values and the final test set accuracy we achieved.


| Learning rate | Test accuracy (%) |
|---------------|-------------------|
| 1e-4 | 20.52 |
| 5e-4 | 31.35 |
| 1e-3 | 42.32 |
| 5e-3 | 54.47 |
| 1e-2 | 63.40 |
| 5e-2 | 72.04 |
| 1e-1 | 73.39 |
| 2e-1 | 71.19 |
| 5e-1 | 26.66 |

If you were to attempt to reproduce these results, you would not produce the same numbers due to the random initialisation of weights and contents of batches. However you should observe the same trend.

**Q.** Why do you think the network's performance dropped so much when using 5e-1 (0.5) for the learning rate?

---
## Tuning Batch Size

Another fundamental hyperparameter in training is the batch size—i.e. the number of examples that we propagate through the network at each iteration. The risk of smaller batches is that they aren't sufficiently representative of the entire dataset and so the parameter updates (particularly with a large learning rate) overfit that batch. Larger batches with a fixed number of epochs results in fewer weight updates, as the number of iterations in an epoch is reduced.

**Task:** By default the code provided uses a batch size of 128, using your new learning rate (1e-1), try varying the batch-size by specifying the `--batch-size <bs>` argument when running `train_cifar.py`. What happens when you use batches of 64 or 256 images? 

Again, for completeness, we provide a comprehensive evaluation of batch sizes when holding the learning rate at 1e-1.


| Batch size | Test accuracy (%) | Number of steps | Time (s) |
|------------|-------------------|-----------------|----------|
| 1          | 10.00             | 1M              | 2366     |
| 2          | 10.00             | 500k            | 1193     |
| 4          | 10.00             | 250k            | 601      |
| 8          | 37.51             | 125k            | 323      |
| 16         | 51.69             | 62.5k           | 288      |
| 32         | 59.08             | 31.25k          | 190      |
| 64         | 72.95             | 15.63k          | 144      |
| 128        | 71.21             | 7.81k           | 116      |
| 256        | 68.78             | 3.91k           | 94       |
| 512        | 68.57             | 1.95k           | 78       |
| 1024       | 67.33             | 0.98k           | 74       |

Note the change in runtime depending on batch size. We can complete the same number of epochs with larger batches quicker than smaller batches as we make better utilisation of the GPU's parallel processing ability.

With a batch size of 4 examples, the network ends up always predicting the same class--this suggests the learning rate is far too high for this number of examples per batch and causes the loss to diverge.
The accuracy peaks at a batch size of 64 and then starts to decrease, this suggests that for the largest batch sizes, there aren't a sufficient number of parameter updates to train the network to convergence. We can validate this hypothesis by comparing the loss curves from the BS=16 experiment and the BS=1024 experiment.


Batch-Size: 64
![Loss curve BS=64](./media/loss-curve-bs=64-lr=1e-1.png)

Batch-Size:1024
![Loss curve BS=1024](./media/loss-curve-bs=1024-lr=1e-1.png)

The network with BS=1024 is still training as the training loss hasn't plateaued like in the BS=64 experiment. There is strong overfitting in the BS=64 experiment as you can see an increasing gap between training and test loss. The larger batches have a regularising effect on the training of the network.

---

## Batch Normalisation

The distribution of each layer's input, i.e. either the input data itself or the output from the previous hidden layer, plays a crucial role when training. This distribution varies during training. This phenomenon is known as *internal covariate shift*.

It would ease the optimisation of our network if such distributions didn't change during training and adjusting the model's parameters.

This is what batch normalisation does by normalising layer inputs. It was proposed in 2015 by Sergey Ioffe and Christian Szegedy at Google in [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/pdf/1502.03167.pdf). Let's see how it works. 

From the paper:

> By fixing the distribution of the layer inputs as the training progresses, we expect to improve the training speed. It has been long known that the network training converges faster if its inputs are whitened – i.e., linearly transformed to have zero means and unit variances, and decorrelated. As each layer observes the inputs produced by the layers below, it would be advantageous to achieve the same whitening of the inputs of each layer.

Consequently they propose normalising the input $x_i$ to each layer on a per channel basis (they skip decorrelating inputs).

$$ \hat{x}_{i, c, x, y} = \frac{x_{i, c, x, y} - \mu_c}{\sigma_c^2 + \epsilon}$$

Where 
- $x$ is the output from the previous layer
- $\hat{x}$ is the input to the subsequent layer
- $i, c, x, y$ are the batch, channel, width, and height indices
- The mean $\mu_c$ and variance $\sigma_c^2$ are computed over each training batch
- $\epsilon$ is a small constant for numerical stability.

One issue with batch normalisation is that whilst it speeds up convergence, it restricts the possible functions the neural network can approximate. To restore the representational power of the network, two new parameters per channel are learnt: a multiplier $\gamma_c$ and a bias $\beta_c$ to adapt the mean and variance of the normalised input. Refer to your lectures for more details. 

$$ \hat{x}_{i, c, x, y} = \gamma_c \frac{x_{i, c, x, y} - \mu_c}{\sqrt{\sigma_c^2 + \epsilon}} + \beta_c $$

Batch normalisation is typically added **after** each convolutional or fully connected layer **before** the activation function.


**Task**: 
- Add [`BatchNorm2d`](https://pytorch.org/docs/1.2.0/nn.html#batchnorm2d) layers after each convolutional and [`BatchNorm1d`](https://pytorch.org/docs/1.2.0/nn.html#batchnorm1d) after your first fully connected layer (you wouldn't put it on your output fully connected layer).
- Remember you should define the layers in the `__init__` method of your `CNN` class and then call it in the `forward` pass after performing the convolution and before the activation function. Each BatchNorm layer will have different input dimensions and will need to learn different parameters, so it is important that you define separate BatchNorm layers in your `__init__` method.
- Change `tb_log_dir_prefix` in `get_summary_writer_log_dir` to read `f'CNN_bn_bs={args.batch_size}_lr={args.learning_rate}_run_'` (this way we can distinguish our logs from the batch normalised networks and the un-normalised networks).
- Train your network with BS=128 and LR=1e-1. Expect to get an accuracy of ~74%.

If we compare the loss of the batch-normalized network and the previous version without batch-normalization, we can see that it converges much faster, yet interestingly it also does not overfit as much, it has a reguarlising effect.

![BN vs. Non-BN network loss](./media/loss-bn-vs-non-bn.png)

BN network performs better (~2%) in terms of accuracy.

![BN vs. Non-BN network accuracy](./media/accuracy-bn-vs-non-bn.png)

**NOTE:** Recent research casts doubt on the reason for batch normalisation's success, see [Addendum 1](#Addendum-1:-A-note-on-Batch-Normalisation) for more information on this (*optional*).

---
## SGD with Momentum

A useful technique using SGD is the concept of momentum. 
We can think of the surface of the loss function we're optimising as a landscape. 
Each point on that landscape corresponds to a set of network parameters.
In SGD, we compute the steepest direction down from our current position and take a step in that direction iteratively until we reach one of the minima.
An alternative approach is to simulate a physical ball rolling across this landscape. 
As the ball rolls down the hill, it picks up momentum, it is not only the steepest direction that defines the ball's trajectory, but also its current velocity.
We can do exactly the same thing with our network during the optimisation process.
We give our ball (representing the network parameters) an initial velocity of 0, then compute its acceleration as a step in the steepest direction down.
We update the ball's velocity according to its previous velocity and its current acceleration and then update its position.

Practically, let's compare the implementation of these two strategies: 

Without momentum, i.e. normal SGD, we compute the steepest direction, and take a step of size $\eta$ (the learning rate) in the opposite direction

$$\mathbf{W}_{t+1} = \mathbf{W}_t - \eta\nabla_{W_t} \mathcal{L}(\mathbf{x}, \mathbf{W}_t)$$

With momentum, we simulate a ball moving in the loss' landscape. It starts from being stationary, i.e. it's velocity is 0.

$$ v_0 = 0 $$

We then update it's velocity according to its acceleration, the gradient of the surface (akin to gravity).

$$ v_{t + 1} = \mu v_t - \eta\nabla_{W_t} \mathcal{L}(\mathbf{x}, \mathbf{W}_t) $$

where

- $v_t$ represents the ball's velocity at time $t$
- $\mu$ is the *momentum* coefficient, typically set to somewhere around 0.9. 
  When $\mu = 0$ the ball has no momentum (like in normal SGD) and when $\mu = 1$ the particle does not suffer friction.

We update the ball's position according to its velocity $v_t$

$$ W_{t + 1} = W_t + v_t $$



**Task:** 
1. Locate the part of your code defining the SGD optimizer
   ```python
       optimizer = optim.SGD(model.parameters(), args.learning_rate)
   ```
2. Add an argument `momentum` to your [`SGD`](https://pytorch.org/docs/1.2.0/optim.html#torch.optim.SGD) optimizer, and set that to 0.9. 
3. Change `tb_log_dir_prefix` in `get_summary_writer_log_dir` to read 
   `f'CNN_bn_bs={args.batch_size}_lr={args.learning_rate}_momentum=0.9_run_'`.
4. Train the network with a batch size of 128, learning-rate of 1e-1. Expect to get a test accuracy of around 77%.

**NOTE** You could instead define an argument at the top of your file
```python
    parser.add_argument("--sgd-momentum", default=0, type=float)
```
and pass this into the `SGD` constructor invocation. This allows you to change the momentum value (0.9) from the command line, similar to how you can control the learning rate and batch size.

If we compare our loss and accuracy curves with and without momentum we can see that with momentum the network converges a little quicker and reaches a higher test accuracy.

![Loss curves: momentum vs. no momentum](./media/loss-momentum-vs-no-momentum.png)
![Accuracy curves: momentum vs. no momentum](./media/accuracy-momentum-vs-no-momentum.png)


# End of Lab 3

---
# Addendum 1: A note on Batch Normalisation

Whilst batch normalisation was originally motivated by the desire to reduce internal covariate shift, a [recent analysis](https://papers.nips.cc/paper/7996-understanding-batch-normalization.pdf) of the technique suggest that the key benefit of the technique is to constrain activations to zero mean and unit variance. Without this constraint, very deep networks (e.g. 110 layer ResNets) suffer from increasingly large activations with depth in the network. These large activations cause the network to diverge (the loss increases without bound) during training. Consequently deep networks without batch normalisation have to be trained with smaller learning rates than their batch normalised equivalents. The larger learning rates help reduce over fitting of the network, speed up convergence, and tend to improve the final accuracy too!
If you're interested in learning more, read [Understanding Batch Normalization - Bjorck et al (NeurIPS 2018)](https://papers.nips.cc/paper/7996-understanding-batch-normalization.pdf).

# Addendum 2: Learning Rate Decay

When you read deep learning related papers, often you will encounter figures with loss curve / accuracy like following: ![lr-decay-example](./media/lr-decay.png) 

where you witness a jump in the training error, this is the result of applying learning rate decay. (Source: the [ResNet](https://arxiv.org/pdf/1512.03385.pdf) paper.)

Scheduling learning rate is a empirical technique to ameliorate deep neural network training. Again, think of the surface of the loss function we're optimising as a landscape, the initial learning rate guides the physical ball, i.e. network parameters, rolling downhill quickly; as descending goes on, the physical ball may oscillate around a local minimum if the learning rate is too big. A decayed learning rate in this stage can hence help the network converge to a local minimum.

Pytorch support multiple learning rate schedulers, here we will apply the Step scheduler to our training.

**Task**: 
- After defining the `optimizer`, define a [`StepLR`](https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.StepLR.html), which scales the learning rate by 0.1 every 20 epochs:
```python
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=20, gamma=0.1)
```
- Add `scheduler` argument to `Trainer.__init__` and pass the `scheduler` to it in `Training` instantiation.
- At the end of each training epoch, call `self.scheduler.step()`, which increment the scheduler counter by 1 epoch. The scheduler will update the learning rate of the next epoch when necessary.
- Change `tb_log_dir_prefix` in `get_summary_writer_log_dir` accordingly.
- Before, we train the network for 20 epochs; now, train the network for 40 epochs, with BS=128, LR=1e-1. As a result of using `stepLR`, in the first 20 epoch lr=0.1, and in the next 20 epochs lr=0.01. Expect to get an accuracy of ~78%.

Verify the effect of step scheduler by inspecting the loss curve and the accuracy.

![loss_stepLR](./media/loss-stepLR.png)
![accuracy_stepLR](./media/accuracy-stepLR.png)

## Reading
- [HOW DOES LEARNING RATE DECAY HELP MODERN NEURAL NETWORKS?](https://openreview.net/pdf?id=r1eOnh4YPB#:~:text=Learning%20rate%20decay%20(lrDecay)%20is,help%20both%20optimization%20and%20generalization.)
- [Learing Rate Wikipedia](https://en.wikipedia.org/wiki/Learning_rate)
- [Optimisation Chapter in the Deep Learning book](https://www.deeplearningbook.org/contents/optimization.html)

# Addendum 3: CHALLENGE! Implementing BatchNorm2d

Read the Section 3 of the [BatchNorm](https://arxiv.org/pdf/1502.03167.pdf) paper, implement BatchNorm2d following _Algorithm 1_ and _Algorithm 2_. Track the dataset statistic using running average: $$ \hat{x}_{new} = (1 - \text{momentum}) * \hat{x} + \text{momentum} * x_t$$ where $\hat{x}$ is the estimated statistic and $x_t$ is the new observed value. 

By default set momentum=0.1 and $\epsilon$=1e-5.

Code skelecton:

```python
import torch.nn as nn

class MyBatchNorm2d(nn.Module):
    def __init__(self, num_features):
        super().__init__()
        # TASK: Define and initailise parameters
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        if self.training:
            # TASK: Implement training phase
            pass
        else:
            # TASK: Implement testing phase
            pass
        raise NotImplementedError
```

Q: Which dimensions to average over? _Hint: BatchNorm2d is also called SpatialBatchNorm. Read Sec 3.2_  
Q: Which parameters are trainable? Which parameters are calculate directly / not trainable? *Hint: [nn.Parameter](https://pytorch.org/docs/stable/generated/torch.nn.parameter.Parameter.html#torch.nn.parameter.Parameter) vs [self.register\_buffer](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.register_buffer)*. _Hint2: [discussion](https://stackoverflow.com/questions/57540745/what-is-the-difference-between-register-parameter-and-register-buffer-in-pytorch)._  
Q: How to initialize affine parameters correctly? _Hint: Initially BN acts as a plain normalising function_  
Q: How to initialize dataset statistic correctly? _Hint: Normalised values are assumed to be $N(0, I)$_  

**Note**
- Each object of `torch.nn.Module` class has a property called `self.training`, which indicates whether the object (model) is in training mode or not. Setting `model.eval()` or `model.train()` will update this property.

You should be able get the same performance using your own BatchNorm.

![Your-Own-BN](./media/your-own-bn.png)

*We provide our implementation of BatchNorm in `lab-3-training/code/batch_norm_ref.py` for your reference, don't look at it if you want to explore yourself!*