# Optimisation for Deep Learning

Learning outcomes
- understand mathematically and intuitively the most common optimisation algorithms used for optimising deep models
- implement your own optimiser in PyTorch

## Reminder of gradient based optimisation

So far we've looked at some pretty simple gradient based optimisers including gradient descent and stochastic gradient descent.

In this notebook, we'll look at some more complex optimisers, which can overcome some of the shortcomings of the methods we've looked at previously, and will allow us to train deep models more quickly.

Here's a visualisation of how different optimisers might iteratively update the model weights.

![](images/optim_vis.gif)

## Challenges with optimising deep models
- Local structure may not be representative of global structure
- 

## Again, don't be scared by local optima

1. Local minima become exponentially rare with the number of parameters in the model
For every weight, we will compute how the loss changes with respect to it.
It becomes exponentially unlikely that the rate of change along every weight axis will be positive. 

2. Empirically, local minima perform well enough!
Even local minima can achieve 

## Gradient Descent

![](images/gradient_descent.jpg)

## SGD

![](images/SGD.jpg)

## SGD with momentum

![](images/momentum.jpg)

## SGD with Nesterov momentum

![](images/nesterov.jpg)

## AdaGrad

Is there a more systematic way to reduce the learning rate over time?

AdaGrad assumes so, and reduces the learning rate

![](images/adagrad.jpg)

## RMSProp

The problem with AdaGrad is that the learning rate can never recover and increase to speed up optimisation once it has slowed down, it can only decrease further. So if a steep part of the loss surface is encountered before a flatter part, the learning rate for this parameter will be divided by the large loss surface gradient in the steep region and be too small to make meaningful progress in the flatter region.

RMSProp is similar to AdaGrad except for how it accumulates the gradient to decay the learning rate for each parameter. Instead of continuuing to sum up the square of all of the gradients encountered in each given direction, it takes an *exponential moving average*. This gives the chance for the learning rate to increase if a steep gradient were not encountered recently, as the historical gradients encountered have an exponentially smaller influence on the learning rate with each optimisation step.

![](images/rmsprop.jpg)

## Adam

![](images/adam.jpg)

## So which algorithm do I use?

Well... as usual, it depends on your problem and your dataset.

It's still a highly active field of research. But in general, **SGD with momentum or Adam** are the go to choices for optimising deep models.

## Using these optimisation algorithms

Let's set up the same neural network as in the previous module, and then switch out the optimiser for Adam and others and show how you can adapt it to use momentum.

In [None]:
import sys
sys.path.append('..')
from utils import NN, get_dataloaders
import torch
import torch.nn.functional as F

my_nn = NN([784, 1024, 1024, 512, 10], distribution=True, flatten_input=True)

learning_rate = 0.0001

# HOW TO USE DIFFERENT OPTIMISERS PROVIDED BY PYTORCH
optimiser = torch.optim.SGD(my_nn.parameters(), lr=learning_rate, momentum=0.1)
# optimiser = torch.optim.Adagrad(NN.parameters(), lr=learning_rate)
# optimiser = torch.optim.RMSprop(NN.parameters(), lr=learning_rate)
optimiser = torch.optim.Adam(my_nn.parameters(), lr=learning_rate)

The stuff below is exactly the same as before!

In [None]:
# GET DATALOADERS
test_loader, val_loader, train_loader = get_dataloaders()
criterion = F.cross_entropy

# SET UP TRAINING VISUALISATION
from torch.utils.tensorboard import SummaryWriter

# TRAINING LOOP
def train(model, optimiser, tag, graph_name, epochs=1):
    writer = SummaryWriter(log_dir=f'../../runs/{tag}') # make a different writer for each tagged optimisation run
    for epoch in range(epochs):
        for idx, minibatch in enumerate(train_loader):
            inputs, labels = minibatch
            prediction = model(inputs)             # pass the data forward through the model
            loss = criterion(prediction, labels)   # compute the loss
            print('Epoch:', epoch, '\tBatch:', idx, '\tLoss:', loss)
            optimiser.zero_grad()                  # reset the gradients attribute of each of the model's params to zero
            loss.backward()                        # backward pass to compute and set all of the model param's gradients
            optimiser.step()                       # update the model's parameters
            writer.add_scalar(f'Optimisers/{graph_name}', loss, epoch*len(train_loader) + idx)    # write loss to a graph

# train(my_nn, optimiser)

Let's compare the training curves generated using some of the optimisers that we explained above.

In [None]:
optimisers = [
    {
        'optimiser_class': torch.optim.SGD, 
        'tag': 'SGD'
    },
    {
        'optimiser_class': torch.optim.Adam,
        'tag': 'Adam'
    },
    {
        'optimiser_class': torch.optim.Adagrad,
        'tag': 'Adagrad'
    },
    {
        'optimiser_class': torch.optim.RMSprop,
        'tag': 'RMSProp'
    }
]

learning_rates = [0.01, 0.001, 0.0001, 0.00001]

for optimiser_obj in optimisers:   
    for lr in learning_rates:
        my_nn = NN([784, 1024, 1024, 512, 10], distribution=True, flatten_input=True)
        optimiser_class = optimiser_obj['optimiser_class']
        optimiser = optimiser_class(my_nn.parameters(), lr=lr)
        tag = optimiser_obj['tag']
        train(my_nn, optimiser, f'Optimisers/{tag}', graph_name=f'lr={lr}', epochs=1)
    

## Implementing our own PyTorch optimiser

To understand a bit further what's happening under the hood, let's implement SGD from scratch.

In [None]:
class SGD():
    def __init__(self, model_params, learning_rate):
        self.model_params = list(model_params) # HACK turning to list prevents len model_params being zero
        self.learning_rate = learning_rate

    def step(self):
        with torch.no_grad():
            for param in self.model_params:
                param -= self.learning_rate * param.grad

    def zero_grad(self):
        for param in self.model_params:
            if param.grad is None: # if not yet set (loss.backward() not yet called)
                print('continuing')
                continue
            param.grad = torch.zeros_like(param.grad)


In [None]:
my_nn = NN([784, 1024, 1024, 512, 10], distribution=True, flatten_input=True)
optimiser = SGD(my_nn.parameters(), learning_rate=0.1)

train(my_nn, optimiser, 'Loss/Train/custom_sgd')

## Challenges
- flash card match images with name of optimisation algorithm
- roughly sketch the paths that different optimisation algorithms might take