# Optimisation

> Deep Learning provides __a lot__ of optimization techniques and __optimization is one of the most (if not the most) important aspects__

![](images/optim_vis.gif)

> __You can find optimisation related procedures for PyTorch in [`torch.optim` package](https://pytorch.org/docs/stable/optim.html)__

## SGD

Previously we have seen __Stochastic Gradient Descent__, basic optimization technique for gradient based models, which has the following update formula:

$$
\theta_{t+1} = \theta_t - \alpha \frac{1}{M}\sum_{i}^{M} \nabla L(h(x^i;\theta_t), y^i)
$$

$$
\alpha \rightarrow \text{learning rate}
$$

$$
L \rightarrow \text{cost function}
$$

$$
M \rightarrow \text{batch size}
$$

$$
\theta_t \rightarrow \text{model parameters at timestep t}
$$


### Know-how

- Basic, but __often used in SOTA (state of the art) neural network models__ (or it's two upcoming variants)
- Harder (and we mean it) to fine tune, but often brings great results (especially when mixed with `scheduler`, more about that later)
- Used for testing, whether our neural network actually works
- __Might not be the best default (see Adam)__

In [4]:
import torch

clf = torch.nn.Linear(100, 10)

sgd = torch.optim.SGD(clf.parameters(), lr=1e-03)

## Momentum (SGD + Momentum)

> Momentum is a small modification to SGD __taking into account moving average of previous parameter updates__

There are two different formulas for SGD update, we will go with the one provided by PyTorch and based on [this paper](http://www.cs.toronto.edu/%7Ehinton/absps/momentum.pdf):

$$
v_{t+1} = \mu v_t + \alpha \frac{1}{M}\sum_{i}^{M} \nabla L(h(x^i;\theta_t), y^i)
$$
$$
\theta_{t+1} = \theta_t - \alpha 
$$
$$
v_0 = 0
$$

Here, we define `v` (velocity) which takes into account previous update multiplied by $\mu$.

> __WARNING:__ This alternative formula maybe different for other frameworks, so it can be one source of different results (e.g. between this and Tensorflow)!

### Intuition

- If previous step was in the same direction, current optimization step will be even larger in this direction (snowball effect)
- If previous step was in the opposite direction, next step will be smaller (more cautious)
- Learning rates change more, additional oscillations and regularization due to that factor

### Why?

- Make the convergence faster if the loss surface is smooth and does not curve back
- If the gradient changes rapidly, steps will become smaller and more cautious

### Why not?

- We may "overshoot" the minima
- We may take too small steps
- __Above because feedback from previous steps might be simply wrong__

### Know-how

- Set `momentum` parameter as high as possible and close to `1` (usually `0.99` or `0.999`)
- Set `lr` as high as possible without divergence (might be hard and increase oscillations too much)
- Rule of thumb: twice as high `lr` as with SGD (see [here](https://distill.pub/2017/momentum/)))

In [None]:
momentum = torch.optim.SGD(clf.parameters(), lr=1e-03, momentum=0.99)

## Nesterov Accelerated Gradient (NAG) - SGD with momentum and Nesterov update

> Momentum with Nesterov update evaluates gradient calculated __after momentum was applied to parameters__

$$
\hat{\theta} = x + \mu v
$$

$$
v = \mu * v - \alpha \nabla L(h(x^i;\hat{\theta}_t), y^i)
$$

$$
\theta = \theta + v
$$

$$
\hat{\theta} \rightarrow \text{Look ahead parameters of Nesterov}
$$

$$
\theta \rightarrow \text{Current parameters}
$$

![](images/momentum_vs_nesterov.jpeg)

__Source:__ [Stanford CS231n](https://cs231n.github.io/neural-networks-3/)

### Why?

- We know momentum will take us to to different point, no matter what
- Our estimates of gradient __did not take this shift into account previously!__

### Results

- Gives us better theoretical estimates of gradient
- It might be different in practice, but it's worth a shot
- Similar idea to pure `Momentum`

### Practical concerns

- We always keep "ahead" gradient instead of normal gradient

In [None]:
momentum = torch.optim.SGD(clf.parameters(), lr=1e-03, momentum=0.99, nesterov=True)

# Pre-Adam optimizers

> Many other optimizer versions have been proposed throughout the years but most didn't see widespread adoption

Most popular predecessors of `Adam` (described below) were:
- __AdaGrad__ - one of the first optimizers with adaptive learning rate on a per-parameter basis:
    - __might be useful for sparse data__
    - __keeps accumulated squared gradients__
- __AdaDelta__ - extension of AdaGrad, tries to reduce dying learning rates as the optimization progresses:
    - __moving window of accumulated past squared gradients__
- __RMSProp__ - similar goal to AdaDelta, unpublished, developed independently

## Adam

> ADAptive Moments makes update on a per-parameter basis and tries to normalize them with moving mean and variance

__Adam (or AdamW) should be your default choice for optimization__

$$
g = \nabla L(h(x^i;\theta_t), y^i)
$$

$$
m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t
$$

$$
v_t = \beta_2 v_{t-1} + (1-\beta_2)g^2_t
$$

$$
\hat{m}_t = \frac{m_t}{1 - \beta_1^t}
$$

$$
\hat{v}_t = \frac{v_t}{1 - \beta_2^t}
$$

$$
\theta_t = \theta_{t-1} - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}
$$

$$
m_t \rightarrow \text{moving average of first moment (mean) at timestep t}
$$

$$
v_t \rightarrow \text{moving average of second moment (variance) at timestep t}
$$

$$
\hat{m}_t \rightarrow \text{moving average of mean with bias correction (power of timestep!)}
$$

$$
\hat{v}_t \rightarrow \text{moving average of variance with bias correction (power of timestep!)}
$$

### AdamW

> AdamW is a slight change to Adam which decouples regularization of weights (`weight_decay` in PyTorch)

> Originally Adam with L2 regularization had regularization term inside mean and variance, hence __moving mean/variance was influenced by regularization__, which skewed the results

### Know-how

- Always use `AdamW` when using weight decay
- Use a `warm up` period - lower learning rate at the beginning of training; due to random initialization, there is large variance in the initial estimates of statistics which disrupts the training later on
- Keep default `beta1` and `beta2` values, usually those should not be altered
- Algorithm is quite resistant to different learning rates - easier optimization

In [None]:
# AdamW with L2
adam = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)

## Post-Adam optimizers

> There were a lot of other optimizers after Adam was introduced, but __not many received widespread adoption__ as the tools we have usually work fine

> It might be worth checking out (and mixing) approaches below when you are pushing for the best score (e.g. on Kaggle)

Few interesting cases:
- `Nadam`, `AdaMax` etc. - small changes to adam (the first one applies Nesterov to `Adam`)
- `RAdam` - research paper: https://arxiv.org/abs/1908.03265 - removes the need for warm up period in adaptive optimizers
- LookAhead optimizer - research paper: https://arxiv.org/abs/1907.08610: 
    - Keeps two sets of weights
    - One is "faster" - few steps taken by optimizer with larger learning rate
    - Another one is "slower" - not updated during those steps
    - After `k` steps (usually `5`) we take the mean direction of fast and slow weights
    - __Demonstrated to improve over Adam and SGD__
    
> __Resist the urge to combine every possible extension!__ Analyze what might help during your training and choose wisely, otherwise you might end up with worse score than initially!

# Schedulers

> Schedulers allow us to change hyperparameters of optimizers

__Almost always we change learning rate__ and we usually do that based on:
- Epochs we ran optimization for
- Some metric (usually on validation set)
- Using predefined algorithm, which, supposedly, should improve convergence somehow

> You can find PyTorch provided schedulers inside [`torch.optim.lr_scheduler` package](/https://pytorch.org/docs/stable/optim.html)

Let's see widespread schedulers for each type:

## Step/MultiStep schedulers

> After `N` epochs passed, multiply learning rate by some value

PyTorch provides a few which allow us to do it:
- [`torch.optim.lr_scheduler.MultiplicativeLR`](https://pytorch.org/docs/stable/optim.html#torch.optim.lr_scheduler.MultiplicativeLR) - allows us to change learning rate based on integer number of current epoch
- [`torch.optim.lr_scheduler.StepLR`](https://pytorch.org/docs/stable/optim.html#torch.optim.lr_scheduler.StepLR)- decays learning rate by `gamma` every `N` epochs
- [`torch.optim.lr_scheduler.MultiStepLR`](https://pytorch.org/docs/stable/optim.html#torch.optim.lr_scheduler.MultiStepLR) - like above, but we can specify after integer number for each epoch, where we want to multiply the `learning_rate` by `gamma` (__used very often in many research papers on ImageNet as standard optimization procedure__)

### Know-how

- Usually we multiply `lr` by `0.1` (or, less often by `5` or `2`)
- Most often used with non-adaptive optimizers, __but can also work with more advanced optimizers like Adam__

In [None]:
scheduler = MultiStepLR(optimizer, milestones=[30, 80], gamma=0.1)
for epoch in range(100):
    train(...)
    validate(...)
    scheduler.step()

## Reducing LR on Plateau

> If our metric stops improving for a few epochs (usually validation loss), we reduce `lr`

- [`torch.optim.lr_scheduler.ReduceLROnPlateau`](https://pytorch.org/docs/stable/optim.html#torch.optim.lr_scheduler.ReduceLROnPlateau) - does exactly this

### Know-how

- Usually we multiply `lr` by `0.1` (or, less often by `5` or `2`) - same as `step` schedulers
- Most often used with non-adaptive optimizers, __but can also work with more advanced optimizers like Adam__

In [None]:
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
scheduler = ReduceLROnPlateau(optimizer, 'min')
for epoch in range(10):
    train(...)
    val_loss = validate(...)
    # Note that step should be called after validate()
    scheduler.step(val_loss)

## Algorithmic schedulers

> Active area of research, previous approaches are more battle tested!

- [`torch.optim.lr_scheduler.CyclicLR`](https://pytorch.org/docs/stable/optim.html#torch.optim.lr_scheduler.CyclicLR) - propagated by `fastai`, which does the following:
    - Initially increase learning rate (warmup phase, useful with `Adam`/`AdamW`)
    - Get to large learning rate which distrupts parameters (jumping out of local minima)
    - Get back to low learning rate to fine-tune new optima
    - If the minima is large and flat, we will not jump out of it, no matter the larger learning rate
    - See research paper by Leslie et al.: https://arxiv.org/abs/1506.01186
    
![](images/cyclic_lr.png)

- [__`torch.optim.lr_scheduler.OneCycleLR`__](https://pytorch.org/docs/stable/optim.html#torch.optim.lr_scheduler.OneCycleLR) - as above, __use with `AdamW` or experimental optimizers__, but a increase/decrease schedule which was found to work well and allow for SuperConvergence: https://arxiv.org/abs/1708.07120

> Super Convergence - phenomena, where using abruptly high learning rates (e.g. `2` or `5`) will result in fast convergence to flat minima

__Above is often used with `OneCycle` learning policy!__

# General optimization procedure

> __Below is a basic outline of the steps one might try in order to find a good optimization procedure__


1. Find related research paper (and specified optimization procedure) if you are using well-known architecture (e.g. MobileNet we will learn later about)
2. Iterate over that with grid __small__ search if results are unsatisfactory 
3. If architecture is custom try `lr_finder` (with appropriate optimizer, usually `Adam` or `RAdam`, see below) until divergence (and set your LR `10x` lower). Example repository implementing it [here](https://github.com/davidtvs/pytorch-lr-finder)
4. Start with `Adam` as a good default (preferably with warm up period or use `RAdam`, example implementation [here](https://github.com/LiyuanLucasLiu/RAdam))
5. Experiment based on loss and apply appropriate schedulers (__usually lowering learning rate is beneficial, BUT more sophisticated approaches could be checked__)
6. If you have too much time (or really want this SOTA score) try to optimize with `SGD` & `momentum` over the optimization procedure you've found above (__for those with too much free time__)
7. Try AutoML approaches if you have better hardware and iterate over that (see for example [this project](https://auto.gluon.ai/stable/index.html) (__last resort__)
8. Hyperparam search if you have enough hardware to handle that and are totally lost after above steps (__last resort__)

## Summary

- PyTorch provides __most widely used__ optimizers within the package
- For additional optimizers we should look to another libraries, __though it usually is not worth it__
- __Optimizers__ come around improving weak (or perceived as weak) points of previous optimization techniques:
    - __SGD__ - is the backbone and basic technique
    - __Momentum__ - takes into account previous steps & makes the step smaller/larger appropriately
    - __Nesterov__ - like momentum, but gradient is evaluated at step after momentum was applied
    - __Adam (ADAptive Moments)__ - adaptive optimizer, each parameter has different learning rate based on mean and variance of gradients from previous steps
- __Schedulers__ alter learning rate, usually based on:
    - Training time (epochs) - `Step` schedulers, multiplying initial learning rate by some below `1` constant like `0.1`
    - Training feedback - Reduce learning rate when validation loss stops improving
    - Algorithm - predefined schedule increasing and/or decreasing learning rate. Can cycle between large and small learning rates in order to escape local minimas and find flat regions   
- Basic PyTorch training loop consists of:
    - Setting up criterion
    - Casting model and data to device
    - Backpropagating loss
    - Taking optimizer step
    - Zeroing out gradient in model using `optimizer.zero_grad()`

# Challenges

## Assessment

- What is Learning Rate Finder and how does it fit with 1 Cycle policy? Check out [this blog post](https://sgugger.github.io/the-1cycle-policy.html) from `fastai` contributor

## Non-assessment

- Check out [Stochastic Weight Averaging](https://pytorch.org/docs/stable/optim.html#stochastic-weight-averaging) - why might this work and how it works?
- What is [Stochastic Gradient Descent with Warm Restarts](https://pytorch.org/docs/stable/optim.html#torch.optim.lr_scheduler.CosineAnnealingWarmRestarts)?
- What is [`torch.optim.lr_scheduler.LambdaLR`](https://pytorch.org/docs/stable/optim.html#torch.optim.lr_scheduler.LambdaLR)?