In [None]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

## Basic setup

Create anaconda environment
<br>
```bash
conda create -n ml python=3.7.5 jupyter
```
Install fastai library
<br>
```bash
conda install -c pytorch -c fastai fastai
```

## Exponential moving average

Define data as $(y_1, y_2, \dots, y_n)$ peaks of function $f:X \to Y$ where for some $(x_1, x_2, \dots, x_n)$, $f(x_1) = y_1, f(x_2) = y_2, \dots, f(x_n) = y_n$ and define
<br>
$$v_1 = (1 - \beta)y_1,  \\
v_2 = \beta v_1 + (1 - \beta)y_2 \\
\dots \\
v_n = \beta v_{n-1} + (1 - \beta)y_n \\
$$ 
</br>
for $\beta \in [0, 1]$

<div>
<img src="images/opts/ema_1.png"  height="600" width="800" />
</div>

Now lets try higher $\beta$

<div>
<img src="images/opts/ema_2.png"  height="600" width="800" />
</div>

Try with lower $\beta$

<div>
<img src="images/opts/ema_3.png"  height="600" width="800" />
</div>

Bias on early stage:

<div>
<img src="images/opts/ema_4.png"  height="600" width="800" />
</div>

For early stages, worm up, better will be if we use, so called bias correction:
$$
v^{corr}_t =\frac{v_t}{1 - \beta^t}
$$


## Different loss functions and noisy gradients

Applied machine learning is hardly experiment base and needs many experiments while achieving the stable result. On the other hand, DL models need "big" data for training and full batch processing is almost never possible.
<br>
For instance $100000$ examples or even more than $1000000$ can't be feat in to the one batch in GPU memory and iteratively calculation is too slow.

In batch gradient descent loss goes down per iteration, if not, than maybe learning rate is too big or other:

<div>
<img src="images/opts/bt_1.png"  height="600" width="800" />
</div>

From the different point of view, gradients are pretty similar at each epoch and if we stack in local extrema or saddle point, we can stay there for long:

<div>
<img src="images/opts/bt_2.png"  height="600" width="800" />
</div>

Generally data is divided in mini-batches and model is trained on that mini-datasets with batch gradient descent:

<div>
<img src="images/opts/bt_3.png"  height="600" width="800" />
</div>

But to match noise can increase the training time significatly:

<div>
<img src="images/opts/bt_4.png"  height="600" width="800" />
</div>

Noisy gradient descent:

<div>
<img src="images/opts/ng_1.png"  height="600" width="800" />
</div>

We need to somehow speedup the training time (the amount of experiments might be large):

<div>
<img src="images/opts/ng_2.png"  height="600" width="800" />
</div>

Loss function is often overcomplicated rather that MSE with convex landscape:

For probability activations usually softmax is used:
$$
\sigma(x) = \frac{e^{x_j}}{\sum_i{e^{x_i}}}
$$
<br>
with cross entropy loss:
$$
C = -\sum_{i}^{C}t_{i} log (s_{i}) = -log\left ( \frac{e^{s_{p}}}{\sum_{j}^{C} e^{s_{j}}} \right )
$$

Training deep learning models might use combination of many loss functions:

- Object detection: bounding box regression plus (weighted) cross entropy for image classification.
- Instance segmentation: bounding box regression plus (weighted) pixel binary classification plus (weighted) cross entropy for classification 

Different losses combined for different tasks:

<div>
<img src="images/opts/loss_odcs_1.png"  height="600" width="800" />
</div>

Detect, segment and classify at once:

Different losses combined for different tasks:

<div>
<img src="images/opts/loss_odcs_2.png"  height="600" width="800" />
</div>

On the other hand loss can have a different surfaces, because of many parameters, chance that many of them has the same direction is low:

<div>
<img src="images/opts/loss_ld_1.jpeg"  height="600" width="800" />
</div>

## Loss landscape and optimization

Non smooth surface:

<div>
<img src="images/opts/ls_1.jpeg"  height="600" width="800" />
</div>

Smooth surface:

<div>
<img src="images/opts/ls_2.jpeg"  height="600" width="800" />
</div>

Make step optimization (moving average):

<div>
<img src="images/opts/lsgd_1.png"  height="600" width="800" />
</div>

Make surface optimization (landscape):

<div>
<img src="images/opts/lsgd_2.gif"  height="600" width="800" />
</div>

## Gradient descent with momentum

Recall gradient descent algorithm, for some cost function $C$ and learning rate $\alpha$ we do parameters (weights) update by:
$$
W^l_{i,j} = W^l_{i,j} - \alpha \frac{\partial{C}}{\partial{W^l_{i,j}}}
$$
<br>
and
$$
b^l_{j} = b^l_{j} - \alpha \frac{\partial{C}}{\partial{W^l_{j}}}
$$

If the cost function $C$ is defined, denote partial derivative with respect of weights: 
$$
d{W} = \nabla_{W}C
$$
<br>
and with respect of biases:
$$
d{b} = \nabla_{b}C
$$

So our gradient descent optimization can be written as:
$$
W^l_{i,j} = W^l_{i,j} - \alpha \partial{W^l_{i,j}}
$$
<br>
and
$$
b^l_{j} = b^l_{j} - \alpha \partial{b^l_{j}}
$$

Or in general for batch or mini-batch gradient descent
optimization can be written as:
$$
W = W - \alpha d{W}
$$
<br>
and
$$
b = b - \alpha d{b}
$$

<div>
<img src="images/opts/gd_1.png"  height="600" width="800" />
</div>

For each iteration, compute $d W$ and $d b$ for the current mini-batch 

Here we have a picks and higher is oscillation slower the optimization is.
We can use moving average in order to reduce horizontal variance and increase speed of optimization:
$$
V_{d W} = \beta V_{d W} + (1 - \beta)d W
$$
<br>
$$
V_{d b} = \beta V_{d b} + (1 - \beta)d b
$$

We can use this averages in gradient descent optimization instead of direct gradients:
$$
W = W - \alpha V_{d W}
$$
<br>
and
$$
b = b - \alpha V_{d b}
$$

<div>
<img src="images/opts/gd_mom_1.png"  height="600" width="800" />
</div>

## RMSProp optimizer

<b>Root Mean Square Prop</b>

For each iteration, compute $d W$ and $d b$ for the current mini-batch 

Calculate momentum with changes:
$$
S_{d W} = \beta S_{d W} + (1 - \beta)d W^2 \text{ elementwise}
$$
<br>
$$
S_{d b} = \beta S_{d b} + (1 - \beta)d b^2 \text{ elementwise}
$$

Now update the weights and biases by:
$$
W = W - \alpha \frac{d W}{\sqrt{S_{d W}}}
$$
<br>
and
$$
b = b - \alpha \frac{d b}{\sqrt{S_{d b}}}
$$

<div>
<img src="images/opts/rms_1.png"  height="600" width="800" />
</div>

RMSProp steps:

<div>
<img src="images/opts/rms_2.png"  height="600" width="800" />
</div>

Here if $S_{d W}$ is large, it means that step will be forward

To avoid zero division error (if $\sqrt{S_{d W}}$ or $\sqrt{S_{d b}}$ is almost zero) we can add small $\epsilon$ to the denominators:
$$
W = W - \alpha \frac{d W}{\sqrt{S_{d W}} + \epsilon}
$$
<br>
and
$$
b = b - \alpha \frac{d b}{\sqrt{S_{d b}} + \epsilon}
$$

Here $\epsilon = 10^{-8}$ for instance

## Adam optimizer

<b>Adaptive Moment Estimation </b>

For Adam optimization algorithm we will combine Momentum and RMSProp together

Set $V_{d W} = 0$, $S_{d W = 0}$, $V_{d b} = 0$, $S_{d b = 0}$

For each iteration, compute $d W$ and $d b$ for the current mini-batch 

First calculate momentums $V_{d W}$ and $V_{d b}$ with hyperparameter $\beta_1$:
$$
V_{d W} = \beta_1 V_{d W} + (1 - \beta_1)d W
$$
<br>
$$
V_{d b} = \beta_1 V_{d b} + (1 - \beta_1)d b
$$

Now calculate $S_{d W}$ and $S_{d b}$ with hyperparameter $\beta_2$:
$$
S_{d W} = \beta_2 S_{d W} + (1 - \beta_2)d W^2 \text{ elementwise}
$$
<br>
$$
S_{d b} = \beta_2 S_{d b} + (1 - \beta_2)d b^2 \text{ elementwise}
$$

For the current iteration, let's say $t$, calculate so called bias corrections for momentum:
$$
V^{corr}_{d W} = \frac {V_{d W}}{1 - \beta_1^t}
$$
<br>
$$
V^{corr}_{d b} = \frac{V_{d b}}{{1 - \beta_1^t}}
$$

The same correction for RMSProp parameters:
$$
S^{corr}_{d W} = \frac {S_{d W}}{1 - \beta_2^t}
$$
<br>
$$
S^{corr}_{d b} = \frac{S_{d b}}{{1 - \beta_2^t}}
$$

Now lets update weights and biases:
$$
W = W - \alpha \frac{V^{corr}_{d W}}{\sqrt{S^{corr}_{d W}} + \epsilon}
$$
<br>
and
$$
b = b - \alpha \frac{V^{corr}_{d b}}{\sqrt{S^{corr}_{d b}} + \epsilon}
$$

This algorithms combines the Momentum effect together with RMSProp optimizer

We than have a number of hyperparameters:
- The learning rate $\alpha$
- Momentum parameter $\beta_1$ for derivatives $d W$ and $d b$ with common choice $0.9$
- RMSProp parameter $\beta_2$ for $d W^2$ and $d b^2$ squares with common choice $0.999$
- $\epsilon = 10^{-8}$

Almost always $\alpha$ is tunned and almost never other parameters, the values above was recommended by the authors of the Adam paper and I personally don't remember that other values gave me any positive effect in improvements of performance (optimization) if not negative.

The $\beta_1$ is considered for first moment and $\beta_2$ for second moment and that's why optimizer has the name

## Visualization of different optimizers

Optimizers on surface:

<div>
<img src="images/opts/gds_1.gif"  height="600" width="800" />
</div>

Optimizers on projection:

<div>
<img src="images/opts/gds_2.gif"  height="600" width="800" />
</div>

## Learning rate decay

Near the extrema points, large learning rate might cause so called bouncing gradient:

<div>
<img src="images/opts/lrdc_1.png"  height="600" width="800" />
</div>

In order to avoid this problem, we can decrease learning rate per iteration, or epochs:

$$
\alpha = \frac{1}{1 + dr \cdot ep} \cdot \alpha_0
$$
<br>
- $dr$ - is the decay rate
- $ep$ - is the epoch number

Or square root decay:

$$
\alpha = \frac{k}{\sqrt{ep}} \cdot \alpha_0
$$
<br>
- $k$ - is the another hyperparameter
- $ep$ - is the epoch number

Or mini-batch decay:

$$
\alpha = \frac{k}{\sqrt{t}} \cdot \alpha_0
$$
<br>
- $k$ - is the another hyperparameter
- $t$ - mini-batch number

Stepwise decay:

<div>
<img src="images/opts/lrdc_2.png"  height="600" width="800" />
</div>

## Local optimum and saddle points

On early stages of deep learning, people thought that local optimum was the main problem of optimization. But for local optima we need that each 
$$\frac{\partial C}{\partial W^l_{i,j}}$$ 
direction to be the same which for instance of $100000000$ or even $200000000$ parameters might have a $2^{-100000000}$ or $2^{-200000000}$ probability

Local optima in lower dimensional case:

<div>
<img src="images/opts/lo_1.png"  height="600" width="800" />
</div>

<b> Analysis in lower dimensions often does not generalize for higher dimensions </b>

- The probability of local optima is low
- But the probability of saddle points is high

Local optima vs saddle points in higher dimensions:

<div>
<img src="images/opts/lo_2.png"  height="600" width="800" />
</div>

Saddle points and plateaus:

<div>
<img src="images/opts/saddle_1.jpeg"  height="600" width="800" />
</div>

The main goal of optimization algorithms is to escape plateaus:

<div>
<img src="images/opts/saddle_2.gif"  height="600" width="800" />
</div>

## One cycle policy CycleLR

In order to avoid saddle point trap, we can increase learning rate during the epoch and then decrease it with some linear or non linear (exponential) function:

<div>
<img src="images/opts/one_cycle_1.png"  height="600" width="800" />
</div>

One cycle policy with non linear learning rate change:

<div>
<img src="images/opts/one_cycle_2.png"  height="600" width="800" />
</div>

We can use this technique to find initial learning rate before training:

<div>
<img src="images/opts/lr_find_1.png"  height="600" width="800" />
</div>

<div>
<img src="images/opts/lr_find_2.png"  height="600" width="800" />
</div>

One cycle policy speeds up learning process significantly and reduces the chances to get in to the saddle point or bad extrema point trap

There are many techniques of learning rate manipulation during the training:

<div>
<img src="images/opts/one_cycle_mx_1.png"  height="600" width="800" />
</div>

Or even:

<div>
<img src="images/opts/one_cycle_mx_2.png"  height="600" width="800" />
</div>

## Questions