In [2]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Basic setup

Create anaconda environment
<br>
```bash
conda create -n ml python=3.7.5 jupyter
```
Install fastai library
<br>
```bash
conda install -c pytorch -c fastai fastai
```

## Optimization algorithms

#### Moving average

Define data as $(y_1, y_2, \dots, y_n)$ peaks of function $f:X \to Y$ where for some $(x_1, x_2, \dots, x_n)$, $f(x_1) = y_1, f(x_2) = y_2, \dots, f(x_n) = y_n$ and define
<br>
$$a_1 = y_1,  a_2 = \beta y_2 + (1 - \beta)a_1$$ 
</br>
for $\beta \in [0, 1]$

## SGD with momentum

## Gradient descent with momentum

Recall gradient descent algorithm, for some cost function $C$ and learning rate $\alpha$ we do parameters (weights) update by:
$$
W^l_{i,j} = W^l_{i,j} - \alpha \frac{\partial{C}}{\partial{W^l_{i,j}}}
$$
<br>
and
$$
b^l_{j} = b^l_{j} - \alpha \frac{\partial{C}}{\partial{W^l_{j}}}
$$

If the cost function $C$ is defined, denote partial derivative with respect of weights: 
$$
d{W} = \nabla_{W}C
$$
<br>
and with respect of biases:
$$
d{b} = \nabla_{b}C
$$

So our gradient descent optimization can be written as:
$$
W^l_{i,j} = W^l_{i,j} - \alpha \partial{W^l_{i,j}}
$$
<br>
and
$$
b^l_{j} = b^l_{j} - \alpha \partial{b^l_{j}}
$$

Or in general for batch or mini-batch gradient descent
optimization can be written as:
$$
W = W - \alpha d{W}
$$
<br>
and
$$
b = b - \alpha d{b}
$$

<div>
<img src="images/opts/gd_1.png"  height="600" width="800" />
</div>

Here we have a picks and higher is oscillation slower the optimization is.
We can use moving average in order to reduce horizontal variance and increase speed of optimization:
$$
V_{d W} = \beta V_{d W} + (1 - \beta)d W
$$
<br>
$$
V_{d b} = \beta V_{d b} + (1 - \beta)d b
$$

We can use this averages in gradient descent optimization instead of direct gradients:
$$
W = W - \alpha V_{d W}
$$
<br>
and
$$
b = b - \alpha V_{d b}
$$

<div>
<img src="images/opts/gd_mom_1.png"  height="600" width="800" />
</div>

## RMSProp optimization

<b>Root Mean Square Prop</b>

Calculate momentum with changes:
$$
S_{d W} = \beta S_{d W} + (1 - \beta)d W^2 \text{ elementwise}
$$
<br>
$$
S_{d b} = \beta S_{d b} + (1 - \beta)d b^2 \text{ elementwise}
$$

Now update the weights and biases by:
$$
W = W - \alpha \frac{d W}{\sqrt{S_{d W}}}
$$
<br>
and
$$
b = b - \alpha \frac{d b}{\sqrt{S_{d b}}}
$$

Here if $S_{d W}$ is large, it means that step will be forward

To avoid zero division error (if $\sqrt{S_{d W}}$ or $\sqrt{S_{d b}}$ is almost zero) we can add small $\epsilon$ to the denominators:
$$
W = W - \alpha \frac{d W}{\sqrt{S_{d W}} + \epsilon}
$$
<br>
and
$$
b = b - \alpha \frac{d b}{\sqrt{S_{d b}} + \epsilon}
$$

Here $\epsilon = 10^{-8}$ for instance