# Faster Optimizers

To make training faster we can also use faster optimizer than the regular Gradient Descent.

## Momentum Optimization

Gradient Descent updates the weights by directly subtracting the gradient of the cost function multiplied learning rate, it does not care about what the previous gradients were.

Momentum optimization cares a lot about the previous gradients: at each iteration, it subtracts the local gradient from the *momentum vector* $m$, and it updates the weights by adding this new momentum vector. There is a new hyperparameter $\beta$, called *momentum*. It controls speed of momentum and prevent momentum from growing too large, which must be set between 0 and 1. A typical value is 0.9.
$$
m \gets \beta m - \eta \nabla_\theta J(\theta) \\
\theta \gets \theta + m
$$
The maximum size of the weight updates is equal to that gradient multiplied by the learning rate multiplied by $\frac1{1 - \beta}$. This way if $\beta=0.9$ then optimization ends up 10 times faster than Gradient Descent! This allows momentum optimization to escape from plateaus and local optima much faster.

> Due to the momentum, the optimizer may overshoot a bit, then comes back, overshoot again, and oscillate like this many times before stabilizing at the minimum.

This simple momentum in Keras can be implemented very easily:

```python
optimizer = keras.optimizers.SGD(lr=0.0001, momentum=0.9)
```

## Nestrov Accelerated Gradient

One small variant of momentum optimization is *Nestrov Accelerated Gradient (NAG)*. It measures gradients not at the local position $\theta$, but slightly ahead in the direction of the momentum, at $\theta + \beta m$.
$$
m \gets \beta m - \eta\nabla_\theta J(\theta + \beta m)\\
\theta \gets \theta + m
$$
Every step it is a bit closed to the optimum, it adds up and NAG converges much faster.

To use it simply set `nestrov=True` when creating the `SGD` optimizer:

```python
optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9, nestrov=True)
```

## AdaGrad

When there is a problem of elongated bowl, GD sometimes go in wrong direction (like when input scales are different) results in slower convergence. AdaGrad solves this problem by decaying learning rate or scaling down the gradient vector when there is a steep dimension. This is called *adaptive learning rate*.
$$
s \gets s + \nabla_\theta J(\theta) \otimes \nabla_\theta J(\theta)\\
\theta \gets \theta - \eta\nabla_\theta J(\theta) \oslash \sqrt{s + \epsilon}
$$
So it squares up the gradient vector and then square root them in updating. By this when there is a steep square values became very large resulting vector in the direction of optimum.

It is good for simple quadratic problems, but it stops earlier when training neural network as it scales down learning rate very much in that case. So, you should not use AdaGrad for neural networks. But its useful to learn the concept of adaptive learning rate.

## RMSProp

RMSProp fixes the problem of AdaGrad by accumulating just few previous gradients (rather than from the beginning). It does so by using exponential decay in the first step:
$$
s \gets \beta_s + (1 - \beta) \nabla_\theta J(\theta) \otimes \nabla_\theta J(\theta) 
\\
\theta \gets \theta - \eta \nabla_\theta J(\theta) \oslash \sqrt{s + \epsilon}
$$
The decay rate $\beta$ is typically set to 0.9. In Keras you can use RMSProp like this:

```python
optimizer = keras.optimizers.RMSProp(lr=0.001, rho=0.9)
```

RMSProp works in every case as it is default in most research before Adam arrives.

## Adam and Nadam

Adam stands for adaptive moment estimation, combines the idea of momentum and RMSProp optimization: just like momentum optimization, it keeps track of an exponential decaying average of past gradients and squares of past gradients as RMSProp:
$$
m \gets \beta_1m - (1 - \beta_1) \nabla_\theta J(\theta)
\\
s \gets \beta_2s + (1 - \beta_2) \nabla_\theta J(\theta) \otimes \nabla_\theta J(\theta)
\\
\hat{m} \gets \frac{m}{1-\beta_1^t}
\\
\hat{s} \gets \frac{s}{1-\beta_2^t}
\\
\theta \gets \theta + \eta \hat{m} \oslash\sqrt{\hat{s} + \epsilon}
$$
Here, $t$ indicates the iteration number. 

We have to change value of $m$ and $s$, because they are initialized to zero, they are biased to zero. So, those two steps boosts $m$ and $s$ at the beginning of the training.

The momentum decay parameter $\beta_1$ is set to 0.9, while scaling decay hyperparameter $\beta_2$ is often initialized to 0.999. 

Here is how to use Adam in Keras:

```python
optmizer = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999)
```

Since Adam is an adaptive learning rate algorithm, it requires less tuning of learning rate. You can start by $\eta=0.001$, making Adam even easier to use Gradient Descent.

Finally, two variants of Adam:

- *Ada Max*:
  - In this it changes only second step to $s \gets \max(\beta_2s, \nabla_\theta J(\theta))$. This can make Adam more stable, but it really depends on the dataset. In general Adam performs well.
- Nadam:
  - It is Adam plus the Nestrov trick, so it will converge slightly faster than Adam.

> For some dataset adaptive optimizations are just not generalize well, so you can use simple Nestrov Accelerated Gradients technique when it happens.

> ### Training Sparse Models
>
> All the optimizations discussed are dense, means most parameters will be non-zero. For making model faster at runtime, you can set small values to zero, but it may degrade model's performance. Or you can use strong $l_1$ regularization as it pushed out optimizer to zero out as many as weights.

## Learning Rate Scheduling

Setting learning rate properly is very important. If it is too large model diverges, if too small model converges slowly. We can find which learning rate to use by training model many times. But it is time consuming and we use a constant value throughout the training.

So there is a better way, first initialize a large learning rate and then reduce it on every epoch which make faster and good convergence. This technique is called *learning schedule*. Let's see most common ones:

*Power scheduling*

- You can set learning to a function computing following:
  $$
  t: \eta(t) = \frac{\eta_0}{(1 + \frac{t}{s})^c}
  $$
  Here,

  - $t$: iteration number
  - $\eta_0$: initial learning rate
  - $s$: a step number hyperparameter
  - $c$: a power constant (typically set 1)

  The learning rate drops at each steps, after $s$ steps it is down to $\eta_0/2$, then $\eta_0/3$ and so on. This schedule first drops quickly, then slowly.

*Exponential scheduling*

- Set learning rate to $\eta(t) = \eta_0 0.1^{t/s}$. This way learning rate keeps dropping by a factor of 10 every $s$ steps.

*Piecewise constant scheduling*

- Use a constant learning rate for some epochs (like $\eta_0 = 0.1$ for 5 epochs), then a smaller learning rate for another epochs (like $\eta_1=0.05$ for 50 epochs) and so on. It works good, but you have to fill all constant learning rates with their epochs.

*performance scheduling*

- Measure the validation error, if it stops dropping drop learning rate by a factor of $\lambda$.

*1-cycle scheduling*

- This approach starts by increase learning rate $\eta_0$ to $\eta_1$ in half training and reducing it to $\eta_0$. $\eta_1$ is decided by training many times and $\eta_0$ is roughly 10 times lower.

Implementing power scheduling is easiest option:

```python
optimizer = keras.optimizers.SGD(lr=0.01, decay=1e-4)
```

The `decay` is the inverse of $s$, and $c$ has default value 1.

For exponential and piecewise , you can make a function that take epoch number:

```python
def exp_decay_fn(epoch):
    return 0.01 * 0.1**(epoch / 20)
```

If you don't want to hardcode $\eta_0$ and $s$, then you can make a outer function that previous function:

```python
def exp_decay(lr0, s):
    def exp_decay_fn(epoch):
        return lr0 * 0.1**(epoch / s)
    return exp_decay_fn

exp_decay_fn = exp_decay(lr0=0.01, s=20)
```

Now, let's set callback for this schedule function:

```python
lr_scheduler = keras.callbacks.LearningRateScheduler(exp_decay_fn)
history = model.fit(X, y, [...], callbacks=[lr_scheduler])
```

It updates learning rate at beginning of every epoch. Once model is saved new learning rate is also saved, so you can continue the training, but if your schedule function take `epoch`, training always starts with 0 epoch. To solve this , you can set the `fit()` method's `initial_epoch` to right epoch.

For performance scheduling, you can use `ReduceLROnPlateau` callback, as it multiply learning rate by 0.5 when no change in best validation loss for 5 epochs.

```python
lr_scheduler = keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5)
```

An alternative way is to define learning rate by `tf.keras.optimizers.schedules` and then pass it to optimizer:

```python
s = 20 * len(X_train) // 32
learning_rate = keras.optimizers.schedules.ExponentialDecay(0.01, s, 0.1)
optimizer = keras.optimizers.SGD(learning_rate)
```

