# Faster Optimizers

To make training faster we can also use faster optimizer than the regular Gradient Descent.

## Momentum Optimization

Gradient Descent updates the weights by directly subtracting the gradient of the cost function multiplied learning rate, it does not care about what the previous gradients were.

Momentum optimization cares a lot about the previous gradients: at each iteration, it subtracts the local gradient from the *momentum vector* $m$, and it updates the weights by adding this new momentum vector. There is a new hyperparameter $\beta$, called *momentum*. It controls speed of momentum and prevent momentum from growing too large, which must be set between 0 and 1. A typical value is 0.9.
$$
m \gets \beta m - \eta \nabla_\theta J(\theta) \\
\theta \gets \theta + m
$$
The maximum size of the weight updates is equal to that gradient multiplied by the learning rate multiplied by $\frac1{1 - \beta}$. This way if $\beta=0.9$ then optimization ends up 10 times faster than Gradient Descent! This allows momentum optimization to escape from plateaus and local optima much faster.

> Due to the momentum, the optimizer may overshoot a bit, then comes back, overshoot again, and oscillate like this many times before stabilizing at the minimum.

This simple momentum in Keras can be implemented very easily:

```python
optimizer = keras.optimizers.SGD(lr=0.0001, momentum=0.9)
```

