# Advanced Optimization Methods

In the [previous notebook](./sgd.ipynb), we encountered stochastic-gradient descent (SGD) as the standard method to fit neural networks to data. Despite its simplicity, SGD is still the foundation of many state-of-the-art optimiziers. In this notebook, we learn how speed and accuracy of SGD can be improved with simple tricks.

This notebook is based on Sebastian Ruder's wonderful [blog post](http://ruder.io/optimizing-gradient-descent/index.html).

## Adagrad

After several iterations of SGD the weights gradually approach the optimum. However, at a certain point the gradient steps become too coarse to achieve further improvements. Once this happens, the learning rate should be reduced to arrive at a better approximation.

 **Adagrad**, developed by [John Duchi, Elad Hazan, Yoram Singer](http://jmlr.org/papers/v12/duchi11a.html), achieves this goal by reducing the learning rate with successive gradient steps. 

More precisely, writing $g^{(i)} = \nabla \ell(w^{(i)})$ for the gradient of the loss function in the $i$th iteration, we let 
$$G^{(i)} = {\sum_{k \le i} (g^{(i)})^2}$$
denote the sum of squared gradients in the first $i$ steps. Then, the Adagrad update step becomes
$$w^{(i+1)} = w^{(i)} - \frac\alpha{\sqrt{G^{(i)}}}g^{(i)}.$$

In the actual implementation, Adagrad introduces a separate learning rate for each weight, so as to take into account situations where some converge faster than others.

## RMSProp

In practice, Adagrad often results in a highly aggressive reduction of the learning rate and therefore into slow learning. **RMSProp** developed by [Geoff Hinton](https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf) resolves this problem by replacing the sum of squared gradients with a moving average:
$$G^{(i+1)} = .9 \, G^{(i)} + .1\, g^{(i)}.$$

As before, we put
$$w^{(i+1)} = w^{(i)} - \frac\alpha{\sqrt{G^{(i)}}}g^{(i)}.$$

## Momentum

**Momentum** is a [classical technique](https://www.sciencedirect.com/science/article/pii/0041555364901375?via%3Dihub) in optimization for dampening oscillations appearing in the course of gradient descent:

<img src="images/momentum.png" alt="Drawing" style="width: 500px;"/>
https://distill.pub/2017/momentum/

The idea is to smoothen the gradients by replacing gradients with moving averages. Using a parameter $\beta<1$, we introduce the momentum vector
$$m^{(i+1)} = \beta m^{(i)} + g^{(i)}.$$
and define the update step via
$$w^{(i+1)} = w^{(i)} - \alpha m^{(i)}.$$

Gabriel Goh's [Distill article](https://distill.pub/2017/momentum/) provides beautiful interactive illustrations and a mathematical proof for momentum on the vanilla SGD. 

## Adam

**Adam** was developed by [Diederik P. Kingma and Jimmy Ba](https://arxiv.org/abs/1412.6980) and augments RMSProp with momentum. More precisely, we put
$$m^{(i+1)} = \beta_1 m^{(i)} + (1 - \beta_1) g^{(i)},$$
$$G^{(i+1)} = \beta_2 G^{(i)} + (1 - \beta_2) (g^{(i)})^2,$$

and then define the update step
$$w^{(i+1)} = w^{(i)} - \frac\alpha{\sqrt{G^{(i)}}}m^{(i)}.$$

The actual implementation also contains bias-corrections for $m^{(i)}$ and $G^{(i)}$.


## Second-Order Methods?

In classical optimization gradient descend approaches are typically superseded by methods incorporating information on the Hessian, such as the Newton-Raphson algorithm. This has the advantage of speeding up convergence substantially. There are three problems, why this reasoning does not carry over to deep learning.

1. Even moderately-sized networks have millions of weights. That means, the number of entries in the Hessian goes into the trillions. This means the death, unless highly sophisticated heuristics are used to reduce the number of weights. 
2. The Hessian is typically [ill-conditioned in deep learning](https://arxiv.org/abs/1706.04454). That is, the vast majority of eigenvalues are close to 0 and matrix inversion is numerically volatile. 
3. The number of training examples that are available nowadays is typically very large, in particular if techniques like data augmentation are used. Spending more computational time on refined optimization methods means that the algorithm can see less training data. Experience shows that the time gained by faster convergence is typically not worth the price paid by seeing less examples.  

## Optimization in Keras

Keras makes it very convenient to specify the optimization algorithm to be used:

In [2]:
from keras import layers, models, optimizers

model = models.Sequential([
    layers.Dense(64, input_shape=(10,)),
    layers.Dense(1)
])

adam = optimizers.Adam(lr=1e-3)
model.compile(loss='mean_squared_error', optimizer=adam)