### Momentum updates 

Momentum can be used for updating the weights instead of vanilla gradient descent. To understand the difference intuitively, you can think of gradent descent as a person walking down a hill following the steepest path whereas the momentum update would be a ball rolling down that hill causing it to accelerate as well as smoothing out its trajectory due to inertia. 

Let $\vec w$ represents the vector comprising of all weights and biases, then the weight update for gradient descent for the $k$-th iteration is

$$ w^{k+1} = w^k - \alpha \nabla J(w^k) $$

The weight update using momentum will include another term $z^k$ that is collecting the gradient from the previous weight updates.

$$
\begin{equation}
\begin{split}
z^{k+1} &= \beta z^k + \nabla J(w^k) \\
w^{k+1} &= w^k - \alpha z^{k+1}
\end{split}
\end{equation}
$$

The value for momentum term $\beta$ is to be chosen between $0$ and $1$. Observe that $\beta=0$ (for no momentum) reduces to the vanilla gradient descent. See this [blog](https://distill.pub/2017/momentum/) for more details and also visualization for experimenting with different $\alpha$ and $\beta$ values.

Weight updates using momentum converges faster and better than gradient descent even with the learning rate decay. A slight variance called Nestorov Momentum performs better than the regular momentum:

$$
\begin{equation}
\begin{split}
z^{k+1} &= \beta z^k + \nabla J(w^{k+1}) \\
w^{k+1} &= w^k - \alpha z^{k+1} + \alpha \beta \left(z^{k+1} - z^{k} \right)
\end{split}
\end{equation}
$$

The above optimizing methods adjusts the learning rate globally and equally for all parameters. The adaptive learning rate methods such as Adagrad, Adam, and RMSprop tunes learning rates adaptively per parameter. They often have better performance in practise and are used for state-of-the-art algorithms. 


### Adaptive learning rate methods

#### Adagrad
Adagrad uses a cache $c$ to keep track of the squares of gradients at each iteration for every parameter (weights/biases). It uses this cache to normalize the weight updates. The effective rate is $\frac{\alpha}{\sqrt{c}+\epsilon}$ where $\epsilon$ is the smoothing term (usually $e^{-7}$) used to avoid division by zero. The effective rate is low for parameters (weights/bias) for which gradients values are large whereas it is high for paramaters with smaller values of gradients. Since, the learning rate is different for different parameters (weights/biases), such a method is called adaptive learning rate method.

$$
\begin{equation}
\begin{split}
c &:= c + \nabla J^2 \\
w &:= w - \frac{\alpha \nabla J}{\sqrt{c}+\epsilon}
\end{split}
\end{equation}
$$

#### RMSprop

The effective learning rate for Adagrad is monotonically decreasing (since c is increasing) with each iteration. RMSprop counters this behavior by using a moving average of squared gradients instead of simply adding the gradients in the cache term of Adagrad.
$$
\begin{equation}
\begin{split}
c &:= \beta c + (1-\beta)\nabla J^2 \\
w &:= w - \frac{\alpha \nabla J}{\sqrt{c}+\epsilon}
\end{split}
\end{equation}
$$
Here $\beta$, the decay rate, takes values being 0.9–0.99. The effective learning rate for RMSprop still depends on the magnitude of gradients for each parameter but the weight updates are not monotonically decreasing because of using moving averages. Thus, RMSprop is more effective in practise than Adagrad.

#### Adam
Adam combines RMSprop with momentum. The gradient $\nabla J$ is replaced by a momentum term $m$ that ensures smoother weight updates.
$$
\begin{equation}
\begin{split}
m &:= \beta_1 m + (1-\beta_1)\nabla J \\
c &:= \beta_2 c + (1-\beta_2)\nabla J^2 \\
w &:= w - \frac{\alpha m}{\sqrt{c}+\epsilon}
\end{split}
\end{equation}
$$

Among adaptive learning methods, Adam often performs best and it is safe to use it with the default values for the hyperparameters, that is $\alpha = 0.001, \beta_1 = 0.9, \beta_2 = 0.99, \epsilon=e^{-7}$. 