# Gradient Descent with Momentum

### Gradients in physics

The GD algorithm is often compared to the effect of gravity on a marble placed on a valley-shaped surface like Figure 1a) below. Regardless of whether we place the marble at A or B, it will eventually roll down and end up at position C.

![momentum](./img/momentum.png)

However, if the surface has two valley bottoms as in Figure 1b), then depending on whether the ball is placed at A or B, the final position of the ball will be at C or D. Point D is a local minimum point that we do not want.

If we think more physically, still in Figure 1b), if the initial velocity of the ball at point B is large enough, when the ball rolls to point D, with momentum, the ball can continue to move up the slope to the left of D. And if we assume the initial velocity is even larger, the ball can go up the slope to point E and then roll down to C as in Figure 1c). This is exactly what we want. Readers may ask whether the ball rolling from A to C follows the momentum to roll to E and then to D. The answer is that this is less likely to happen because compared to slope DE, slope CE is much higher.

Based on this phenomenon, an algorithm was born to overcome the problem of GD's solution falling at an unwanted local minimum point. That algorithm is called Momentum

*Then, how do we represent momentum mathematically?*

In GD, we need to calculate the change at time $t$ to update the new position of the solution (i.e the marble). If we consider this quantity as the velocity $v_t$ in physsics, the new position of the marble will be $\theta_{t + 1} = \theta_t - v_t$. 

The minus sign represents moving in the opposite direction of the derivative. Our job now is to calculate the quantity $v_t$ so that it carries both information about **the slope** (i.e. the derivative) and information about **the momentum**, i.e. the previous velocity $v_{t-1}$ (assume that the velocity at time $t = 0$ is $v_0 = 0$). In the simplest way, we can add (weighted) these two quantities together:

$$v_t = \gamma v_{t-1} + \eta \nabla J(\theta_t)$$

Where:
- $v_t$: the velocity at time $t$
- $\gamma$: the momentum parameter, usually set to 0.9
- $v_{t - 1}$: the velocity at time $t - 1$
- $\eta$: the learning rate
- $\nabla J(\theta_t)$: the gradient of the cost function at time $t$

And the update rule for the solution is:
$$\theta_{t + 1} = \theta_t - v_t$$

Small example, Consider this simple function with two local minimum points, and 1 global minimum point:

$$f(x) = x^2 + 10\sin(x)$$

Have derivative:

$$f'(x) = 2x + 10\cos(x)$$

![No Momentum](./img/no_momentum.gif)

The above animation shows the path of the experiment without using the Momentum, algorithm converged after only 5 iterations but the experiment was found to a local minimum.

![Momentum](./img/momentum.gif)

And this one shows the path of the solution when using Momentum, the ball was able to climb the slope to the area near the global minimun point, then oscillate around this point, decelerate and finally reach the goal. Although it takes more iterations, GD with Momentum gives us a more accurate solution. Observing the path of the ball in this case, we see that this is more physics-like!