# 1. Using Momentum to Speed Up Training
We will now take a look at one of the most effective methods at improving plain gradient descent, called **momentum**. This can be thought of as the 80% factor to improve your learning procedure. 

A way to think of this is as follows: Gradient descent without momentum requires a *force* or *push* each time we want to get the weights to move. In other words, each time we want to move, there has to a be a gradient so that we can move in the direction of the gradient. If we had **momentum**, we can imagine that our update could keep moving, even without the gradient being present. 

This can be thought of as pushing a box on ice vs. pushing a box on gravel. If we are pushing the box on gravel, the minute we stop applying force, the box will also stop moving. This is analogous to gradient descent without momentum. However, if we were pushing the box on ice we could and then let go and it would continue moving for a period of time, before stopping. This is analogous to gradient descent with momentum. Let's put this into math. 

## 1.1 Gradient Descent, *without* Momentum
Our update for $\theta$ can be described as:
#### $$\theta_t \leftarrow \theta_{t-1} - \eta g_t$$
This says that $\theta_t$ is equal to the previous value of $theta$, minus the learning rate, times the gradient $g_t$. From this we can see that if the gradient is 0, nothing will happen to $\theta_t$. It just gets updated to it's old value and doesn't change. 

## 1.2 Gradient Descent, *with* Momentum
Now let's say that we add in **momentum**. Note that the term momentum is used very loosely here, since it has nothing to do with actual physical momentum. What we do is create a new variable, $v$, which stands for the velocity. It is equal to $\mu$ (the momentum term) times its old velocity, minus the learning rate times the gradient. 

#### $$v_t \leftarrow \mu v_{t-1} - \eta g_t$$

This new term, $\mu v_{t-1}$, gives us the ability to "slide on ice" if you will. In other words, it allows us to continue to move in the same direction that we were going before. Now, we talked about how if a box is sliding on ice, it will still stop eventually. That means that we are going to want our updated $v$ to be a fraction of the prior $v$, and hence $\mu$ should be a fraction. Typical values of $\mu$ are 0.9, 0.95, 0.99, etc. This means that without any $g$, the equation will still eventually "slow down". Our update rule for $\theta_t$ then becomes:

#### $$\theta_t \leftarrow \theta_{t-1} + v_t$$

Now, if we combine these two equations we can see that our total update rule is:

#### $$\theta_t \leftarrow \theta_{t-1} + \mu v_{t-1} -\eta g_t $$

And we can see that if we set the momentum term, $\mu$, equal to zero, we end up with the same update rule we originally had for gradient descent. 

## 1.3 The Effect of Momentum
You may be wondering, what is the effect of using momentum? Well we can see below that by using momentum, the cost converges to its minimum value much faster. This significantly speeds up training! 

<img src="images/momentum.png">

From another perspective, we can think of a situation where we have unequal gradients in different directions. In the image below, we have a very large gradient that creates the valley (each side is very steep), and then in the other direction (the stream flowing down), it is a very small gradient. 

<img src="images/large-small-gradient.png">

For visualization purposes, lets assume we have 2 parameters to optimize: the vertical and horizontal parameter. The gradient in one direction is very steep, and the gradient in the other direction is very shallow. The idea is that if you don't have momentum, then you rely purely on the gradient, which points more in the steep direction than in the shallow direction-this is just a property of the gradient, it is the direction of steepest descent. Since this gradient vector points more in the steep direction, we are going to zigzag back and forth across the valley. That is a very inefficient way of reaching the minimum. 

<img src="images/contours-momentum.png">

Once we add momentum, however, things change. Because in the shallow direction, we move in the same direction every time, those velocities are going to accumulate, so we will have a portion of our old velocity, added to our new velocity to help us along in that direction. The result is that we get there faster by taking bigger steps in the shallow direction of the gradient. 

---
<br></br>
# 2. Nesterov Momentum
Nesterov momentum was coined by **Y Nesterov** in 1983. It is described as:
> "A method for unconstrained convex minimization problem with rate of convergence O(1/$k^2$)"

The idea behind Nesterov momentum is as follows: instead of just using momentum to blindly keep going in the direction that we were already going, let's instead peak ahead, by taking a big jump in the direction of the previous velocity, and calculate the gradient from there. We can think of it as though you are gambling, and if you are going to gamble it is better to take a big jump and then make a correction, than to make a correction and then gamble. 

<img src="images/nesterov1.png">

So first we peak ahead, jumping in the direction of the previous velocity (accumulated gradient): 

<img src="images/nesterov2.png">

We then measure the gradient, and go downhill in the direction of the gradient. We use that gradient to update our velocity (accumulated gradient). In other words, we combine the big jump with our gradient to get the accumulated gradient. So in a way, its peaking ahead and then course correcting based on where we would have ended up. 

<img src="images/nesterov3.png">

We then take that accumulated gradient (first green vector), multiply by some momentum constant, $\mu$, and then we take the next big jump in the direction of that accumulated gradient. Again, at the place where we end up (head of second brown vector), we measure the gradient, we go downhill (second red vector) to correct any errors we have made, and we get a new accumulated gradient (second green vector)

We can see that the blue vectors represent where we would go if we were using standard momentum, where we first measure the gradient where it currently is (small blue vector), and it adds that to the brown vector, and ends up making a jump by the big blue vector (first brown vector plus small blue vector, i.e. the current gradient). The brown vector represents our peak ahead value. Notice that it is in the same direction as the blue vector. The red vector is the gradient at the peak ahead value. The green vector is just the vector of the brown vector and the red vector. 

## 2.1 Nesterov Equations
So, with the visuals discussed, what do the equations look like? First, we are going to use $w$ to represent our weights instead of $\theta$. Now, lets start with the vector that represents the previous value of our weights, $w_{t-1}$, and the previous velocity, $v_{t-1}$:

<img src="images/nesterov-eq-1.png">

Now, we have this jump ahead, which we can call $w'_{t-1}$. It is in the same direction of our previous velocity vector, but it is slightly smaller since the jump is scaled by $\mu$:

<img src="images/nesterov-eq-2.png">

#### $$look \; ahead\; value: \; w'_{t-1} = w_{t-1} +\mu v_{t-1}$$

Note, that as seen in the image above, the jump ahead is just the previous value of the velocity, plus the momentum term multiplied by the previous velocity. Next, we calculate the gradient at this jump ahead point, and then use that to update $v$:

<img src="images/nesterov-eq-3.png">

<img src="images/nesterov-eq-4.png">

#### $$v_t \leftarrow \mu v_{t-1} - \eta \nabla J(w'_{t-1})$$

Which is equal to:

#### $$v_t \leftarrow \mu v_{t-1} - \eta \nabla J(w_{t-1} +\mu v_{t-1})$$

And then the last step is to update $w_t$, the accumulated gradient, which is the same as it was for standard momentum:

<img src="images/nesterov-eq-5.png">

#### $$w_t \leftarrow w_{t-1} + v_t$$

The main difference to note is that in the standard method we are taking the gradient and then making our jump, whereas in the Nesterov method we first take a big jump, then correct by taking the gradient from that point. 

## 2.3 Reformulation
However, in practice, this is not how nesterov momentum is usually implemented. Instead, we will reformulate the equations. Let's try and express everything only in terms of $w'$, our lookahead value of $w$, and it is where we want to calculate the gradient from. So, we can define $w'_t$ and $w'_{t-1}$ using the same definition:

#### $$w'_t = w_t + \mu v_t$$
#### $$w'_{t-1} = w_{t-1} + \mu v_{t-1}$$

In other words, these are the lookahead values of $w$ at two consecutive steps. The second equation is seen in the below: 

<img src="images/nesterov-eq-6.png">