<h2 id="Contents">Contents<a href="#Contents"></a></h2>
        <ol>
        <li><a class="" href="#Imports">Imports</a></li>
<li><a class="" href="#Optimization-Methods">Optimization Methods</a></li>
<ol><li><a class="" href="#Gradient-Descent">Gradient Descent</a></li>
<ol><li><a class="" href="#Stochastic-Gradient-Descent">Stochastic Gradient Descent</a></li>
<li><a class="" href="#Mini-Batch-Gradient-Descent">Mini-Batch Gradient Descent</a></li>
</ol><li><a class="" href="#Momentum">Momentum</a></li>
<li><a class="" href="#RMSProp">RMSProp</a></li>
<li><a class="" href="#Adam">Adam</a></li>
</ol><li><a class="" href="#Learning-Rate-Decay-and-Scheduling">Learning Rate Decay and Scheduling</a></li>
<ol><li><a class="" href="#Decay-on-every-iteration">Decay on every iteration</a></li>
<li><a class="" href="#Fixed-Interval-Scheduling">Fixed Interval Scheduling</a></li>
</ol>

# Imports

In [1]:
import numpy as np

# Optimization Methods

Having a good optimization algorithm can be the difference between waiting days vs. just a few hours to get a good result. Until now, we've always used Gradient Descent to update the parameters and minimize the cost. Here, we'll talk about some other optimization methods.

## Gradient Descent

We'll be including the gradient descent just for the sake of completeness. When you take gradient steps with respect to all  m  examples on each step, it is also called Batch Gradient Descent. The  gradient descent rule is, for $l = 1, ..., L$: 
$$ W^{[l]} = W^{[l]} - \alpha \text{ } dW^{[l]}$$
$$ b^{[l]} = b^{[l]} - \alpha \text{ } db^{[l]}$$

where L is the number of layers and $\alpha$ is the learning rate.

In [2]:
def update_parameters_with_gd(parameters, grads, learning_rate):
    L = len(parameters) // 2
    for l in range(1, L + 1):
        parameters["W" + str(l)] -= grads["dW" + str(l)]*learning_rate
        parameters["b" + str(l)] -= grads["db" + str(l)]*learning_rate
    return parameters

### Stochastic Gradient Descent

Stochastic Gradient Descent is a variant of Gradient Descent where you only update the parameters after seeing a single example.

The update rule that you have just implemented does not change. What changes is that you would be computing gradients on just one training example at a time, rather than on the whole training set. The code examples below illustrate the difference between stochastic gradient descent and (batch) gradient descent.


- **(Batch) Gradient Descent**:

``` python
X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
    # Forward propagation
    a, caches = forward_propagation(X, parameters)
    # Compute cost.
    cost += compute_cost(a, Y)
    # Backward propagation.
    grads = backward_propagation(a, caches, parameters)
    # Update parameters.
    parameters = update_parameters(parameters, grads)
        
```

- **Stochastic Gradient Descent**:

```python
X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
    for j in range(0, m):
        # Forward propagation
        a, caches = forward_propagation(X[:,j], parameters)
        # Compute cost
        cost += compute_cost(a, Y[:,j])
        # Backward propagation
        grads = backward_propagation(a, caches, parameters)
        # Update parameters.
        parameters = update_parameters(parameters, grads)
```

>In Stochastic Gradient Descent, you use only 1 training example before updating the gradients. When the training set is large, SGD can be faster. But the parameters will "oscillate" toward the minimum rather than converge smoothly.

![](images/0401.png)

### Mini-Batch Gradient Descent

Mini-batch gradient descent uses an intermediate number of examples for each step. With mini-batch gradient descent, you loop over the mini-batches instead of looping over individual training examples. There are two steps involved in mini-batch gradient descent:
- **Shuffle**: Create a shuffled version of the training set (X, Y) as shown below. Each column of X and Y represents a training example. Note that the random shuffling is done synchronously between X and Y. Such that after the shuffling the $i^{th}$ column of X is the example corresponding to the $i^{th}$ label in Y. The shuffling step ensures that examples will be split randomly into different mini-batches. 
 
![](images/0402.png)

- **Partition**: Partition the shuffled (X, Y) into mini-batches of size `mini_batch_size` (here 64). Note that the number of training examples is not always divisible by `mini_batch_size`. The last mini batch might be smaller, but you don't need to worry about this. When the final mini-batch is smaller than the full `mini_batch_size
  
![](images/0403.png)

In [3]:
def random_mini_batches(X, Y, mini_batch_size = 64, seed = 0):
    np.random.seed(seed)
    m = X.shape[1]
    mini_batches = []

    permutation = list(np.random.permutation(m))
    shuffled_X = X[:, permutation]
    shuffled_Y = Y[:, permutation].reshape((1, m))
    
    inc = mini_batch_size

    num_complete_minibatches = np.floor(m / mini_batch_size) 
    for k in range(0, num_complete_minibatches):
        mini_batch_X = shuffled_X[:, k*mini_batch_size:(k+1)*mini_batch_size]
        mini_batch_Y = shuffled_Y[:, k*mini_batch_size:(k+1)*mini_batch_size]
        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)
    
    # For handling the end case (last mini-batch < mini_batch_size i.e less than 64)
    if m % mini_batch_size != 0:
        mini_batch_X = shuffled_X[:, (k+1)*mini_batch_size:]
        mini_batch_Y = shuffled_Y[:, (k+1)*mini_batch_size:]
        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)
    
    return mini_batches

## Momentum

Because mini-batch gradient descent makes a parameter update after seeing just a subset of examples, the direction of the update has some variance, and so the path taken by mini-batch gradient descent will "oscillate" toward convergence. Using momentum can reduce these oscillations.

Momentum takes into account the past gradients to smooth out the update. The 'direction' of the previous gradients is stored in the variable  v . Formally, this will be the exponentially weighted average of the gradient on previous steps. You can also think of  v  as the "velocity" of a ball rolling downhill, building up speed (and momentum) according to the direction of the gradient/slope of the hill.

For this algorithm to work, we first intialize the *velocity* as a vector of zeros. After that the velocity as well as the weight parameters are updated using the formula:
$$ \begin{cases}
v_{dW^{[l]}} = \beta v_{dW^{[l]}} + (1 - \beta) dW^{[l]} \\
W^{[l]} = W^{[l]} - \alpha v_{dW^{[l]}}
\end{cases}$$

$$\begin{cases}
v_{db^{[l]}} = \beta v_{db^{[l]}} + (1 - \beta) db^{[l]} \\
b^{[l]} = b^{[l]} - \alpha v_{db^{[l]}} 
\end{cases}$$
for $l = 1, ..., L$
where L is the number of layers and $\beta$ is the momentum and $\alpha$ is the learning rate.



In [4]:
def update_parameters_with_momentum(parameters, grads, v, beta, learning_rate):
    L = len(parameters) // 2
    for l in range(1, L + 1):
        v["dW" + str(l)] = beta*v["dW" + str(l)] + (1 - beta)*grads["dW" + str(l)]
        v["db" + str(l)] = beta*v["db" + str(l)] + (1 - beta)*grads["db" + str(l)]

        parameters["W" + str(l)] -= learning_rate*v["dW" + str(l)]
        parameters["b" + str(l)] -= learning_rate*v["db" + str(l)]
    return parameters, v

**Note that**:
- The velocity is initialized with zeros. So the algorithm will take a few iterations to "build up" velocity and start to take bigger steps.
- If $\beta = 0$, then this just becomes standard gradient descent without momentum. 

**How do you choose $\beta$?**

- The larger the momentum $\beta$ is, the smoother the update, because it takes the past gradients into account more. But if $\beta$ is too big, it could also smooth out the updates too much. 
- Common values for $\beta$ range from 0.8 to 0.999. If you don't feel inclined to tune this, $\beta = 0.9$ is often a reasonable default. 
- Tuning the optimal $\beta$ for your model might require trying several values to see what works best in terms of reducing the value of the cost function $J$.

## RMSProp

Gradient Descent starts by quickly going down the steepest slope, then slowly goes down the bottom of the valley. It would be nice if the algorithm could detect this early on and correct its direction to point a bit more toward the global optimum. This is what RMSProp does. RMSProp keeps track of an exponentially decaying average of past squared gradients. It updates the parameter as:
$$s ←\beta s + (1-\beta)∇_θJ(θ)⊗ ∇_θJ(θ)\\
θ ←θ − η ∇_θJ(θ) ⊗\sqrt{s +\epsilon}$$

⊗ is the element-wise product.

## Adam

Adam is one of the most effective optimization algorithms for training neural networks. It combines ideas from RMSProp and Momentum.

**How does Adam work?**
1. It calculates an exponentially weighted average of past gradients, and stores it in variables $v$ (before bias correction) and $v^{corrected}$ (with bias correction). 
2. It calculates an exponentially weighted average of the squares of the past gradients, and  stores it in variables $s$ (before bias correction) and $s^{corrected}$ (with bias correction). 
3. It updates parameters in a direction based on combining information from "1" and "2".

The update rule is, for $l = 1, ..., L$: 

$$\begin{cases}
v_{dW^{[l]}} = \beta_1 v_{dW^{[l]}} + (1 - \beta_1) \frac{\partial \mathcal{J} }{ \partial W^{[l]} } \\
v^{corrected}_{dW^{[l]}} = \frac{v_{dW^{[l]}}}{1 - (\beta_1)^t} \\
s_{dW^{[l]}} = \beta_2 s_{dW^{[l]}} + (1 - \beta_2) (\frac{\partial \mathcal{J} }{\partial W^{[l]} })^2 \\
s^{corrected}_{dW^{[l]}} = \frac{s_{dW^{[l]}}}{1 - (\beta_2)^t} \\
W^{[l]} = W^{[l]} - \alpha \frac{v^{corrected}_{dW^{[l]}}}{\sqrt{s^{corrected}_{dW^{[l]}}} + \varepsilon}
\end{cases}$$
where:
- t counts the number of steps taken of Adam 
- L is the number of layers
- $\beta_1$ and $\beta_2$ are hyperparameters that control the two exponentially weighted averages. 
- $\alpha$ is the learning rate
- $\varepsilon$ is a very small number to avoid dividing by zero

In [None]:
def update_parameters_with_adam(
    parameters, grads, v, s, t, learning_rate=0.01, beta1=0.9, beta2=0.999, epsilon=1e-8
):

    L = len(parameters) // 2
    v_corrected = {}
    s_corrected = {}
    for l in range(1, L + 1):
        v["dW" + str(l)] = beta1 * v["dW" + str(l)] + (1 - beta1) * grads["dW" + str(l)]
        v["db" + str(l)] = beta1 * v["db" + str(l)] + (1 - beta1) * grads["db" + str(l)]

        v_corrected["dW" + str(l)] = v["dW" + str(l)] / (1 - (beta1) ** t)
        v_corrected["db" + str(l)] = v["db" + str(l)] / (1 - (beta1) ** t)

        s["dW" + str(l)] = (
            beta2 * s["dW" + str(l)] + (1 - beta2) * (grads["dW" + str(l)]) ** 2
        )
        s["db" + str(l)] = (
            beta2 * s["db" + str(l)] + (1 - beta2) * (grads["db" + str(l)]) ** 2
        )

        s_corrected["dW" + str(l)] = s["dW" + str(l)] / (1 - (beta2) ** t)
        s_corrected["db" + str(l)] = s["db" + str(l)] / (1 - (beta2) ** t)

        parameters["W" + str(l)] -= (learning_rate * v_corrected["dW" + str(l)]) / (
            np.sqrt(s_corrected["dW" + str(l)]) + epsilon
        )
        parameters["b" + str(l)] -= (learning_rate * v_corrected["db" + str(l)]) / (
            np.sqrt(s_corrected["db" + str(l)]) + epsilon
        )

    return parameters, v, s, v_corrected, s_corrected


#  Learning Rate Decay and Scheduling

Apart from the optimization algorithm, changing the learning rate can also speed up the training of the model. 
During the first part of training, our model can get away with taking large steps, but over time, using a fixed value for the learning rate alpha can cause our model to get stuck in a wide oscillation that never quite converges. But if we were to slowly reduce our learning rate alpha over time, we could then take smaller, slower steps that bring we closer to the minimum. This is the idea behind learning rate decay.

Learning rate decay can be achieved by using either adaptive methods or pre-defined learning rate schedules.

## Decay on every iteration

We can try one of the pre-defined schedules for learning rate decay, called exponential learning rate decay. It takes this mathematical form:
$$\alpha = \frac{1}{1 + decayRate \times epochNumber} \alpha_{0}$$

This method has the problem that as the epoch number increases, the learning rate becomes smaller and smaller and eventually becomes zero. This makes the model unable to learn anything.

## Fixed Interval Scheduling

One way to solve this problem is to use a fixed interval scheduling. This is a method that takes a fixed number of iterations and then reduces the learning rate.

The mathematical form of this schedule is:
$$\alpha = \frac{1}{1 + decayRate \times \lfloor\frac{epochNum}{timeInterval}\rfloor} \alpha_{0}$$