<font color = green >

# Improving deep neural networks

</font>

- Optimization Methods
    - Mini-Batch Gradient descent
    - Exponentially weighted averages
    - Gradient descent with momentum
    - Bias correction in exponentially weighted averages
    - RMSProp (root mean square prop)
    - Adam
    - Learning rate decay
- Multi class classification    

- Bias and Variance  
- Regularization: Weight decay 
- Regularition: dropout
- Getting more data
- Early stopping
- Vanishing / Exploding gradients
    - Changing initialization 
    - Using non-saturation functions 
    - Batch normalization 
    - Gradient clipping




<font color = green >

# Optimization Methods

</font>



<font color = green >

## Recall Gradient Descent

</font>

Gradient steps are taken with respect to all $m$ examples on each step, it is also called Batch Gradient Descent.

For $l = 1, ..., L$: 
$$ W^{[l]} = W^{[l]} - \alpha \cdot \frac{d\mathcal{L}}{\partial W^{[l]}} \quad b^{[l]} = b^{[l]} - \alpha \cdot \frac{d\mathcal{L}}{\partial b^{[l]}} $$

``` python

# Initialize_parameters
parameters = initialize_parameters(layers_dims)
# Loop (gradient descent)
for i in range(num_iterations):
    # Forward propagation 
    A_last, caches = forward_propagation_whole_process(X, parameters, keep_prob)
    # Compute cost
    cost = compute_cost_with_regularization(A_last, Y, parameters, lambd)
    # Backward propagation
    grads = backward_propagation_whole_process(A_last, Y, caches, lambd, keep_prob)
    # Update parameters
    parameters = update_parameters(parameters, grads, learning_rate)
```

<font color = green >

## Stochastic Gradient Descent

</font>

Gradient steps are taken with respect to each single $m$ example on each step.
``` python

# Initialize_parameters
parameters = initialize_parameters(layers_dims)
# Loop (gradient descent)
for i in range(num_iterations):
    for j in range(0, m):
        # Forward propagation 
        A_last, caches = forward_propagation_whole_process(X, parameters, keep_prob)
        # Compute cost
        cost = compute_cost_with_regularization(A_last, Y, parameters, lambd)
        # Backward propagation
        grads = backward_propagation_whole_process(A_last, Y, caches, lambd, keep_prob)
        # Update parameters
        parameters = update_parameters(parameters, grads, learning_rate)
```

<img src = "data/stochastic.png" align = 'left'>


<font color = green >

## Mini-Batch Gradient descent

</font>

Gradient steps are taken with respect to $k$ of $m$ example on each step.
<img src = "data/mini_batch.png" align = 'left' >


<font color = green >

### Mini-batch implementation 

</font>

1. Shuffle samples 
2. Split to mini-batches 

<img src = "data/batches.png" align = 'left' height = '500' width = '500'>

<div style="clear:left;"></div>

``` python

# Initialize_parameters
parameters = initialize_parameters(layers_dims)
# Loop (gradient descent for epochs)
for i in range(num_epochs):    
    # Define the random minibatches
    minibatches = random_mini_batches(X, Y, minibatch_size, seed)
    # Loop (gradient descent for minibatches)
    for minibatch in minibatches:
        # Select a minibatch
        (minibatch_X, minibatch_Y) = minibatch
        # Forward propagation 
        A_last, caches = forward_propagation_whole_process(minibatch_X, parameters, keep_prob)
        # Compute cost
        cost = compute_cost_with_regularization(A_last, minibatch_Y, parameters, lambd)
        # Backward propagation
        grads = backward_propagation_whole_process(A_last, minibatch_Y, caches, lambd, keep_prob)
        # Update parameters
        parameters = update_parameters(parameters, grads, learning_rate)
```

In [1]:
def random_mini_batches(X, Y, mini_batch_size = 64, seed = 0):
    """
    Creates a list of random minibatches from (X, Y)
    X - input data, of shape (input size, number of examples)
    Y - true "label" vector (1/0), of shape (1, number of examples)
    mini_batch_size - size of the mini-batches, integer
    
    Returns:
    list o mini_batch_X, mini_batch_Y
    """    
    np.random.seed(seed)
    m = X.shape[1]                  # number of training examples
    mini_batches = []
        
    #  Shuffle (X, Y)
    permutation = list(np.random.permutation(m))
    shuffled_X = X[:, permutation]
    shuffled_Y = Y[:, permutation].reshape((1,m))

    # Split to complete mini batches
    num_complete_minibatches = math.floor(m/mini_batch_size) # number of mini batches of size mini_batch_size in your partitionning
    for k in range(0, num_complete_minibatches):
        mini_batch_X = shuffled_X[:, mini_batch_size * k : mini_batch_size * (k+1)]
        mini_batch_Y = shuffled_Y[:, mini_batch_size * k : mini_batch_size * (k+1)]
        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)
    
    # Handling the end case (last mini-batch < mini_batch_size)
    if m % mini_batch_size != 0:
        mini_batch_X = shuffled_X[:, m - (m//mini_batch_size) * mini_batch_size:]
        mini_batch_Y =  shuffled_Y[:, m - (m//mini_batch_size) * mini_batch_size:]
        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)
    
    return mini_batches

<font color = green >
    
## Exponentially weighted averages

</font>

$
v_{0} = 0\\
v_{l} = \beta \cdot v_{0} + (1 - \beta)\cdot t^{l} \\
v_{2} = \beta \cdot v_{1} + (1 - \beta)\cdot t^{2} \\
v_{3} = \beta \cdot v_{2} + (1 - \beta)\cdot t^{3} \\
...\\
v_{n} = \beta \cdot v_{n-1} + (1 - \beta)\cdot t^{n} \\
$

Sample for $\beta =0.9$

<img src = "data/temperature_2.png" align = 'left' height = '400' width = '400'>
<img src = "data/temperature_3.png" align = 'left' height = '400' width = '400'>





<font color = green >
    
## Gradient descent with momentum

</font>

For $l = 1, ..., L$: 
$$
v_{W}^{[l]} = \beta \cdot  v_{W}^{[l]} + (1 - \beta) \cdot \frac{\partial \mathcal{L}}{\partial W{[l]}} 
\quad \rightarrow \quad W^{[l]} = W^{[l]} - \alpha \cdot v_{W}^{[l]}\\
v_{b}^{[l]} = \beta \cdot  v_{b}^{[l]} + (1 - \beta) \cdot \frac{\partial \mathcal{L}}{\partial b{[l]}} 
\quad \rightarrow \quad b^{[l]} = b^{[l]} - \alpha \cdot v_{b}^{[l]}\\
$$

Note: 
- If $\beta = 0$, then this just becomes standard gradient descent without momentum.
- $\beta = 0.9$ is often a reasonable default value.


<font color = green >
    
## Bias correction in exponentially weighted averages

</font>

$$v = \beta \cdot v + (1 - \beta)\cdot t^{n} \quad \rightarrow \quad 
{v\,}^{corrected} = \frac{v}{1 - \beta^t}$$

<font color = green >
    
## RMSProp (root mean square prop)

</font>

For $l = 1, ..., L$: 
$$
s_{W}^{[l]} = \beta \cdot s_{W}^{[l]} + (1 - \beta) \cdot (\frac{\partial \mathcal{L} }{\partial W^{[l]} })^2 
\quad\quad\quad
W^{[l]} = W^{[l]} - \alpha \frac{W^{[l]}}{\sqrt{s_{W}^{[l]}} + \varepsilon}
\\
s_{b}^{[l]} = \beta \cdot s_{b}^{[l]} + (1 - \beta) \cdot (\frac{\partial \mathcal{L} }{\partial b^{[l]} })^2 
\quad\quad\quad
b^{[l]} = b^{[l]} - \alpha \frac{b^{[l]}}{\sqrt{s_{b}^{[l]}} + \varepsilon}
$$


<font color = green >
    
## Adam 

</font>


Adam is one of the most effective optimization algorithms for training neural networks. It combines ideas from RMSProp and Momentum. 

1. It calculates an exponentially weighted average of past gradients, and stores it in variables $v$ (before bias correction) and $v^{corrected}$ (with bias correction). 
2. It calculates an exponentially weighted average of the squares of the past gradients, and  stores it in variables $s$ (before bias correction) and $s^{corrected}$ (with bias correction). 
3. It updates parameters in a direction based on combining information from "1" and "2".

The update rule is, for $l = 1, ..., L$: 

$$
v_{W}^{[l]} = \beta_1 \cdot  v_{W}^{[l]} + (1 - \beta_1)\cdot   \frac{\partial \mathcal{L} }{ \partial W^{[l]} }; 
\quad\quad\quad
(v_{W}^{[l]})^{corrected} = \frac{v_{W}^{[l]}}{1 - (\beta_1)^t}\\
s_{W}^{[l]} = \beta_2 \cdot s_{W}^{[l]} + (1 - \beta_2) \cdot (\frac{\partial \mathcal{L} }{\partial W^{[l]} })^2 
\quad\quad\quad
(s_{W}^{[l]})^{corrected} = \frac{s_{W}^{[l]}}{1 - (\beta_2)^t} \\
W^{[l]} = W^{[l]} - \alpha \frac{(v_{W}^{[l]})^{corrected}}{\sqrt{(s_{W}^{[l]})^{corrected}} + \varepsilon}
\\
v_{b}^{[l]} = \beta_1 \cdot  v_{b}^{[l]} + (1 - \beta_1)\cdot   \frac{\partial \mathcal{L} }{ \partial b^{[l]} }; 
\quad\quad\quad
(v_{b}^{[l]})^{corrected} = \frac{v_{b}^{[l]}}{1 - (\beta_1)^t}\\
s_{b}^{[l]} = \beta_2 \cdot s_{b}^{[l]} + (1 - \beta_2) \cdot (\frac{\partial \mathcal{L} }{\partial b^{[l]} })^2 
\quad\quad\quad
(s_{b}^{[l]})^{corrected} = \frac{s_{b}^{[l]}}{1 - (\beta_2)^t} \\
b^{[l]} = b^{[l]} - \alpha \frac{(v_{b}^{[l]})^{corrected}}{\sqrt{(s_{b}^{[l]})^{corrected}} + \varepsilon}
$$


where:
- t counts the number of steps taken of Adam 
- L is the number of layers
- $\beta_1$ and $\beta_2$ are hyperparameters that control the two exponentially weighted averages. 
- $\alpha$ is the learning rate
- $\varepsilon$ is a very small number to avoid dividing by zero

<br>

<font color = green >
    
## Learning rate decay 

</font>

Using minibatch gradient descent you may need to decrease $\alpha$. 

There are veriaty of implementations
e.g.

$\alpha_{0} = 0.2;  \quad\ \text{decay_rate} = 1; \quad\quad\quad \alpha = \frac{1}{1+ \text{epoch_number} \cdot \text{decay_rate}} \cdot \alpha_{0}$ 

or 

$\alpha =  0.95 ^{\text{epoch_number}} \cdot \alpha_{0}$ 


<font color = green >
    
# Multi class classification

## Softmax Regression
</font>
 
 
For binary classification there is single unit in the last layer $L$ and sigmoid function $\sigma$.
<br>For multi class classification the last layer has $n^{[L]} = n_{c}$ units, where $n_{c}$ is number of classes.

The output $Z^{[L]}$ is converted to probability distribution by following formula:
$$A^{[L]} = \frac{{e\,}^{Z^{[L]}}} {\| Z^{[L]} \|} \quad \text{i.e.}\quad a^{[L]_{i}} = \frac{t_{i}} {\sum \limits_{k} t_{k}} 
\text{ where } t = {e\,}^{z^{[L]}}$$
e.g.

$Z^{[L]} = \begin{bmatrix} 5 \\ 2 \\ -1 \\ 3 \end{bmatrix}\quad\rightarrow\quad  t =\begin{bmatrix} e^{5}\\ e^{2}\\ e^{-1}\\ e^{3} \end{bmatrix}  
\quad\rightarrow\quad A^{[L]}= \begin{bmatrix}0.84203357\\  0.04192238\\ 0.00208719\\  0.11395685 \end{bmatrix}
\quad\rightarrow\quad  \hat{y} = argmax \, A^{[L]} \\ \text{} $


<font color = green >
    
### Training softmax classification
</font>
 


$$y=\begin{bmatrix} 0 \\ 1 \\ 0 \\ 0 \end{bmatrix}\quad \quad \hat { y } =\begin{bmatrix} 0.5 \\ 0.1 \\ 0.3 \\ 0.1 \end{bmatrix}\\\text{ }\\ \mathcal{ L }(y,\hat { y } )=-\frac { 1 }{ m } \sum _{ i=1 }^{ m }{ y_{ i }\, log\hat { y_{ i } }  }$$


<font color = green >

## Bias and Variance

</font>

**High bias**  (underfitting): train classification error is high and dev classification error is high.
<br>**High variance** (overfitting): train classification error is low but dev classification error is high.
<br>**Optimal** : train classification error is low and dev classification error is low.
<br><br>Note: need to consider **human classification error** e.g. 
<br>train classification error = 20% , dev classification error = 22% is **high bias** when human classification error is close to zero,<br> but it is close to **optimal classification** when human classification error is close to 20%.


Main machine learning appoach: Resolve **high bias first**, then resolve **high variance**.

The following can resolve **high bias**: 
- build larger network (more layers/more units) 
- train longer 
- apply advanced optimization algorithm
- reduce learning rate 
- reduce regularization 
- normalize input 
- use advanced params initialization
- use other network architecture 
- use more/other features

The following can resolve **high variance**: 
- get more data (consider augmentated data)
- increase regularization 
- use other network architecture 




<font color = green >
    
## Regularization: Weight decay (frobenius norm/ L2 norm regularization )


</font>

$$ \mathcal{L} = -\frac{1}{m} \sum\limits_{i = 1}^{m} (y^{(i)}\log\left(a^{[L] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right))\quad\rightarrow \quad
\mathcal{L}_{F} = \mathcal{L} +  \frac{\lambda}{2m} \sum\limits_{l = 1}^{L}  \sum\limits_{i = 1}^{n^{[l]}}  \sum\limits_{j = 1}^{n^{[l-1]}} (w^{[l]}_{ij}) ^2
\\
\frac{\partial \mathcal{L} }{\partial W^{[l]}} = \frac{1}{m} \frac{\partial \mathcal{L} }{\partial Z^{[l]}} \,@\, A^{[l-1] T} \quad\rightarrow \quad 
\frac{\partial \mathcal{L}_{F} }{\partial W^{[l]}} = \frac{\partial \mathcal{L} }{\partial W^{[l]}}+ \frac{\lambda}{m} W^{[l]}\\
\quad \\
W^{[l]}  = W^{[l]} - \alpha \cdot \frac{\partial \mathcal{L}}{\partial W^{[l]}} \quad=\quad W^{[l]} - \alpha \cdot (\frac{\partial \mathcal{L} }{\partial W^{[l]}}+ \frac{\lambda}{m} W^{[l]}) \quad = \quad W^{[l]} \cdot (1 -\frac{\alpha \lambda}{m}) - \alpha \cdot \frac{\partial \mathcal{L} }{\partial W^{[l]}}
\\
$$



<font color = green >
    
### Implementation

</font>

- Develop cost with regularization
- Correct the backward propagation 
- Correct the training model to consider lambda and new cost with regularization

<font color = green >
    
## Regularization: dropout


</font>

<img src = "data/19_dropout2.png" align = 'left' height = '500' width = '500'>

<div style="clear:left;"></div>

Note: Dropout is only on training model stage but not at predicting 


   



<font color = green >
    
### Implementation

</font>

Define `keep_prob` e.g. `0.8` that means to shutdown approximately `20%` of units. In forward propagation correct computing post-activation: 

- Generate random matrix of shape $A$:<br> 
    $\quad $ `D = np.random.rand(A.shape[0], A.shape[1])` 
    
- convert entries of D to 0 or 1 (using keep_prob as the threshold):<br>
    $\quad $ `D = D <keep_prob` 
    
- shut down some neurons of A:<br>
    $\quad $`A = A * D`
    
- scale the value of neurons that haven't been shut down (**Inverted dropout**):<br>
    $\quad$ `A = A / keep_prob` 

Note: 
- You may apply different thresshold (`keep_prob`) values to defferent layers  
- Don't apply dropout to the last layer 

<font color = green >
    
## Data Augmentation to reduce the variance 

</font>

<img src = "data/19_cat.png" align = 'left' height = '300' width = '300'>

<img src = "data/19_cat_2.png" align = 'left' height = '300' width = '300'>
<img src = "data/19_cat_3.png" align = 'left' height = '300' width = '300'>
<img src = "data/19_cat_4.png" align = 'left' height = '900' width = '900'>
<div style="clear:left;"></div>



   



<font color = green >
    
## Early stopping

</font>

<img src = "data/19_early_stop.png" align = 'left' height = '500' width = '500'>



   



<font color = green >
    
## Vanishing / Exploding gradients

</font>

<img src = "data/19_van.png" align = 'left' height = '500' width = '500'>
<div style="clear:left;"></div>

Simplified case: all $b^{[l]}=0$ and all $g^{[l]}(z) = z$
<br>Then $\hat{y} = W^{[1]} \, @ \, W^{[1]} \, @ \, W^{[2]} \, @ \, ... \, @ \, W^{[L]} \, @ \, X$

<br>Let's assume the init $W^{[l]} = \begin{bmatrix} 0.9 & 0 \\ 0 & 0.9 \end{bmatrix}$  Then $\hat{y} =  \begin{bmatrix} 0.9 & 0 \\ 0 & 0.9 \end{bmatrix}^{L-1} @ W^{[L]}  @  X $
<br>If $L$ is large then the gradients of first layers will be very small and the gradient steps is very little productive, thus the convergence is very slow.

<br>
In practice there is also case of exploding gradient. 

**Normal case of learning weight**

<img src = "data/vanish_1.png" align = 'left' height = '550' width = '550'>

**Vanishing case of learning weight**

<img src = "data/vanish_2.png" align = 'left' height = '550' width = '550'>

**Due to chain rule of calculating the derivatives, in the calculation there is product of lots of multipliers**

<img src = "data/vanish_3.png" align = 'left' height = '600' width = '600'>

<font color = green >
    
## Solutions for vanishing/explosing gradient

</font>


- Changing initialization 
- Using non-saturation functions 
- Batch normalization 
- Gradient clipping

<font color = green >
    
### Changing initialization 

</font>


Normalizing inputs allows to avoid exploding gradients and speeds up the training process

<img src = "data/19_norm_1.png" align = 'left' height = '400' width = '400'>
<img src = "data/19_norm_3.png" align = 'left' height = '400' width = '400'>
<img src = "data/19_norm_2.png" align = 'left' height = '300' width = '300'>

<img src = "data/19_norm_4.png" align = 'left' height = '350' width = '350'>



   



If the size of layer $l$ is large then the $Z^{[l]} = W^{[l]} @ A^{[l-1]}+ b^{[l]}$ is very large. e.g. for single unit:  $z= w_{1} x_{1} + w_{2} x_{2} + ...+  w_{n} x_{n}$, to avoid output of layer very large, need to initialize the parameters being based on $n$: the larger $n$ the smaller $W$  
<br> One of approach is to use **Xavier (Glorot) initialization**: $$W^{[l]} = random \cdot (\frac{1}{\sqrt{(n^{[l-1]}}}) $$

or even modified version:  <br> $$W^{[l]} = random \cdot (\frac{1}{\sigma}),\quad \text{where} ,\quad
\sigma^2 = \frac{ n^{[l-1]}+ n^{[l]}}{2} $$

<img src = "data/xavier.png" align = 'left' height = '700' width = '700'>
<img src = "data/init_1.png" align = 'left' height = '700' width = '700'>
<img src = "data/init_2.png" align = 'left' height = '700' width = '700'>
<img src = "data/init_3.png" align = 'left' height = '500' width = '500'>


<font color = green >

### Using non-saturation functions

<img src = "data/satur_0.png" align = 'left' height = '700' width = '700'>
<img src = "data/satur_1.png" align = 'left' height = '337' width = '337'>
<img src = "data/satur_2.png" align = 'left' height = '400' width = '400'>
  
</font>


<font color = green >
    
### Batch normalization 
   
</font>

<img src = "data/batch_norm_2.png" align = 'left' height = '700' width = '700'>    
<img src = "data/batch_norm.png" align = 'left' height = '700' width = '700'>


**Note:** 
- Batch n ormalization works on batch level  not on all data. But it uses rolling average i.e. calculates the mean for first batch lets say it equals to `10` and then for second batch `12` and then it uses `11` to estimate the mean of whole data set (on certain layer)

- $\gamma$ and $\beta$  are trainable as model parameters 
- $\mu$ and $\sigma$ are  learned as average during the training 




<font color = green >
    
### Gradient clipping
   
</font>


#### Clip by threshold 
Clip every value in gradient vector to be in certain thresold e.g. `-1` and `1` 

<img src = "data/clipping.png" align = 'left' height = '400' width = '400'>

#### Clip by norm 

Clip values by norm to keep the same direction of gradient


<img src = "data/clipping_2.png" align = 'left' height = '400' width = '400'>

**Note:** Clip by gradsient may bring small numbers and small learning.  The solution is to experiment with thresholds 

[What is Vanishing/Exploding Gradients Problem in NNs](https://youtu.be/2f_45VzKEfE?si=ROktoR-2OyUH-sgN)

[How to Choose the Correct Initializer for your Neural Network](https://youtu.be/ix3FcOaU6UI?si=AxSBpbu_RpshlOAG)

[How to Choose an Activation Function for Neural Networks](https://youtu.be/_UjA9sk_fK4?si=xXFPi1LfTmwB3ThU)