<font color = green >

# Improving deep neural networks

</font>

- Bias and Variance  
- Regularization: Weight decay 
- Regularition: dropout
- Getting more data
- Early stopping
- Vanishing / Exploding gradients
- Normalization of input 



<font color = green >

## Bias and Variance

</font>

**High bias**  (underfitting): train classification error is high and dev classification error is high.
<br>**High variance** (overfitting): train classification error is low but dev classification error is high.
<br>**Optimal** : train classification error is low and dev classification error is low.
<br><br>Note: need to consider **human classification error** e.g. 
<br>train classification error = 20% , dev classification error = 22% is **high bias** when human classification error is close to zero,<br> but it is close to **optimal classification** when human classification error is close to 20%.


Main machine learning appoach: Resolve **high bias first**, then resolve **high variance**.

The following can resolve **high bias**: 
- build larger network (more layers/more units) 
- train longer 
- apply advanced optimization algorithm
- reduce learning rate 
- reduce regularization 
- normalize input 
- use advanced params initialization
- use other network architecture 
- use more/other features

The following can resolve **high variance**: 
- get more data (consider augmentated data)
- increase regularization 
- use other network architecture 




<font color = green >
    
## Regularization: Weight decay (frobenius norm/ L2 norm regularization )


</font>

$$ \mathcal{L} = -\frac{1}{m} \sum\limits_{i = 1}^{m} (y^{(i)}\log\left(a^{[L] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right))\quad\rightarrow \quad
\mathcal{L}_{F} = \mathcal{L} +  \frac{\lambda}{2m} \sum\limits_{l = 1}^{L}  \sum\limits_{i = 1}^{n^{[l]}}  \sum\limits_{j = 1}^{n^{[l-1]}} (w^{[l]}_{ij}) ^2
\\
\frac{\partial \mathcal{L} }{\partial W^{[l]}} = \frac{1}{m} \frac{\partial \mathcal{L} }{\partial Z^{[l]}} \,@\, A^{[l-1] T} \quad\rightarrow \quad 
\frac{\partial \mathcal{L}_{F} }{\partial W^{[l]}} = \frac{\partial \mathcal{L} }{\partial W^{[l]}}+ \frac{\lambda}{m} W^{[l]}\\
\quad \\
W^{[l]}  = W^{[l]} - \alpha \cdot \frac{\partial \mathcal{L}}{\partial W^{[l]}} \quad=\quad W^{[l]} - \alpha \cdot (\frac{\partial \mathcal{L} }{\partial W^{[l]}}+ \frac{\lambda}{m} W^{[l]}) \quad = \quad W^{[l]} \cdot (1 -\frac{\alpha \lambda}{m}) - \alpha \cdot \frac{\partial \mathcal{L} }{\partial W^{[l]}}
\\
$$



<font color = green >
    
### Implementation

</font>

- Develop cost with regularization
- Correct the backward propagation 
- Correct the training model to consider lambda and new cost with regularization

<font color = green >
    
## Regularization: dropout


</font>

<img src = "data/19_dropout2.png" align = 'left' height = '500' width = '500'>

<div style="clear:left;"></div>

Note: Dropout is only on training model stage but not at predicting 


   



<font color = green >
    
### Implementation

</font>

Define `keep_prob` e.g. `0.8` that means to shutdown approximately `20%` of units. In forward propagation correct computing post-activation: 

- Generate random matrix of shape $A$:<br> 
    $\quad $ `D = np.random.rand(A.shape[0], A.shape[1])` 
    
- convert entries of D to 0 or 1 (using keep_prob as the threshold):<br>
    $\quad $ `D = D <keep_prob` 
    
- shut down some neurons of A:<br>
    $\quad $`A = A * D`
    
- scale the value of neurons that haven't been shut down (**Inverted dropout**):<br>
    $\quad$ `A = A / keep_prob` 

Note: 
- You may apply different thresshold (`keep_prob`) values to defferent layers  
- Don't apply dropout to the last layer 

<font color = green >
    
## Data Augmentation to reduce the variance 

</font>

<img src = "data/19_cat.png" align = 'left' height = '300' width = '300'>

<img src = "data/19_cat_2.png" align = 'left' height = '300' width = '300'>
<img src = "data/19_cat_3.png" align = 'left' height = '300' width = '300'>
<img src = "data/19_cat_4.png" align = 'left' height = '900' width = '900'>
<div style="clear:left;"></div>



   



<font color = green >
    
## Early stopping

</font>

<img src = "data/19_early_stop.png" align = 'left' height = '500' width = '500'>



   



<font color = green >
    
## Vanishing / Exploding gradients

</font>

<img src = "data/19_van.png" align = 'left' height = '500' width = '500'>
<div style="clear:left;"></div>

Simplified case: all $b^{[l]}=0$ and all $g^{[l]}(z) = z$
<br>Then $\hat{y} = W^{[1]} \, @ \, W^{[1]} \, @ \, W^{[2]} \, @ \, ... \, @ \, W^{[L]} \, @ \, X$

<br>Let's assume the init $W^{[l]} = \begin{bmatrix} 0.9 & 0 \\ 0 & 0.9 \end{bmatrix}$  Then $\hat{y} =  \begin{bmatrix} 0.9 & 0 \\ 0 & 0.9 \end{bmatrix}^{L-1} @ W^{[L]}  @  X $
<br>If $L$ is large then the gradients of first layers will be very small and the gradient steps is very little productive, thus the convergence is very slow.

<br>
In practice there is also case of exploding gradient. 
<br>If the size of layer $l$ is large then the $Z^{[l]} = W^{[l]} @ A^{[l-1]}+ b^{[l]}$ is very large. e.g. for single unit:  $z= w_{1} x_{1} + w_{2} x_{2} + ...+  w_{n} x_{n}$, to avoid output of layer very large, need to initialize the parameters being based on $n$: the larger $n$ the smaller $W$  
<br> One of approach is to use Xavier initialization: $$W^{[l]} = random \cdot (\frac{1}{\sqrt{(n^{[l-1]}}}) $$


<font color = green >
    
## Normalizing inputs

</font>

Normalizing inputs allows to avoid exploding gradients and speeds up the training process

<img src = "data/19_norm_1.png" align = 'left' height = '400' width = '400'>
<img src = "data/19_norm_3.png" align = 'left' height = '400' width = '400'>
<img src = "data/19_norm_2.png" align = 'left' height = '300' width = '300'>

<img src = "data/19_norm_4.png" align = 'left' height = '350' width = '350'>



   

