## Vanishing/Explding Gradients Problems

* *Vanishing gradients* problem: Gradients often get smaller and smaller as the algorithm progresses down to the lower layers. Thus, the Gradient Descent update leaves he lower layer connection weights virtually unchanged so training never converges to a good solution.

* *Exploding gradients* problem is the opposite, gradietns grow bigger and bigger that somany layers get insanely large weight updates and the algorithm diverges. Mostly encountered in recurrent neural netowrks.

Gradient problems are worse with logistic activation functions, as the function saturates at $0$ or $1$ (gradients $\rightarrow 0$)

### Xavier and He Initialization
For the signal to flow through the layers properly, we need the variance of the outputs of each layer to be equal to the variance of its inputs, also the gradients to have equal variance before and after flowing through a layer in the reverse directions. Not possible unless the layer has equal number of input and output connections, but some practices have been proven to work well.


| Activation function        | Uniform distribution [-r,r]           | Normal distribution  |
| :-------------|:-------------:| :-------------:|
| Logistic      | $r = \sqrt{\dfrac{6}{n_{inputs}+n_{outputs}}}$ | $\sigma = \sqrt{\dfrac{2}{n_{inputs}+n_{outputs}}}$ |
| Hyperbolic tangent      | $r = 4\sqrt{\dfrac{6}{n_{inputs}+n_{outputs}}}$ | $\sigma = 4\sqrt{\dfrac{2}{n_{inputs}+n_{outputs}}}$|
| ReLu (variants) |  $r = \sqrt{2}\sqrt{\dfrac{6}{n_{inputs}+n_{outputs}}}$ | $\sigma = \sqrt{2}\sqrt{\dfrac{2}{n_{inputs}+n_{outputs}}}$ |


### Nonsaturating Activation Function
* ReLU suffers from *dying*. ifa neuron's weights get updated such that the weighted sum of the neuron's inputs is negative, it will start outputting $0$ and is unlikely to come back to life since the gradient of ReLU is 0 when having negative input.
* (pp. 279 - 280) *leaky* ReLU, *randomized leaky* ReLU (RReLU), *parametric leaky* ReLU (PReLU), *scaled exponential linear unit* (SELU)

* $ReLU(xz) = \max (0, z)$
* $LeakyReLU_\alpha(x) = \max (\alpha z, z)$
<img src="images\11_leaky_relu_plot.png" width = "300" alt="Leaky Relu">
* $ELU(z)_\alpha =\left \{\begin{array}{ll}  
                             \alpha (\exp (z)-1) & z<0 \\ 
                             z & z\geq 0 
                   \end{array} \right.$
<img src="images\11_elu_plot.png" width = "300" alt="ELU">

* $SELU(z)_\alpha = \lambda \left \{\begin{array}{ll}  
                             \alpha (\exp (z)-1) & x<0 \\ 
                             z & z\geq 0 
                   \end{array} \right.$
<img src="images\11_selu_plot.png" width = "300" alt="SELU">

### Batch Normalization

* At training time, use the mean and variance of the current batch to center (-$\mu$) and normalize (/$\sigma$) each input, and then apply scaling ($\cdot \gamma$) and shifting ($+\beta$) 

* At test time, use the empirical mean and standard deviation of the whole training set, note that $\mu, \sigma, \gamma, \beta$ are learned for each batch-normalized layer.

### BN with TensorFlow
* (pp. 284 - 285)

### Gradient Clipping (pp 286)

## Reusing Pretrained Layers
* Generally, transfer learning will work only well if the inputs have similar low-level features.

### Reusing a TensorFlow Model (pp.287)

### Reusing Models from Other Frameworks (pp.288)

### Freezing the Lower Layers (pp. 289)
* `optimizer.minimize(loss, var_list = ...)`

### Cachine the Frozen Layers (pp.290)
* Run the whole training set through the frozen lower layers once, and then use mini-batches of the outputs to train higher layers.

### Tweaking, Dropping, or Replacing the Upper Layers (pp. 290)
* Find the right number of layers to reuse

### Model Zoos (pp.291)

### Unsupervised Pretraining (pp.291)
 * If we have a complex task but not much labeled training data, we can use unsupervised pretraining (e.g., auto-encoders) to determine the network structure, and then train the model using the labeled data.

### Pretraining on an Auxiliary Task (pp.292)

## Resources
* Full [implementation](https://github.com/ageron/handson-ml/blob/master/11_deep_learning.ipynb) By [Aurélien Geron](https://github.com/ageron)