# 11. Training Deep Neural Networks

After introducing relatively shallow nets, let's move on to deeper DNNs (layers >= 10; neurons/layer: X00, X000; connections: X0,000). 

Here are some problems we may encounter along the way，and some techniques we may try out to solve them:

1. Vanishing / Exploding gradients problem making lower layers very hard to train 
                > Initialization 
2. Not enough data / too costly to label 
                > Transfer learning and unsupervised pretraining
3. Painfully slow training 
                > Optimizers to the rescue!
4. Serious overfitting risk for millions params models, especially if there are not enough training instances or if they are too noisy 
                > Good ol' (and new) regularization techniques

### 1. Vanishing / Exploding Gradients Problems

As we know from our previous chapter, Gradient Descent goes from output > input layer propagating the error gradient along the way. Once it has computed the gradient of the cost function for each param of the network, it uses these gradients to update each parameter with a Gradient Descent step.

**Vanishing** gradients gets smaller and smaller, leaving lower layers weights virtually unchanged (no convergence to good solution).  
**Exploding** gradients gets bigger and bigger, making lower layers weights extremely large (divergence).

This behavious was not clearly understood until Glorot and Benjo suggested in a 2010 [paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf) that this may be due to the the logistic sigmoid function and the weight initialization technique (normal dist 0,1).

In short, they showed that with this activation function and this initialization scheme, the **each layer outputs variance > inputs variance**. 

#### Glorot and He Initialization

The ideal solution would therefore be to have $var_{input} = var_{output}$ and $var_{forward} = var_{backwards}$. 

It is actually not possible to guarantee both unless the layer has an equal number of inputs and neurons ($fan_{in} = fan_{out}$) but there is workaround, i.e. initialize connection weight randomly from either:
* Normal dist of mean $0$ and variance $\sigma^2 = \frac{1}{fan_{avg}}$
* Uniform dist between {-r,r} with $r = \sqrt\frac{3}{fan_{avg}}$

Other initializations exist, differing in the variance used:

**Initialization** | **Activation function** | **$\sigma^2$(Normal)** 
-|-|-|
Glorot | None, tanh, logistic, softmax | $\frac{1}{fan_{avg}}$ 
He | ReLU and variants | $\frac{2}{fan_{in}}$
LeCun | SELU | $\frac{1}{fan_{in}}$

#### Nonsaturating activation functions

Altough the ReLU function solves some of the issues of the sigmoid function, it is far from perfect. A common issue are **dying ReLUs**, neurons which die when their weights get tweaked in such a way that the weighted sum of its inputs are negative for all instances in the training set, and therefore keeps their gradients will keep outputting 0.

To solve the problem, we could employ a **leaky ReLU**, which doesn't allow the neurons to die since it has a slope $\alpha$ also when $z<0$.  
Slope is generally 0.01 but could also be:
* Randomized: $\alpha$ is picked randomly in a given range during training and is fixed to an average value during testing (seems also to work as regularizer)
* Parametric: $\alpha$ authorized to be learned during training (not an hyperparam). Performs strongly on complex datasets but prone to overfitting in small ones. 