# Training Deep Neural Nets

The challenges for deep neuralnets:
- vanishing gradients and exploding gradients problem
- slow training
- overfitting

### Vanishing/Exploding Gradients Problems

Gradients often get smaller and smaller during the backpropagation process. As a result, the Gradient Descent update leaves the lower layer connection weights virtually unchanged, and training never converges to a good solution.

In some cases, the opposite can happen(mostly in recurrent neural networks): the gradients get bigger and bigger

In general, deep neural networks suffer from unstable gradients; different layers may learn at widely different speeds.

##### Solution

When using logistiv sigmoid activation function and random initialization using a strandard normal distribution, the variance of the outpu of each layer is much larger than its input.
When the inputs get larger, the sigmoid function saturates at 0 or 1, with a derivatie close to 0, this will cause the vanishing gradient problem.

We need the signal to flow properly in both directions: forward for prediction and backward for backpropagation

We need the output of each layer to be euqal to its input, and we also need the gradient to be the same.

**Xavier and He Initialization** can do this

Using these initialization strategies can speed up the training considerably, and it is one of the tricks that led to the current success of Deep Learning.

**And also, using Nonsaturating Activation Functions can help**

In practice, ReLU behave much better than sigmoid function, mostly because it does not saturate for positive values, and also because it is quite fast to compute.

However, it suffers from a problem called "*Dying ReLUS*": during training, some neurons effectively die, meaning they stop outputting anything other than 0

To solve this, you can use "***Leaky ReLU***" or "***exponential linear unit(ELU)***" instead

#### Which activation function to use?

- generally ELU > leaky ReLU > ReLU > tanh > logistic
- if you care much about runtime performance, then leaky ReLu over ELU
- if you don't want to tweak another hyperparameter, use alpha=0.01 for leaky ReLU and alpha=1 for ELU as the default
- however, if you have enough time, use cv to determine all these stuff

**Besides, Batch Normalization**

Batch Normalization consists of adding an operation in the model just before the activation function of each layer, simply zero-centerting and normalizing the inputs, then scaling and shifting the result using two new parameters per layer (one for scaling, the other for shifting). In other words, this operation lets the model learn the optimal scale and mean if the inputs for each layer.

This technique makes the model very robust against vanishing gradients problem and makes it possible to use much larger learning rate to speed up training.

However, it makes it much slower to test. So it you need prediction to be lightning-fast, you may want to check ELU + He initialization before rushing into Batch Normalization

### Reusing Pretrained Layers

It is generally not a good idea to train a very large DNN from scratch: instead, you should always try to find an existing neural network that accomplishes a similar task to the one you are trying to tackle, then just reuse the lower layers of this nerwork: This is called transfer learning.

It will not only speed up training considerably, but will also require much less training data.

Note: Transfer Learning will work only well if the inputs have similar low-level features (can be achieved by preprocessing)

### Faster Optimizers

Training a very large deep neural network can be painfully slow. So far we can speed up the training process by:
- good ininitialization
- good activation function
- batch normalization
- reuse pretrain models

Here comes another: **Faster Optimizers**

**Spolier alert: The conclusion of this is that you should almose always use Adam optimization.**

#### Momentum Optimization

The gradient is used as acceletation, not speed.

Faster, but might overshood a bit.

#### Nesterov Accelerated Gradient

Measure the gradient not at the local position, but a little ahead in the direction of the momentum

Even Faster, closer to the optimum, but still, overshoot

#### AdaGrad

Use adaptive learning rate, performs well for simple quadratic problems, but unfortunately often stops too early when training neural nets.

#### RMSProp

Fix the early stopping problem in AdaGrad, already a good-enough optimizer.

#### Adam Optimization

Adam(*adaptive moment estimation*) combines the ideas of momentum and RMSProp

### Learning Rate Scheduling

### Avoid Overfitting Through Regularization

#### Early Stopping

Early stopping works well in practice, but you can often get a better performance by combining it with other regularization techniques.

#### l1 and l2 Regularization

#### Dropout

***The most popular regularization techniques for deep neural networks is arguably dropout.***

At every training step, every neuron (including the input neurons but excluding the output neurons) has a probability p(typically 50%) of being temporarily "dropped out", meaning it will be entirely ignored during this training step, but it may be active during the next training step. After training, neurons don't get dropped out anymore.

Note: during test, each neuron will be taking much more inputs than it does in training, so the weights must be adjusted.

In general, we need to multiply each input connection weight by the *keep peobability*(1-p) after training.

**Dropout does tend to significantly slow down convergence, but it usually results in a much better model when tuned properly. So, it is generally well worth the extra time and effort.**

#### Max-Norm Regularization

For each neuron, it constrains the weights if the incoming connections to be less than gamma.

Another regularization technique that is quite popular for neural networks.

#### Data Augmentation

Generating new training instances from existing ones, boosting the size of the training set.

Generating realistic training instances; ideally a human should not be able to tell which instances was generated and which was not.

Simply adding white noise could not help, the modifications should be learnable.

### Practical Guidelines

This setting should works well in most cases:

- Initialization: He initialization
- Activation Function: ELU
- Normalization: Batch Normalization
- Regularization: Dropout
- Optimizer: Adam
- Learning rate schedule: None

some other advice:
- if you can't find a good learning rate, add an exponential decay learning rate schedule might help
- if you training set is a bit too small, try data augmentation
- if you need a sparse model, use l1 regularization
- if you want a lightning-fast model, drop Barch Normalization, maybe repalce ELU with leaky ReLU, having a sparse model will also help.