# Training Deep Neural Networks

When you work with deep neural networks likely you have to be faced to some problems

- Vanishing and exploding gradients problems
- Not having enough training data
- Training may be extremely slow
- Overfitting

We will go through each of these problem and present techniques to solve them

## The Vanishing/Exploding Gradients Problems

More generally, deep neural networks suffer from unstable gradients, diferent layer may learn at widely different speeds.

### Gorot and He Initialization

Ir require the variance of the input and output be the same.

The connection weights of each layer must be initialized randomly.

Number of input = _fan-in_
NUmber of neurons = _fan-out_

Using Gorot initialization can spped up training considerably, and it is one of the tricks that led to the success of deep learning.

Some similar strategies has been showed work better with some activatio functions

![alt text](images/initializations.png)

By default Keras uses Glorot initialization with a uniform distribution

### Nonsaturating Activation Functions

ReLU activation function used to be mostly used cause it does not saturate for positive values (and because) it is fast to compute

Unfortunately, tis function have a big problem called dying ReLUs, during training some neurons 'die', it means they outputting 0 only.

One alternative is _leaky ReLU_ and his variants. These variant outperformed ReLU. 
- Randomized leaky ReLU (RReLU): Alpha is picked randomly, reducing overfitting 
- Parametric leaky ReLU (PReLU): alpha is learned during training, it is faced like a parameter.

Last but not least, the function ELU (_exponential linear unit_) outperformed ReLU too. One variant of this is Scaled ELU (SELU).


> "So, which activation function should you use for the hidden layers
of your deep neural networks? Although your mileage will vary, in
general SELU > ELU > leaky ReLU (and its variants) > ReLU > tanh > logistic. If the network’s architecture prevents it from self
normalizing,
then ELU may perform better than SELU (since SELU
is not smooth at z = 0). If you care a lot about runtime latency, then
you may prefer leaky ReLU. If you don’t want to tweak yet another
hyperparameter, you may use the default a values used by Keras
(e.g., 0.3 for leaky ReLU). If you have spare time and computing
power, you can use cross-validation to evaluate other activation
functions, such as RReLU if your network is overfitting or PReLU
if you have a huge training set. That said, because ReLU is the most
used activation function (by far), many libraries and hardware
accelerators provide ReLU-specific optimizations; therefore, if
speed is your priority, ReLU might still be the best choice."

### Batch Normalization

This technique consist of adding an operation in the model just before or after the activation function each hidden layer. This operation simply zero-centers and normlizes each input.

BN also acts like reguarizer reducing the need for other regularization techniques.

> You may find that training is rather slow, because each epoch takes
much more time when you use Batch Normalization. This is usu
ally
counterbalanced by the fact that convergence is much faster
with BN, so it will take fewer epochs to reach the same perfor
mance.

![Batch normalization in Keras](images\batchnormalization.png)

The BatchNormalization class has quite a few hyperparameters you can tweak like momentum. A good momentum is tipically close to 1 (0.9, 0.99, 0.999)

## Reusing Pretrained Layers

It is generally not a good idea to train a very large DNN from scratch: instead, you should always try to find an existing neural network that accomplishes a similar task to the one you are trying to tackle. This technique is called _transfer learning_

The more similar the taks are, the more layers you want to reuse (starting with the lower layers). For very similar tasks, try keeping all the hidden layers and just replacing the output layer.

If you still cannot get good performance, and you have little training data, try dropping the top hidden layer(s) and freezing all the remaining hidden layers again. You can iterate until you find the right number of layers to reuse. If you have plenty of
training data, you may try replacing the top hidden layers instead of dropping them, and even adding more hidden layers.

You must always compile your model after you freeze or unfreeze layers.

Other ways to face the fact you don't have enough data to train you model is to use unsupervised learning and self-supervised learning.

## Faster Optimizers

In addition to the mentioned above, one way to optimize the training comes from using faster optimizer than the regular Gradient Descent like: momentum optimization, Nesterov Accelerated Gradient, AdaGrad, RMSProp and finally Adam and Nadam optimization.

In momentum optimizarion, the gradient is used for acceleration, not for speed.

Nesterov Accelerated Gradient is a variant of momentum

AdaGrad is an adaptative learning faster the traditional gradient and requires much less tuning of the learning rate hyperparameter.

RMSProp fixes the problem with the last optimizer, that is tAdaGrad runs the risk of slowing down a bit too fast and never convergind. Except for simple problem, RMSProp is better than AdaGrad.

Adam is the preferred optimizer nowdays. Adam stands for _addaptative moment estimation_ combines ideas of momentum and RMSProp. Nadam plus Nesterov trick so it will often converge slightly faster than Adam.

Optimizer comparison

![optimizers comparison](images\optimizers.png)

## Avoiding Overfitting Through Regularization

One of the best regularization techniques is early stopping, even batch normalization. The nest are other popular

- L1 and L2 regularization
- Dropout: It is a fairly simple algorithm: at every training step, every neuron (including the input neurons, but always excluding the output neurons) has a probability p of being
temporarily “dropped out,” meaning it will be entirely ignored during this training step, but it may be active during the next step.
- Monte Carlo (MC) Dropout: Based on samples that can be trained too.
- Max-Norm Regularization

## Summary and Practical Guidelines