<a href="https://colab.research.google.com/github/Richish/hands_on_ml/blob/master/ch11_training_deep_neural_networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Challenges
when traiining a much deeper DNN, perhaps with 10 layers or much more, each containing hundreds of neurons, connected by hundreds of thousands of connections.

1. Vanishing gradients problem (or the related exploding gradients problem) that affects deep neural networks and makes lower layers very hard to train. 
2. You might not have enough training data for such a large network, or it might be too costly to label - Solved by transfer learning.
3. Training may be extremely slow - solved by various optimizers.
4. A model with millions of parameters would severely risk overfitting the training set, especially if there are not enough training instances, or they are too noisy. - solved by regularization techniques.

# Vanishing/Exploding Gradients Problems

During backpropagation: gradients often get smaller and smaller as the algorithm progresses down to the lower layers. As a result, the Gradient Descent update leaves the lower layer connection weights virtually unchanged, and training never converges to a good
solution. This is called the vanishing gradients problem. In some cases, the opposite
can happen: the gradients can grow bigger and bigger, so many layers get insanely
large weight updates and the algorithm diverges. This is the exploding gradients problem,
which is mostly encountered in recurrent neural networks. More generally,
deep neural networks suffer from unstable gradients; different layers may learn at
widely different speeds.

This behavior was one of the reasons why deep neural networks were mostly abandoned for a
long time, it is only around 2010 that significant progress was made in understanding
it. 

A paper titled “Understanding the Difficulty of Training Deep Feedforward
Neural Networks” by Xavier Glorot and Yoshua Bengio found a few suspects, including
the combination of the popular logistic sigmoid activation function and the
weight initialization technique that was most popular at the time, namely random initialization
using a normal distribution with a mean of 0 and a standard deviation of 1.
In short, they showed that with this activation function and this initialization scheme,
the variance of the outputs of each layer is much greater than the variance of its
inputs. Going forward in the network, the variance keeps increasing after each layer
until the activation function saturates at the top layers. This is actually made worse by
the fact that the logistic function has a mean of 0.5, not 0 (the hyperbolic tangent
function has a mean of 0 and behaves slightly better than the logistic function in deep
networks).

When
inputs become large (negative or positive), the function saturates at 0 or 1, with a
derivative extremely close to 0. Thus when backpropagation kicks in, it has virtually
no gradient to propagate back through the network, and what little gradient exists
keeps getting diluted as backpropagation progresses down through the top layers, so
there is really nothing left for the lower layers.



## Initializers- Glorot, LeCunn and He Initializations

### Glorot
In their paper, Glorot and Bengio propose a way to significantly alleviate this problem.
We need the signal to flow properly in both directions: in the forward direction
when making predictions, and in the reverse direction when backpropagating gradients.
We don’t want the signal to die out, nor do we want it to explode and saturate.
For the signal to flow properly, the authors argue that we need the variance of the
outputs of each layer to be equal to the variance of its inputs,2 and we also need the
gradients to have equal variance before and after flowing through a layer in the
reverse direction. 

It is actually not possible to guarantee both unless the layer has an equal
number of inputs and neurons (these numbers are called the fan-in and fan-out of the
layer), but they proposed a good compromise that has proven to work very well in
practice: the connection weights of each layer must be initialized randomly as described in Equation below, where fan{avg} = (fan{in} + fan{out})/2. This initialization strategy is called Xavier initialization (after the author’s first name) or Glorot initialization (after his last name).

Normal distribution with mean 0 and variance: σ^2 = 1/fan{avg}

Or a uniform distribution between −r and + r, with r = root(3/fan{avg})

### LeCunn
If you just replace fan{avg} with fan{in} in above eqn, you get an initialization strategy
that was actually already proposed by Yann LeCun in the 1990s, called LeCun initialization. It is equivalent to Glorot initialization when fan{in} = fan{out}. It took over a decade for researchers to realize
just how important this trick really is. Using Glorot initialization can speed up training considerably, and it is one of the tricks that led to the current success of Deep Learning.

### He
Some papers have provided similar strategies for different activation functions.
These strategies differ only by the scale of the variance and whether they use fan{avg} or fan{in}, as shown in Table below - for the uniform distribution, just compute r = root(3.σ^2). 

The initialization strategy for the ReLU activation function (and its variants, including the ELU activation described shortly) is sometimes called He initialization (after the last name of its author). 

The SELU activation function will be explained . It should be used with LeCun initialization (preferably with a normal distribution, as we will see).

### Table of initializers:
| Initialization      | Activation functions | σ^2 (Normal)    |
| :---        |    :----   |          :--- |
| Glorot      | None, Logistic, tanh, Softmax      | 1/fan{avg}  |
| LeCunn   | SELU        | 1/fan{in}      |
| He   | RELU        | 2/fan{in}      |

By default, Keras uses Glorot initialization with a uniform distribution. You can
change this to He initialization by setting kernel_initializer="he_uniform" or kernel_initializer="he_normal" when creating a layer, like this:


In [None]:
import keras
keras.layers.Dense(10, activation="relu", kernel_initializer="he_normal")

<tensorflow.python.keras.layers.core.Dense at 0x7ff6d47a33c8>

In [None]:
# If you want He initialization with a uniform distribution, but based on fanavg rather
# than fanin, you can use the VarianceScaling initializer like this:
he_avg_init = keras.initializers.VarianceScaling(scale=2., mode='fan_avg', distribution='uniform') # basically a custom initializer
keras.layers.Dense(10, activation="sigmoid", kernel_initializer=he_avg_init)

<tensorflow.python.keras.layers.core.Dense at 0x7ff6d5128358>

## Activation Functions- Nonsaturating Activation Functions

### Relu
One of the insights in the 2010 paper by Glorot and Bengio was that the vanishing/
exploding gradients problems were in part due to a poor choice of activation function.
Until then most people had assumed that if Mother Nature had chosen to use
roughly sigmoid activation functions in biological neurons, they must be an excellent
choice. But it turns out that other activation functions behave much better in deep
neural networks, in particular the ReLU activation function, mostly because it does
not saturate for positive values (and also because it is quite fast to compute).

#### Problem of dying relus:
During training in relu, some neurons effectively die, meaning
they stop outputting anything other than 0. In some cases, you may find that half of your network’s neurons are dead, especially if you used a large learning rate. A neuron dies when its weights get tweaked in such a way that the weighted sum of its inputs are negative for all instances in the training set. When this happens, it just keeps outputting 0s, and gradient descent does not affect it anymore since the gradient
of the ReLU function is 0 when its input is negative.

### Leaky relu(solves the problem of dying relu):
This function is defined as LeakyReLUα(z) = max(αz, z). The hyperparameter α defines how much the function “leaks”: it is the
slope of the function for z < 0, and is typically set to 0.01. This small slope ensures that leaky ReLUs never die; they can go into a long coma, but they have a chance to eventually wake up. 

A 2015 paper compared several variants of the ReLU activation function and one of its conclusions was that the leaky variants always outperformed the strict ReLU activation function. In fact, setting α = 0.2 (huge leak) seemed to result in better performance than α = 0.01 (small leak). 

They also evaluated the **randomized leaky ReLU (RReLU)***, where α is picked randomly in a given range during training, and it is fixed to an average value during testing. It also performed fairly well and seemed to act as a regularizer (reducing the risk of overfitting the training set).

Finally, they also evaluated the **parametric leaky ReLU (PReLU)**, where α is authorized to be learned during training (instead of being a hyperparameter, it becomes a parameter that can be modified by backpropagation like any other parameter). This was reported to strongly outperform ReLU on large image datasets, but on smaller
datasets it runs the risk of overfitting the training set.

### ELU
A 2015 paper by Djork-Arné Clevert et al.6 proposed a new activation
function called the exponential linear unit (ELU) that outperformed all the ReLU variants in their experiments: training time was reduced and the neural network performed better on the test set. 
ELU activation function:
ELU{α} (z) = α (exp(z) − 1) if z < 0 else z

It looks like relu for +ve values of z.

 3 differences from relu:
1. It takes on negative values when z < 0, which allows the unit to have an average output closer to 0. This helps alleviate the vanishing gradients problem, as discussed earlier. The hyperparameter α defines the value that the ELU function approaches when z is a large negative number. It is usually set to 1, but you can tweak it like any other hyperparameter if you want.
2. It has a nonzero gradient for z < 0, which avoids the dead neurons problem.
3. If α is equal to 1 then the function is smooth everywhere, including
around z = 0, which helps speed up Gradient Descent, since it does not bounce as much left and right of z = 0.

Drawbacks of ELU:
The main drawback of the ELU activation function is that it is slower to compute than the ReLU and its variants (due to the use of the exponential function), but during training this is compensated by the faster convergence rate. However, at test time an ELU network will be slower than a ReLU network.

#### SELU (Scaled ELU)
In a 2017 paper7 by Günter Klambauer et al., called “Self-Normalizing
Neural Networks”, the authors showed that if you build a neural network composed exclusively of a stack of dense layers, and if all hidden layers use the SELU activation function (which is just a scaled version of the ELU activation function, as its name
suggests), then the network will self-normalize: the output of each layer will tend to preserve mean 0 and standard deviation 1 during training, which solves the vanishing/ exploding gradients problem. As a result, this activation function often outperforms other activation functions very significantly for such neural nets (especially
deep ones). **However, there are a few conditions for self-normalization to happen:**

1. The input features must be standardized (mean 0 and standard deviation 1).
2. Every hidden layer’s weights must also be initialized using LeCun normal initialization. In Keras, this means setting kernel_initializer="lecun_normal".
3. **The network’s architecture must be sequential.** Unfortunately, if you try to use SELU in non-sequential architectures, such as recurrent networks or networks with skip connections (i.e., connections that skip layers, such as in wide & deep nets), self-normalization will not be guaranteed, so SELU will not necessarily outperform other activation functions.
4. The paper only guarantees self-normalization if all layers are dense. However, in practice the SELU activation function seems to work great with convolutional neural nets as well.

### Which optimization function to use:

For the hidden layers of your deep neural networks- Although your mileage will vary, in
general:

SELU > ELU > leaky ReLU (and its variants) > ReLU > tanh
> logistic. 

If the network’s architecture prevents it from self-normalizing,
then ELU may perform better than SELU (since SELU
is not smooth at z = 0). 

If you care a lot about runtime latency, then you may prefer leaky ReLU. 

If you don’t want to tweak yet another hyperparameter, you may just use the default α values used by Keras (e.g., 0.3 for the leaky ReLU). 
If you have spare time and computing power, you can use cross-validation to evaluate other activation functions, in particular RReLU if your network is overfitting, or PReLU if you have a huge training set.


In [None]:
# To use the leaky ReLU activation function, you must create a LeakyReLU instance like this:
from tensorflow import keras
leaky_relu = keras.layers.LeakyReLU(alpha=0.2)
layer = keras.layers.Dense(10, activation=leaky_relu, kernel_initializer="he_normal")


In [None]:
# For PReLU, just replace LeakyRelu(alpha=0.2) with PReLU(). There is currently no
# official implementation of RReLU in Keras, but you can fairly easily implement your own.
p_relu = keras.layers.PReLU()
layer = keras.layers.Dense(10, activation=p_relu, kernel_initializer="he_normal")

In [None]:
# For SELU activation, just set activation="selu" and kernel_initializer="lecun_normal" when creating a layer:
layer = keras.layers.Dense(10, activation="selu", kernel_initializer="lecun_normal")

## Batch Normalization:

Although using He initialization along with ELU (or any variant of ReLU) can significantly
reduce the vanishing/exploding gradients problems at the beginning of training,
it doesn’t guarantee that they won’t come back during training.

Batch Normalization consists of adding an operation in the model just before or after the
activation function of each hidden layer, simply zero-centering and normalizing each
input, then scaling and shifting the result using two new parameter vectors per layer:
one for scaling, the other for shifting. This operation lets the model
learn the optimal scale and mean of each of the layer’s inputs. 

In many cases, if you
add a BN layer as the very first layer of your neural network, you do not need to
standardize your training set (e.g., using a StandardScaler): the BN layer will do it
for you (well, approximately, since it only looks at one batch at a time, and it can also
rescale and shift each input feature).

In order to zero-center and normalize the inputs, the algorithm needs to estimate
each input’s mean and standard deviation. It does so by evaluating the mean and standard
deviation of each input over the current mini-batch (hence the name “Batch
Normalization”).


So during training, BN just standardizes its inputs then rescales and offsets them.
What about at test time?

It is often preferred to estimate these final statistics
during training using a moving average of the layer’s input means and standard
deviations. To sum up, four parameter vectors are learned in each batch-normalized
layer: γ (the ouput scale vector) and β (the output offset vector) are learned through
regular backpropagation, and μ (the final input mean vector), and σ (the final input
standard deviation vector) are estimated using an exponential moving average. Note
that μ and σ are estimated during training, but they are not used at all during training,
only after training

Batch Normalization
also acts like a regularizer, reducing the need for other regularization techniques
(such as dropout, described later in this chapter).
Batch Normalization does, however, add some complexity to the model (although it
can remove the need for normalizing the input data, as we discussed earlier). Moreover,
there is a runtime penalty: the neural network makes slower predictions due to
the extra computations required at each layer. So if you need predictions to be
lightning-fast, you may want to check how well plain ELU + He initialization perform
before playing with Batch Normalization.



You may find that training is rather slow, because each epoch takes
much more time when you use batch normalization. However, this
is usually counterbalanced by the fact that convergence is much
faster with BN, so it will take fewer epochs to reach the same performance.
All in all, wall time will usually be smaller (this is the
time measured by the clock on your wall).

### Implementing Batch Normalization with Keras

Just add a BatchNormalization layer before or after each hidden layer’s activation
function, and optionally add a BN layer as well as the first layer in your model. For
example, this model applies BN after every hidden layer and as the first layer in the
model (after flattening the input images):

In [1]:
from tensorflow import keras

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation="softmax")
])

In [2]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 784)               0         
_________________________________________________________________
batch_normalization (BatchNo (None, 784)               3136      
_________________________________________________________________
dense (Dense)                (None, 300)               235500    
_________________________________________________________________
batch_normalization_1 (Batch (None, 300)               1200      
_________________________________________________________________
dense_1 (Dense)              (None, 100)               30100     
_________________________________________________________________
batch_normalization_2 (Batch (None, 100)               400       
_________________________________________________________________
dense_2 (Dense)              (None, 10)                1

In [3]:
[(var.name, var.trainable) for var in model.layers[1].variables]

[('batch_normalization/gamma:0', True),
 ('batch_normalization/beta:0', True),
 ('batch_normalization/moving_mean:0', False),
 ('batch_normalization/moving_variance:0', False)]

The authors of the BN paper argued in favor of adding the BN layers before the activation
functions, rather than after (as we just did). There is some debate about this, as
it seems to depend on the task. So that’s one more thing you can experiment with to
see which option works best on your dataset. To add the BN layers before the activation
functions, we must remove the activation function from the hidden layers, and
add them as separate layers after the BN layers. Moreover, since a Batch Normalization
layer includes one offset parameter per input, you can remove the bias term from
the previous layer (just pass use_bias=False when creating it):

In [5]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, kernel_initializer="he_normal", use_bias=False),
    keras.layers.BatchNormalization(),
    keras.layers.Activation("elu"),
    keras.layers.Dense(100, kernel_initializer="he_normal", use_bias=False),
    keras.layers.Activation("elu"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation="softmax")
])
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_2 (Flatten)          (None, 784)               0         
_________________________________________________________________
batch_normalization_6 (Batch (None, 784)               3136      
_________________________________________________________________
dense_6 (Dense)              (None, 300)               235200    
_________________________________________________________________
batch_normalization_7 (Batch (None, 300)               1200      
_________________________________________________________________
activation_2 (Activation)    (None, 300)               0         
_________________________________________________________________
dense_7 (Dense)              (None, 100)               30000     
_________________________________________________________________
activation_3 (Activation)    (None, 100)              

#### Hyperparameters of Batch Normalization layer:

1. Momentum:  This hyperparameter is used when updating the exponential moving averages: given a
new value v.
Running average, V{avg} is calculated using eqn:
V{avg} = V{avg}*momentum + v{new_batch}*(1-momentum)

A good momentum value is typically close to 1—for example, 0.9, 0.99, or 0.999 (you
want more 9s for larger datasets and smaller mini-batches).

2. Axis: it determines which axis should be normalized.
It defaults to –1, meaning that by default it will normalize the last axis (using
the means and standard deviations computed across the other axes). For example,
when the input batch is 2D (i.e., the batch shape is [batch size, features]), this means
that each input feature will be normalized based on the mean and standard deviation
computed across all the instances in the batch. For example, the first BN layer in the
previous code example will independently normalize (and rescale and shift) each of
the 784 input features. However, if we move the first BN layer before the Flatten
layer, then the input batches will be 3D, with shape [batch size, height, width], therefore
the BN layer will compute 28 means and 28 standard deviations.


it will normalize all pixels in a given column using the same mean and standard deviation.
There will also be just 28 scale parameters and 28 shift parameters. If instead
you still want to treat each of the 784 pixels independently, then you should set
axis=[1, 2].

Notice that the BN layer does not perform the same computation during training and
after training: it uses batch statistics during training, and the “final” statistics after
training (i.e., the final value of the moving averages).

Batch Normalization has become one of the most used layers in deep neural networks,
to the point that it is often omitted in the diagrams, as it is assumed that BN is
added after every layer. 

However, a very recent paper10 by Hongyi Zhang et al. may
well change this: the authors show that by using a novel fixed-update (fixup) weight
initialization technique, they manage to train a very deep neural network (10,000 layers!)
without BN, achieving state-of-the-art performance on complex image classification
tasks.

## Gradient Clipping

Another popular technique to lessen the exploding gradients problem is to simply
clip the gradients during backpropagation so that they never exceed some threshold.
This is called Gradient Clipping.

This technique is most often used in recurrent neural networks, as Batch Normalization is tricky to use in RNNs.
For other types of networks, BN is usually sufficient.

In Keras, implementing Gradient Clipping is just a matter of setting the clipvalue or
clipnorm argument when creating an optimizer. For example:



In [6]:
optimizer = keras.optimizers.SGD(clipvalue=1.0)
model.compile(loss="mse", optimizer=optimizer)
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_2 (Flatten)          (None, 784)               0         
_________________________________________________________________
batch_normalization_6 (Batch (None, 784)               3136      
_________________________________________________________________
dense_6 (Dense)              (None, 300)               235200    
_________________________________________________________________
batch_normalization_7 (Batch (None, 300)               1200      
_________________________________________________________________
activation_2 (Activation)    (None, 300)               0         
_________________________________________________________________
dense_7 (Dense)              (None, 100)               30000     
_________________________________________________________________
activation_3 (Activation)    (None, 100)              

This will clip every component of the gradient vector to a value between –1.0 and 1.0.
This means that all the partial derivatives of the loss (with regards to each and every
trainable parameter) will be clipped between –1.0 and 1.0. The threshold is a hyperparameter
you can tune. Note that it may change the orientation of the gradient vector:
for example, if the original gradient vector is [0.9, 100.0], it points mostly in the
direction of the second axis, but once you clip it by value, you get [0.9, 1.0], which
points roughly in the diagonal between the two axes. In practice however, this
approach works well. 

If you want to ensure that Gradient Clipping does not change
the direction of the gradient vector, you should clip by norm by setting clipnorm
instead of clipvalue. This will clip the whole gradient if its ℓ2 norm is greater than
the threshold you picked. For example, if you set clipnorm=1.0, then the vector [0.9,
100.0] will be clipped to [0.00899964, 0.9999595], preserving its orientation, but
almost eliminating the first component. If you observe that the gradients explode
during training (you can track the size of the gradients using TensorBoard), you may
want to try both clipping by value and clipping by norm, with different threshold,
and see which option performs best on the validation set.