### Vanishing/Exploding Gradients Problems

The backpropagation algorithm works by going from the output layer to the input layer, propagating the error gradient on the way. It uses the computed gradient to update each parameter. But, gradients often get smaller and smaller as the algorithm progresses down to the lower layers. As a result, the Gradient Descent update leaves the lower layer connection weights virtually unchanged, and training never converges to a good solution. This is called the vanishing gradients problem.

Also, the gradients can grow bigger and bigger, so many layers get insanely large weight updates and the algorithm diverges. This is called exploding gradients problem.


### Glorot and He initialization

For the signal to flow properly, we need the variance of the outputs of each layer to be equal to the variance of its inputs, and we also need the gradients to have equal variance before and after flowing through a layer in the reverse direction. It is actually not possible to guarantee both unless the layer has an equal number of inputs and neurons. But, the connection weights of each layer must be intialized randomly, where fanavg = (fanin + fanout)/2. This intialization strategy is called Xavier initialization or Glorot initialization. 


He initialization aims to maintain a stable variance of activations throughout the layers of the network, preventing the gradients from becoming too small or too large during the backpropagation process.


For tanh, logistic or softmax activation function glorot intialization is preferred.

For ReLU and its variants, He initializaiton is used.

For SeLU, LeCun is used.


By default, Keras uses Glorot initialization with a uniform distribution. We can change this to He initialization by setting `kernel_initializer="he_uniform"` or `kernel_initializer="he_normal"` when creating a layer.



### Nonsaturating Activation Functions

The ReLU activation function is not perfect as it suffers from a problem known as the dying ReLUs meaning they stop outputting anything other than 0. In some cases, more than half of the network's neurons are dead, especially if we used a large learning rate. 

A neuron dies, when its weighted sum of its input gets negative, and as by ReLU activation function the output or gradient of the negative value is 0. So, it just keeps outputting 0s.

To solve it, we can use a variant of the ReLU function such as Leaky ReLU. It is defined as LeakyReLU(z) = max(az, z). The hyperparameter a defines how much the function leaks: it is the slope of the function for z < 0, and is typically set to 0.01. This small slope ensures that Leaky ReLU never die; they can go into a long coma, but they have a chance to eventually wake up.

RReLU (Randomized Leaky ReLU), where a is picked randomly in a given range during training, and it is fixed to an average value during testing. It also performed fairly well and seemed to act as a regularizer. 

Parametric Leaky ReLU, where a is authorized to be learned during training (modified by backpropagation). This was reported to strongly outperform ReLU on large image datasets, but on smaller datasets it runs the risk of overfitting the training set.

Exponential Linear Unit (ELU) outperformed all the ReLU variants in their experiments. It takes on negative values when z < 0, which allows the unit to have an average output closer to 0. This helps alleviate the vanishing gradients problem. It has non zero gradient for z < 0, which avoids the dead neurons problem. At z = 0, the function is differential, so it helps Gradient Descent to speed up, since it will not bounce as much left and right of z = 0. The main drawback of ELU is that it is slower to compute than the ReLU and its variants, but during training this is compensated by the faster convergence rate. However, at test time an ELU network will be slower than a ReLU network.

SELU (Scaled ELU) activation function will make the network self-normalize: the output of each layer will tend to preserve mean 0 and standard deviation 1 during training, which solves the vanishing/exploding gradients problem. To use it: the input features must be standardized, every hidden layers weight must also be initialized using the LeCun normal initialization, the networks architecture must be sequential.


In general SELU > ELU > Leaky ReLU > ReLU > tanh > logistic. If the network's architecture prevents it from self-normalizing, then ELU may perform better than SELU. 

To use the leaky ReLU activation function, we must create a LeakyReLU instance.

In [1]:
from tensorflow import keras

leaky_relu = keras.layers.LeakyReLU(alpha=0.2)
layer = keras.layers.Dense(10, activation=leaky_relu, kernel_initializer="he_normal")

For SELU activation, we can set `activation="selu"` and `kernel_initializer="lecun_normal"` when creating a layer.

In [2]:
layer = keras.layers.Dense(10, activation="selu", kernel_initializer="lecun_normal")