### Vanishing/Exploding Gradients Problems

The backpropagation algorithm works by going from the output layer to the input layer, propagating the error gradient on the way. It uses the computed gradient to update each parameter. But, gradients often get smaller and smaller as the algorithm progresses down to the lower layers. As a result, the Gradient Descent update leaves the lower layer connection weights virtually unchanged, and training never converges to a good solution. This is called the vanishing gradients problem.

Also, the gradients can grow bigger and bigger, so many layers get insanely large weight updates and the algorithm diverges. This is called exploding gradients problem.


### Glorot and He initialization

For the signal to flow properly, we need the variance of the outputs of each layer to be equal to the variance of its inputs, and we also need the gradients to have equal variance before and after flowing through a layer in the reverse direction. It is actually not possible to guarantee both unless the layer has an equal number of inputs and neurons. But, the connection weights of each layer must be intialized randomly, where fanavg = (fanin + fanout)/2. This intialization strategy is called Xavier initialization or Glorot initialization. 


He initialization aims to maintain a stable variance of activations throughout the layers of the network, preventing the gradients from becoming too small or too large during the backpropagation process.


For tanh, logistic or softmax activation function glorot intialization is preferred.

For ReLU and its variants, He initializaiton is used.

For SeLU, LeCun is used.


By default, Keras uses Glorot initialization with a uniform distribution. We can change this to He initialization by setting `kernel_initializer="he_uniform"` or `kernel_initializer="he_normal"` when creating a layer.



### Nonsaturating Activation Functions

The ReLU activation function is not perfect as it suffers from a problem known as the dying ReLUs meaning they stop outputting anything other than 0. In some cases, more than half of the network's neurons are dead, especially if we used a large learning rate. 

A neuron dies, when its weighted sum of its input gets negative, and as by ReLU activation function the output or gradient of the negative value is 0. So, it just keeps outputting 0s.

To solve it, we can use a variant of the ReLU function such as Leaky ReLU. It is defined as LeakyReLU(z) = max(az, z). The hyperparameter a defines how much the function leaks: it is the slope of the function for z < 0, and is typically set to 0.01. This small slope ensures that Leaky ReLU never die; they can go into a long coma, but they have a chance to eventually wake up.

RReLU (Randomized Leaky ReLU), where a is picked randomly in a given range during training, and it is fixed to an average value during testing. It also performed fairly well and seemed to act as a regularizer. 

Parametric Leaky ReLU, where a is authorized to be learned during training (modified by backpropagation). This was reported to strongly outperform ReLU on large image datasets, but on smaller datasets it runs the risk of overfitting the training set.

Exponential Linear Unit (ELU) outperformed all the ReLU variants in their experiments. It takes on negative values when z < 0, which allows the unit to have an average output closer to 0. This helps alleviate the vanishing gradients problem. It has non zero gradient for z < 0, which avoids the dead neurons problem. At z = 0, the function is differential, so it helps Gradient Descent to speed up, since it will not bounce as much left and right of z = 0. The main drawback of ELU is that it is slower to compute than the ReLU and its variants, but during training this is compensated by the faster convergence rate. However, at test time an ELU network will be slower than a ReLU network.

SELU (Scaled ELU) activation function will make the network self-normalize: the output of each layer will tend to preserve mean 0 and standard deviation 1 during training, which solves the vanishing/exploding gradients problem. To use it: the input features must be standardized, every hidden layers weight must also be initialized using the LeCun normal initialization, the networks architecture must be sequential.


In general SELU > ELU > Leaky ReLU > ReLU > tanh > logistic. If the network's architecture prevents it from self-normalizing, then ELU may perform better than SELU. 

To use the leaky ReLU activation function, we must create a LeakyReLU instance.

In [1]:
from tensorflow import keras

leaky_relu = keras.layers.LeakyReLU(alpha=0.2)
layer = keras.layers.Dense(10, activation=leaky_relu, kernel_initializer="he_normal")

For SELU activation, we can set `activation="selu"` and `kernel_initializer="lecun_normal"` when creating a layer.

In [2]:
layer = keras.layers.Dense(10, activation="selu", kernel_initializer="lecun_normal")

### Batch Normalization

It consists of adding zero-centering and normalizing each input, then scaling and shifting the result using two new parameter vectors per layer: one for scaling and other for shifting. This operation lets the model to learn the optimal scale and mean of each of the layer's inputs.


For testing, it is preferred to estimate the final statistics during training using a moving average of the layer's input means and standard deviations. Four parameter vectors are learned in each batch-normalized layer: gamma(the output scale vector) and beta(the output offset vector) are learned through regular backpropagation, and meu(the final input mean vector), and sigma(the final input standard deviation vector) are estimated using an exponential moving average. 

Due to BN, the vanishing gradients problem is strongly reduced, to the point that we could use saturating activation functions such as the tanh and even the logistic activation function. The networks are also much less sensitive to the weight initialization. Higher learning rates can be used, speeding up the learning process. BN also acts like a regularizer, reducing the need for other regularization techniques.

Each epoch takes much more time when using BN. However, this is usually conuterbalanced by the fact that convergence is much faster with BN, so it will take fewer epochs to reach the same performance.


#### Implementing Batch Normalization with Keras

In [2]:
from tensorflow import keras

model = keras.models.Sequential([
 keras.layers.Flatten(input_shape=[28, 28]),
 keras.layers.BatchNormalization(),
 keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
 keras.layers.BatchNormalization(),
 keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
 keras.layers.BatchNormalization(),
 keras.layers.Dense(10, activation="softmax")
 ])

In [3]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten (Flatten)           (None, 784)               0         
                                                                 
 batch_normalization (BatchN  (None, 784)              3136      
 ormalization)                                                   
                                                                 
 dense (Dense)               (None, 300)               235500    
                                                                 
 batch_normalization_1 (Batc  (None, 300)              1200      
 hNormalization)                                                 
                                                                 
 dense_1 (Dense)             (None, 100)               30100     
                                                                 
 batch_normalization_2 (Batc  (None, 100)              4

In [4]:
[(var.name, var.trainable) for var in model.layers[1].variables]

[('batch_normalization/gamma:0', True),
 ('batch_normalization/beta:0', True),
 ('batch_normalization/moving_mean:0', False),
 ('batch_normalization/moving_variance:0', False)]

In [5]:
# For adding the BN layers before the activation functions, rather than after. 
    
model = keras.models.Sequential([
 keras.layers.Flatten(input_shape=[28, 28]),
 keras.layers.BatchNormalization(),
 keras.layers.Dense(300, kernel_initializer="he_normal", use_bias=False),
 keras.layers.BatchNormalization(),
 keras.layers.Activation("elu"),
 keras.layers.Dense(100, kernel_initializer="he_normal", use_bias=False),
 keras.layers.Activation("elu"),
 keras.layers.BatchNormalization(),
 keras.layers.Dense(10, activation="softmax")
 ])

### Gradient Clipping

Another popular tecnique to lessen the exploding gradients problem is to simply clip the gradients during backpropagation so that they never exceed some threshold. This technique is most often used in recurrent neural networks, as Batch Normalization is tricky to use in RNNs. 

In Keras, implementing Gradient Clipping is just a matter of setting the clipvalue or clipnorm argument when creating an optimizer. 

clipvalue may change the orientation of the gradient vector so for ensuring that Gradient Clipping doesn't change the direction of the gradient vector, clipnorm should be used instead of clipvalue. 


### Reusing Pretrained Layers

Transfer learning is a machine learning technique where a model trained for a specific task is reused for a different but related task. Transfer learning allows the new model to benefit from the knowledge acquired from the previous task. Transfer learning can reduce the cost and time of building and training the new model.

The output layer of the original model should usually be replaced since it is most likely not useful at all for the new task.

Similarly, the upper hidden layers of the original model are less likely to be as useful as the lower layers, since the high-level features that are most useful for the new task may differ significantly from the ones that were most useful for the original task.

At first, we should try freezing all the reused layers, then train model to see how it performs. Then unfreezing one or two of the top hidden layers to let backpropagation tweak them and see if performance improves. It is also useful to reduce the learning rate when we unfreeze reused layers: this will avoid wrecking their fine-tuned weights.


### EWMA

EWMA stands for Exponentially Weighted Moving Average. It's a statistical method used to smooth time series data and identify trends by giving more weight to recent observations while exponentially decreasing the weights for older data. 


### Momentum Optimization

Momentum Optimizer in Deep Learning is a technique that reduces the time to train a model. The path of learning in mini-batch gradient descent is zig-zag, and not straight. Thus, some time gets wasted in moving in a zig-zag direction. Momentum Optimizer in Deep Learning smooths out the zig-zag path and makes it much straighter, thus reducing the time taken to train the model.

Momentum Optimization cares about what previous gradients were: at each iteration, it subtracts the local gradient from the momentum vector m (multiplied by the learning rate n), and it updates the weights by simply adding this momentum vector.  To simulate some sort of friction mechanism and prevent the momentum from growing too large, the algorithm introduces a new hyperparameter β, simply called the momentum, which must be set between 0 (high friction) and 1(no friction). A typical momentum value is 0.9.


Due to the momentum, the optimizer may overshoot a bit, then come back, overshoot again, and oscillate like this many times before stabilizing at the minimum. This is one of the reason to have a bit of friction in the system: it gets rid of these oscillations and thus speeds up convergence.

In [3]:
from tensorflow import keras

optimizer = keras.optimizers.SGD(learning_rate=0.001, momentum=0.9)

### Nesterov Accelerated Gradient

The idea of NAG is to measure the gradient of the cost function not at the local position but slightly ahead in the direction of the momentum. The only difference from vanilla Momentum optimization is that the gradient is measured at θ + βm rather than at θ. So, small improvements add up and NAG ends up being significantly faster than regular Momentum optimization. 

The problem with NAG is it can stop at local minima.

In [1]:
from tensorflow import keras

optimizer = keras.optimizers.SGD(learning_rate=0.001, momentum=0.9, nesterov=True)