<h2> Training Deep Neural Networks </h2> 

To tackle complex problems, we need to train much deeper DNN with 10 layers or much more, each containing hundreds of neurons. 

<h3> Issues faced when training DNN </h3> 

- Vanishing gradient/ exploding gradients
- Not enough training data for a large network
- Training will be extemely slow
- A model with millions of parameters would risk overfitting the training set if there is not enough instances. 

<h3> Vanishing/Exploding Gradients Problems </h3> 

- The backpropagation algorithm works by going from the output layer to the input layer, propagating the error gradient on the way. 
- Once the gradient for each parameter is calculated, the gradients are used to update each parameter. 
- **However** gradients get smaller and smaller as the algorithm progresses to the lower layers. 

- Due to this vanishing gradient, the weights and biases of the lower layers stay unchanged. 


**The opposite can happen too**

- In some cases, the gradient gets bigger when moving to lower layers, hence the lower layers end up getting large weight updates and the algorithm diverges. 


In a general sense, deep neural networks suffer from unstable gradients throughout the network resulting in different layers learning at different speeds. 

Issues which contribute to this:

- Combination of logistic sigmoid activation function and the weight initialization technique (Random initialization using a normal distribution with a mean of 0 and a standard dev. of 1)

- Looking at the sigmoid activation function, it saturates at large positive and negative inputs and the function saturates at 0 and 1. At these saturated positions, the gradient is close to zero and hence cannot back propagate through a network. Hence there is not enough to be backpropagated to lower layers.


<h3> Vanishing gradient can occur in multiple ways </h3> 

Either you can have your gradient decreasing as you reach lower layers during backpropagation as you are running out of gradient, or your outputs from the neuron are always zero causing no gradient to exist (Usually happens when your weights are giving out zeros as the output for neuron)


<h3> Glorot and He Initialization </h3> 

For signals to flow properly during forward pass and back propagation, it is required that the variance of the input be the same as the variance of the output. 

It is not possible to guarantee this condition, unless the layer has an equal number of inputs and neurons. Instead a valid compromise is to **randomly initialize the connection weights of each layer as follows**:

fan(avg) = (fan(in) + fan(out))/2

Different strategies exist, however they differ only by the scale of variance and whether they use fan(avg) or fan(in)

For all the different initialization strategies, we need to plug in the appropriate variance equation which is dependent on **fan(avg) or fan(in)**. Then we can get the boundary of initialization and find the initialized weights and biases using uniform distribution. 

By default Keras uses Glorot initialization where we use 1/fan(avg) as the variance to plug into find the boundaries for uniform distribution. 


<h3> Nonsaturating Activation Functions </h3> 

<h4> Dying ReLU </h4>

Dying ReLU are neurons which are outputing 0s continuously. This happens when all the weights of your network(or weights for most neurons) are tweaked such that every output is a negative, causing the ReLU output to be a zero always. 

A variant of the ReLU function is the Leaky ReLU function. 

In Leaky ReLU, instead of max(0,z), we have max(a * z, z) in which **a** is the hyperparameter which defines how much the function leaks. It is the slope of the function below z = 0 and usually set to 0.01. 


<h4> Exponential Linear Unit </h4> 

In this activation function, if z > 0, then the output is z. However if z < 0, output is a(exp(z) - 1). For all values below 0, it takes on a negative value, has a nonzero gradient which avoids dead neuron problem, and closer to zero, the function is smooth, hence it doesnt bounce around during convergence and fuaster in training than the ReLU. But it is slower in prediction as the function is more complicated.

<h4> Standardized Exponential Linear Unit </h4> 

This activation function allows the network to self normalize where the weights are with standard dev of 1 and mean 0. But the following conditions are required for that to happen:

- Input features must be Standardized (mean 0 and standard dev of 1).
- Every hidden layers weight must be initialized under the **LeCunn Normal Initialization**
- Sequential API needs to be used or else normalization is not guaranteed. 


Effectiveness of activation functions:

SELU>ELU>LeakReLU(and variants)>ReLU>tanh>sigmoid

<h2> Batch Normalization(BN) </h2> 

Using the ELU or LeakyReLU and its variants, we reduce the problem of vanishing/exploding gradients as we ensure that neurons wont die and that weights and biases are in a range which prevents explosion. But it doesnt guarantee that it wont come back during training. 

Involves adding an operation in the model before or after the activation function of each layer, to zero center and normalize each input. Then scaling and shifting the result using 2 new parameter vectors per layer. 

Essentially for each batch, you normalize each feature to have mean 0 and sd 1. Then you do element-wise multiplication with gamma vector in which gamma is a scaling factor applied for each feature of each input. Then you shift it with a vector beta, which also is available for the entire vector. 