# 1. Normalizing Inputs

Among the best practices for training a Neural Network is to normalize your data to obtain a mean close to 0. Normalizing the data generally speeds up learning and leads to faster convergence. 

* If you normalize your inputs this will speed up the training process a lot.
    * Normalization are going on these steps:
    * Let `x(i)` be the input vector for observation `(i)`, i.e. $x^{(i)}_1, x^{(i)}_2, ..., x^{(i)}_N$
    * Get the mean of the training set: `mean = (1/m) * sum(x(i))`
    * Subtract the mean from each input: `X = X - mean`
    * This makes your inputs centered around 0.
    * Get the variance of the training set: `variance = (1/m) * sum(x(i)^2)`
    * Normalize the variance. `X /= variance`
* These steps should be applied to training, validation, and testing sets (but using mean and variance of the train set).
* Why normalize?
    * If we don't normalize the inputs our cost function will be deep and its shape will be inconsistent (elongated) then optimizing it will take a long time.
    * But if we normalize it the opposite will occur. The shape of the cost function will be consistent (look more symmetric like circle in 2D example) and we can use a larger learning rate alpha - the optimization will be faster.
    
<img src="figures/normalization.png" alt="normalization" style="width: 700px;"/>


# 2. Normalizing Activations

## 2.1. Normalizing activations in a network

In the rise of deep learning, one of the most important ideas has been an algorithm called **batch normalization**, created by two researchers, Sergey Ioffe and Christian Szegedy. Batch Normalization speeds up learning.

In the previous section, we normalized inputs by subtracting the mean and dividing by variance. This helped a lot for the shape of the cost function and for reaching the minimum point faster. The question is: for any hidden layer can we normalize the activations to train the weights and the biases faster? This is what batch normalization is about.

There are some debates in the deep learning literature about whether we should normalize values before the activation function Z[l] or after applying the activation function A[l]. In practice, normalizing Z[l] is done more often.

Algorithm:
* Let `Z[l]` be the z vector at layer [l] for observation `(i)`, i.e. $z^{(i)}_1, z^{(i)}_2, ..., z^{(i)}_L$
* Given `Z[l] = [z(1), ..., z(m)]`, i = 1 to m (for each input)
* Compute `mean = 1/m * sum(z(i))`
* Compute `variance = 1/m * sum((z(i) - mean)^2)`
* Then `z_norm(i) = (z(i) - mean) / np.sqrt(variance + epsilon)` (add epsilon for numerical stability if variance = 0)
    * Forcing the inputs to a distribution with zero mean and variance of 1.
* Then `z_tilde(i) = gamma * Z_norm(i) + beta`
    * To make inputs belong to other distribution (with other mean and variance).
    * gamma and beta are learnable parameters of the model.
    * We don't want the hidden units to always have mean 0 and variance 1. Maybe it makes sense for hidden units to have a different distribution.
    * Making the NN learn the distribution of the outputs.
    * Note: if `gamma = sqrt(variance + epsilon)` and `beta = mean` then `Z_tilde[l] = Z[l]`
    
    
## 2.2. Fitting Batch Normalization into a neural network

* Using batch norm in 3 hidden layers NN: 
<img src="figures/bn.png" alt="bn" style="width: 900px;"/>
* Our NN parameters will be:
    * `W[1], b[1], ..., W[L], b[L], beta[1], gamma[1], ..., beta[L], gamma[L]`
    * `beta[1], gamma[1], ..., beta[L], gamma[L]` are updated using any optimization algorithms (like GD, RMSprop, Adam)
* If you are using a deep learning framework, you won't have to implement batch norm yourself:
    * Ex. in Tensorflow you can add this line: `tf.nn.batch_normalization()`
    * Ex. in tf.keras, you can add this layer: `tf.keras.layers.BatchNormalization`
* Batch normalization is usually applied with mini-batches.
* If we are using batch normalization parameters `b[1], ..., b[L]` doesn't count because they will be eliminated after mean subtraction step, so:
```python
Z[l] = W[l]A[l-1] + b[l] => Z[l] = W[l]A[l-1]
Z_norm[l] = ...
Z_tilde[l] = gamma[l] * Z_norm[l] + beta[l]
```
* Taking the mean of a constant `b[l]` will eliminate the `b[l]`
* So if you are using batch normalization, you can remove `b[l]` or make it always zero.
* So the parameters will be `W[l]`, `beta[l]`, and `alpha[l]`.
* Shapes:
    * `Z[l] - (n[l], m)`
    * `beta[l] - (n[l], m)`
    * `gamma[l] - (n[l], m)`

## 2.3. Why does Batch normalization work?

* The first reason is the same reason as why we normalize X.
* The second reason is that batch normalization reduces the problem of input values changing (the problem of covariate shift):

<img src="figures/covariate_shift.png" alt="covariate_shift" style="width: 600px;"/>

* Batch normalization does some regularization:
    * Each mini batch is scaled by the mean/variance computed of that mini-batch.
    * This adds some noise to the values `Z[l]` within that mini batch. So similar to dropout it adds some noise to each hidden layer's activations.
    * This has a slight regularization effect.
    * Using bigger size of the mini-batch you are reducing noise and therefore regularization effect.
    * Don't rely on batch normalization as a regularization. It's intended for normalization of hidden units, activations and therefore speeding up learning. For regularization use other regularization techniques (L2 or dropout).

## 2.4. Batch normalization at test time

* When we train a NN with Batch normalization, we compute the mean and the variance of the mini-batch.
* In testing we might need to process examples one at a time. The mean and the variance of one example won't make sense.
* We have to compute an estimated value of mean and variance to use it in testing time.
* We can use the weighted average across the mini-batches.
* We will use the estimated values of the mean and variance to test.
* This method is also sometimes called "Running average".
* In practice most often you will use a deep learning framework and it will contain some default implementation of doing such a thing.

# 3. Exploding / Vanishing Gradients

When training a deep neural network with gradient based learning and backpropagation, we find the partial derivatives by traversing the network from the the final layer (y_hat) to the initial layer. Using the chain rule, layers that are deeper into the network go through continuous matrix multiplications in order to compute their derivatives.

In a network of n hidden layers, n derivatives will be multiplied together. If the derivatives are large then the gradient will increase exponentially as we propagate down the model until they eventually explode, and this is what we call the problem of exploding gradient. Alternatively, if the derivatives are small then the gradient will decrease exponentially as we propagate through the model until it eventually vanishes, and this is the vanishing gradient problem.

In order to explain this phenomenon, let's consider a neural network with $L$ layers, let's say all the activation functions are linear and each bias $b = 0$; we can then write:

```python
Y = W[L]W[L-1].....W[2]W[1]X
```

If we have 2 hidden units per layer and x1 = x2 = 1, we result in:

```python
if W[l] = [1.5   0] 
          [0   1.5] (l != L because of different dimensions in the output layer)
Y = W[L]  [1.5   0]^(L-1) X = 1.5^L 	# which will be very large
          [0   1.5]
```

```python
if W[l] = [0.5   0]
          [0   0.5]
Y = W[L]  [0.5   0]^(L-1) X = 0.5^L 	# which will be very small
          [0   0.5]
```

* The last example explains that the activations (and similarly derivatives) will be decreased/increased exponentially as a function of number of layers.
* So If W > I (Identity matrix) the activation and gradients will explode.
* And If W < I (Identity matrix) the activation and gradients will vanish.
* Recently Microsoft trained 152 layers (ResNet)! which is a really big number. With such a deep neural network, if your activations or gradients increase or decrease exponentially as a function of L, then these values could get really big or really small. And this makes training difficult, especially if your gradients are exponentially smaller than L, then gradient descent will take tiny little steps. It will take a long time for gradient descent to learn anything.
* **This shows how the weight themselves can explode or vanish! We can use the same logic and backpropagation to prove that this can also happen to gradients.** 




# 4. Weights Initialization

It's important for weights to be initialized randomly. This is important to break symmetry and make sure different hidden units can learn different things. If the weights have been initialized to the same values (for example zeros), this will make all consequent neurons to have exactly the same value, leading to the problem of symmetry. It is however okay to initialize the biases $b^{[l]}$ to zeros. Symmetry is still broken so long as $W^{[l]}$ is initialized randomly. One thing to pay attention to though is that different initializations lead to different results.

A partial solution to the Vanishing / Exploding gradients in a neural network is to better or more careful choice of the random initialization of weights. In a single neuron (Perceptron model): $z = w_1x_1 + w_2x_2 + ... + w_nx_n$
So if $n_x$ is large we want the W's to be smaller so that activation do not explode. 

On way to solve this problem is to initialize the weights from a distribution of variance equals to $1/n_x$ to be the range of W's.

One way we can initialize the weight's W's of layer $[l]$ is sampling from a normal distribution of mean 0 and variance $1/n_{[l-1]}$. Where $n_{[l-1]}$ is the number of neurons in  the preceeding layer that participates in the computation of each neuron at layer $[l]$:

```python
np.random.rand(shape) * np.sqrt(1/n[l-1]) # shape is batch size * number of neurons in layer [l]
```

This is known as the "Xavier Initialization", it's mostly used with the tanh activation function. There exist another variation which is preferred when having RELU activation functions. It samples the weight as follows:

```python
np.random.rand(shape) * np.sqrt(2/n[l-1]) # shape is batch size * number of neurons in layer [l]
```

Bengio et al. samples the weights as follow: 


```python
np.random.rand(shape) * np.sqrt(2/(n[l-1] + n[l]))
```