<h2 id="Contents">Contents<a href="#Contents"></a></h2>
        <ol>
        <li><a class="" href="#Imports">Imports</a></li>
<li><a class="" href="#Initialization">Initialization</a></li>
<li><a class="" href="#Different-Types-of-Initialization">Different Types of Initialization</a></li>
<ol><li><a class="" href="#Zero-Initialization">Zero Initialization</a></li>
<li><a class="" href="#Random-Initialization">Random Initialization</a></li>
<li><a class="" href="#He-Initialization">He Initialization</a></li>
<li><a class="" href="#Glorot-Initialization">Glorot Initialization</a></li>
</ol>

# Imports

In [3]:
import numpy as np
import matplotlib.pyplot as plt

# Initialization

Training your neural network requires specifying an initial value of the weights. A well-chosen initialization method helps the learning process.

A well-chosen initialization can:
- Speed up the convergence of gradient descent
- Increase the odds of gradient descent converging to a lower training (and generalization) error

# Different Types of Initialization

The weights of a neural network can be initialized in a variety of ways. Some of them are:
1. Zero initialization: Initialize all weights to 0.
2. Random initialization: Initialize all weights to random values.
3. He initialization: Initialize all weights to random values drawn from a normal distribution with a mean of 0 and standard deviation of $\sqrt{2 / n}$.
4. Glorot initialization: Initialize all weights to random values drawn from a normal distribution with a mean of 0 and standard deviation of $\sqrt{2 / (n + m)}$.
5. Xavier initialization: Initialize all weights to random values drawn from a normal distribution with a mean of 0 and standard deviation of $\sqrt{1 / n}$.
Here $n$ is the number of input units, $m$ is the number of output units for the current layer.

![](images/0301.png)

## Zero Initialization

This is the most trivial initialization method. In this method, all
- the weight matrices $(W^{[1]}, W^{[2]}, W^{[3]}, ..., W^{[L-1]}, W^{[L]})$
  
 and
- the bias vectors $(b^{[1]}, b^{[2]}, b^{[3]}, ..., b^{[L-1]}, b^{[L]})$
  
are initialized to 0.

In [2]:
def initialize_parameters_zeros(layers_dims):
    
    parameters = {}
    L = len(layers_dims)
    for l in range(1, L):
        parameters['W' + str(l)] = np.zeros((layers_dims[l], layers_dims[l-1]))
        parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))
    return parameters

However, this type of initialization is of no use. Once the weights are initialized to zero, they remain zero throughout the training and the model doesn't learn anything. Let's see why this can happen.

Since the weights and biases are zero, multiplying by the weights creates the zero vector which gives 0 when the activation function is ReLU. As `z = 0`

$$a = ReLU(z) = max(0, z) = 0$$

At the classification layer, where the activation function is sigmoid you then get (for either input): 

$$\sigma(z) = \frac{1}{ 1 + e^{-(z)}} = \frac{1}{2} = y_{pred}$$

As for every example you are getting a 0.5 chance of it being true our cost function becomes helpless in adjusting the weights.

Your loss function:
$$ \mathcal{L}(a, y) =  - y  \ln(y_{pred}) - (1-y)  \ln(1-y_{pred})$$

For `y=1`, `y_pred=0.5` it becomes:

$$ \mathcal{L}(0, 1) =  - (1)  \ln(\frac{1}{2}) = 0.6931471805599453$$

For `y=0`, `y_pred=0.5` it becomes:

$$ \mathcal{L}(0, 0) =  - (1)  \ln(\frac{1}{2}) = 0.6931471805599453$$

As we can see with the prediction being 0.5 whether the actual (`y`) value is 1 or 0 we get the same loss value for both, so none of the weights get adjusted and we are stuck with the same old value of the weights. 

This is why we can see that the model is predicting 0 for every example!

In general, initializing all the weights to zero results in the network failing to break symmetry. This means that every neuron in each layer will learn the same thing, so we might as well be training a neural network with $n^{[l]}=1$ for every layer. This way, the network is no more powerful than a linear classifier like logistic regression.

## Random Initialization

To break symmetry, initialize the weights randomly. Following random initialization, each neuron can then proceed to learn a different function of its inputs.

In [4]:
def initialize_parameters_random(layers_dims):
    parameters = {}
    L = len(layers_dims) 
    for l in range(1, L):
        parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1])*10
        parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))
    return parameters

With random initialization, the model starts with making prediction by random but starts to learn with every epoch.

Random initialization can be used to break symmetry, however it is far from perfect. Poor initialization can lead to vanishing/exploding gradients, which also slows down the optimization algorithm. What's more initializing with overly large random numbers slows down the optimization.

## He Initialization

"He Initialization"; this is named for the first author of He et al., 2015. This is the same as random intialization, the only different being that while the standard deviation for random initialization is1, the standard deviation for He initialization is $\sqrt{\frac{2}{\text{dimension of the previous layer}}}$. Changing the standard deviation to $\sqrt{\frac{1}{\text{dimension of the previous layer}}}$ gives the same **Xavier initialization**.

In [5]:
def initialize_parameters_he(layers_dims):
    parameters = {}
    L = len(layers_dims) - 1
     
    for l in range(1, L + 1):
        parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1])*(np.sqrt(2/(layers_dims[l-1])))
        parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))        
    return parameters

## Glorot Initialization

Here, the standard deviation is $\sqrt{\frac{2}{\text{dimension of the previous layer}+\text{dimension of the current layer}}}$.

In [6]:
def initialize_parameters_glorot(layers_dims):
    parameters = {}
    L = len(layers_dims) - 1
     
    for l in range(1, L + 1):
        parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1])*(np.sqrt(2/((layers_dims[l-1]+layers_dims[l]))))
        parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))        
    return parameters