<a href="https://colab.research.google.com/github/guanshenwang/dscamp_public/blob/master/Project%20Object%20Recognition/Tutorials/1%20Model%20Initialization/pt1_Weight_Initialization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Neural Networks Weight Initialization

Training your neural network requires specifying an initial value of the weights. 

A well chosen initialization can:
<ul>
<li>Speed up the convergence of gradient descent
<li>Increase the odds of gradient descent converging to a lower training (and generalization) error


We consider three possible weight initialization methods in this module:

---


<ul>
<li> <i>Zero Initialization</i> -- Assigns an initial weight of zero to all the weights.
    <li> <i> Random Initialization </i> -- This initializes the weights to large random values.
    <li> <i> He Initialization </i> --  This initializes the weights to random values scaled according to a paper by He et al., 2015. 
</ul>

These three initialization methods were applied on a three layer neural network which tried to separate the blue dots from the red dots as shown in the image below.

![title](https://raw.githubusercontent.com/mahadevprakash90/dscamp_public/Aditya/Project%20Object%20Recognition/Tutorials/img/input_image.png)

Now we can see how each initialization methods perform in their classification tasks

There are two types of parameters to initialize in a neural network:
<ul>
<li>the weight matrices (${W^{[1]},W^{[2]},W^{[3]},\dots,W^{[L-1]},W^{[L]}}$)
<li>the bias vectors ($b^{[1]},b^{[1]},b^{[1]}.\dots,b^{[L-1]},b^{[L]}$)
    </ul>

## 1. Zero Weight Initialization

We see in the plot below that the cost function is not getting updated during the training phase.

![zero_weights_performance](https://raw.githubusercontent.com/mahadevprakash90/dscamp_public/Aditya/Project%20Object%20Recognition/Tutorials/img/zero_weights_performance.png)

We can also see the decision boundary obtained after training the model (We would expect a boundary that separates the red dots from blue dots):

![zero_weight_result](https://raw.githubusercontent.com/mahadevprakash90/dscamp_public/Aditya/Project%20Object%20Recognition/Tutorials/img/zero_weights_result.png)

The decision boundary is -- there is no boundary.

Why? because the model has the exactly same prediction for every example.



Initializing weights to a same value make the network fail to break symmetry. Which means that every neuron in each layer will learn the same thing. In the other words, the network is no more powerful than a linear classifier such as logistic regression.

Below is a simple neural network:

![title](https://qph.cf2.quoracdn.net/main-qimg-c63faaef309939f9bfa4887b4ed537df)

In the hidden layer, all neural (blue circles) will learn the exactly same value as they all have the exactly same starting point!

Even worse, for some activation functions, zero initizations will not help the algorithm to make any moves. Then the whole network will not learn anything but zeros.


<b>What you should remember</b>:
<ul>
<li>The weights ${W^{[l]}}$ should NOT be initialized at the same value including zero. Instead, a random initialization is good to break symmetry.
<li>It is however okay to initialize the biases ${b^{[l]}}$ to zeros. Symmetry is still broken so long as ${W^{[l]}}$ is initialized randomly.
</ul>

**Pros and Cons of zero initialization:**

Pros:
1.   Convenient? At least it is fast to train.

Cons:
1.   Only work for linear problems. (But we need a neural network!)
2.   Sometimes it totally learns nothing.

It is basically a bad idea to do. Do not initialize weights to zeros.


## 2. Random Weight Initialization

Now we know weights have to been set up differently. In this way, each neuron can then proceed to learn a different function of its inputs. 

Now, what happens if the weights are intialized randomly, but to very large values?

Each weight matrix is initialized with the values ```np.random.randn( layers_dims[l],layers_dims[l-1] ) * 10```. This generates normally distributed random variables with mean 0 and standard deviation and multiplies the resulting random number by 10. The bias vectors are initialized with zero values.

![random_weights_performance](https://raw.githubusercontent.com/mahadevprakash90/dscamp_public/Aditya/Project%20Object%20Recognition/Tutorials/img/random_weights_performance.png)

Good news: the cost function curve shows the network is learning! Cost is getting lower as we run more iterations.


Now we have a decision boundary to tell what the model predicts for each data point.

Bad news: it takes a long time to run especially those random numbers are too large.

![random_weight_result](https://raw.githubusercontent.com/mahadevprakash90/dscamp_public/Aditya/Project%20Object%20Recognition/Tutorials/img/random_weights_results.png)

<b>What to know</b>:
<ul>
<li>Random initialization make the neural network do its job: each neuron is able to learn differently.

<li>But we need to pay attention on data scales: if we pick large values to start, the gradient (will be introduced in the next session) will be too large to handle. This is called gradient explosion, which cost huge to train the model but only to get unstable results.

<li>What about starting from small values? Values may be too small to make a signficant move. This is called gradient varnish.


A very good explanation on gradient explosion and varnishing:

https://towardsdatascience.com/the-vanishing-exploding-gradient-problem-in-deep-neural-networks-191358470c11


**Pros and Cons of random initialization:**

Pros:

1.   Allows the neural network to learn more complex functions
2.   Easy to implement

Cons:

1.   Initializing to large weights may leads to huge calculation costs, therefore the result will be unstable.
2.   Smaller initial weights will slow down the optimization process.



## 3. He Initialization

Finally, we try "He Initialization"; this is named for the first author of He et al., 2015. (If you have heard of "Xavier initialization", this is similar except Xavier initialization uses a scaling factor for the weights  of ```sqrt(1./layers_dims[l-1])``` where He initialization would use ```sqrt(2./layers_dims[l-1]).)```

The only difference is that instead of multiplying ```np.random.randn(..,..) by 10```, we multiply it by $\sqrt{\frac{2}{\text{dimension of the previous layer}}}$, which is what He initialization recommends for layers with a ReLU activation.

![he_weights_performance](https://raw.githubusercontent.com/mahadevprakash90/dscamp_public/Aditya/Project%20Object%20Recognition/Tutorials/img/he_weights_performance.png)

After training, it works like a charm!

![he_weights_results](https://raw.githubusercontent.com/mahadevprakash90/dscamp_public/Aditya/Project%20Object%20Recognition/Tutorials/img/he_weights_result.png)

<b>Observations:</b>
<ul>
<li>The model with He initialization separates the blue and the red dots very well in a small number of iterations.

**Pros and Cons of He initialization:**

Pros:

1.   Allows fast optimization
2.   Can learn complex and highly non-linear patterns



## 4. Conclusion

We haves seen three different types of initializations. For the same number of iterations and same hyperparameters the comparison is:

|Model|Train Accuracy| Comment|
|---|---|---|
|3-layer NN with zeros initialization|50%|fails to break symmetry|
|3-layer NN with large random initialization|83%|too large weights|
|3-layer NN with He initialization|99%|recommended method|

What you should remember from this notebook:
<ul>
<li>Different initializations lead to different results
<li>Random initialization is used to break symmetry and make sure different hidden units can learn different things
<li>Don't intialize to values that are too large
<li>He initialization works well for networks with ReLU activations.