Part 1: Understanding Weight Initialization

In [1]:
## Q No. 1 :
"""
While building and training neural networks, it is crucial to initialize the weights appropriately to ensure a model
with high accuracy. If the weights are not correctly initialized, it may give rise to the Vanishing Gradient problem 
or the Exploding Gradient problem.
"""

'\nWhile building and training neural networks, it is crucial to initialize the weights appropriately to ensure a model\nwith high accuracy. If the weights are not correctly initialized, it may give rise to the Vanishing Gradient problem \nor the Exploding Gradient problem.\n'

In [2]:
## Q No. 2 :
"""
One of the main reasons for the slow convergence and the suboptimal generalization results of MLP 
(Multilayer Perceptrons) based on gradient descent training is the lack of a proper initialization of the weights 
to be adjusted. Even sophisticated learning procedures are not able to compensate for bad initial values of weights, 
while good initial guess leads to fast convergence and or better generalization capability even with simple 
gradient-based error minimization techniques. Although initial weight space in MLPs seems so critical there is no 
study so far of its properties with regards to which regions lead to solutions or failures concerning generalization
and convergence in real world problems. There exist only some preliminary studies for toy problems, like XOR. 
A data mining approach, based on Self Organizing Feature Maps (SOM), is involved in this paper to demonstrate that a 
complete analysis of the MLP weight space is possible. This is the main novelty of this paper. The conclusions drawn 
from this novel application of SOM algorithm in MLP analysis extend significantly previous preliminary results in the 
literature. MLP initialization procedures are overviewed along with all conclusions so far drawn in the literature and 
an extensive experimental study on more representative tasks, using our data mining approach, reveals important initial
weight space properties of MLPs, extending previous knowledge and literature results.
"""

'\nOne of the main reasons for the slow convergence and the suboptimal generalization results of MLP \n(Multilayer Perceptrons) based on gradient descent training is the lack of a proper initialization of the weights \nto be adjusted. Even sophisticated learning procedures are not able to compensate for bad initial values of weights, \nwhile good initial guess leads to fast convergence and or better generalization capability even with simple \ngradient-based error minimization techniques. Although initial weight space in MLPs seems so critical there is no \nstudy so far of its properties with regards to which regions lead to solutions or failures concerning generalization\nand convergence in real world problems. There exist only some preliminary studies for toy problems, like XOR. \nA data mining approach, based on Self Organizing Feature Maps (SOM), is involved in this paper to demonstrate that a \ncomplete analysis of the MLP weight space is possible. This is the main novelty of this

In [3]:
## Q No. 3 :
"""
The aim of weight initialization is to prevent layer activation outputs from exploding or vanishing during
the course of a forward pass through a deep neural network. If either occurs, loss gradients will either be too
large or too small to flow backwards beneficially, and the network will take longer to converge, if it is even 
able to do so at all.

Matrix multiplication is the essential math operation of a neural network. In deep neural nets with several layers,
one forward pass simply entails performing consecutive matrix multiplications at each layer, between that layer’s 
inputs and weight matrix. The product of this multiplication at one layer becomes the inputs of the subsequent layer,
and so on and so forth.
"""

'\nThe aim of weight initialization is to prevent layer activation outputs from exploding or vanishing during\nthe course of a forward pass through a deep neural network. If either occurs, loss gradients will either be too\nlarge or too small to flow backwards beneficially, and the network will take longer to converge, if it is even \nable to do so at all.\n\nMatrix multiplication is the essential math operation of a neural network. In deep neural nets with several layers,\none forward pass simply entails performing consecutive matrix multiplications at each layer, between that layer’s \ninputs and weight matrix. The product of this multiplication at one layer becomes the inputs of the subsequent layer,\nand so on and so forth.\n'

Part 2: Weight Initialization Techniques

In [4]:
## Q No. 4 :
"""
Zero initialization causes the neuron to memorize the same functions almost in each iteration. Random initialization
is a better choice to break the symmetry. However, initializing weight with much high or low value can 
result in slower optimization.
"""

'\nZero initialization causes the neuron to memorize the same functions almost in each iteration. Random initialization\nis a better choice to break the symmetry. However, initializing weight with much high or low value can \nresult in slower optimization.\n'

In [5]:
## Q No. 5 :
"""
Random Initialization for neural networks aids in the symmetry-breaking process and improves accuracy.
The weights are randomly initialized in this manner, very close to zero. 
As a result, symmetry is broken, and each neuron no longer performs the same computation.
"""

'\nRandom Initialization for neural networks aids in the symmetry-breaking process and improves accuracy.\nThe weights are randomly initialized in this manner, very close to zero. \nAs a result, symmetry is broken, and each neuron no longer performs the same computation.\n'

In [6]:
## Q No. 6 :
'''
Xavier Glorot's initialization is one of the most widely used methods for initializing weight matrices in neural 
networks. While in practice, it is straightforward to utilize in your deep learning setup, reflecting upon the 
mathematical reasoning behind this standard initialization technique can prove most beneficial.
'''

"\nXavier Glorot's initialization is one of the most widely used methods for initializing weight matrices in neural \nnetworks. While in practice, it is straightforward to utilize in your deep learning setup, reflecting upon the \nmathematical reasoning behind this standard initialization technique can prove most beneficial.\n"

In [7]:
## Q No. 7 :
"""
In summary, the main difference for machine learning practitioners is the following: He initialization works better
for layers with ReLu activation. Xavier initialization works better for layers with sigmoid activation.
"""

'\nIn summary, the main difference for machine learning practitioners is the following: He initialization works better\nfor layers with ReLu activation. Xavier initialization works better for layers with sigmoid activation.\n'

Part 3: Applying Weight Initialization

In [13]:
## Q No. 8 :
import tensorflow as tf

In [15]:
# Zero Initialization
from tensorflow.keras import layers
from tensorflow.keras import initializers
 
initializer = tf.keras.initializers.Zeros()
layer = tf.keras.layers.Dense(
  3, kernel_initializer=initializer)

In [16]:
# Random Normal Distribution
from tensorflow.keras import layers
from tensorflow.keras import initializers
 
initializer = tf.keras.initializers.RandomNormal(
  mean=0., stddev=1.)
layer = tf.keras.layers.Dense(3, kernel_initializer=initializer)

In [17]:
# Random Uniform Initialization
from tensorflow.keras import layers
from tensorflow.keras import initializers
 
initializer = tf.keras.initializers.RandomUniform(
  minval=0.,maxval=1.)
layer = tf.keras.layers.Dense(3, kernel_initializer=initializer)

In [18]:
# Xavier/Glorot Uniform Initialization
from tensorflow.keras import layers
from tensorflow.keras import initializers
 
initializer = tf.keras.initializers.GlorotUniform()
layer = tf.keras.layers.Dense(3, kernel_initializer=initializer)

In [19]:
#Normailzed Xavier/Glorot Uniform Initialization
from tensorflow.keras import layers
from tensorflow.keras import initializers
 
initializer = tf.keras.initializers.GlorotNormal()
layer = tf.keras.layers.Dense(3, kernel_initializer=initializer)

In [20]:
# He Uniform Initialization
from tensorflow.keras import layers
from tensorflow.keras import initializers
 
initializer = tf.keras.initializers.HeUniform()
layer = tf.keras.layers.Dense(3, kernel_initializer=initializer)

In [21]:
# He Normal Initialization
from tensorflow.keras import layers
from tensorflow.keras import initializers
 
initializer = tf.keras.initializers.HeNormal()
layer = tf.keras.layers.Dense(3, kernel_initializer=initializer)

In [24]:
# Import python libraries required in this example:
from keras.models import Sequential
from keras.layers import Dense, Activation
import numpy as np

# Use numpy arrays to store inputs (x) and outputs (y):
x = np.array([[0,0], [0,1], [1,0], [1,1]])
y = np.array([[0], [1], [1], [0]]) 

# Define the network model and its arguments. 
# Set the number of neurons/nodes for each layer:
model = Sequential()
model.add(Dense(2, input_shape=(2,)))
model.add(Activation('sigmoid'))
model.add(Dense(1))
model.add(Activation('sigmoid')) 
model.add(layer)

# Compile the model and calculate its accuracy:
model.compile(loss='mean_squared_error', optimizer='sgd', metrics=['accuracy']) 

# Print a summary of the Keras model:
model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_13 (Dense)            (None, 2)                 6         
                                                                 
 activation_4 (Activation)   (None, 2)                 0         
                                                                 
 dense_14 (Dense)            (None, 1)                 3         
                                                                 
 activation_5 (Activation)   (None, 1)                 0         
                                                                 
 dense_8 (Dense)             (None, 3)                 6         
                                                                 
Total params: 15 (60.00 Byte)
Trainable params: 15 (60.00 Byte)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [25]:
## Q No. 9 :
'''
The simplest way to initialize weights and biases is to set those to small uniform random values which works
well for neural networks with a single hidden layer. But, when number of hidden layers is more than one, 
then you can use a good initialization scheme like “Glorot (also known as Xavier) Initialization”.
'''

'\nThe simplest way to initialize weights and biases is to set those to small uniform random values which works\nwell for neural networks with a single hidden layer. But, when number of hidden layers is more than one, \nthen you can use a good initialization scheme like “Glorot (also known as Xavier) Initialization”.\n'