In [None]:
Part 1: Understanding Weight Initialization
1. Explain the importance of weight initialization in artificial neural networks. Why is it necessary to initialize 
the weights carefully
2. Describe the challenges associated with improper weight initialization. How do these issues affect model 
training and convergence
3. Discuss the concept of variance and how it relates to weight initialization. Why is it crucial to consider the 
variance of weights during initialization?

## 1. importance of weight initialization in artificial neural networks. Why is it necessary to initialize the weights carefully

In [None]:
The aim of weight initialization is to prevent layer activation outputs from exploding or vanishing during the course of a forward pass through a deep neural network.

In [None]:
Its main objective is to prevent layer activation outputs from exploding or vanishing gradients during the forward propagation. 
If either of the problems occurs, loss gradients will either be too large or too small, and the network will take more time to converge if it is even able to do so at all.

In [None]:
The weights of artificial neural networks must be initialized to small random numbers. 
This is because this is an expectation of the stochastic optimization algorithm used to train the model, called stochastic gradient descent

## 2.challenges associated with improper weight initialization. How do these issues affect model training and convergence

In [None]:
If all the weights are initialized with 0, the derivative with respect to loss function is the same for every w in W[l], thus all weights have the same value in subsequent iterations. 
This makes hidden units symmetric and continues for all the n iterations i.e. setting weights to 0 does not make it better than a linear model. 
An important thing to keep in mind is that biases have no effect what so ever when initialized with 0.

In [None]:
Studies have shown that initializing the weights with values sampled from a random distribution instead of constant values like zeros and ones actually helps a neural net train better and faster.

## 3. Discuss the concept of variance and how it relates to weight initialization. Why is it crucial to consider the variance of weights during initialization?

In [None]:
The term variance refers to a statistical measurement of the spread between numbers in a data set. 
More specifically, variance measures how far each number in the set is from the mean (average), and thus from every other number in the set. 
Variance is often depicted by this symbol: σ2.

In [None]:
Variance of forward flowing signal for each and every hidden layer is equal

In [None]:
Part 2: Weight Initialization Technique
4. Explain the concept of zero initialization. Discuss its potential limitations and when it can be appropriate 
to use
5. Describe the process of random initialization. How can random initialization be adjusted to mitigate 
potential issues like saturation or vanishing/exploding gradients
6. Discuss the concept of Xavier/Glorot initialization. Explain how it addresses the challenges of improper 
weight initialization and the underlying theory behind it
7. Explain the concept of He initialization. How does it differ from Xavier initialization, and when is it 
preferred?

## 4. Explain the concept of zero initialization. Discuss its potential limitations and when it can be appropriate to use

In [None]:
Initializing all the weights with zeros leads the neurons to learn the same features during training. 
In fact, any constant initialization scheme will perform very poorly. Consider a neural network with two hidden units, and assume we initialize all the biases to 0 and the weights with some constant α.

In [None]:
If all the weights are initialized to zeros, the derivatives will remain same for every w in W[l]. 
As a result, neurons will learn same features in each iterations. This problem is known as network failing to break symmetry. 
And not only zero, any constant initialization will produce a poor result.

## 5. Describe the process of random initialization. How can random initialization be adjusted to mitigate potential issues like saturation or vanishing/exploding gradients

In [None]:
Random Initialization for neural networks aids in the symmetry-breaking process and improves accuracy. 
The weights are randomly initialized in this manner, very close to zero. 
As a result, symmetry is broken, and each neuron no longer performs the same computation.

## 6. Discuss the concept of Xavier/Glorot initialization. Explain how it addresses the challenges of improper weight initialization and the underlying theory behind it

In [None]:
Xavier Glorot's initialization is one of the most widely used methods for initializing weight matrices in neural networks. 
While in practice, it is straightforward to utilize in your deep learning setup, reflecting upon the mathematical reasoning behind this standard initialization technique can prove most beneficial.

In [None]:
Xavier/Glorot Initialization is used to maintain the same smooth distribution for both the forward pass as well the backpropagation. But, Glorot Initialization fails for ReLU, instead we use He Initialization for ReLU.

## 7. Explain the concept of He initialization. How does it differ from Xavier initialization, and when is it preferred?

In [None]:
He initialization works better for layers with ReLu activation. 
Xavier initialization works better for layers with sigmoid activation

In [None]:
He Initialization, is an initialization method for neural networks that takes into account the non-linearity of activation functions, such as ReLU activations.

In [None]:
Part 3: Applying Weight Initialization
8. Implement different weight initialization techniques (zero initialization, random initialization, Xavier 
initialization, and He initialization) in a neural network using a framework of your choice. Train the model 
on a suitable dataset and compare the performance of the initialized models
9. Discuss the considerations and tradeoffs when choosing the appropriate weight initialization technique 
for a given neural network architecture and task.

In [None]:
8. Implement different weight initialization techniques (zero initialization, random initialization, Xavier 
initialization, and He initialization) in a neural network using a framework of your choice. Train the model 
on a suitable dataset and compare the performance of the initialized models

## With weight initializers

In [1]:
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

mnist=tf.keras.datasets.mnist

(X_train_full, y_train_full), (X_test, y_test) = mnist.load_data()

# create a validation data set from the full training data 
# Scale the data between 0 to 1 by dividing it by 255. as its an unsigned data between 0-255 range
X_valid,X_train=X_train_full[:5000]/255., X_train_full[5000:]/255.
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

# scale the test set as well
X_test = X_test / 255.



from keras.models import Sequential
from keras.layers import Dense,Flatten,Dropout
from tensorflow.keras import regularizers

# Creating layers of ANN
LAYERS=[tf.keras.layers.Flatten(input_shape=[28, 28], name="inputLayer"),
        tf.keras.layers.Dense(300, activation="relu", name="hiddenLayer1",kernel_regularizer=regularizers.L2(1.0e-04)),
        tf.keras.layers.Dense(100, activation="relu", name="hiddenLayer2",kernel_regularizer=regularizers.L1L2(l1=1.0e-05,l2=1.0e-04)),
        tf.keras.layers.Dense(10, activation="softmax", name="outputLayer")]

model_clf=tf.keras.models.Sequential(LAYERS)

from keras.models import Sequential
from keras.layers import Dense,Flatten,Dropout
from tensorflow.keras import regularizers
from tensorflow.keras import initializers


model_clf=Sequential()
model_clf.add(Flatten(input_shape=[28, 28], name="inputLayer")),
model_clf.add(Dense(64, activation="relu", name="hiddenLayer1",kernel_initializer=tf.keras.initializers.HeNormal(seed=None))),
model_clf.add(Dense(32, activation="relu", name="hiddenLayer2",kernel_initializer=tf.keras.initializers.HeNormal(seed=None))),
model_clf.add(Dense(10, activation="softmax", name="outputLayer"))

LOSS_FUNCTION = "sparse_categorical_crossentropy" # use => tf.losses.sparse_categorical_crossentropy
OPTIMIZER = "SGD" # or use with custom learning rate=> tf.keras.optimizers.SGD(0.02)
METRICS = ["accuracy"]


model_clf.compile(loss=LOSS_FUNCTION,
              optimizer=OPTIMIZER,
              metrics=METRICS)

EPOCHS = 10
VALIDATION_SET = (X_valid, y_valid)

history=model_clf.fit(X_train,y_train,epochs=EPOCHS,validation_data=VALIDATION_SET,batch_size=32)


pd.DataFrame(history.history)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Unnamed: 0,loss,accuracy,val_loss,val_accuracy
0,0.712177,0.801018,0.336159,0.9068
1,0.315951,0.910218,0.264639,0.9238
2,0.263491,0.923982,0.22806,0.9342
3,0.229933,0.933527,0.202457,0.942
4,0.206207,0.940364,0.194384,0.9458
5,0.187263,0.9462,0.17188,0.9528
6,0.171345,0.950836,0.159354,0.9572
7,0.158387,0.954345,0.159702,0.9566
8,0.147018,0.958364,0.144624,0.9608
9,0.137526,0.960836,0.138137,0.9632


## Without weight initializers

In [2]:
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

mnist=tf.keras.datasets.mnist

(X_train_full, y_train_full), (X_test, y_test) = mnist.load_data()

# create a validation data set from the full training data 
# Scale the data between 0 to 1 by dividing it by 255. as its an unsigned data between 0-255 range
X_valid,X_train=X_train_full[:5000]/255., X_train_full[5000:]/255.
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

# scale the test set as well
X_test = X_test / 255.



from keras.models import Sequential
from keras.layers import Dense,Flatten,Dropout
from tensorflow.keras import regularizers

# Creating layers of ANN
LAYERS=[tf.keras.layers.Flatten(input_shape=[28, 28], name="inputLayer"),
        tf.keras.layers.Dense(300, activation="relu", name="hiddenLayer1",kernel_regularizer=regularizers.L2(1.0e-04)),
        tf.keras.layers.Dense(100, activation="relu", name="hiddenLayer2",kernel_regularizer=regularizers.L1L2(l1=1.0e-05,l2=1.0e-04)),
        tf.keras.layers.Dense(10, activation="softmax", name="outputLayer")]

model_clf=tf.keras.models.Sequential(LAYERS)

from keras.models import Sequential
from keras.layers import Dense,Flatten,Dropout
from tensorflow.keras import regularizers
from tensorflow.keras import initializers


model_clf=Sequential()
model_clf.add(Flatten(input_shape=[28, 28], name="inputLayer")),
model_clf.add(Dense(64, activation="relu", name="hiddenLayer1")),
model_clf.add(Dense(32, activation="relu", name="hiddenLayer2")),
model_clf.add(Dense(10, activation="softmax", name="outputLayer"))

LOSS_FUNCTION = "sparse_categorical_crossentropy" # use => tf.losses.sparse_categorical_crossentropy
OPTIMIZER = "SGD" # or use with custom learning rate=> tf.keras.optimizers.SGD(0.02)
METRICS = ["accuracy"]


model_clf.compile(loss=LOSS_FUNCTION,
              optimizer=OPTIMIZER,
              metrics=METRICS)

EPOCHS = 10
VALIDATION_SET = (X_valid, y_valid)

history=model_clf.fit(X_train,y_train,epochs=EPOCHS,validation_data=VALIDATION_SET,batch_size=32)


pd.DataFrame(history.history)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Unnamed: 0,loss,accuracy,val_loss,val_accuracy
0,0.79099,0.774909,0.351218,0.905
1,0.332397,0.9048,0.281841,0.9218
2,0.278637,0.920236,0.241497,0.9344
3,0.244738,0.929545,0.212757,0.9406
4,0.218616,0.938036,0.195551,0.9448
5,0.197561,0.943018,0.180302,0.9496
6,0.179976,0.948255,0.165023,0.9554
7,0.164794,0.952473,0.153268,0.961
8,0.15212,0.9568,0.145173,0.9608
9,0.140851,0.958927,0.138519,0.9616


## with weight initialization, accuracy is improved

In [None]:
9. Discuss the considerations and tradeoffs when choosing the appropriate weight initialization technique 
for a given neural network architecture and task.

In [None]:
for tanh, we use Xavier/Glorot normal initialization
for sigmoid/reLU, we use He initialization