In [3]:
# Weight initialization is a critical step in training artificial neural networks (ANNs) as it can significantly impact the convergence and performance of the model.
# Proper weight initialization helps in overcoming challenges associated with convergence and ensures that the model learns effectively.
# Why is it Necessary to Initialize Weights Carefully:
# Avoiding Vanishing or Exploding Gradients
# symmetry breaking
# faster convergence

# Challenges Associated with Improper Weight Initialization:
# explaoiding gradients : with out the proper weight initaialisation the vanishing gradient problems may occur in the network and the model cannot learn efficiently
# divergence : when weights are not properly initialised the model may fail to converge to the optimal soution
# Slow Convergence:If weights are not initialized properly, the network may converge very slowly or get stuck in a local minimum, making it difficult to train effectively.
# Symmetry Issues:Poor weight initialization can result in symmetry issues, where neurons in the same layer learn the same features, limiting the expressive power of the network.

# Variance in weight initialization is crucial because it affects the scale of activations in the network.
#  Activations that are too small or too large can lead to the aforementioned issues such as vanishing/exploding gradients and slow convergence.
# The variance of weights should be carefully chosen based on the architecture of the network and the activation functions used.
#  For example, Xavier/Glorot initialization and He initialization are methods that consider the variance of weights based on the number of input and output units in a layer, helping to maintain a balance during training.
# careful weight initialization is essential to overcome convergence challenges, avoid issues like vanishing or exploding gradients, and promote stable and effective learning in neural networks. Variance plays a crucial role in determining the scale of weights, and choosing appropriate initialization methods helps in achieving better training outcomes.

In [4]:
#  When all the weights in a neural network are initialized to zero, the derivative with respect to the loss function is the same for every weight
# in the network's weight matrix. As a result, all the weights take on the same value in subsequent iterations, making the hidden layers symmetric. This process
# continues for all the iterations, resulting in a neural network that is no better than a linear model. It is important to note that setting biases to zero does
# #  not cause any issues since non-zero weights break the symmetry, and even if the bias is zero, the values in each neuron will still be different.
# While zero initialization is straightforward and easy to implement, it comes with several limitations that make it less desirable for many scenarios:
# Symmetry Issues:
# All neurons in a layer initialized to zero will learn the same features during training. This results in symmetry issues, where neurons in the same layer are updated identically, limiting the expressive power of the network.
# Vanishing Gradients:
# During backpropagation, if all weights are initialized to zero, the gradients for each weight will be the same. This leads to vanishing gradients, where the gradients become extremely small, causing slow or stalled learning, especially in deep networks.
# Lack of Capacity to Break Symmetry:
# Neurons need to have different initial weights to break symmetry and allow each neuron to learn different features. Zero initialization fails to provide this diversity.
# When Zero Initialization can be Appropriate:
# Linear Activation Functions:
# In networks where the activation function is strictly linear (e.g., identity function), zero initialization may not lead to symmetry issues or vanishing gradients. However, such networks may not effectively capture non-linear relationships in the data.

In [5]:
#  Random weight initialization is a technique used to initialize the weights of neural network connections with random values in a specific range. The idea behind this technique is to break the symmetry of weights and prevent vanishing or exploding gradients during the training process.
# The most commonly used method for random weight initialization is to generate weights from a normal distribution with a mean of 0 and a standard deviation of 1. This means that the weights are randomly initialized with values centred around 0 and spread out within a certain range.
# However, this method can result in weights that are too small or too large, which can slow down the learning process or cause numerical instability. Therefore, a more commonly used approach is to scale the randomly generated weights by a factor that depends on the number of input and output connections for each neuron.

In [6]:
#  Xavier weight initialization, also known as Glorot initialization, is a technique for initializing the weights of a neural network. The objective of this technique is to prevent the vanishing or exploding gradient problem during the training of the network. The idea behind Xavier weight initialization is to set the initial weights in such a way that the variance of the outputs of each neuron is the same as the variance of its inputs. This ensures that the gradients do not vanish or explode as they propagate through the network during backpropagation.
# - The formula for Xavier weight initialization for a layer with n inputs and m outputs is:
# python
# W = np.random.randn(n, m) * np.sqrt(1/n)
# where W is the weight matrix for the layer, np.random.randn(n, m) generates a matrix of random numbers with a normal distribution, and np.sqrt(1/n) scales the random numbers to ensure that the variance of the outputs of each neuron is the same as the variance of its inputs.
# - Here, the factor 1/n is used because we want the variance of the outputs to be proportional to the number of inputs. This ensures that the variance of the gradients with respect to the inputs is roughly the same for each layer, which helps prevent the gradients from vanishing or exploding.
# - Let's consider an example to understand Xavier weight initialization better. Suppose we have a neural network with an input layer of size 1000, a hidden layer of size 500, and an output layer of size 10. We can initialize the weights of the hidden layer using Xavier weight initialization as follows:
# python
# W1 = np.random.randn(1000, 500) * np.sqrt(1/1000)
# W2 = np.random.randn(500, 10) * np.sqrt(1/500)
# ```
# - Here, W1 is the weight matrix for the hidden layer with 1000 inputs and 500 outputs, and W2 is the weight matrix for the output layer with 500 inputs and 10 outputs. The np.sqrt(1/n) term ensures that the variance of the outputs of each neuron is the same as the variance of its inputs.
# - By using Xavier weight initialization, we can ensure that the network trains faster and achieves better accuracy compared to random weight initialization or zero weight initialization.

In [None]:
#  He Weight Initialization is a weight initialization technique used in neural networks. It is an improvement over the Xavier initialization method and is commonly used in deep neural networks that use the ReLU activation function.

# - The basic idea behind He initialization is to initialize the weights of each neuron in the network with random values drawn from a Gaussian distribution with a mean of 0 and a standard deviation of sqrt(2/n), where n is the number of inputs to the neuron.

# - The formula for He initialization is given as:

# W ~ N(0, sqrt(2/n))

# Where,
# W - weight matrix
# N - normal distribution
# 0 - mean
# sqrt(2/n) - standard deviation

# - The factor of sqrt(2/n) in the standard deviation helps to maintain a balance between the variance of the activations and the variance of the gradients in the network, preventing vanishing or exploding gradients.

# - He initialization is effective for networks that use the ReLU activation function, as it helps to address the problem of vanishing gradients that can occur when using a small initial weight range with ReLU.


In [7]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import os
import tensorflow as tf

In [8]:
mnist = tf.keras.datasets.mnist
(x_train_full,y_train_full),(x_test,y_test) = mnist.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


In [9]:
x_valid,x_train = x_train_full[:5000]/255.,x_train_full[5000:]/255.
y_valid,y_train = y_train_full[:5000],y_train_full[5000:]
x_test = x_test/255.

In [11]:
layers = [tf.keras.layers.Flatten(input_shape=[28,28],name="inputlayer"),
          tf.keras.layers.Dense(300,activation = "relu",kernel_initializer=tf.keras.initializers.HeNormal(seed = None),name = "hiddenlayer"),
          tf.keras.layers.Dense(100,activation = "relu" , kernel_initializer=tf.keras.initializers.HeNormal(seed = None),name = "hiddenlayer2"),
          tf.keras.layers.Dense(10, activation="softmax", name="outputLayer")]
model = tf.keras.models.Sequential(layers)

In [12]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 inputlayer (Flatten)        (None, 784)               0         
                                                                 
 hiddenlayer (Dense)         (None, 300)               235500    
                                                                 
 hiddenlayer2 (Dense)        (None, 100)               30100     
                                                                 
 outputLayer (Dense)         (None, 10)                1010      
                                                                 
Total params: 266610 (1.02 MB)
Trainable params: 266610 (1.02 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [14]:
layers2 = [tf.keras.layers.Flatten(input_shape=[28,28],name="inputlayer"),
          tf.keras.layers.Dense(300,activation = "relu",kernel_initializer=tf.keras.initializers.GlorotUniform(seed = None),name = "hiddenlayer"),
          tf.keras.layers.Dense(100,activation = "relu" , kernel_initializer=tf.keras.initializers.GlorotUniform(seed = None),name = "hiddenlayer2"),
          tf.keras.layers.Dense(10, activation="softmax", name="outputLayer")]
model2 = tf.keras.models.Sequential(layers2)

In [16]:
layers3 = [tf.keras.layers.Flatten(input_shape=[28,28],name="inputlayer"),
          tf.keras.layers.Dense(300,activation = "relu",kernel_initializer=tf.keras.initializers.GlorotNormal(seed = None),name = "hiddenlayer"),
          tf.keras.layers.Dense(100,activation = "relu" , kernel_initializer=tf.keras.initializers.GlorotNormal(seed = None),name = "hiddenlayer2"),
          tf.keras.layers.Dense(10, activation="softmax", name="outputLayer")]
model3 = tf.keras.models.Sequential(layers3)

In [18]:
loss = "sparse_categorical_crossentropy"
optimizer = "SGD"
Metrics = ["accuracy"]
model.compile(loss = loss,optimizer = optimizer,metrics = Metrics)
model2.compile(loss = loss,optimizer = optimizer,metrics = Metrics)
model3.compile(loss = loss,optimizer = optimizer,metrics = Metrics)

In [19]:
epochs =10
validation_Set = (x_valid,y_valid)
history = model.fit(x_train,y_train,epochs = epochs,validation_data= validation_Set,batch_size=32)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [20]:
history2 = model2.fit(x_train,y_train,epochs = epochs,validation_data= validation_Set,batch_size=32)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [21]:
history3 = model3.fit(x_train,y_train,epochs = epochs,validation_data= validation_Set,batch_size=32)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
