In [2]:
pip install tensorflow

Collecting tensorflow
  Downloading tensorflow-2.15.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (475.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m475.2/475.2 MB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting wrapt<1.15,>=1.11.0
  Downloading wrapt-1.14.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting termcolor>=1.1.0
  Downloading termcolor-2.4.0-py3-none-any.whl (7.7 kB)
Collecting ml-dtypes~=0.2.0
  Downloading ml_dtypes-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m60.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-pasta>=0.1.1
  Downloading google_pasta-0.2.0-py3-none-any.whl (57 kB)
[2K     

# Q1. Explain the concept of batch normalization in the context of Artificial Neural Networks

## Batch Normalization is a technique used in neural networks to stabilize and accelerate training. It normalizes the activations within mini-batches during training, reducing internal covariate shift. It helps networks converge faster, allows for higher learning rates, acts as regularization, and improves generalization. Batch Normalization is applied to layers and involves normalizing, scaling, and shifting activations.

# Q2. Describe the benefits of using batch normalization during training

## 1. Faster convergence: Deep neural networks can be notoriously slow to train. Batch normalization helps address a problem called "internal covariate shift," where the distribution of activations changes throughout training. This can make it difficult for gradients to flow back through the network and update the weights effectively. By normalizing the activations across each layer and batch, batch normalization reduces this shift and allows the network to converge faster, meaning it takes fewer training iterations to reach the desired performance.

## 2. Higher learning rates: Typically, deep learning requires careful tuning of the learning rate, the parameter that controls how much the weights are updated in each iteration. A high learning rate can lead to instability and divergence, while a low learning rate can slow down training significantly. Batch normalization allows you to use larger learning rates without sacrificing stability, leading to faster training times.

## 3. Improved generalization: Generalization refers to the network's ability to perform well on unseen data. Batch normalization can improve generalization by reducing the network's sensitivity to the initial weights and hyperparameters. This makes it less prone to overfitting to the training data and more likely to generalize well to new examples.

## 4. Reduced dependence on weight initialization: Choosing good initial weights for a deep neural network can be crucial for successful training. Batch normalization helps alleviate this dependence by making the network less sensitive to the initial values. This can simplify the training process and make it more robust to different initialization schemes.

## 5. Regularization effect: Batch normalization acts as a form of regularization, which helps to prevent overfitting. By reducing the variance of the activations, it makes it harder for the network to memorize the training data and encourages it to learn more generalizable features.

## 6. Makes some activation functions viable: Certain activation functions, like tanh and sigmoid, can suffer from vanishing gradients in deep networks, which makes them difficult to train. Batch normalization can mitigate this issue by keeping the activations within a specific range, making these functions more effective in deeper networks.

# Q3. Discuss the working principle of batch normalization, including the normalization step and the learnable parameters.

## 1. Normalization Step: Calculate mean and variance: For each batch of activations (outputs of a layer) within a training iteration, batch normalization computes the mean (μ) and variance (σ²) across the batch.
## Normalize activations: Each activation value within the batch is then normalized by subtracting the mean and dividing by the standard deviation (square root of variance), resulting in a distribution with a mean of 0 and a standard deviation of 1.

## 2. Learnable Parameters: Gamma (γ) and Beta (β): After normalization, two learnable parameters, γ and β, are introduced to restore flexibility and learn the optimal distribution for each layer.
## Scaling and shifting: The normalized activations are scaled by γ and shifted by β, allowing the network to adjust the mean and variance of the activations if needed.

##  3. Key Points: Layer-wise application: Batch normalization is typically applied to each layer's activations (excluding input and output layers).
## Training vs. inference: During training, batch statistics (mean and variance) are calculated for each batch. During inference, the running averages of these statistics from training are used for normalization.
## Regularization effect: Batch normalization acts as a regularizer, reducing overfitting by introducing noise to the activations.

# Q.2 IMPLEMENTATION
## *Before Batch Normalization*

## 1. Choose a dataset of your choice (e.g., MNIST, CIAR-0) and preprocess itr

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import seaborn as sns
import tensorflow as tf
import keras
from keras.datasets import mnist

In [16]:
#Loading the FashionMnist Dataset
(X_train_full, y_train_full), (X_test, y_test) = mnist.load_data()
X_train_full = X_train_full / 255.0 #Typecasting to float
X_test = X_test / 255.0 #Typecasting to float
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


In [17]:
X_train_full.shape , y_train_full.shape

((60000, 28, 28), (60000,))

In [18]:
X_train.shape , y_train.shape

((55000, 28, 28), (55000,))

In [19]:
X_test.shape , y_test.shape

((10000, 28, 28), (10000,))

In [20]:
X_valid.shape , X_valid.shape

((5000, 28, 28), (5000, 28, 28))

In [21]:
# Creating layer of model

#Setting seed for code reproducability
tf.random.set_seed(42)
np.random.seed(42)

LAYERS = [ tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(300, kernel_initializer="he_normal"),
    tf.keras.layers.LeakyReLU(),
    tf.keras.layers.Dense(100, kernel_initializer="he_normal"),
    tf.keras.layers.LeakyReLU(),
    tf.keras.layers.Dense(10, activation="softmax")]
model = tf.keras.models.Sequential(LAYERS)

In [22]:
# Compiling the model
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=tf.keras.optimizers.SGD(learning_rate=1e-3),
              metrics=["accuracy"])

In [23]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten_2 (Flatten)         (None, 784)               0         
                                                                 
 dense_6 (Dense)             (None, 300)               235500    
                                                                 
 leaky_re_lu_4 (LeakyReLU)   (None, 300)               0         
                                                                 
 dense_7 (Dense)             (None, 100)               30100     
                                                                 
 leaky_re_lu_5 (LeakyReLU)   (None, 100)               0         
                                                                 
 dense_8 (Dense)             (None, 10)                1010      
                                                                 
Total params: 266610 (1.02 MB)
Trainable params: 26661

In [24]:
#Training and Calculating the training time
import time
#Starting time
start = time.time()

history = model.fit(X_train,
                    y_train,
                    epochs=15,
                    validation_data=(X_valid,y_valid),
                    verbose = 2
                    )

#Ending time
end = time.time()

#Total time taken
print(f"Runtime of the program is {end - start}")

Epoch 1/15
1719/1719 - 6s - loss: 1.5672 - accuracy: 0.5777 - val_loss: 0.9944 - val_accuracy: 0.7808 - 6s/epoch - 3ms/step
Epoch 2/15
1719/1719 - 5s - loss: 0.7839 - accuracy: 0.8147 - val_loss: 0.6180 - val_accuracy: 0.8490 - 5s/epoch - 3ms/step
Epoch 3/15
1719/1719 - 5s - loss: 0.5618 - accuracy: 0.8566 - val_loss: 0.4878 - val_accuracy: 0.8772 - 5s/epoch - 3ms/step
Epoch 4/15
1719/1719 - 5s - loss: 0.4713 - accuracy: 0.8748 - val_loss: 0.4235 - val_accuracy: 0.8896 - 5s/epoch - 3ms/step
Epoch 5/15
1719/1719 - 5s - loss: 0.4221 - accuracy: 0.8846 - val_loss: 0.3847 - val_accuracy: 0.8970 - 5s/epoch - 3ms/step
Epoch 6/15
1719/1719 - 5s - loss: 0.3906 - accuracy: 0.8923 - val_loss: 0.3590 - val_accuracy: 0.9020 - 5s/epoch - 3ms/step
Epoch 7/15
1719/1719 - 5s - loss: 0.3683 - accuracy: 0.8978 - val_loss: 0.3400 - val_accuracy: 0.9080 - 5s/epoch - 3ms/step
Epoch 8/15
1719/1719 - 5s - loss: 0.3515 - accuracy: 0.9017 - val_loss: 0.3249 - val_accuracy: 0.9138 - 5s/epoch - 3ms/step
Epoch 9/

# Observation:
## val_accuracy: 0.9250
## Runtime of the program is 72.57323980331421

# After Batch Normalization

In [25]:
# delete the previous model
del model

In [26]:
# Defing new model with batch normalization

tf.random.set_seed(42)#Setting seed for code reproducability
np.random.seed(42)

LAYERS_BN = [
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(300, activation="relu"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(10, activation="softmax")
]

model = tf.keras.models.Sequential(LAYERS_BN)

In [27]:
model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten_3 (Flatten)         (None, 784)               0         
                                                                 
 batch_normalization (Batch  (None, 784)               3136      
 Normalization)                                                  
                                                                 
 dense_9 (Dense)             (None, 300)               235500    
                                                                 
 batch_normalization_1 (Bat  (None, 300)               1200      
 chNormalization)                                                
                                                                 
 dense_10 (Dense)            (None, 100)               30100     
                                                                 
 batch_normalization_2 (Bat  (None, 100)              

In [28]:
bn1 = model.layers[1]

In [29]:
for variable in bn1.variables:
    print(variable.name, variable.trainable)

batch_normalization/gamma:0 True
batch_normalization/beta:0 True
batch_normalization/moving_mean:0 False
batch_normalization/moving_variance:0 False


In [30]:
# Compiling the model
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=tf.keras.optimizers.SGD(learning_rate=1e-3),
              metrics=["accuracy"])

In [31]:
#Training and Calculating the training time

#Starting time
start = time.time()

history = model.fit(X_train,
                    y_train,
                    epochs=15,
                    validation_data=(X_valid,y_valid),
                    verbose = 2
                    )

#Ending time
end = time.time()

#Total time taken
print(f"Runtime of the program is {end - start}")

Epoch 1/15
1719/1719 - 8s - loss: 0.8602 - accuracy: 0.7323 - val_loss: 0.4660 - val_accuracy: 0.8674 - 8s/epoch - 5ms/step
Epoch 2/15
1719/1719 - 6s - loss: 0.4392 - accuracy: 0.8721 - val_loss: 0.3434 - val_accuracy: 0.9016 - 6s/epoch - 4ms/step
Epoch 3/15
1719/1719 - 6s - loss: 0.3583 - accuracy: 0.8970 - val_loss: 0.2907 - val_accuracy: 0.9196 - 6s/epoch - 4ms/step
Epoch 4/15
1719/1719 - 6s - loss: 0.3122 - accuracy: 0.9093 - val_loss: 0.2612 - val_accuracy: 0.9260 - 6s/epoch - 4ms/step
Epoch 5/15
1719/1719 - 6s - loss: 0.2834 - accuracy: 0.9176 - val_loss: 0.2384 - val_accuracy: 0.9324 - 6s/epoch - 4ms/step
Epoch 6/15
1719/1719 - 6s - loss: 0.2629 - accuracy: 0.9231 - val_loss: 0.2215 - val_accuracy: 0.9380 - 6s/epoch - 4ms/step
Epoch 7/15
1719/1719 - 6s - loss: 0.2435 - accuracy: 0.9287 - val_loss: 0.2083 - val_accuracy: 0.9420 - 6s/epoch - 4ms/step
Epoch 8/15
1719/1719 - 6s - loss: 0.2323 - accuracy: 0.9321 - val_loss: 0.1955 - val_accuracy: 0.9450 - 6s/epoch - 4ms/step
Epoch 9/

# Observation:
## val_accuracy: 0.9548
## Runtime of the program is 98.73910808563232

# Conclusion:
## Before Applying Batch Normalization
## val_accuracy: 0.9250
## Runtime of the program is 72.57323980331421

# After Applying Batch Normalization
## val_accuracy: 0.9548
## Runtime of the program is 98.73910808563232