# Batch normalization

### Issues with training Deep Neural Networks 

- There are 2 major issues 1) Internal Covariate shift, 2) Vanishing Gradient

### Internal Covariate shift

- The concept of covariate shift pertains to the change that occurs in the distribution of the input to a learning system. In deep networks, this distribution can be influenced by parameters across all input layers. Consequently, even minor changes in the network can have a significant impact on its output. This effect gets magnified as the signal propagates through the network, which can result in a shift in the distribution of the inputs to internal layers. This phenomenon is known as internal covariate shift.

- When inputs are whitened (i.e., have zero mean and unit variance) and are uncorrelated, they tend to converge faster during training. However, internal covariate shift can have the opposite effect, as it introduces changes to the distribution of inputs that can slow down convergence. Therefore, to mitigate this effect, techniques like batch normalization have been developed to normalize the inputs to each layer in the network based on statistics of the current mini-batch.

### Vanishing Gradient

- Saturating non-linearities such as sigmoid or tanh are not suitable for deep networks, as the signal tends to get trapped in the saturation region as the network grows deeper. This makes it difficult for the network to learn and can result in slow convergence during training. To overcome this problem we can use the following.

- Non-linearities like ReLU which do not saturate.
- Smaller learning rates
- Careful initializations
---
### What is Normalization?

- Normalization in deep learning refers to the process of transforming the input or output of a layer in a neural network to improve its performance during training. The most common type of normalization used in deep learning is batch normalization, which normalizes the activations of a layer for each mini-batch during training.
---
### What is batch normalization?

- Batch normalization is a technique in deep learning that helps to standardize and normalize the input to each layer of a neural network by adjusting and scaling the activations. The idea behind batch normalization is to normalize the inputs to a layer to have zero mean and unit variance across each mini-batch of the training data.

### Steps involved in batch normalization

1) During training, for each mini-batch of data, compute the mean and variance of the activations of each layer. This can be done using the following formulas:

- Mean: $\mu_B = \frac{1}{m} \sum_{i=1}^m x_i$

- Variance: $\sigma_B^2 = \frac{1}{m} \sum_{i=1}^m (x_i - \mu_B)^2$

- Here, $m$ is the size of the mini-batch, and $x_i$ is the activation of the $i$-th neuron in the layer.

2) Normalize the activations of each layer in the mini-batch using the following formula:

- $\hat{x_i} = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$
Here, $\epsilon$ is a small constant added for numerical stability.

3) Scale and shift the normalized activations using the learned parameters $\gamma$ and $\beta$, respectively:

- $y_i = \gamma \hat{x_i} + \beta$
- The parameters $\gamma$ and $\beta$ are learned during training using backpropagation.

4) During inference, the running mean and variance of each layer are used for normalization instead of the mini-batch statistics. These running statistics are updated using a moving average of the mini-batch statistics during training.
---
### The benefits of batch normalization include:

- Improved training performance: Batch normalization reduces the internal covariate shift, which is the change in the distribution of the activations of each layer due to changes in the distribution of the inputs. This allows the network to converge faster and with more stable gradients.

- Regularization: Batch normalization acts as a form of regularization by adding noise to the activations of each layer, which can help prevent overfitting.

- Increased robustness: Batch normalization makes the network more robust to changes in the input distribution, which can help improve its generalization performance.
---

## Observing results before and after Normalization

In [2]:
### Before applying Batch Normalization

In [7]:
# Importing necessary modules
import os
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
plt.style.use('fivethirtyeight')
%load_ext tensorboard

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


In [9]:
(X_train_full, y_train_full), (X_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()
X_train_full = X_train_full / 255.0
X_test = X_test / 255.0
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
[1m29515/29515[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
[1m26421880/26421880[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
[1m5148/5148[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz
[1m4422102/4422102[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


In [15]:
# creating layer of model
tf.random.set_seed(42) # for getting similar output (optional)
np.random.seed(42) # for getting similar output (optional)

LAYERS = [
    tf.keras.layers.Flatten(input_shape = [28, 28]),
    tf.keras.layers.Dense(300, kernel_initializer="he_normal"),
    tf.keras.layers.LeakyReLU(),
    tf.keras.layers.Dense(100, kernel_initializer="he_normal"),
    tf.keras.layers.LeakyReLU(),
    tf.keras.layers.Dense(10, activation="softmax")
]

model = tf.keras.models.Sequential(LAYERS)

# Compiling the model
model.compile(loss = "sparse_categorical_crossentropy",
              optimizer=tf.keras.optimizers.SGD(learning_rate=1e-3),
              metrics=["accuracy"])

In [17]:
model.summary()

In [19]:
# now training and calculating the training data

# starting time
start = time.time()

history = model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid), verbose=2)

#ending time
end = time.time()

#total time taken
print(f"Runtime of the program is {end - start}")

Epoch 1/10
1719/1719 - 4s - 2ms/step - accuracy: 0.6281 - loss: 1.2546 - val_accuracy: 0.7282 - val_loss: 0.8449
Epoch 2/10
1719/1719 - 3s - 1ms/step - accuracy: 0.7524 - loss: 0.7645 - val_accuracy: 0.7736 - val_loss: 0.6873
Epoch 3/10
1719/1719 - 3s - 2ms/step - accuracy: 0.7830 - loss: 0.6600 - val_accuracy: 0.7934 - val_loss: 0.6177
Epoch 4/10
1719/1719 - 3s - 2ms/step - accuracy: 0.8009 - loss: 0.6046 - val_accuracy: 0.8082 - val_loss: 0.5759
Epoch 5/10
1719/1719 - 3s - 2ms/step - accuracy: 0.8117 - loss: 0.5689 - val_accuracy: 0.8166 - val_loss: 0.5476
Epoch 6/10
1719/1719 - 3s - 1ms/step - accuracy: 0.8191 - loss: 0.5436 - val_accuracy: 0.8240 - val_loss: 0.5269
Epoch 7/10
1719/1719 - 3s - 2ms/step - accuracy: 0.8240 - loss: 0.5245 - val_accuracy: 0.8280 - val_loss: 0.5110
Epoch 8/10
1719/1719 - 3s - 2ms/step - accuracy: 0.8283 - loss: 0.5095 - val_accuracy: 0.8318 - val_loss: 0.4983
Epoch 9/10
1719/1719 - 3s - 2ms/step - accuracy: 0.8317 - loss: 0.4972 - val_accuracy: 0.8352 - 

### Conclusion
- Runtime of the program is 28.55 sec
- accuracy: 0.8351

In [22]:
# After applying batch normalization

In [24]:
# Delete the previous model
del model

In [32]:
# Defining new model with batch normalization
LAYERS_BN = [
    tf.keras.layers.Flatten(input_shape = [28, 28]),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(300, activation="relu"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(10, activation="softmax")
]

model = tf.keras.models.Sequential(LAYERS_BN)

  super().__init__(**kwargs)


In [34]:
model.summary()

In [36]:
bn1 = model.layers[1]

for variable in bn1.variables :
    print(variable.name, variable.trainable)

gamma True
beta True
moving_mean False
moving_variance False


In [38]:
# Compiling the model
model.compile(loss = "sparse_categorical_crossentropy",
              optimizer=tf.keras.optimizers.SGD(learning_rate=1e-3),
              metrics=["accuracy"])

In [40]:
# now training and calculating the training data

# starting time
start = time.time()

history = model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid), verbose=2)

#ending time
end = time.time()

#total time taken
print(f"Runtime of the program is {end - start}")

Epoch 1/10
1719/1719 - 5s - 3ms/step - accuracy: 0.7187 - loss: 0.8426 - val_accuracy: 0.8138 - val_loss: 0.5543
Epoch 2/10
1719/1719 - 3s - 2ms/step - accuracy: 0.8013 - loss: 0.5716 - val_accuracy: 0.8348 - val_loss: 0.4760
Epoch 3/10
1719/1719 - 4s - 2ms/step - accuracy: 0.8245 - loss: 0.5065 - val_accuracy: 0.8468 - val_loss: 0.4408
Epoch 4/10
1719/1719 - 3s - 2ms/step - accuracy: 0.8363 - loss: 0.4691 - val_accuracy: 0.8538 - val_loss: 0.4199
Epoch 5/10
1719/1719 - 4s - 2ms/step - accuracy: 0.8452 - loss: 0.4429 - val_accuracy: 0.8570 - val_loss: 0.4057
Epoch 6/10
1719/1719 - 4s - 2ms/step - accuracy: 0.8527 - loss: 0.4226 - val_accuracy: 0.8616 - val_loss: 0.3954
Epoch 7/10
1719/1719 - 4s - 2ms/step - accuracy: 0.8587 - loss: 0.4060 - val_accuracy: 0.8650 - val_loss: 0.3874
Epoch 8/10
1719/1719 - 4s - 2ms/step - accuracy: 0.8633 - loss: 0.3917 - val_accuracy: 0.8684 - val_loss: 0.3808
Epoch 9/10
1719/1719 - 4s - 2ms/step - accuracy: 0.8675 - loss: 0.3792 - val_accuracy: 0.8698 - 

### Conclusion
- Runtime of the program is 37.50 sec
- accuracy: 0.8716