<a href="https://colab.research.google.com/github/Rishav-hub/Challenge1/blob/main/Documentation_5_BatchNormalization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Batch Normalization
- Although after changing various Activation function and initializing weights in different ways its not guareented that Vanishing and Exploding Gradient problems won't come back. 
- So, In a 2015 paper, Sergey Ioffe and Christian Szegedy proposed a technique called Batch Normalization (BN).
- This involves adding an operation in the model just before or after the activation function of each hidden layers.
- Simply zero centering and normalizing each input(), then scaling and shifting the results using two new parameters per layers.
- Parameters $\mu_{B} \Rightarrow$ Batch- Mean, $\sigma^{2}_{B}\Rightarrow$Batch Variance are non - trainable
- Parameters $\gamma \Rightarrow$ Scaling, $\beta\Rightarrow$Shifting are trainable

Steps - :

- Mini Batch Mean $\Rightarrow \mu_{B} = \frac{1}{M_{B}}\displaystyle\sum\limits_{i=0}^n X^{i}$
- Mini - Batch Variance $\Rightarrow \sigma^{2}_{B} = \frac {1}{M_{B}}\displaystyle\sum\limits_{i=0}^n (X^{i} - \mu_{B})$
- Normalizing $X^{i} \Rightarrow$ $\hat{X}^{i} = \frac{X^{i} - \mu_{B}}{\sqrt{\sigma^{2}_{B}} + \epsilon}$, $\epsilon \Rightarrow $ smoothning term $10^{-7}$ to avoid zero division error.
- $$Z^{i} = \gamma\hat{X} + \beta$$

### Two types of approach - 
- Batch Normalization layer before the activation function.
- Batch Normalization layer after the activation function.



#### Batch Normalization layer before the activation function
A shifted output that lies within the trainable area. This will give the activation function a better input in his trainable part.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf

In [None]:
minist = tf.keras.datasets.mnist

(X_train_full, y_train_full), (X_test, y_test) = minist.load_data()

In [None]:
X_valid, X_train = X_train_full[:5000] / 255., X_train_full[5000:] / 255.
y_valid, y_train =  y_train_full[:5000], y_train_full[5000:]

In [None]:
model = tf.keras.models.Sequential([
 tf.keras.layers.Flatten(input_shape=[28, 28]),
 tf.keras.layers.BatchNormalization(),
 tf.keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal", use_bias=False),
 tf.keras.layers.BatchNormalization(),
 tf.keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal", use_bias=False),
 tf.keras.layers.BatchNormalization(),
 tf.keras.layers.Dense(10, activation="softmax")
])


#### Why did we remove the bias terms in the Dense layers ?
Since a Batch Normalization layer includes one offset parameter per input, we can remove the bias term from the previous layer.

In [None]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 784)               0         
_________________________________________________________________
batch_normalization (BatchNo (None, 784)               3136      
_________________________________________________________________
dense (Dense)                (None, 300)               235200    
_________________________________________________________________
batch_normalization_1 (Batch (None, 300)               1200      
_________________________________________________________________
dense_1 (Dense)              (None, 100)               30000     
_________________________________________________________________
batch_normalization_2 (Batch (None, 100)               400       
_________________________________________________________________
dense_2 (Dense)              (None, 10)                1

In [None]:
784 * 4 # mean, variance, gamma and Beta

3136

Combining all BN layers 

In [None]:
784 * 4 + 300*4+ 100*4

4736

In [None]:
4736//2

2368

In [None]:
271346 - 2368

268978

### Why did we divide the total BN layers by 2 ?
This is beacause as we can see from the model.summary() only 268,978 are trainable from total 271,346 parameters, this is because we only make $\gamma    and  \beta $as trainable parameter. So, to exclude $\mu and \sigma$ we divide it by 2.

In [None]:
for variable in model.layers[1].variables:
  print(variable.name, variable.trainable)

batch_normalization/gamma:0 True
batch_normalization/beta:0 True
batch_normalization/moving_mean:0 False
batch_normalization/moving_variance:0 False


In [None]:
LOSS_FUNCTION = "sparse_categorical_crossentropy" # use => tf.losses.sparse_categorical_crossentropy
OPTIMIZER = tf.keras.optimizers.SGD(learning_rate=1e-3) # or use with custom learning rate=> tf.keras.optimizers.SGD(0.02)
METRICS = ["accuracy"]

model.compile(loss=LOSS_FUNCTION,
              optimizer=OPTIMIZER,
              metrics=METRICS)

In [None]:
EPOCHS = 10
VALIDATION_SET = (X_valid, y_valid)

new_history = model.fit(X_train, y_train, epochs=EPOCHS,
                    validation_data=VALIDATION_SET,batch_size = 32)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
model.evaluate(X_test, y_test)



[43.61552047729492, 0.7975000143051147]

#### Batch Normalization layer after the activation function

In [None]:
LAYERS_BN_BIAS_FALSE = [
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(300, use_bias=False),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation("relu"),
    tf.keras.layers.Dense(100, use_bias=False),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation("relu"),
    tf.keras.layers.Dense(10, activation="softmax")
]

model_2 = tf.keras.models.Sequential(LAYERS_BN_BIAS_FALSE)

LOSS_FUNCTION = "sparse_categorical_crossentropy" # use => tf.losses.sparse_categorical_crossentropy
OPTIMIZER = tf.keras.optimizers.SGD(learning_rate=1e-3) # or use with custom learning rate=> tf.keras.optimizers.SGD(0.02)
METRICS = ["accuracy"]

model_2.compile(loss=LOSS_FUNCTION,
              optimizer=OPTIMIZER,
              metrics=METRICS)

EPOCHS = 10
VALIDATION_SET = (X_valid, y_valid)

new_history = model_2.fit(X_train, y_train, epochs=EPOCHS,
                    validation_data=VALIDATION_SET,batch_size = 32)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
model_2.evaluate(X_test, y_test)



[42.745121002197266, 0.8435999751091003]

#### Observation
Model 1 performed better than model 2 in the test set. So, we can conclude that Batch Normalization layer after the activation function has some advantage over other approach.

### In test time why not use BN layer ?
We may need to make predictions for individual instances rather than for batches of instances: in this case, we will have no way to compute each inputâ€™s mean and standard deviation. Moreover, even if we do have a batch of instances, it may be too small, or the instances may not be independent and identically distributed

### Advantages - 
- Faster convergance 
- Helps to reduce Vanishing and Exploding Gradient Issue.
- It is not affected by choice of activation function.
- No need of scaling

### Disadvantage - 
- Complexity in the network.
- Increase in trainable parameters.
- Slow prediction

### When to use BN ?
- Mostly in CNNs.
- Deep layers more than 16 layers 
- To remove Vanishing and Exploding Gradient Issue in RNNs. 