<a href="https://colab.research.google.com/github/MayurKolki/Data_Science_DL/blob/main/_t01__BATCH_NORMALIZATION.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import os
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use("fivethirtyeight")
%load_ext tensorboard

Let's train a neural network on Fashion MNIST using the Leaky ReLU:

In [None]:
(X_train_full, y_train_full), (X_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()
X_train_full = X_train_full / 255.0
X_test = X_test / 255.0
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

In [None]:
tf.random.set_seed(42)
np.random.seed(42)

LAYERS = [ tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(300, kernel_initializer="he_normal"),
    tf.keras.layers.LeakyReLU(),
    tf.keras.layers.Dense(100, kernel_initializer="he_normal"),
    tf.keras.layers.LeakyReLU(),
    tf.keras.layers.Dense(10, activation="softmax")]


model = tf.keras.models.Sequential(LAYERS)

In [None]:
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=tf.keras.optimizers.SGD(lr=1e-3),
              metrics=["accuracy"])

In [None]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 784)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 300)               235500    
_________________________________________________________________
leaky_re_lu (LeakyReLU)      (None, 300)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 100)               30100     
_________________________________________________________________
leaky_re_lu_1 (LeakyReLU)    (None, 100)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 10)                1010      
Total params: 266,610
Trainable params: 266,610
Non-trainable params: 0
__________________________________________________

In [None]:
history = model.fit(X_train, y_train, epochs=10,
                    validation_data=(X_valid, y_valid), verbose=2)

Epoch 1/10
1719/1719 - 2s - loss: 1.2819 - accuracy: 0.6229 - val_loss: 0.8886 - val_accuracy: 0.7160
Epoch 2/10
1719/1719 - 2s - loss: 0.7955 - accuracy: 0.7361 - val_loss: 0.7130 - val_accuracy: 0.7656
Epoch 3/10
1719/1719 - 2s - loss: 0.6816 - accuracy: 0.7721 - val_loss: 0.6427 - val_accuracy: 0.7898
Epoch 4/10
1719/1719 - 1s - loss: 0.6217 - accuracy: 0.7944 - val_loss: 0.5900 - val_accuracy: 0.8064
Epoch 5/10
1719/1719 - 2s - loss: 0.5832 - accuracy: 0.8075 - val_loss: 0.5582 - val_accuracy: 0.8202
Epoch 6/10
1719/1719 - 2s - loss: 0.5553 - accuracy: 0.8157 - val_loss: 0.5350 - val_accuracy: 0.8238
Epoch 7/10
1719/1719 - 2s - loss: 0.5338 - accuracy: 0.8225 - val_loss: 0.5157 - val_accuracy: 0.8304
Epoch 8/10
1719/1719 - 2s - loss: 0.5173 - accuracy: 0.8273 - val_loss: 0.5079 - val_accuracy: 0.8284
Epoch 9/10
1719/1719 - 2s - loss: 0.5040 - accuracy: 0.8290 - val_loss: 0.4895 - val_accuracy: 0.8386
Epoch 10/10
1719/1719 - 2s - loss: 0.4924 - accuracy: 0.8321 - val_loss: 0.4817 - 

# Batch Normalization

#### Internal Covariate Shift
* We define Internal Covariate Shift as the change in the
distribution of network activations due to the change in
network parameters during training. 

* To improve the training, we seek to reduce the internal covariate shift. By
fixing the distribution of the layer inputs x as the training
progresses, we expect to improve the training speed. 

* It has been long known (LeCun et al., 1998b; Wiesler & Ney,
2011) that the network training converges faster if its inputs are whitened – i.e., linearly transformed to have zero
means and unit variances, and decorrelated. 

* As each layer observes the inputs produced by the layers below, it would
be advantageous to achieve the same whitening of the inputs of each layer. 

* By whitening the inputs to each layer, we would take a step towards achieving the fixed distributions of inputs that would remove the ill effects of the
internal covariate shift.

reference [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/pdf/1502.03167.pdf)

## Input: 
### Values of x over a mini-batch: $B = \{x_{1...m}\}$
### Learnable parameters: $\gamma$ and $\beta$


## Output: 
### $\{z^{(i)} = BN _{\gamma, \beta}(x^{(i)})\}$

## Algorithm:

### 1. mini-batch mean: $\mu_B = \frac{1}{m_B} \sum_{i=1}^{m_B} x^{(i)}$

### 2. mini-batch variance: $\sigma_B^2 = \frac{1}{m_B} \sum_{i=1}^{m_B} (x^{(i)} - \mu_B)^2$

### 3. normalize: $\hat{x}^{(i)} = \frac{x^{(i)} - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$

### 4. scale and shift: $ z^{(i)} = \gamma \otimes  \hat{x}^{(i)} + \beta \equiv BN _{\gamma, \beta}(x^{(i)})\ $ 

---

## BN Approach One

In [None]:
del model

In [None]:
LAYERS_BN = [
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(300, activation="relu"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(10, activation="softmax")
]

model = tf.keras.models.Sequential(LAYERS_BN)

In [None]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_1 (Flatten)          (None, 784)               0         
_________________________________________________________________
batch_normalization (BatchNo (None, 784)               3136      
_________________________________________________________________
dense_4 (Dense)              (None, 300)               235500    
_________________________________________________________________
batch_normalization_1 (Batch (None, 300)               1200      
_________________________________________________________________
dense_5 (Dense)              (None, 100)               30100     
_________________________________________________________________
batch_normalization_2 (Batch (None, 100)               400       
_________________________________________________________________
dense_6 (Dense)              (None, 10)               

In [None]:
784 * 4 , 300 * 4 , 100 * 4

784 * 4 + 300 * 4 + 100 * 4

(784 * 4 + 300 * 4 + 100 * 4)/2

2368.0

In [None]:
784 * 4 # mean, variance, gamma and Beta

3136

In [None]:
300 * 4

1200

In [None]:
100 *4 

400

In [None]:
3136 + 1200 + 400

4736

In [None]:
4736 / 2

2368.0

In [None]:
266610 + 2368.0

268978.0

In [None]:
266610 + 4736

271346

In [None]:
bn1 = model.layers[1]
for variable in bn1.variables:
    print(f"variable name: {variable.name.split('/')[-1][:-2]}, \nis trainable: {variable.trainable}\n\n")

variable name: gamma, 
is trainable: True


variable name: beta, 
is trainable: True


variable name: moving_mean, 
is trainable: False


variable name: moving_variance, 
is trainable: False




In [None]:
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=tf.keras.optimizers.SGD(lr=1e-3),
              metrics=["accuracy"])

In [None]:
history = model.fit(X_train, y_train, epochs=10,
                    validation_data=(X_valid, y_valid), verbose=2)

Epoch 1/10
1719/1719 - 3s - loss: 0.8293 - accuracy: 0.7221 - val_loss: 0.5539 - val_accuracy: 0.8162
Epoch 2/10
1719/1719 - 3s - loss: 0.5703 - accuracy: 0.8035 - val_loss: 0.4792 - val_accuracy: 0.8378
Epoch 3/10
1719/1719 - 3s - loss: 0.5161 - accuracy: 0.8214 - val_loss: 0.4425 - val_accuracy: 0.8492
Epoch 4/10
1719/1719 - 3s - loss: 0.4788 - accuracy: 0.8314 - val_loss: 0.4212 - val_accuracy: 0.8562
Epoch 5/10
1719/1719 - 3s - loss: 0.4547 - accuracy: 0.8406 - val_loss: 0.4051 - val_accuracy: 0.8616
Epoch 6/10
1719/1719 - 3s - loss: 0.4386 - accuracy: 0.8445 - val_loss: 0.3931 - val_accuracy: 0.8628
Epoch 7/10
1719/1719 - 3s - loss: 0.4254 - accuracy: 0.8502 - val_loss: 0.3829 - val_accuracy: 0.8644
Epoch 8/10
1719/1719 - 3s - loss: 0.4123 - accuracy: 0.8538 - val_loss: 0.3759 - val_accuracy: 0.8672
Epoch 9/10
1719/1719 - 3s - loss: 0.4026 - accuracy: 0.8578 - val_loss: 0.3691 - val_accuracy: 0.8676
Epoch 10/10
1719/1719 - 3s - loss: 0.3924 - accuracy: 0.8614 - val_loss: 0.3631 - 

## BN Approach Two

Sometimes applying BN before the activation function works better (there's a debate on this topic). Moreover, the layer before a `BatchNormalization` layer does not need to have bias terms, since the `BatchNormalization` layer some as well, it would be a waste of parameters, so you can set `use_bias=False` when creating those layers:

In [None]:
del model

In [None]:
LAYERS_BN_BIAS_FALSE = [
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(300, use_bias=False),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation("relu"),
    tf.keras.layers.Dense(100, use_bias=False),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation("relu"),
    tf.keras.layers.Dense(10, activation="softmax")
]

model = tf.keras.models.Sequential(LAYERS_BN_BIAS_FALSE)

In [None]:
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=tf.keras.optimizers.SGD(lr=1e-3),
              metrics=["accuracy"])

In [None]:
history = model.fit(X_train, y_train, epochs=10,
                    validation_data=(X_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
