# 7.2.2 Batch Normalization

## Explanation of Batch Normalization

Batch Normalization (BN) is a technique designed to improve the training of deep neural networks. It was introduced by Sergey Ioffe and Christian Szegedy in their 2015 paper, "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift." BN addresses issues related to training stability and speed by normalizing the activations of each layer. 


## Key Terminologies

- **Activation**: The output of a neuron in a neural network after applying an activation function.
- **Mean**: The average value of the activations for a given layer.
- **Variance**: The measure of the spread of activations around the mean for a given layer.
- **Normalization**: The process of adjusting the activations to have a mean of zero and a variance of one.

## Process of Batch Normalization

Batch Normalization involves normalizing the activations of a layer. The process can be divided into two main steps: the forward pass and the backward pass.



### Forward Pass

1. **Compute the Mean and Variance**:
   
   For a given batch of data $X$ (where $X$ has $N$ examples and $D$ features), calculate the mean $\mu$ and variance $\sigma^2$ for each feature:

   $$
   \mu = \frac{1}{N} \sum_{i=1}^{N} X_i
   $$

   $$
   \sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (X_i - \mu)^2
   $$

2. **Normalize the Data**:

   Normalize the activations using the computed mean and variance:

   $$
   \hat{X} = \frac{X - \mu}{\sqrt{\sigma^2 + \epsilon}}
   $$

   Here, $\epsilon$ is a small constant added for numerical stability.

3. **Scale and Shift**:

   Apply learned parameters $\gamma$ (scale) and $\beta$ (shift) to the normalized data:

   $$
   Y = \gamma \hat{X} + \beta
   $$

   Here, $Y$ is the output of the batch normalization layer.


### Backward Pass

1. **Compute Gradients**:

   During backpropagation, compute the gradients of the loss with respect to the batch normalization parameters ($\gamma$, $\beta$) and the input $X$. The gradients are computed using the chain rule of calculus and involve the derivatives of the mean and variance.

   - Gradient with respect to $\gamma$:

     $$
     d\gamma = \sum_{i=1}^{N} \frac{\partial L}{\partial Y_i} \cdot \hat{X}_i
     $$

   - Gradient with respect to $\beta$:

     $$
     d\beta = \sum_{i=1}^{N} \frac{\partial L}{\partial Y_i}
     $$

   - Gradient with respect to the normalized input $\hat{X}$ and original input $X$ involve more complex terms considering the effects of normalization on the gradient.



___
___
### Readings:
- [Batch Norm Explained Visually](https://towardsdatascience.com/batch-norm-explained-visually-how-it-works-and-why-neural-networks-need-it-b18919692739)
- [Batch Normalization](https://medium.com/nerd-for-tech/batch-normalization-51e32053f20)
- [Introduction to Batch Normalization](https://www.analyticsvidhya.com/blog/2021/03/introduction-to-batch-normalization/)
- [Batch normalization in 3 levels of understanding](https://towardsdatascience.com/batch-normalization-in-3-levels-of-understanding-14c2da90a338)
___
___

## Benefits of Batch Normalization

- **Improved Training Speed**: BN can reduce the number of training epochs required for convergence.
- **Reduced Internal Covariate Shift**: Normalization mitigates the problem of internal covariate shift by stabilizing the distribution of layer inputs.
- **Increased Model Stability**: By normalizing activations, BN helps stabilize the training process and can reduce the dependence on careful initialization and learning rates.

## Use Cases

Batch Normalization is widely used in various types of neural networks, including:
- Convolutional Neural Networks (CNNs)
- Deep Feedforward Networks
- Recurrent Neural Networks (RNNs)

By normalizing the activations, BN facilitates the training of deeper and more complex neural networks, leading to improved performance and generalization.

In [1]:
import numpy as np

class BatchNormalization:
    def __init__(self, input_dim, epsilon=1e-8, momentum=0.9):
        self.epsilon = epsilon
        self.momentum = momentum
        self.gamma = np.ones(input_dim)
        self.beta = np.zeros(input_dim)
        self.running_mean = np.zeros(input_dim)
        self.running_var = np.ones(input_dim)
    
    def forward(self, X, training=True):
        if training:
            self.mean = np.mean(X, axis=0)
            self.var = np.var(X, axis=0)
            self.X_hat = (X - self.mean) / np.sqrt(self.var + self.epsilon)
            self.out = self.gamma * self.X_hat + self.beta
            
            # Store input for backward pass
            self.X = X
            
            # Update running mean and variance
            self.running_mean = self.momentum * self.running_mean + (1 - self.momentum) * self.mean
            self.running_var = self.momentum * self.running_var + (1 - self.momentum) * self.var
            
            return self.out
        else:
            # During inference, use running mean and variance
            X_hat = (X - self.running_mean) / np.sqrt(self.running_var + self.epsilon)
            return self.gamma * X_hat + self.beta

    def backward(self, d_out):
        # Compute gradients
        N, D = d_out.shape
        d_X_hat = d_out * self.gamma
        d_var = np.sum(d_X_hat * (self.X - self.mean) * -0.5 * np.power(self.var + self.epsilon, -1.5), axis=0)
        d_mean = np.sum(d_X_hat * -1 / np.sqrt(self.var + self.epsilon), axis=0) + d_var * np.mean(-2 * (self.X - self.mean), axis=0)
        d_X = d_X_hat / np.sqrt(self.var + self.epsilon) + d_var * 2 * (self.X - self.mean) / N + d_mean / N
        
        d_gamma = np.sum(d_out * self.X_hat, axis=0)
        d_beta = np.sum(d_out, axis=0)
        
        return d_X, d_gamma, d_beta

# Example usage
np.random.seed(0)
X = np.random.randn(5, 3)  # Sample data: 5 samples, 3 features

batch_norm = BatchNormalization(input_dim=3)

# Forward pass
out = batch_norm.forward(X, training=True)
print("Forward pass output:\n", out)

# Backward pass (example gradients from subsequent layers)
d_out = np.random.randn(*X.shape)
d_X, d_gamma, d_beta = batch_norm.backward(d_out)
print("\nBackward pass gradients:\n")
print("d_X:\n", d_X)
print("d_gamma:\n", d_gamma)
print("d_beta:\n", d_beta)


Forward pass output:
 [[ 0.79835038 -0.10633531  0.7310406 ]
 [ 1.50500184  1.9398276  -1.5772949 ]
 [-0.40789861 -0.87537424 -0.54579944]
 [-1.20739247 -0.46346352  1.29223008]
 [-0.68806115 -0.49465453  0.09982365]]

Backward pass gradients:

d_X:
 [[-1.62886563e-03  1.56315931e+00 -5.47883128e-01]
 [ 4.16198431e-01 -1.95247371e-01 -8.73595579e-01]
 [-2.92845494e-01  1.15427271e-01  1.70826155e-01]
 [ 1.59490467e+00 -2.81277254e+00 -8.46188693e-01]
 [-1.71662874e+00  1.32943333e+00  2.09684124e+00]]
d_gamma:
 [-2.14074433 -2.65652776  4.48771932]
d_beta:
 [ 3.3829314   1.58283307 -1.98519581]


___
___
## Usage in `TensorFlow`

In [2]:
import tensorflow as tf
from tensorflow.keras.layers import Dense, BatchNormalization,Input
from tensorflow.keras.models import Sequential
import numpy as np

# Generate some sample data
np.random.seed(0)
X_train = np.random.randn(100, 10)  
y_train = np.random.randint(0, 2, size=(100,))  # Binary classification

# Define a simple neural network with Batch Normalization
model = Sequential([
    Input(shape=(10,)),  # Specify input shape using Input layer
    Dense(64),  # Dense layer
    BatchNormalization(),  # Batch Normalization layer
    tf.keras.layers.Activation('relu'),  # Activation function
    Dense(32),  # Hidden layer
    BatchNormalization(),  # Batch Normalization layer
    tf.keras.layers.Activation('relu'),  # Activation function
    Dense(1, activation='sigmoid')  # Output layer
])

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.fit(X_train, y_train, epochs=15, batch_size=16)
loss, accuracy = model.evaluate(X_train, y_train)

Epoch 1/15
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - accuracy: 0.4666 - loss: 0.7710
Epoch 2/15
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.4902 - loss: 0.7623 
Epoch 3/15
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.4973 - loss: 0.7299 
Epoch 4/15
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.6570 - loss: 0.6437 
Epoch 5/15
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.6604 - loss: 0.6404 
Epoch 6/15
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.6591 - loss: 0.6252 
Epoch 7/15
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.7294 - loss: 0.5929 
Epoch 8/15
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.6679 - loss: 0.5834 
Epoch 9/15
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1

In [3]:
print(f"Loss: {loss:.4f}")
print(f"Accuracy: {accuracy:.4f}")

Loss: 0.5401
Accuracy: 0.8500


## Conclusion

In this implementation, we explored Batch Normalization within the context of neural networks using TensorFlow. Batch Normalization is a technique that normalizes the activations of each layer to improve the training process and enhance model performance. This technique can speed up training, reduce sensitivity to initialization, and potentially lead to better generalization.

We demonstrated how to integrate Batch Normalization into a neural network model using TensorFlow's Keras API. By placing `BatchNormalization` layers after `Dense` layers, we effectively standardize the outputs, which helps stabilize and accelerate the training process. 

The code example provided:
- Generates synthetic training data.
- Constructs a simple feedforward neural network with Batch Normalization layers.
- Compiles, trains, and evaluates the model, showing how to incorporate this technique into a typical workflow.

In practice, incorporating Batch Normalization can be highly beneficial for deep learning models, especially when dealing with complex and large datasets. The use of `Input` layers to define the input shape, as demonstrated, also helps avoid common issues and warnings related to layer specifications.
