# Batch Normalization (BatchNorm)

### Overview:
Batch Normalization (BatchNorm) is a technique commonly employed during training of Deep Neural Networks (DNNs). It addresses the issue of *internal covariate shift*, which is the change in the distribution of network activations during training. BatchNorm aims to stabilize and accelerate the training process by normalizing the inputs of each layer in a mini-batch.

Batch Normalization helps to address the unstable gradients problem while also making the network train a little bit faster. At the same time BatchNorm helps dealing with overfitting by having a slight regularization effect on the network during training.
***
BatchNorm was introduced by **Sergey Ioffe** and **Christian Szegedy** in their paper [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/abs/1502.03167)
 in 2015.  It was initially proposed to address the challenges of training very deep Neural Networks.
***

#### Internal Covariate Shift:

- **Covariate Shift:** In the context of Machine Learning and Neural Networks, <u>covariate shift refers to the change in the distribution of input data between training and testing phases</u>. When the distribution of the training data differs significantly from the distribution of the testing data, it can lead to decreased model performance.

- **Internal Covariate Shift:** Internal covariate shift <u>extends this idea to the distribution of the activations within a neural network's layers during training</u>. As we train a DNN, the distribution of activations (output values) within each layer tends to change as the model's parameters are updated. This shift in the distribution of activations is known as internal covariate shift.

- **Impact on Training:** Internal covariate shift can pose challenges during training for a couple of reasons:

    - Vanishing/Exploding Gradients: When the distribution of activations changes dramatically, it can lead to issues like vanishing or exploding gradients. These issues hinder the convergence of the optimization algorithm and slow down the training process.

    - Learning Rate Sensitivity: If the distribution shifts significantly, certain layers might require very small learning rates to prevent the network from diverging during training. This slows down the overall learning process.

### Explanation:
During the training of a neural network, the distribution of inputs to each layer can shift due to changing parameters in previous layers. This can slow down training as the network needs to continuously adapt to these shifts. BatchNorm counters this by ensuring that the inputs to a layer have a consistent mean and variance.

BatchNorm operates within a mini-batch of training examples. For each feature in the mini-batch, it calculates the mean and variance, normalizes the features based on these statistics and then also scales and shifts the data using learnable parameters $\gamma \text{ (gamma)}$ and $\beta \text{ (beta)}$. 

### Formulas:
Batch Normalization is performed using the following formulas:   

For a feature $x$ in a mini-batch, the BatchNorm transformation is applied as follows:
$x$
1. Calculate the mean $\mu$ and variance $\sigma^2$ of the feature across the mini-batch.
2. Normalize the feature: $\large \hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}$, where $\epsilon$ is a small constant to avoid division by zero.
3. Scale and shift the normalized feature using learnable parameters: $\large z^{(i)} = \gamma \odot \hat{x}^{(i)} + \beta$, where $\gamma \text{ (gamma)}$ is the scaling parameter and $\beta \text{ (beta)}$ is the shifting (offset) parameter.

### Implementation Details:
- **Poor perfomance for very small batches:** Because BatchNorm computes statistics for normalizing data over each separate feature in the mini-batch, it might not work as well for small batches – 16 or so.
- **Learnable $\gamma$ and $\beta$:** Learnable means that we don't have to specify them during model creation. Their values will be learned by the network during training.
- **$\beta \text{ (beta)}$ vs Bias $b$:** Beta learnable parameter is used in the same way as bias to offset the data, so in the end we don't need both of them if we are using BatchNorm, as they both serve the same purpose. When using BatchNorm we can train our network without the bias hyperparameter, thus making the training sligly faster and less complicated.
```python
# We can use beta from BatchNorm
# instead of bias in a layer.
model = keras.models.Sequential ([
    keras.layers.Flatten(input_shape=[28,28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, activation = 'relu', use_bias=False),
    keras.layers.BatchNormalization(),
    ...
```
- **Sligtly longer Epochs:** With Batch Normalization each epoch will take slightly longer, due to addition computational overhead from transformations and learnable parameters, but convergence will be faster and more accurate.
- **Normalizing Input Data:** With Batch Normalization we can <u>add additional BatchNorm layer before our first input layer</u> in the model architecture <u>to effectively normalize the input data prior to training</u>, thus removing the need for separate data normalization prior to feeding it to the network.
```python
# Adding BatchNorm layer after Flatten layer, 
# before the first Dense layer.
model = keras.models.Sequential ([
    keras.layers.Flatten(input_shape=[28,28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, activation = 'relu'),
    keras.layers.BatchNormalization(),
    ...
```
- **Slight Regularization effect:** Batch Normalization can have a slight normalization effect on the network training process, due to it computing mini-batch statistics based on the relatively small amount of mini-batch data, thus adding some additional noice.
    - **Note:** Amount of regularition effect depends on the size of mini-batch, – the bigger the batches, the less the regularization effect.
    - **Note:** Regularization effect of BatchNorm must not be treated as the main source of regularization for the network! The only correct way to view it is as an additional benefit, but not the main purpose of BatchNorm. We still have to employ traditional regularization techniques to negate overfitting, like Dropout layers and Regularization layers.
- **BatchNorm before the Activation function:** Authors of the original article spoke favourably of applying BatchNorm before applying activation function, but this is the thing we must test on our own, if it is beneficial for the network or not. Some people argue that it is.
```python
# Adding BatchNorm layer before activation function
model = keras.models.Sequential ([
    keras.layers.Flatten(input_shape=[28,28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300)
    keras.layers.BatchNormalization(),
    # Apply ReLU activation after BatchNorm
    keras.layers.Activation('relu')
    ...
```
### Code Examples:
Simplified example of how BatchNorm might be implemented in a neural network in TensorFlow:

```python
import tensorflow as tf

# Assume `input_data` is the input to a layer
normalized_data = tf.keras.layers.BatchNormalization()(input_data)
```
And PyTorch:
```python
import torch.nn as nn

def __init__(self):
    super(SimpleNN, self).__init__()

    self.fc1 = nn.Linear(in_features=784, out_features=256)
    self.bn1 = nn.BatchNorm1d(num_features=256)
```
- BatchNorm was initially designed for convolutional and fully connected layers, but its concepts have been extended to other architectures.
- It has been influential in the design of subsequent normalization techniques like Layer Normalization and Group Normalization.

### Conclusion:
Batch Normalization is a crucial tool for improving the training of Deep Neural Networks. It contributes to faster convergence, better generalization, and more stable training dynamics. Understanding how BatchNorm works and when to apply it can significantly enhance our ability to design and train effective machine learning models.