## Batch Normalization

Batch Normalization (BN) is an algorithmic technique designed to make the training of Deep Neural Networks (DNN) faster and more stable. The main idea is to normalize the activation vectors from hidden layers using the mean and variance of the current batch. This normalization is applied just before or after the non-linear activation function.

### Problem Addressed by Batch Normalization

- **Small Learning Rate**: A small learning rate leads to slow training.
- **Non-Normalized Data**: If data is not normalized, the training process is also slower.

When input is fed to the neurons of the first layer, it is normalized. This idea led experts to explore the concept of normalizing the outputs of each layer. For example, the output of the first layer can be normalized before being fed into the second layer, and so on.

### Internal Covariate Shift

Before we dive into how batch normalization works, it's important to understand **covariate shift** and its variant:

- **Covariate Shift**: This occurs when the distribution of input features differs between the training and testing phases. If this happens, the model needs to be retrained.
- **Internal Covariate Shift**: This refers to the change in the distribution of network activations during training due to changing network parameters. It can cause issues like instability or slow convergence during training.

Consider the analogy:  
**Peas → Beas → Knees → Cheese → Fleas**  
This represents how a message can get distorted as it’s passed along from one person to another. The first person says "peas," the second person hears "beas," the third person hears "knees," and so on. Similarly, the distribution of data can get distorted across layers in a deep neural network, leading to issues in training.

### How Batch Normalization Solves These Problems

Batch Normalization normalizes the outputs from each layer before passing them into the next. The output is normalized to have zero mean and unit variance. Here's how it works step-by-step:

Let’s consider the weighted sum of inputs for a given layer, $Z_{11}$:

$$
Z_{11} = w_1 \cdot \text{cgpa} + w_2 \cdot \text{iq} + b
$$
Then, the activation function $g(Z_{11}) = a_{11}$ applies a non-linear transformation.

#### Step 1: Normalize the activation
The first step in batch normalization is to normalize the weighted sum $Z_{11}$. This is done by subtracting the mean $\mu$ and dividing by the standard deviation $\sigma$ of the batch:

$$
Z_{11}^N = \frac{Z_{11} - \mu}{\sigma}
$$

Here, $\mu$ is the mean of the batch, and $\sigma$ is the standard deviation of the batch.

#### Step 2: Scale and Shift the normalized value
Once the activation is normalized, we scale and shift it using two learnable parameters: $\gamma$ (gamma) and $\beta$ (beta):

$$
Z_{11}^{BN} = \gamma \cdot Z_{11}^N + \beta
$$

- $\gamma$ (gamma) is a scaling factor.
- $\beta$ (beta) is a shifting factor.

These parameters are learned during training, similar to how weights and biases are learned. Initially, $\beta$ is set to 1, and $\gamma$ is set to 0 in Keras.

#### Activation after Batch Normalization
After scaling and shifting, the normalized activation $Z_{11}^{BN}$ is passed through the non-linear activation function:

$$
g(Z_{11}^{BN}) = a_{11}
$$

### Why Step 2 (Scale and Shift)?

You might wonder why we scale and shift the normalized activations. The idea behind this step is to give the model flexibility. While normalization (step 1) ensures the data has zero mean and unit variance, we sometimes need the model to have different distributions. By introducing the learnable parameters $\gamma$ and $\beta$, we allow the network to adapt and learn a different distribution for each neuron. This ensures that the model doesn’t get "stuck" with a rigid normalization requirement if it's not suitable for the data.

### Batch Normalization in Keras

In Keras, Batch Normalization is treated as a **layer** in the model. If a layer contains two neurons, there will be four trainable parameters: one set of $\beta$ and $\gamma$ for each neuron. Keras also applies gradient descent to update these parameters, just like it does for weights.

### Batch Normalization During Training vs. Testing

- **Training Phase**: During training, $\mu$ (mean) and $\sigma$ (standard deviation) are calculated from the current batch. These are used to normalize the activations.
  
- **Testing Phase**: During testing, there’s typically only one data point. Hence, we can't calculate the mean and standard deviation over a batch. To solve this, Batch Normalization uses **Exponential Weighted Moving Average (EWMA)** of the mean and standard deviation accumulated during training. This allows the model to use these running averages for testing.

$$
\mu = \text{EWMA}(\mu_{\text{training}})
$$
$$
\sigma = \text{EWMA}(\sigma_{\text{training}})
$$
These parameters are **non-learnable** during testing, as they are pre-computed during training and used for inference.

### Summary: Parameters in Batch Normalization

For each neuron, we store four parameters:
- **Two learnable parameters**: $\gamma$ (gamma) and $\beta$ (beta).
- **Two non-learnable parameters**: The mean $\mu$ and the standard deviation $\sigma$, which are computed during training and used during testing.

### Advantages of Batch Normalization

1. **Stable Training**: Batch Normalization stabilizes training, allowing for a wider range of hyperparameters to be used effectively.
2. **Faster Training**: By normalizing the activations, the network converges faster during training.
3. **Acts as a Regularizer**: Though not as strong as dropout, BN helps to reduce overfitting by adding a slight regularization effect.
4. **Reduces the Impact of Poor Weight Initialization**: By normalizing the activations, BN mitigates some of the issues caused by bad weight initialization, improving training efficiency.

In conclusion, Batch Normalization is an effective technique for improving the stability and speed of training deep neural networks. By normalizing activations, scaling, shifting, and using the Exponential Weighted Moving Average for testing, BN improves the overall performance and generalization of the model.



In [None]:
# Keras Implementation
model = Sequential()
model.add(Dense(3, activation='relu', input_dim=2))
model.add(BatchNormalization())
model.add(Dense(2, activation='relu'))
model.add(BatchNormalization())
model.add(Dense(1, activation='sigmoid'))