# Batch Normalization

Batch normalization (often called *BatchNorm*) is a technique used in deep neural networks to stabilize and accelerate the training process. It was introduced by Sergey Ioffe and Christian Szegedy in 2015 to address issues such as *internal covariate shift* and *vanishing/exploding gradients*.

## Key Idea

1. Normalization Per Mini-Batch  
   During training, for each mini-batch (of size $m$), BatchNorm normalizes each feature channel by subtracting the mini-batch mean and dividing by the mini-batch standard deviation:

   $$
   \hat{x} = \frac{x - \mu_{\text{batch}}}{\sigma_{\text{batch}} + \epsilon}
   $$

   where:
   - $x$ is the input value (e.g., activation from the previous layer)
   - $\mu_{\text{batch}}$ is the mean of the mini-batch
   - $\sigma_{\text{batch}}$ is the standard deviation of the mini-batch
   - $\epsilon$ is a small constant to avoid division by zero

2. Trainable Scale and Shift  
   After normalization, BatchNorm applies a trainable linear transformation:

   $$
   y = \gamma \hat{x} + \beta
   $$

   where $\gamma$ and $\beta$ are learnable parameters that allow the network to “undo” any normalization if needed, effectively letting the model retain representational power.




3. At Inference Time  
   Instead of using the mini-batch mean and variance (which won’t be reliable for a single sample), BatchNorm uses an exponential moving average of the mean and variance collected during training. This ensures consistent normalization at inference.

## Benefits

1. Improved Training Stability  
   By normalizing the inputs to each layer, BatchNorm helps reduce the risk of vanishing or exploding gradients. This makes deeper networks more trainable.

2. Faster Training  
   BatchNorm allows you to use higher learning rates because it stabilizes the distribution of intermediate activations. This often results in faster convergence.

3. Regularization Effect  
   The mini-batch statistics introduce a small amount of noise (because each mini-batch’s mean and variance differ slightly). This acts like a regularizer and can help reduce overfitting.

4. Possible Removal of Dropout  
   In some architectures, especially convolutional neural networks, BatchNorm can reduce the need for dropout, simplifying the model without harming performance.

## Where It’s Used

BatchNorm is commonly applied after a convolution or fully connected layer, and before applying the nonlinearity (e.g., ReLU). It has been widely adopted in state-of-the-art models for image classification, object detection, and many other tasks.

## Practical Tips

- **Batch Size**  
  For BatchNorm to work effectively, you typically need a sufficiently large batch size (e.g., 32 or more). For very small batch sizes, the mini-batch statistics can become too noisy.

- **Layer Order**  
  A common pattern is:

  $$
  \text{Convolution} \rightarrow \text{BatchNorm} \rightarrow \text{ReLU} \rightarrow \text{Pooling}
  $$

  In some cases, the order might differ based on the particular architecture or research best practices.


# Batch Normalization Example with Multivariate Data

Below is a more detailed example illustrating how Batch Normalization works with **two features** instead of just one. We’ll go through both the training and testing stages.

## Training Time

### Step 1: Mini-Batch Data

Assume we have two training examples, each with two features $(x_1, x_2)$:

- Example 1: $[4, 10]$
- Example 2: $[6, 12]$

We have:
$$
\begin{aligned}
x^{(1)} &= [4, 10], \\
x^{(2)} &= [6, 12].
\end{aligned}
$$

### Step 2: Compute Mini-Batch Mean and Variance

We compute the mean and variance *separately* for each feature across the batch.

1. **Mean for each feature**  

   $$
   \mu_1 = \frac{4 + 6}{2} = 5, 
   \quad
   \mu_2 = \frac{10 + 12}{2} = 11
   $$

   So the mini-batch mean vector is:
   $$
   \mu_{\text{batch}} = [5, \, 11].
   $$

2. **Variance for each feature**  

   For the first feature:
   $$
   \sigma_1^2 
   = \frac{(4 - 5)^2 + (6 - 5)^2}{2}
   = \frac{1 + 1}{2}
   = 1
   $$

   For the second feature:
   $$
   \sigma_2^2 
   = \frac{(10 - 11)^2 + (12 - 11)^2}{2}
   = \frac{1 + 1}{2}
   = 1
   $$

   So the mini-batch variance vector is:
   $$
   \sigma_{\text{batch}}^2 = [1, \, 1].
   $$

3. **Standard Deviation**  

   $$
   \sigma_{\text{batch}} 
   = \sqrt{ \sigma_{\text{batch}}^2 }
   = [1, \, 1].
   $$

### Step 3: Normalize the Inputs

We normalize each feature of each example using the batch mean and standard deviation. If $\epsilon$ is a small constant like $10^{-5}$:

$$
\hat{x}^{(i)} = \frac{x^{(i)} - \mu_{\text{batch}}}{\sigma_{\text{batch}} + \epsilon}.
$$

- For $x^{(1)} = [4, \, 10]$:

  $$
  \hat{x}^{(1)}_1 
  = \frac{4 - 5}{1 + 10^{-5}} 
  \approx -0.99999, 
  \quad
  \hat{x}^{(1)}_2 
  = \frac{10 - 11}{1 + 10^{-5}}
  \approx -0.99999.
  $$

  So:
  $$
  \hat{x}^{(1)} \approx [-0.99999, \, -0.99999].
  $$

- For $x^{(2)} = [6, \, 12]$:

  $$
  \hat{x}^{(2)}_1 
  = \frac{6 - 5}{1 + 10^{-5}} 
  \approx 0.99999, 
  \quad
  \hat{x}^{(2)}_2 
  = \frac{12 - 11}{1 + 10^{-5}}
  \approx 0.99999.
  $$

  So:
  $$
  \hat{x}^{(2)} \approx [0.99999, \, 0.99999].
  $$

### Step 4: Apply Scale and Shift

Let the learnable scale parameter be $\gamma = [2, \, 2]$ and the learnable shift parameter be $\beta = [3, \, 5]$. We apply:

$$
y^{(i)} = \gamma \odot \hat{x}^{(i)} + \beta
$$

where $\odot$ denotes element-wise multiplication.

- For $\hat{x}^{(1)}$:

  $$
  y^{(1)} 
  = [2, 2] \odot [-0.99999, -0.99999] + [3, 5]
  \approx [-1.99998, -1.99998] + [3, 5]
  \approx [1.00002, 3.00002].
  $$

- For $\hat{x}^{(2)}$:

  $$
  y^{(2)} 
  = [2, 2] \odot [0.99999, 0.99999] + [3, 5]
  \approx [1.99998, 1.99998] + [3, 5]
  \approx [4.99998, 6.99998].
  $$

Hence, the final outputs of the BatchNorm layer for this mini-batch are approximately:

$$
y^{(1)} \approx [1.00002, \, 3.00002], 
\quad
y^{(2)} \approx [4.99998, \, 6.99998].
$$

### Step 5: Update Running Mean and Variance

We maintain a running average of the batch means and variances for each feature:

$$
\mu_{\text{run}} \leftarrow \alpha \mu_{\text{run}} + (1 - \alpha)\mu_{\text{batch}}
$$

$$
\sigma_{\text{run}}^2 \leftarrow \alpha \sigma_{\text{run}}^2 + (1 - \alpha)\sigma_{\text{batch}}^2
$$

where $\alpha$ is a momentum term (e.g., 0.9).

## Testing Time

During testing (inference), we typically have a single example or a different batch size. We **do not** compute mean and variance from the test examples. Instead, we use the *stored* (running) statistics from training: $\mu_{\text{run}}$ and $\sigma_{\text{run}}^2$.

For a test example $x_{\text{test}}$ with 2 features, the normalized output is:

$$
\hat{x}_{\text{test}} 
= \frac{x_{\text{test}} - \mu_{\text{run}}}{\sqrt{\sigma_{\text{run}}^2 + \epsilon}}
$$

Then we apply the learned scale and shift:

$$
y_{\text{test}} = \gamma \odot \hat{x}_{\text{test}} + \beta.
$$

Using these running statistics ensures the model behaves consistently, regardless of test-time batch sizes or single-sample inputs.


# Running Statistics Example with Two Features

During each training iteration:
1. Compute the mini-batch mean and variance.
2. Update the running mean and variance using a momentum or smoothing factor.
3. Use the mini-batch mean/variance for that training step’s normalization.

At test time, you simply use the *final* stored running mean and variance (accumulated throughout training), rather than recomputing from the test samples.

Below, we’ll illustrate how BatchNorm’s running mean and variance are updated **across multiple training iterations** when the same mini-batch data is encountered each time. This corresponds to the **two-feature** example:

$$
\begin{aligned}
x^{(1)} &= [4, \, 10],\\
x^{(2)} &= [6, \, 12].
\end{aligned}
$$

---

## Mini-Batch Statistics

For each training iteration, we assume the **same** mini-batch:
- Two samples, each with 2 features:
  - $x^{(1)} = [4, \, 10]$
  - $x^{(2)} = [6, \, 12]$

1. **Batch Mean**  
   For feature 1 (the first component of each vector):
   $$
   \mu_1 = \frac{4 + 6}{2} = 5.
   $$
   For feature 2 (the second component of each vector):
   $$
   \mu_2 = \frac{10 + 12}{2} = 11.
   $$
   So:
   $$
   \mu_{\text{batch}} = [5, \, 11].
   $$

2. **Batch Variance**  
   For feature 1:
   $$
   \sigma_1^2 
   = \frac{(4 - 5)^2 + (6 - 5)^2}{2}
   = \frac{1 + 1}{2}
   = 1.
   $$
   For feature 2:
   $$
   \sigma_2^2 
   = \frac{(10 - 11)^2 + (12 - 11)^2}{2}
   = \frac{1 + 1}{2}
   = 1.
   $$
   Hence:
   $$
   \sigma_{\text{batch}}^2 = [1, \, 1].
   $$

---

## Initialization and Momentum

We’ll track the **running mean** ($\mu_{\text{run}}$) and **running variance** ($\sigma_{\text{run}}^2$) over multiple iterations. Suppose:

- **Initial** running mean: $\mu_{\text{run}}^{(0)} = [0, \, 0]$
- **Initial** running variance: $\sigma_{\text{run}}^{2,(0)} = [1, \, 1]$
- **Momentum** (or smoothing factor): $\alpha = 0.9$

After each iteration, we update:

$$
\mu_{\text{run}}^{(t+1)}
\;\leftarrow\;
\alpha\, \mu_{\text{run}}^{(t)}
\;+\;(1 - \alpha)\,\mu_{\text{batch}},
$$

$$
\sigma_{\text{run}}^{2,(t+1)}
\;\leftarrow\;
\alpha\, \sigma_{\text{run}}^{2,(t)}
\;+\;(1 - \alpha)\,\sigma_{\text{batch}}^2.
$$

---

## Batch #1

- **Given:**  
  - $\mu_{\text{run}}^{(0)} = [0, \, 0]$  
  - $\sigma_{\text{run}}^{2,(0)} = [1, \, 1]$  
  - $\mu_{\text{batch}} = [5, \, 11]$  
  - $\sigma_{\text{batch}}^2 = [1, \, 1]$

1. **Update Running Mean**  
   $$
   \mu_{\text{run}}^{(1)}
   = \alpha \, [0,\,0] \;+\; (1 - \alpha)\,[5,\,11]
   = 0.9 \,[0,\,0] \;+\; 0.1 \,[5,\,11].
   $$
   Therefore:
   $$
   \mu_{\text{run}}^{(1)} = [0.5,\; 1.1].
   $$

2. **Update Running Variance**  
   $$
   \sigma_{\text{run}}^{2,(1)}
   = 0.9 \,[1,\,1]
   \;+\; 0.1 \,[1,\,1]
   = [0.9,\,0.9] + [0.1,\,0.1]
   = [1,\,1].
   $$

So after iteration 1:
$$
\mu_{\text{run}}^{(1)} = [0.5, \, 1.1], 
\quad
\sigma_{\text{run}}^{2,(1)} = [1, \, 1].
$$

---

## Batch #2

- **Given:**  
  - $\mu_{\text{run}}^{(1)} = [0.5, \, 1.1]$  
  - $\sigma_{\text{run}}^{2,(1)} = [1, \, 1]$  
  - (Same mini-batch) $\mu_{\text{batch}} = [5, \, 11]$  
  - (Same mini-batch) $\sigma_{\text{batch}}^2 = [1, \, 1]$

1. **Update Running Mean**  
   $$
   \mu_{\text{run}}^{(2)}
   = 0.9 \,[0.5, \, 1.1]
   \;+\; 0.1 \,[5, \, 11].
   $$
   Calculate each dimension:
   - First feature:
     $$
     0.9 \times 0.5 = 0.45,
     \quad
     0.1 \times 5 = 0.5,
     \quad
     \text{sum} = 0.95.
     $$
   - Second feature:
     $$
     0.9 \times 1.1 = 0.99,
     \quad
     0.1 \times 11 = 1.1,
     \quad
     \text{sum} = 2.09.
     $$
   Hence:
   $$
   \mu_{\text{run}}^{(2)} = [0.95,\; 2.09].
   $$

2. **Update Running Variance**  
   $$
   \sigma_{\text{run}}^{2,(2)}
   = 0.9 \,[1,\, 1]
   + 0.1 \,[1,\, 1]
   = [1,\, 1].
   $$

So after iteration 2:
$$
\mu_{\text{run}}^{(2)} = [0.95, \, 2.09],
\quad
\sigma_{\text{run}}^{2,(2)} = [1, \, 1].
$$

---

## Batch #3

- **Given:**  
  - $\mu_{\text{run}}^{(2)} = [0.95, \, 2.09]$  
  - $\sigma_{\text{run}}^{2,(2)} = [1, \, 1]$  
  - (Same mini-batch) $\mu_{\text{batch}} = [5, \, 11]$  
  - (Same mini-batch) $\sigma_{\text{batch}}^2 = [1, \, 1]$

1. **Update Running Mean**  
   $$
   \mu_{\text{run}}^{(3)}
   = 0.9 \,[0.95,\, 2.09]
   + 0.1 \,[5,\, 11].
   $$
   Calculate each dimension:
   - First feature:
     $$
     0.9 \times 0.95 = 0.855,
     \quad
     0.1 \times 5 = 0.5,
     \quad
     \text{sum} = 1.355.
     $$
   - Second feature:
     $$
     0.9 \times 2.09 = 1.881,
     \quad
     0.1 \times 11 = 1.1,
     \quad
     \text{sum} = 2.981.
     $$
   Hence:
   $$
   \mu_{\text{run}}^{(3)} \approx [1.355,\, 2.981].
   $$

2. **Update Running Variance**  
   $$
   \sigma_{\text{run}}^{2,(3)}
   = 0.9 \,[1,\,1]
   + 0.1 \,[1,\,1]
   = [1,\,1].
   $$

So after iteration 3:
$$
\mu_{\text{run}}^{(3)} \approx [1.355, \, 2.981],
\quad
\sigma_{\text{run}}^{2,(3)} = [1, \, 1].
$$

---

## Observations

1. **Running Mean Convergence:**  
   Each iteration, $\mu_{\text{run}}$ is nudged closer to the true mini-batch mean $[5,\,11]$. If you continue many more iterations (with the same mini-batch), $\mu_{\text{run}}$ will gradually converge toward $[5,\, 11]$.

2. **Running Variance Remains at [1, 1]:**  
   Because our mini-batch variance is consistently $[1,\,1]$ and our initial running variance was also $[1,\,1]$, the update formula keeps it at $[1,\,1]$. In practice, if the batch variance or initial guess differed, you’d see the variance slowly move toward the actual values.

3. **Practical Use at Inference:**  
   Once training is finished, we use the final stored $\mu_{\text{run}}$ and $\sigma_{\text{run}}^2$ to normalize any new data during testing/inference:
   $$
   \hat{x}_{\text{test}} 
   = \frac{x_{\text{test}} - \mu_{\text{run}}}{\sqrt{\sigma_{\text{run}}^2 + \epsilon}}
   $$
   which avoids computing statistics on potentially small or single test samples.

Thus, this example shows how the running statistics evolve iteration by iteration for **two-feature** data in a consistent mini-batch.


# How Testing Works in Batch Normalization

When you train a neural network with Batch Normalization (BatchNorm), **for each mini-batch** of data you:
1. Compute the **mini-batch mean** ($\mu_{\text{batch}}$) and **mini-batch variance** ($\sigma_{\text{batch}}^2$).
2. Use these values to **normalize** your activations.
3. **Update** the *running* (or *moving*) mean $\mu_{\text{run}}$ and variance $\sigma_{\text{run}}^2$.

However, **at test (inference) time**, you typically pass **one** sample at a time (or very small batches). Computing a new mean/variance from that single test sample wouldn’t be stable. 

### The Key Idea

- **During training:**
  - We keep track of an exponential-moving average (EMA) of the batch mean and variance over all iterations. These become our **running mean** ($\mu_{\text{run}}$) and **running variance** ($\sigma_{\text{run}}^2$).
  - In pseudo-code:
    $$
    \mu_{\text{run}} \leftarrow \alpha \,\mu_{\text{run}} + (1-\alpha)\,\mu_{\text{batch}}
    $$
    $$
    \sigma_{\text{run}}^2 \leftarrow \alpha \,\sigma_{\text{run}}^2 + (1-\alpha)\,\sigma_{\text{batch}}^2
    $$
    where $\alpha$ is often something like $0.9$ or $0.99$.

- **At test time:**
  - We **do not** compute $\mu_{\text{batch}}$ or $\sigma_{\text{batch}}^2$ from the test sample(s).
  - Instead, we use the final stored $\mu_{\text{run}}$ and $\sigma_{\text{run}}^2$ from training.
  - This ensures stability and consistency, because $\mu_{\text{run}}$ and $\sigma_{\text{run}}^2$ reflect the *overall distribution* of training data.

Formally, if $x_{\text{test}}$ is a test example (or a small batch), the BatchNorm normalization step at test time is:

$$
\hat{x}_{\text{test}} 
= \frac{x_{\text{test}} - \mu_{\text{run}}}{\sqrt{\sigma_{\text{run}}^2 + \epsilon}}
$$

Then we apply the learned scale and shift:

$$
y_{\text{test}} 
= \gamma \,\hat{x}_{\text{test}} + \beta
$$

Here, $\gamma$ and $\beta$ are the same **trainable parameters** used during training.

### Why Not Use the Test Sample’s Mean and Variance?

- A single test sample (or a very small test batch) won’t give a reliable estimate of the mean or variance.
- If you tried to compute the mean/variance of one sample, that sample would end up normalized to zero mean and unit variance all by itself—completely changing the model’s learned expectations.
- Using **running** (EMA) statistics from training gives the best estimate of how the data is expected to be distributed.

### Summary

- **Training:** Compute mean/variance from the current mini-batch, update the *running average* for future use.
- **Testing:** Use the *stored* running mean/variance (no new computation from the test data).  

That’s how BatchNorm remains consistent even when test batches are small (or just single images, etc.).
