# 8 Aug 2024

https://chatgpt.com/c/a9b218dc-221f-4f6c-8fa2-9b6c78c07ab5

The image you provided outlines the algorithm for Batch Normalization specifically used in a Transformer model. Here's a breakdown of the equations and their meaning:

### Forward Propagation
1. **Inputs and Outputs**:
   - **X**: The input matrix of shape \( R^{B \times d} \), where \( B \) is the batch size and \( d \) is the feature dimension.
   - **Y**: The output matrix after batch normalization.

2. **Mini-batch Mean (\( \mu_B \))**:
   - Calculated as the mean of the inputs across the batch.
   - \( \mu_B = \frac{1}{B} \sum_{i=1}^B x_i \) where \( x_i \) are the input vectors in the batch.

3. **Mini-batch Variance (\( \sigma_B^2 \))**:
   - Calculated as the variance of the inputs within the batch.
   - \( \sigma_B^2 = \frac{1}{B} \sum_{i=1}^B (x_i - \mu_B)^2 \)

4. **Normalization**:
   - \( \hat{X} = \frac{X - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} \) where \( \epsilon \) is a small number added for numerical stability (not shown in the image).

5. **Scale and Shift**:
   - The normalized data is scaled and shifted using parameters \( \gamma \) and \( \beta \), which are learned during training.
   - \( Y = \gamma \hat{X} + \beta \)

### Updating Parameters During Training
- **Running Mean (\( \mu \))**:
  - Updated with a momentum \( \alpha \), blending the current mini-batch mean and the previous running mean.
  - \( \mu = \alpha \mu + (1 - \alpha)\mu_B \)

- **Running Variance (\( \sigma^2 \))**:
  - Similarly updated using the current mini-batch variance.
  - \( \sigma^2 = \alpha \sigma^2 + (1 - \alpha)\sigma_B^2 \)

### Backward Propagation (used during training)
- **Inputs**:
  - \( \frac{\partial L}{\partial Y} \) (the gradient of the loss with respect to the output \( Y \)).
- **Outputs**:
  - \( \frac{\partial L}{\partial X} \) (the gradient of the loss with respect to the input \( X \)).
- These gradients are used to update \( \gamma \) and \( \beta \) and are calculated based on further equations not shown here but referred to as Equation 3.

### Inference
- Once the model is trained, the normalization uses the learned running mean and variance:
  - \( Y = \gamma \frac{X - \mu}{\sigma} + \beta \)

This algorithm helps stabilize the learning process by normalizing the inputs and making the model less sensitive to the scale of different features. Batch Normalization is particularly effective in deep networks and can lead to faster convergence and higher overall accuracy.