### **Batch Normalization: A Comprehensive Guide**

#### **Why Batch Normalization?**
Batch Normalization (BatchNorm) is a technique used to **speed up training**, **stabilize deep networks**, and **improve generalization**. It addresses the following issues:

- **Internal Covariate Shift**: The distribution of hidden activations changes during training, slowing convergence.
- **Vanishing/Exploding Gradients**: Helps maintain stable gradients by normalizing activations.
- **Sensitivity to Initialization**: Reduces dependence on weight initialization.
- **Regularization Effect**: Can reduce the need for dropout and other regularization techniques.

---

### **Mathematical Formulation of Batch Normalization**

Consider an input **mini-batch** $X = \{x_1, x_2, \dots, x_m\}$ from an intermediate layer of a neural network. The batch normalization process follows these steps:

#### **Step 1: Compute Batch Statistics**
For a given mini-batch, compute the mean and variance:
$$
\mu_B = \frac{1}{m} \sum_{i=1}^{m} x_i
$$
$$
\sigma_B^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_B)^2
$$
where:
- $\mu_B$ is the batch mean.
- $\sigma_B^2$ is the batch variance.
- $m$ is the batch size.

#### **Step 2: Normalize the Batch**
Normalize each input $x_i$ using:
$$
\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}
$$
where $\epsilon$ is a small constant to prevent division by zero.

#### **Step 3: Scale and Shift**
To allow the network to learn optimal representations, introduce two trainable parameters, **scale** $\gamma$ and **shift** $\beta$:
$$
y_i = \gamma \hat{x}_i + \beta
$$

- $\gamma$ (scale) and $\beta$ (shift) allow the model to recover the original distribution if needed.
- $y_i$ is the final output after batch normalization.



### **Batch Normalization in Neural Networks**
Batch Normalization is typically applied **before** or **after activation functions** in hidden layers. Consider a fully connected neural network:

1. Compute **linear transformation**: $z = Wx + b$
2. Apply **Batch Normalization**: Normalize $z$ using batch statistics.
3. Apply **activation function**: $a = f(y)$

For Convolutional Neural Networks (CNNs), BatchNorm is applied **per channel** across spatial dimensions.



### **Effect on Gradient Descent**
Batch Normalization modifies weight updates in gradient descent by ensuring stable activations.

#### **Gradient Updates with BatchNorm**
Using the chain rule, the gradients of the loss function $L$ w.r.t. $x_i$ are modified as:

$$
\frac{\partial L}{\partial x_i} = \frac{1}{\sqrt{\sigma_B^2 + \epsilon}} \left( \frac{\partial L}{\partial y_i} \cdot \gamma - \frac{1}{m} \sum_{j=1}^{m} \frac{\partial L}{\partial y_j} \cdot \gamma - \hat{x}_i \sum_{j=1}^{m} \frac{\partial L}{\partial y_j} \cdot \gamma \cdot \hat{x}_j \right)
$$

This normalization process stabilizes weight updates, leading to faster training.



### **Example: Applying BatchNorm in a Small ANN**

#### **Consider a Neural Network with One Hidden Layer**
- **Input layer**: 2 neurons
- **Hidden layer**: 3 neurons (ReLU activation)
- **Output layer**: 1 neuron (Sigmoid activation)




```mermaid
graph LR
    subgraph Inputs
        direction LR
        style Inputs fill:#2e7c92,stroke:#64b5f6,stroke-width:2px
        x1[x<sub>11</sub>]
        x2[x<sub>12</sub>]
    end
    
    subgraph "hidden-layer" ["Hidden \nLayer 1"]
        direction LR
        style hidden-layer fill:#2c5c36,stroke:#85ff9f,stroke-width:2px
        h1(b<sub>11</sub>)
        h2(b<sub>12</sub>)
        h3(b<sub>13</sub>)
    end
    
    subgraph Output
        direction LR
        style Output fill:#e78383,stroke:#f8cc52,stroke-width:2px
        y(b<sub>21</sub>)
    end


    x1 --> |w<sub>11</sub><sup>1</sup>| h1
    x1 --> |w<sub>12</sub><sup>1</sup>| h2
    x1 --> |w<sub>13</sub><sup>1</sup>| h3
    x2 --> |w<sub>21</sub><sup>1</sup>| h1
    x2 --> |w<sub>22</sub><sup>1</sup>| h2
    x2 --> |w<sub>23</sub><sup>1</sup>| h3
    h1 --> |w<sub>11</sub><sup>2</sup>| y
    h2 --> |w<sub>21</sub><sup>2</sup>| y
    h3 --> |w<sub>31</sub><sup>2</sup>| y
    y --> Out(["Prediction"])
```

```mermaid
graph LR
    subgraph Inputs
        direction LR
        style Inputs fill:#2e7c92,stroke:#64b5f6,stroke-width:2px
        x1[x<sub>11</sub>]
        x2[x<sub>12</sub>]
    end
    
    subgraph "hidden-layer" ["Hidden \nLayer 1"]
        direction LR
        style hidden-layer fill:#2c5c36,stroke:#85ff9f,stroke-width:2px
        h1(b<sub>11</sub>)
        h2(b<sub>12</sub>)
        h3(b<sub>13</sub>)
    end
    
    subgraph Output
        direction LR
        style Output fill:#e78383,stroke:#f8cc52,stroke-width:2px
        y(b<sub>21</sub>)
    end


    x1 --> |w<sub>11</sub><sup>1</sup>| h1
    x1 --> |w<sub>12</sub><sup>1</sup>| h2
    x1 --> |w<sub>13</sub><sup>1</sup>| h3
    x2 --> |w<sub>21</sub><sup>1</sup>| h1
    x2 --> |w<sub>22</sub><sup>1</sup>| h2
    x2 --> |w<sub>23</sub><sup>1</sup>| h3
    h1 --> |w<sub>11</sub><sup>2</sup>| y
    h2 --> |w<sub>21</sub><sup>2</sup>| y
    h3 --> |w<sub>31</sub><sup>2</sup>| y
    y --> Out(["Prediction"])
```



```mermaid
flowchart LR
    A[(Zbiór Danych\nA)] --> B[Przetwarzanie]
    subgraph Analiza Danych
        B --> C[Analiza] --> D[Modelowanie]
    end
    D --> E[(Zbiór Danych\nB)] 
```

#### **Forward Pass with BatchNorm**
1. Compute linear transformation: $Z^{(1)} = W^{(1)}X + b^{(1)}$
2. Apply Batch Normalization:
   $$
   \hat{Z}^{(1)} = \frac{Z^{(1)} - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}
   $$
   $$
   Y^{(1)} = \gamma \hat{Z}^{(1)} + \beta
   $$
3. Apply activation function: $A^{(1)} = ReLU(Y^{(1)})$
4. Compute output layer: $\hat{y} = \sigma(W^{(2)} A^{(1)} + b^{(2)})$



### **Advantages of Batch Normalization**
- **Improved Training Speed**: Allows for higher learning rates without instability.
- **Reduced Sensitivity to Initialization**: Networks train effectively even with suboptimal initial weights.
- **Regularization Effect**: Reduces the need for dropout by controlling activation magnitudes.
- **Better Gradient Flow**: Prevents issues related to vanishing/exploding gradients.
- **Invariance to Input Scaling**: Model becomes less sensitive to data preprocessing choices.

### **Disadvantages and Considerations**
- **Additional Computation**: Increases computation due to mean and variance calculations.
- **Incompatibility with Small Batch Sizes**: If the batch size is too small, estimated statistics may be unstable.
- **Dependence on Batch Size**: Performance can vary significantly with different batch sizes.

### **Alternative Normalization Techniques**
- **Layer Normalization**: Normalizes activations across features instead of batches.
- **Instance Normalization**: Commonly used in style transfer.
- **Group Normalization**: Divides channels into groups and normalizes within each group.


### **Key Takeaways**
- **Batch Normalization normalizes inputs** at each layer to maintain a stable distribution.
- **Reduces dependence on initialization**, allowing deeper networks to train efficiently.
- **Acts as a regularizer**, reducing the need for dropout.
- **Improves gradient flow** by stabilizing updates, preventing vanishing/exploding gradients.

BatchNorm is a powerful tool that significantly enhances training efficiency and generalization in deep learning models.

#### **Understanding Internal Covariate Shift**
Internal Covariate Shift (ICS) refers to the phenomenon where the distribution of inputs to each layer in a deep neural network changes during training. This happens due to weight updates across layers, causing instability and requiring lower learning rates for convergence.

Effects of ICS:
- Slower convergence due to shifting input distributions.
- Requires careful weight initialization to maintain stable learning.
- Increases the risk of vanishing/exploding gradients in deep networks.

Batch Normalization mitigates ICS by ensuring that each layer receives inputs with a stable distribution, enabling faster training and more robust optimization.


### **How Do We Get $\gamma$ and $\beta$?**

#### **1. Initialization**
- $\gamma$ is typically initialized to **1** so that the initial transformation does not scale the normalized values.
- $\beta$ is initialized to **0** to keep the mean centered at zero initially.

#### **2. Learning Process**
- Both $\gamma$ and $\beta$ are **trainable parameters**, meaning they are updated using backpropagation.
- The gradients of the loss function w.r.t. $\gamma$ and $\beta$ are computed, and their values are updated using gradient descent:
  
  $$
  \gamma \leftarrow \gamma - \eta \frac{\partial L}{\partial \gamma}
  $$
  $$
  \beta \leftarrow \beta - \eta \frac{\partial L}{\partial \beta}
  $$
  
  where $\eta$ is the learning rate.

#### **3. Effect of $\gamma$ and $\beta$**
- If $\gamma = 1$ and $\beta = 0$, BatchNorm behaves like a standard normalization layer.
- The model learns optimal values of $\gamma$ and $\beta$ that help in achieving the best representations.

BatchNorm is a powerful tool that significantly enhances training efficiency and generalization in deep learning models.