# Layer Normalization: Step-by-Step Explanation and Implementation

Layer Normalization is a technique used in deep learning to normalize the inputs of a layer across its features, ensuring that the input distribution has a mean of 0 and variance of 1 for each sample.

---

## Mathematical Formulation

Given an input $ x $ of shape $ (\text{batch size}, \text{num features}) $, Layer Normalization works as follows:

### Step 1: Calculate the Mean
For each sample (row), compute the mean across all features:

$$
\text{mean} = \frac{1}{n} \sum_{i=1}^n x_i
$$

Where:
- $ n $: Number of features.
- $ x_i $: Value of the $ i $-th feature.

---

### Step 2: Calculate the Variance
For each sample (row), compute the variance across all features:

$$
\text{variance} = \frac{1}{n} \sum_{i=1}^n (x_i - \text{mean})^2
$$

---

### Step 3: Normalize
Normalize each feature by subtracting the mean and dividing by the square root of the variance (plus a small $ \epsilon $ to avoid division by zero):

$$
\hat{x}_i = \frac{x_i - \text{mean}}{\sqrt{\text{variance} + \epsilon}}
$$

Where:
- $ \hat{x}_i $: The normalized value of $ x_i $.
- $ \epsilon $: A small constant (e.g., $ 1 \times 10^{-5} $) to prevent division by zero.

---


In [1]:
def layernorm(x, epsilon=1e-5):
    # Initialize the output
    output = []

    # Iterate over each sample in the batch
    for sample in x:
        # Calculate the mean of the sample
        mean = sum(sample) / len(sample)

        # Calculate the variance of the sample
        variance = sum((xi - mean) ** 2 for xi in sample) / len(sample)

        # Normalize each element in the sample
        normalized_sample = [(xi - mean) / ((variance + epsilon) ** 0.5) for xi in sample]

        # Append the normalized sample to the output
        output.append(normalized_sample)

    return output

# Example usage
x = [
    [1.0, 2.0, 3.0],
    [4.0, 5.0, 6.0],
    [7.0, 8.0, 9.0]
]

normalized_x = layernorm(x)

print("Input Matrix:")
for row in x:
    print(row)

print("\nNormalized Matrix:")
for row in normalized_x:
    print(row)


Input Matrix:
[1.0, 2.0, 3.0]
[4.0, 5.0, 6.0]
[7.0, 8.0, 9.0]

Normalized Matrix:
[-1.2247356859083902, 0.0, 1.2247356859083902]
[-1.2247356859083902, 0.0, 1.2247356859083902]
[-1.2247356859083902, 0.0, 1.2247356859083902]


# Batch Normalization: Step-by-Step Explanation and Implementation

Batch Normalization (BatchNorm) is a technique used to normalize the input of a neural network layer across a batch of data. It helps stabilize training, speeds up convergence, and reduces sensitivity to initialization.

---

## Mathematical Formulation

Given an input matrix $ x $ of shape $ (\text{batch size}, \text{num features}) $, Batch Normalization works as follows:

### Step 1: Calculate the Mean for Each Feature
For each feature $ f $, compute the mean across the batch:

$$
\text{mean}_f = \frac{1}{N} \sum_{i=1}^{N} x_{i,f}
$$

Where:
- $ N $: Number of samples in the batch (batch size).
- $ x_{i,f} $: The value of the $ f $-th feature for the $ i $-th sample.

---

### Step 2: Calculate the Variance for Each Feature
For each feature $ f $, compute the variance across the batch:

$$
\text{variance}_f = \frac{1}{N} \sum_{i=1}^{N} (x_{i,f} - \text{mean}_f)^2
$$

---

### Step 3: Normalize Each Feature
Normalize each feature using the computed mean and variance:

$$
\hat{x}_{i,f} = \frac{x_{i,f} - \text{mean}_f}{\sqrt{\text{variance}_f + \epsilon}}
$$

Where:
- $ \epsilon $: A small constant (e.g., $ 10^{-5} $) added to avoid division by zero.

---

### Step 4: Apply Learnable Parameters $ \gamma $ and $ \beta $
Finally, scale and shift the normalized features using learnable parameters $ \gamma $ and $ \beta $:

$$
y_{i,f} = \gamma_f \cdot \hat{x}_{i,f} + \beta_f
$$

Where:
- $ \gamma_f $: Learnable scaling parameter for feature $ f $.
- $ \beta_f $: Learnable shifting parameter for feature $ f $.

---

In [1]:
def batchnorm(x, gamma, beta, epsilon=1e-5):
    """
    Implements Batch Normalization from scratch.

    Args:
        x (list of lists or 2D array): Input matrix (batch_size x num_features).
        gamma (list): Learnable scaling parameters (1D array of size num_features).
        beta (list): Learnable shifting parameters (1D array of size num_features).
        epsilon (float): Small constant to avoid division by zero.

    Returns:
        list of lists: Batch-normalized matrix (same shape as input).
    """
    # Transpose the input to operate on features across the batch
    x_transposed = list(zip(*x))  # Shape: (num_features, batch_size)

    # Initialize the output
    normalized_x_transposed = []

    # Iterate over each feature
    for feature_idx, feature_values in enumerate(x_transposed):
        # Step 1: Calculate the mean for the feature
        mean = sum(feature_values) / len(feature_values)

        # Step 2: Calculate the variance for the feature
        variance = sum((xi - mean) ** 2 for xi in feature_values) / len(feature_values)

        # Step 3: Normalize the feature
        normalized_feature = [(xi - mean) / ((variance + epsilon) ** 0.5) for xi in feature_values]

        # Step 4: Apply gamma and beta (scaling and shifting)
        scaled_and_shifted = [gamma[feature_idx] * normalized_value + beta[feature_idx]
                              for normalized_value in normalized_feature]

        # Append the processed feature
        normalized_x_transposed.append(scaled_and_shifted)

    # Transpose back to the original shape
    normalized_x = list(zip(*normalized_x_transposed))

    # Convert tuples back to lists
    return [list(row) for row in normalized_x]

# Example usage
x = [
    [1.0, 2.0, 3.0],
    [4.0, 5.0, 6.0],
    [7.0, 8.0, 9.0]
]

# Learnable parameters (gamma and beta for scaling and shifting)
gamma = [1.0, 1.0, 1.0]  # Scaling factors
beta = [0.0, 0.0, 0.0]   # Shifting factors

# Apply Batch Normalization
normalized_x = batchnorm(x, gamma, beta)

print("Input Matrix:")
for row in x:
    print(row)

print("\nBatch-Normalized Matrix:")
for row in normalized_x:
    print(row)


Input Matrix:
[1.0, 2.0, 3.0]
[4.0, 5.0, 6.0]
[7.0, 8.0, 9.0]

Batch-Normalized Matrix:
[-1.2247438507721387, -1.2247438507721387, -1.2247438507721387]
[0.0, 0.0, 0.0]
[1.2247438507721387, 1.2247438507721387, 1.2247438507721387]


# Comparison Between BatchNorm and LayerNorm

Batch Normalization (BatchNorm) and Layer Normalization (LayerNorm) are two widely used normalization techniques in deep learning. While they serve similar purposes, they differ significantly in how they normalize data and in their use cases.

---

## Key Differences Between BatchNorm and LayerNorm

| **Aspect**           | **Batch Normalization (BatchNorm)**                                                                     | **Layer Normalization (LayerNorm)**                                                                      |
|-----------------------|--------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------|
| **Normalization Axis**| Normalizes across the batch dimension (for each feature across all samples in the batch).             | Normalizes across the feature dimension (for each sample across all features).                          |
| **Formula**           | $\mu_f = \frac{1}{N} \sum_{i=1}^N x_{i,f}, \sigma_f^2 = \frac{1}{N} \sum_{i=1}^N (x_{i,f} - \mu_f)^2$ | $\mu_s = \frac{1}{n} \sum_{j=1}^n x_{s,j}, \sigma_s^2 = \frac{1}{n} \sum_{j=1}^n (x_{s,j} - \mu_s)^2$ |
| **Use Case**          | Commonly used in Convolutional Neural Networks (CNNs) and feedforward networks during mini-batch training.| Suitable for sequence models (e.g., RNNs, Transformers) or networks with variable batch sizes.          |
| **Dependency on Batch Size**| Requires a consistent batch size to work effectively.                                            | Works independently of the batch size, making it suitable for single-sample inference or varying batch sizes. |
| **Learnable Parameters** | Scaling ($\gamma$) and shifting ($\beta$) parameters per feature.                              | Scaling ($\gamma$) and shifting ($\beta$) parameters per feature in each sample.                     |
| **When Normalization Occurs** | Normalizes features across samples in the same batch.                                         | Normalizes features within each individual sample.                                                       |
| **Computation**       | Operates on $(\text{batch size} \times \text{num features})$ and normalizes per feature across the batch.| Operates on $(\text{batch size} \times \text{num features})$ and normalizes across features for each sample. |
| **Advantages**        | Stabilizes training and reduces sensitivity to initialization, especially in convolutional layers.     | Useful in NLP and sequence models where batch sizes vary, and consistent behavior is required.           |
| **Disadvantages**     | May not work well for very small batch sizes or single-sample inference.                               | Computationally more expensive compared to BatchNorm for large inputs.                                   |

---

## Mathematical Formulas

### Batch Normalization
For each feature $ f $, BatchNorm operates as:

1. Compute mean:
   $$
   \mu_f = \frac{1}{N} \sum_{i=1}^N x_{i,f}
   $$
2. Compute variance:
   $$
   \sigma_f^2 = \frac{1}{N} \sum_{i=1}^N (x_{i,f} - \mu_f)^2
   $$
3. Normalize:
   $$
   \hat{x}_{i,f} = \frac{x_{i,f} - \mu_f}{\sqrt{\sigma_f^2 + \epsilon}}
   $$
4. Scale and shift:
   $$
   y_{i,f} = \gamma_f \cdot \hat{x}_{i,f} + \beta_f
   $$

Where:
- $ N $: Batch size.
- $ f $: Feature index.

---

### Layer Normalization
For each sample $ s $, LayerNorm operates as:

1. Compute mean:
   $$
   \mu_s = \frac{1}{n} \sum_{j=1}^n x_{s,j}
   $$
2. Compute variance:
   $$
   \sigma_s^2 = \frac{1}{n} \sum_{j=1}^n (x_{s,j} - \mu_s)^2
   $$
3. Normalize:
   $$
   \hat{x}_{s,j} = \frac{x_{s,j} - \mu_s}{\sqrt{\sigma_s^2 + \epsilon}}
   $$
4. Scale and shift:
   $$
   y_{s,j} = \gamma_j \cdot \hat{x}_{s,j} + \beta_j
   $$

Where:
- $ n $: Number of features.
- $ s $: Sample index.

---

## Practical Differences

### BatchNorm Example:
Given an input matrix:

$$
x = \begin{bmatrix}
1.0 & 2.0 & 3.0 \\
4.0 & 5.0 & 6.0 \\
7.0 & 8.0 & 9.0
\end{bmatrix}
$$

BatchNorm normalizes across the **batch** (i.e., for each column or feature).

### LayerNorm Example:
Using the same input matrix:

$$
x = \begin{bmatrix}
1.0 & 2.0 & 3.0 \\
4.0 & 5.0 & 6.0 \\
7.0 & 8.0 & 9.0
\end{bmatrix}
$$

LayerNorm normalizes across the **features** (i.e., for each row or sample).

---

## Summary

### BatchNorm:
- Normalizes **across the batch** for each feature.
- Works best for **CNNs and feedforward networks**.
- Relies on a consistent batch size.

### LayerNorm:
- Normalizes **within each sample** across features.
- Preferred for **sequence models** (e.g., RNNs, Transformers).
- Independent of batch size, making it robust for single-sample inputs.

By understanding these differences, you can choose the appropriate normalization technique based on your model's requirements.
