# Layer Normalization Equations

Layer Normalization is a technique used to normalize the activations of a neural network layer across the features for each individual data sample. Layer Normalization normalizes the activations across features for each data sample independently. The mean and variance are computed per sample and per feature, ensuring that each feature has zero mean and unit variance. The normalized output is then scaled and shifted using learnable parameters, allowing the network to maintain expressiveness while stabilizing the training process.

The key equations used in Layer Normalization are as follows:

## 1. Mean Calculation

For a given layer with input features $x_i$, the mean $\mu$ of the features is computed as:

$$
\mu = \frac{1}{H} \sum_{i=1}^{H} x_i
$$

where:
- $H$ is the number of features in the layer.
- $x_i$ is the $i$-th feature of the input.

## 2. Variance Calculation

The variance $\sigma^2$ of the features is calculated as:

$$
\sigma^2 = \frac{1}{H} \sum_{i=1}^{H} (x_i - \mu)^2
$$

where:
- $\sigma^2$ is the variance of the features.

## 3. Normalization

The normalized output $\hat{x}_i$ for each feature $x_i$ is given by:

$$
\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}
$$

where:
- $\mu$ is the mean of the features.
- $\sigma^2$ is the variance of the features.
- $\epsilon$ is a small constant added for numerical stability (e.g., $\epsilon = 1 \times 10^{-5}$).

## 4. Scaling and Shifting

After normalization, the output is scaled and shifted using learnable parameters $\gamma$ and $\beta$:

$$
y_i = \gamma \hat{x}_i + \beta
$$

where:
- $\gamma$ is the scale parameter.
- $\beta$ is the shift parameter.
- $\hat{x}_i$ is the normalized feature.



## LayerNorm

The `LayerNorm` (Layer Normalization) class in PyTorch is used to normalize the inputs across the features within each example in a batch. This normalization helps stabilize and accelerate the training process by reducing internal covariate shift, which is the change in the distribution of layer inputs during training. LayerNorm helps ensure that each layer's output has a stable distribution during training. This makes the learning process more efficient and reduces the chances of the network getting stuck in suboptimal regions of the loss landscape.

Benefits of using Layer Normalization:
- Stabilizes Training: By normalizing the activations, LayerNorm helps reduce fluctuations in the gradient during backpropagation, leading to more stable and faster convergence.
- Independence from Batch Size: Unlike Batch Normalization, which normalizes across a batch of examples, LayerNorm normalizes each example independently. This makes it particularly useful when training with very small batch sizes or even with a single example (e.g., in online learning).

How LayerNorm Works:
- Normalization: For each individual example in the batch, LayerNorm computes the mean and variance across all features (i.e., across the specified dimension). The layer then normalizes the input by subtracting the mean and dividing by the standard deviation.
- Learnable Parameters: Unlike standard normalization techniques (like Batch Normalization), LayerNorm does not depend on the batch size. It has learnable parameters (a scale $\gamma$ and a shift $\beta$) that allow the network to restore some of the representational power that might be lost during normalization. These parameters are learned during training.

In [1]:
import torch
import torch.nn as nn

class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.fc1 = nn.Linear(10, 16)  # input shape (1,10) means 10 features
        self.norm1 = nn.LayerNorm(16)
        self.fc2 = nn.Linear(16, 10)
        self.norm2 = nn.LayerNorm(10)
        self.fc3 = nn.Linear(10, 3)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.norm1(x)
        x = torch.relu(self.fc2(x))
        x = self.norm2(x)
        x = self.fc3(x)
        x = self.softmax(x)
        return x

model = MyModel()