# Layer Normalization Equations

Layer Normalization is a technique used to normalize the activations of a neural network layer across the features for each individual data sample. Layer Normalization normalizes the activations across features for each data sample independently. The mean and variance are computed per sample and per feature, ensuring that each feature has zero mean and unit variance. The normalized output is then scaled and shifted using learnable parameters, allowing the network to maintain expressiveness while stabilizing the training process.

The key equations used in Layer Normalization are as follows:

## 1. Mean Calculation

For a given layer with input features $x_i$, the mean $\mu$ of the features is computed as:

$$
\mu = \frac{1}{H} \sum_{i=1}^{H} x_i
$$

where:
- $H$ is the number of features in the layer.
- $x_i$ is the $i$-th feature of the input.

## 2. Variance Calculation

The variance $\sigma^2$ of the features is calculated as:

$$
\sigma^2 = \frac{1}{H} \sum_{i=1}^{H} (x_i - \mu)^2
$$

where:
- $\sigma^2$ is the variance of the features.

## 3. Normalization

The normalized output $\hat{x}_i$ for each feature $x_i$ is given by:

$$
\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}
$$

where:
- $\mu$ is the mean of the features.
- $\sigma^2$ is the variance of the features.
- $\epsilon$ is a small constant added for numerical stability (e.g., $\epsilon = 1 \times 10^{-5}$).

## 4. Scaling and Shifting

After normalization, the output is scaled and shifted using learnable parameters $\gamma$ and $\beta$:

$$
y_i = \gamma \hat{x}_i + \beta
$$

where:
- $\gamma$ is the scale parameter.
- $\beta$ is the shift parameter.
- $\hat{x}_i$ is the normalized feature.

