In [4]:
import torch
from torch import nn

## Layer Normalization

Layer normalization is a technique used in deep learning to stabilize and accelerate the training of neural networks by normalizing the inputs to a layer across the features for each individual training example. It is an alternative to batch normalization, which normalizes across the batch dimension, particularly for scenarios where batch statistics are less reliable.

##### Key Characteristics of Layer Normalization


1. **Normalization Across Features**: Unlike batch normalization, which normalizes across the batch dimension, layer normalization normalizes across the feature dimension for each training example independently.
   $$
   \hat{\mathbf{x}} = \frac{\mathbf{x} - \mu}{\sigma}
   $$
   where:
   - $\mathbf{x}$ is the input to the layer.
   - $\mu = \frac{1}{H} \sum_{i=1}^H x_i$ is the mean across features.
   -$\sigma = \sqrt{\frac{1}{H} \sum_{i=1}^H (x_i - \mu)^2 + \epsilon}$ is the standard deviation across features.
   - \(H\) is the number of features.

2. **Learnable Parameters**: Layer normalization includes learnable scale ($\gamma$) and shift $(beta)$parameters:
   $$
   \mathbf{y} = \gamma \cdot \hat{\mathbf{x}} + \beta
   $$   These allow the model to restore or adaptively scale the normalized outputs if needed.

3. **Independence from Batch Size**: Since layer normalization computes statistics independently for each example, it works well with small batch sizes and is well-suited for recurrent and transformer architectures.

### Advantages of Layer Normalization
- **Stabilized Training**: By normalizing feature activations, it reduces the risk of exploding or vanishing gradients.
- **Independence from Batch Size**: Unlike batch normalization, it does not depend on batch statistics, making it ideal for models trained with small or variable batch sizes.
- **Improved Convergence**: It can accelerate training by ensuring more consistent gradients.

### Use Cases
- Recurrent Neural Networks (RNNs): Layer normalization is often used in RNNs since batch normalization is challenging to apply due to sequential dependencies.
- Transformer Models: Transformers (e.g., BERT, GPT) widely use layer normalization due to its efficiency and ability to handle small batch sizes effectively.

Layer normalization has become a standard component in many modern deep learning architectures.

### Layer Normalization Numerical Example


---

Given a single input vector with 3 features:

$
\mathbf{x} = [4.0, 2.0, 8.0]
$

#### Step 1: Compute the Mean ($\mu$) and Standard Deviation ($\sigma$)

The mean is calculated as:

$
\mu = \frac{1}{H} \sum_{i=1}^H x_i = \frac{4.0 + 2.0 + 8.0}{3} = 4.67
$

The standard deviation is:

$
\sigma = \sqrt{\frac{1}{H} \sum_{i=1}^H (x_i - \mu)^2}
$

Substituting the values:

$
\sigma = \sqrt{\frac{1}{3} \left((4.0 - 4.67)^2 + (2.0 - 4.67)^2 + (8.0 - 4.67)^2 \right)}
$

$
\sigma = \sqrt{\frac{1}{3} \left(0.4489 + 7.1289 + 11.0889\right)} = \sqrt{6.89} \approx 2.63
$

---

#### Step 2: Normalize the Features ($\hat{x}_i$)

The normalized features are computed as:

$
\hat{x}_i = \frac{x_i - \mu}{\sigma}
$

For each feature:

$
\hat{x}_1 = \frac{4.0 - 4.67}{2.63} \approx -0.25, \quad
\hat{x}_2 = \frac{2.0 - 4.67}{2.63} \approx -1.02, \quad
\hat{x}_3 = \frac{8.0 - 4.67}{2.63} \approx 1.27
$

Thus:

$
\hat{\mathbf{x}} = [-0.25, -1.02, 1.27]
$

---

#### Step 3: Apply Scale ($\gamma$) and Shift ($\beta$)

Assume the learnable parameters are:

$
\gamma = [1.5, 1.0, 0.5], \quad \beta = [0.5, 0.0, -0.5]
$

The final output is computed as:

$
y_i = \gamma_i \cdot \hat{x}_i + \beta_i
$

For each feature:

$
y_1 = 1.5 \cdot -0.25 + 0.5 = 0.125, \quad
y_2 = 1.0 \cdot -1.02 + 0.0 = -1.02, \quad
y_3 = 0.5 \cdot 1.27 - 0.5 = 0.135
$

Thus, the final output is:

$
\mathbf{y} = [0.125, -1.02, 0.135]
$

---

#### Summary

1. **Input**: $[4.0, 2.0, 8.0]$
2. **Normalized**: $[-0.25, -1.02, 1.27]$
3. **Final Output After Scale and Shift**: $[0.125, -1.02, 0.135]$

---

### Layer Normalization PyTorch Example

In [15]:
# Input vector
x = torch.tensor([[4.0, 2.0, 8.0]])  # Shape [1, 3] (1 sample, 3 features)

# Define a LayerNorm instance with normalized dimension matching the input
layer_norm = nn.LayerNorm(normalized_shape=x.size()[1])

# Manually set the learnable parameters (gamma and beta) to match the example
with torch.no_grad():
    layer_norm.weight = nn.Parameter(torch.tensor([1.5, 1.0, 0.5]))  # Gamma (scale)
    layer_norm.bias = nn.Parameter(torch.tensor([0.5, 0.0, -0.5]))   # Beta (shift)

# Apply the layer normalization
y = layer_norm(x)

print("Input:", x)
print("Normalized Output:", y)


Input: tensor([[4., 2., 8.]])
Normalized Output: tensor([[ 0.0991, -1.0690,  0.1682]], grad_fn=<NativeLayerNormBackward0>)


PyTorch uses a sligthly different implementation to enhance numerical stability and this causes the outputs to be slightly different than the outputs in our numerical example.

### **Main Function of Layer Normalization**
1. **Stabilizing Activations**:  
   By normalizing the input features for each sample to have a consistent scale and mean, layer normalization ensures that the activations passed to subsequent layers remain in a stable range. This makes optimization smoother, as the learning dynamics are less affected by variations in feature magnitudes.

2. **Reducing Internal Covariate Shift**:  
   Internal covariate shift refers to the change in the distribution of layer inputs during training as the parameters of preceding layers change. Layer normalization mitigates this by keeping the inputs to each layer in a more predictable range, reducing the amount of "re-adjustment" needed by subsequent layers.

3. **Independence from Batch Size**:  
   Unlike batch normalization, which relies on statistics computed across a batch, layer normalization computes normalization statistics per sample. This makes it particularly useful in models with:
   - Small batch sizes (e.g., recurrent neural networks, transformers).
   - Applications like reinforcement learning or language modeling where batch normalization may not be practical.

---

### **Secondary Benefits**
1. **Mitigating Gradient Problems**:
   - **Gradient Explosion**: Normalization ensures that extremely large activations are scaled down, indirectly reducing the risk of exploding gradients.
   - **Gradient Vanishing**: By maintaining consistent feature scales, layer norm helps prevent activations from becoming too small to propagate meaningful gradients.

2. **Faster Convergence**:  
   The smoother learning dynamics result in faster convergence during training, often requiring fewer iterations to achieve good performance.

---

### When to Use Layer Normalization
Layer normalization is particularly effective in models where:
- **Sequences or temporal structures** are critical (e.g., RNNs, Transformers).  
- Batch normalization doesn't work well due to **variable batch sizes** or **dependencies across samples**.  

In these scenarios, layer normalization not only stabilizes training but also ensures consistent and robust gradients, indirectly avoiding numerical issues like gradient explosion.

### Conclusion

Layer normalization essentially normalizes the inputs with the z-score for the features of each sample, plus learnable scale and shift parameters.The learnable
𝛾
γ and
𝛽
β provide flexibility so the network can learn optimal representations rather than being constrained by strict normalization.