# **Layer Normalization in Transformers**

`Layer Normalization (LayerNorm)` is a normalization technique applied across the features of a single training example (not across the batch like BatchNorm). It rescales activations so that, for each token representation, the `mean` is `0` and `variance` is `1` before applying a learned affine transformation.


In Transformers, it is crucial for stabilizing training, ensuring consistent scaling of hidden states through multiple stacked attention and feedforward layers.

### **Intuition**

- Transformers stack dozens or hundreds of layers; without normalization, the distribution of hidden states can shift (called internal covariate shift).

- LayerNorm ensures that at each step, token embeddings have controlled scale and distribution.

- Unlike BatchNorm, LayerNorm doesn’t depend on batch statistics, so it works well for:
  - Variable-length sequences

  - Small batch sizes

  - Autoregressive models where only one token is processed at inference.

Think of it as re-centering and rescaling each `token’s feature vector` independently to keep training stable.

### **Use Cases**

- `Transformers (BERT, GPT, etc.):` almost every block applies LayerNorm before or after `attention/FFN`.

- `RNNs / NLP models:` stabilizes hidden state dynamics.

- `Reinforcement Learning:` reduces instability in policy/value function training.

- `Small-batch or online training:` unlike BatchNorm, performance is consistent.

### **Mathematical Representation**

For a hidden state vector $x \in \mathbb{R}^d$ (features of a token at one layer):

1. **Compute mean and variance** across the feature dimension:

$$\mu = \frac{1}{d} \sum_{i=1}^d x_i$$

$$\sigma^2 = \frac{1}{d} \sum_{i=1}^d (x_i - \mu)^2$$


2. **Normalize:**

$$\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}$$


3. **Scale and shift** with learned parameters $\gamma, \beta \in \mathbb{R}^d$:

$$\mathrm{LayerNorm}(x) = \gamma \odot \hat{x} + \beta$$


Where:

- $d$ = hidden dimension

- $\epsilon$ = small constant for stability

- $\odot$ = elementwise multiplication

**Basic PyTorch LayerNorm**

In [1]:
import torch
import torch.nn as nn

# Example: hidden dim = 512
layer_norm = nn.LayerNorm(512)

x = torch.randn(2, 10, 512)  # (batch, seq_len, hidden_dim)
out = layer_norm(x)

print(out.shape)  # (2, 10, 512)

torch.Size([2, 10, 512])


In [4]:
out.mean(), out.var()

(tensor(-1.8626e-09, grad_fn=<MeanBackward0>),
 tensor(1.0001, grad_fn=<VarBackward0>))

**Using LayerNorm in a Transformer Block**

In [None]:
import torch
import torch.nn as nn

class TransformerBlock(nn.Module):
    def __init__(self, hidden_dim, num_heads, ff_dim):
        super().__init__()
        self.attention = nn.MultiheadAttention(hidden_dim, num_heads, batch_first=True)
        self.layerNorm1 = nn.LayerNorm(hidden_dim)
        self.feedforward = nn.Sequential(
            nn.Linear(hidden_dim, ff_dim),
            nn.ReLU(),
            nn.Linear(ff_dim, hidden_dim)
        )
        self.layerNorm2 = nn.LayerNorm(hidden_dim)
        
    def forward(self, x:torch.Tensor):
        # Self-attention + residual + norm
        attn_out, _ = self.attention(x, x, x)
        x = self.layerNorm1(x + attn_out)
        
        # Feed-forward + residual + norm
        feedforward_output = self.feedforward(x)
        x = self.layerNorm2(x + feedforward_output)
        return x
    
x = torch.randn(2, 10, 512)  # (batch, seq_len, hidden_dim)
block = TransformerBlock(hidden_dim=512, num_heads=8, ff_dim=2048)
out = block(x)
print(out.shape)  # (2, 10, 512)

torch.Size([2, 10, 512])


In [12]:
out

tensor([[[-0.6516,  0.3798,  0.1604,  ...,  0.4044,  1.4573,  0.9570],
         [ 1.3824,  0.8147,  2.1083,  ...,  0.5233, -0.6967, -0.1114],
         [-0.5544, -0.0451,  0.2103,  ...,  1.1650,  0.8155,  0.4347],
         ...,
         [-0.5664, -0.4424,  1.0642,  ...,  1.2199,  0.2132,  0.1240],
         [-0.3173,  0.2094,  0.7723,  ...,  0.3141,  0.0453, -0.8503],
         [-0.0166, -1.3009,  1.5293,  ...,  1.0549,  1.1779, -0.1229]],

        [[-1.6596, -0.6192,  1.4623,  ...,  2.2082, -0.5622, -0.3565],
         [-1.3105, -2.4407, -0.8630,  ..., -0.5712, -0.5010, -0.4595],
         [-0.7096, -0.2949,  0.6589,  ..., -1.1970,  2.0164,  2.2247],
         ...,
         [ 0.0427,  2.2328,  1.0938,  ...,  0.7347, -0.8213,  0.6817],
         [ 0.4875,  0.1426,  0.5580,  ...,  1.4440,  0.0025,  0.0137],
         [ 0.2994,  0.0944,  0.0350,  ..., -1.1914,  0.4151, -0.7845]]],
       grad_fn=<NativeLayerNormBackward0>)

In [13]:
block

TransformerBlock(
  (attention): MultiheadAttention(
    (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
  )
  (layerNorm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  (feedforward): Sequential(
    (0): Linear(in_features=512, out_features=2048, bias=True)
    (1): ReLU()
    (2): Linear(in_features=2048, out_features=512, bias=True)
  )
  (layerNorm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)

**Hugging Face Transformers (BERT example)**

In [15]:
from transformers import AutoModel

model = AutoModel.from_pretrained("bert-base-uncased")
print(model.encoder.layer[0].output.LayerNorm)  # LayerNorm inside Transformer block

LayerNorm((768,), eps=1e-12, elementwise_affine=True)


### **Summary**

- `Definition:` Normalizes per-token features for stability.

- `Intuition:` Keeps hidden states well-scaled across deep stacks.

- `Use cases:` Transformers, NLP, RL, small-batch training.

- `Mathematics:` Normalize → scale → shift with learned $\gamma, \beta$.

- `Code:` Built-in as nn.LayerNorm in PyTorch, used in every Transformer block.