<a href="https://colab.research.google.com/github/Sana-Harshitha/LLMPractice/blob/main/LayerNormalization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import torch
import torch.nn as nn

In [9]:
class LayerNormalization(nn.Module):
  def __init__(self,emb_dim):
    super().__init__()
    self.eps=1e-6
    self.scale=nn.Parameter(torch.ones(emb_dim))
    self.shift=nn.Parameter(torch.zeros(emb_dim))
  def forward(self,x):
    mean=x.mean(dim=-1,keepdim=True)
    var = x.var(dim=-1, keepdim=True, unbiased=False)
    norm_x=(x-mean)/torch.sqrt(var+self.eps)
    return self.scale*norm_x+self.shift

<div class="alert alert-block alert-warning">

This specific implementation of layer Normalization operates on the last dimension of the
input tensor x, which represents the embedding dimension (emb_dim).

The variable eps is a
small constant (epsilon) added to the variance to prevent division by zero during
normalization.

The scale and shift are two trainable parameters (of the same dimension
as the input) that the LLM automatically adjusts during training if it is determined that
doing so would improve the model's performance on its training task.

This allows the model
to learn appropriate scaling and shifting that best suit the data it is processing.

</div>

_A small note on biased variance_

<div class="alert alert-block alert-info">

In our variance calculation method, we have opted for an implementation detail by
setting unbiased=False.

For those curious about what this means, in the variance
calculation, we divide by the number of inputs n in the variance formula.

This approach does not apply Bessel's correction, which typically uses n-1 instead of n in
the denominator to adjust for bias in sample variance estimation.

This decision results in a so-called biased estimate of the variance.

For large-scale language
models (LLMs), where the embedding dimension n is significantly large, the
difference between using n and n-1 is practically negligible.

We chose this approach to ensure compatibility with the GPT-2 model's normalization layers and because it
reflects TensorFlow's default behavior, which was used to implement the original GPT2 model.
</div>

<div class="alert alert-block alert-success">

Let's now try the LayerNorm module in practice and apply it to the batch input:
</div>

In [10]:
x=torch.rand(5,10,3)
emb_dim=x.shape[-1]
layer_norm=LayerNormalization(emb_dim)

In [12]:
result=layer_norm(x)

In [14]:
print("mean",result.mean(dim=-1,keepdim=True))
print("var",result.var(dim=-1,unbiased=False,keepdim=True))

mean tensor([[[ 5.9605e-08],
         [-3.9736e-08],
         [ 0.0000e+00],
         [ 1.9868e-08],
         [ 3.9736e-08],
         [-9.9341e-08],
         [ 0.0000e+00],
         [ 1.1921e-07],
         [-1.5895e-07],
         [ 0.0000e+00]],

        [[-3.9736e-08],
         [-5.9605e-08],
         [ 2.1855e-07],
         [ 1.6888e-07],
         [-1.1921e-07],
         [ 1.1921e-07],
         [ 1.9868e-08],
         [ 9.9341e-09],
         [ 7.9473e-08],
         [ 1.9868e-08]],

        [[-1.1921e-07],
         [ 0.0000e+00],
         [ 7.9473e-08],
         [ 7.9473e-08],
         [ 6.3578e-07],
         [ 1.9868e-08],
         [ 0.0000e+00],
         [-1.1921e-07],
         [ 0.0000e+00],
         [ 3.9736e-08]],

        [[ 0.0000e+00],
         [-1.1921e-07],
         [ 0.0000e+00],
         [-3.9736e-08],
         [-7.4506e-08],
         [ 0.0000e+00],
         [ 0.0000e+00],
         [ 0.0000e+00],
         [ 3.9736e-08],
         [ 1.1921e-07]],

        [[ 0.0000e+00],
   

As we can see based on the results, the layer normalization code works as expected and normalizes the values of each of the two inputs such that they have a mean of 0 and a variance of 1: