### Why are we using Layer Normalisation ?

- During trainning a transformer model due to backpropagation , a point comes where the gradients starts to disapper and eventually the model stops learning , so inorder to prevent this we use Add & Norm layer which takes input from multi-headed attention but also it has residual connections from the output matrix just after applying positional encoding.
  > Layer Normalisation , What and why ...?

> What is Normalisation?

- After applying activation function the values we get are either extreme positive or extreme negative by applying `Normalisation` we encapsulate the values into a much smaller range , typically centering it around zero `0`
-


# Import Dependencies


In [2]:
import torch
import torch.nn as nn

In [7]:
inputs = torch.Tensor([
    [
        [0.2, 0.1, 0.3],
        [0.5, 0.1, 0.1]
    ]
]) # here we have added a batch dimension to the input just to make training faster

B,S,E=inputs.size()
# B=Batch size, S=Sequence length, E=Embedding size
inputs= inputs.reshape(S,B,E)
inputs.size()


torch.Size([2, 1, 3])

In [17]:
parameter_shape=inputs.size()[-2:]
print(len(parameter_shape))
Gamma=nn.Parameter(torch.ones(parameter_shape))
Beta=nn.Parameter(torch.zeros(parameter_shape))
Gamma.size(),Beta.size()

2


(torch.Size([1, 3]), torch.Size([1, 3]))

In [18]:
dimension=[-(i+1) for i in range(len(parameter_shape))] # finding the dimension across which the layer norm will be applied Batch Dimension & Embedding Dimension
dimension

[-1, -2]

In [20]:
## calculating mean
mean=inputs.mean(dimension,keepdim=True)
print(mean.size())
print(mean)

torch.Size([2, 1, 1])
tensor([[[0.2000]],

        [[0.2333]]])


In [25]:
variance=(inputs-mean).pow(2).mean(dimension,keepdim=True)
epilson=1e-5 # to ensure that the denominator is never zero
std=torch.sqrt(variance+epilson)
std.size(),std

(torch.Size([2, 1, 1]),
 tensor([[[0.0817]],
 
         [[0.1886]]]))

In [27]:
y=(inputs-mean)/std # this is the layer norm output
output=Gamma*y+Beta
output.size(),output

(torch.Size([2, 1, 3]),
 tensor([[[ 0.0000, -1.2238,  1.2238]],
 
         [[ 1.4140, -0.7070, -0.7070]]], grad_fn=<AddBackward0>))

In [30]:
class LayerNormalisation():
    def __init__(self,parameter_shape,epsilon=1e-5)->None:
        self.Gamma=nn.Parameter(torch.ones(parameter_shape))
        self.Beta=nn.Parameter(torch.zeros(parameter_shape))
        self.parameter_shape=parameter_shape
        self.epsilon=epsilon

    def forward(self,inputs:torch.Tensor)->torch.Tensor:
        dimension=[-(i+1) for i in range(len(self.parameter_shape))]
        mean=inputs.mean(dimension,keepdim=True)
        print(f"Mean: \n {mean} \n Mean Size:  {mean.size()}")
        variance=(inputs-mean).pow(2).mean(dimension,keepdim=True)
        print(f"Variance: \n {variance} \n Variance Size:  {variance.size()}")
        std=torch.sqrt(variance+self.epsilon)
        print(f"Standard Deviation: \n {std} \n Standard Deviation Size:  {std.size()}")
        y=(inputs-mean)/std
        print(f"Y:  {y} \n Y Size:  {y.size()}")
        output=self.Gamma*y+self.Beta
        print(f"Output:  {output} \n Output Size:  {output.size()}")
        return output


layer_norm=LayerNormalisation(parameter_shape)
output=layer_norm.forward(inputs)
output.size(),output

Mean: 
 tensor([[[0.2000]],

        [[0.2333]]]) 
 Mean Size:  torch.Size([2, 1, 1])
Variance: 
 tensor([[[0.0067]],

        [[0.0356]]]) 
 Variance Size:  torch.Size([2, 1, 1])
Standard Deviation: 
 tensor([[[0.0817]],

        [[0.1886]]]) 
 Standard Deviation Size:  torch.Size([2, 1, 1])
Y:  tensor([[[ 0.0000, -1.2238,  1.2238]],

        [[ 1.4140, -0.7070, -0.7070]]]) 
 Y Size:  torch.Size([2, 1, 3])
Output:  tensor([[[ 0.0000, -1.2238,  1.2238]],

        [[ 1.4140, -0.7070, -0.7070]]], grad_fn=<AddBackward0>) 
 Output Size:  torch.Size([2, 1, 3])


(torch.Size([2, 1, 3]),
 tensor([[[ 0.0000, -1.2238,  1.2238]],
 
         [[ 1.4140, -0.7070, -0.7070]]], grad_fn=<AddBackward0>))