<a href="https://colab.research.google.com/github/RCortez25/PhD/blob/main/LLM/5.%20LLM%20architecture/Layer_Normalization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Example

Due to the complexities of training (vanishing/exploding gradients), this layer improves the stability and efficiency of the NN training. It helps prevents problems with gradient and helps with convergence.

This layer is created as a separate class because it doesn't happen only in the transformer but also outside of it. Normally, this layer is applied before and after the MultiHead attention module and before the final output layer, that is, inside and outside the transformer block.

This layer normalizes the outputs, i.e., makes them have mean of 0 and a variance of 1. This speeds up convergence.

Now, the `LayerNormPlaceholder` becomes `LayerNormalization`.

Let's make a first simple example.

In [2]:
import torch
import torch.nn as nn

torch.manual_seed(123)
batch_example = torch.randn(2, 5)
batch_example

tensor([[-0.1115,  0.1204, -0.3696, -0.2404, -1.1969],
        [ 0.2093, -0.9724, -0.7550,  0.3239, -0.1085]])

In [5]:
# Create a sequential layer that maps the 5 inputs of each batch to a 6D-space
# then uses a ReLU activation function
layer = nn.Sequential(nn.Linear(5, 6), nn.ReLU())

#Calculate the outputs
outputs = layer(batch_example)
outputs

tensor([[0.0000, 0.1898, 0.4165, 0.0000, 0.5927, 0.0000],
        [0.2917, 0.0000, 0.7161, 0.0000, 0.3986, 0.0000]],
       grad_fn=<ReluBackward0>)

Now, let's check the mean and variance of the outputs before applying the layer normalization.

In [8]:
# Note that this calculates the mean and variance of all the tensor
outputs.mean()
outputs.var()

# However, we want the mean dna variance of each batch, that is, each row
mean_batch = outputs.mean(dim=1, keepdim=True)
var_batch = outputs.var(dim=1, keepdim=True)

print(mean_batch)
print(var_batch)

tensor([[0.1998],
        [0.2344]], grad_fn=<MeanBackward1>)
tensor([[0.0642],
        [0.0854]], grad_fn=<VarBackward0>)


In [9]:
# Note what happens when we don't use keepdim=True

mean_batch_ = outputs.mean(dim=1)
var_batch_ = outputs.var(dim=1)

print(mean_batch_)
print(var_batch_)

tensor([0.1998, 0.2344], grad_fn=<MeanBackward1>)
tensor([0.0642, 0.0854], grad_fn=<VarBackward0>)


So using `keepdim=True` makes the result have the same dimension as the inputs, which makes it readable as each row corresponds to each batch.

# Creation of the layer

In [None]:
class LayerNormPlaceholder(nn.Module):
    def __init__(self, normalized_shape, eps=1e-5):
        super().__init__()

    def forward(self, x):
        return x