<a href="https://colab.research.google.com/github/RCortez25/PhD/blob/main/LLM/6.%20Layer%20Normalization/Layer_Normalization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Example

Due to the complexities of training (vanishing/exploding gradients), this layer improves the stability and efficiency of the NN training. It helps prevents problems with gradient and helps with convergence.

This layer is created as a separate class because it doesn't happen only in the transformer but also outside of it. Normally, this layer is applied before and after the MultiHead attention module and before the final output layer, that is, inside and outside the transformer block.

This layer normalizes the outputs, i.e., makes them have mean of 0 and a variance of 1. This speeds up convergence.

Now, the `LayerNormPlaceholder` becomes `LayerNormalization`.

Let's make a first simple example.

In [None]:
import torch
import torch.nn as nn

torch.manual_seed(123)
batch_example = torch.randn(2, 5)
batch_example

tensor([[-0.1115,  0.1204, -0.3696, -0.2404, -1.1969],
        [ 0.2093, -0.9724, -0.7550,  0.3239, -0.1085]])

In [None]:
# Create a sequential layer that maps the 5 inputs of each batch to a 6D-space
# then uses a ReLU activation function
layer = nn.Sequential(nn.Linear(5, 6), nn.ReLU())

#Calculate the outputs
outputs = layer(batch_example)
outputs

tensor([[0.0000, 0.1898, 0.4165, 0.0000, 0.5927, 0.0000],
        [0.2917, 0.0000, 0.7161, 0.0000, 0.3986, 0.0000]],
       grad_fn=<ReluBackward0>)

Now, let's check the mean and variance of the outputs before applying the layer normalization.

In [None]:
# Note that this calculates the mean and variance of all the tensor
outputs.mean()
outputs.var()

# However, we want the mean dna variance of each batch, that is, each row
mean_batch = outputs.mean(dim=1, keepdim=True)
var_batch = outputs.var(dim=1, keepdim=True)

print(mean_batch)
print(var_batch)

tensor([[0.1998],
        [0.2344]], grad_fn=<MeanBackward1>)
tensor([[0.0642],
        [0.0854]], grad_fn=<VarBackward0>)


In [None]:
# Note what happens when we don't use keepdim=True

mean_batch_ = outputs.mean(dim=1)
var_batch_ = outputs.var(dim=1)

print(mean_batch_)
print(var_batch_)

tensor([0.1998, 0.2344], grad_fn=<MeanBackward1>)
tensor([0.0642, 0.0854], grad_fn=<VarBackward0>)


So using `keepdim=True` makes the result have the same dimension as the inputs, which makes it readable as each row corresponds to each batch.

Now, let's apply normalization to the outputs.

In [None]:
normalized_outputs = (outputs - mean_batch) / torch.sqrt(var_batch + 1e-5)
normalized_outputs

tensor([[-0.7884, -0.0395,  0.8549, -0.7884,  1.5499, -0.7884],
        [ 0.1960, -0.8019,  1.6481, -0.8019,  0.5618, -0.8019]],
       grad_fn=<DivBackward0>)

In [None]:
# Check meand and variance
print(f'Means:\n {normalized_outputs.mean(dim=1, keepdim=True)}')
print(f'Variances:\n {normalized_outputs.var(dim=1, keepdim=True)}')

Means:
 tensor([[-6.9539e-08],
        [-1.9868e-08]], grad_fn=<MeanBackward1>)
Variances:
 tensor([[0.9998],
        [0.9999]], grad_fn=<VarBackward0>)


The results now have a mean of 0 and a variance of 1 for each batch. If we turn off scientific notation the results are even clearer.

In [None]:
torch.set_printoptions(sci_mode=False)
print(f'Means:\n {normalized_outputs.mean(dim=1, keepdim=True)}')
print(f'Variances:\n {normalized_outputs.var(dim=1, keepdim=True)}')

Means:
 tensor([[    -0.0000],
        [    -0.0000]], grad_fn=<MeanBackward1>)
Variances:
 tensor([[0.9998],
        [0.9999]], grad_fn=<VarBackward0>)


# Creation of the layer

In [None]:
class LayerNormalization(nn.Module):
    def __init__(self, embedding_dimension):
        super().__init__()
        # Small number to prevent divison by zero
        self.epsilon = 1e-5
        # gamma scales
        self.gamma = nn.Parameter(torch.ones(embedding_dimension))
        # beta shifts
        self.beta = nn.Parameter(torch.zeros(embedding_dimension))

    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        std = x.var(dim=-1, keepdim=True, unbiased=False)
        normalized_x = (x - mean) / torch.sqrt(std + self.epsilon)
        return self.gamma * normalized_x + self.beta

In the code, $\gamma$ and $\beta$ are traininable parameters that the model learns that help improve the model's performance.

The unbiased variance refers to the fact that one divides by $n-1$ instead of $n$.

Now, let's use the class to process the example above.

In [None]:
oLayerNormalization = LayerNormalization(embedding_dimension=5)
outputs_layerNormalization = oLayerNormalization(batch_example)
print(f'Outputs:\n {outputs_layerNormalization}')

Outputs:
 tensor([[ 0.5528,  1.0693, -0.0223,  0.2656, -1.8654],
        [ 0.9087, -1.3767, -0.9564,  1.1304,  0.2940]], grad_fn=<AddBackward0>)


In [None]:
mean = outputs_layerNormalization.mean(dim=1, keepdim=True)
var = outputs_layerNormalization.var(dim=1, keepdim=True, unbiased=False)
print(f'Means:\n {mean}')
print(f'Variances:\n {var}')

Means:
 tensor([[    -0.0000],
        [     0.0000]], grad_fn=<MeanBackward1>)
Variances:
 tensor([[1.0000],
        [1.0000]], grad_fn=<VarBackward0>)
