# Layer Normalization

The activation of neurons is typically a wide range of values.
Noramlization maps the value in a small range typically centered around zero. This allows the training to be more stable. With small values, we take big steps to update the neural network.

![Alt text](imgs/layer-normalization.png)

To apply layer normalization, we follow these steps:

1. Normalize the inputs
$$
\hat{x_i} = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}  \  \text { normalization }  \\
$$

2. Scale and shift
$$
{y_i} = \gamma \hat{x_i} + \beta   \    \text {scale and shift }   \\
$$


Where $\mu$ is the mean of the inputs and $\sigma^2$ is the variance of the inputs:
$$
\mu = \frac{1}{m} \sum_{i=1}^{m} x_i  \   \text { batch mean } \\
\sigma^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu)^2  \   \text { batch standard deviation } \\
$$

$\epsilon$ is a small number to avoid division by zero, $\gamma$ is the scale parameter, and $\beta$ is the shift parameter.


In [86]:
import torch
from torch import nn

In [87]:
# X = inputs
# m = X.size()[-1]            # the last dimesnion is "3", the number of elements in the tensor

# # Mean
# mu = (1 / m) * torch.sum(X, dim=-1, keepdim=True)

# # Mean
# mu = (1 / m) * torch.sum(X, dim=0)

# # Standard Deviation
# sigma2 = (1 / m) * torch.sum((X - mu) ** 2, dim=0)

# # Normalization
# X_norm = (X - mu) / torch.sqrt(sigma2 + epsilon) 

# # Scale and Shift
# Y = gamma * X_norm + beta

### Initializng the inputs

In [135]:
inputs = torch.Tensor([ [ [0.2, 0.1, 0.3], [0.5, 0.1, 0.1] ] ])
batch_size, sequence_size, embedding_size = inputs.size()
inputs = inputs.reshape(sequence_size, batch_size, embedding_size)

print("Input shape:", inputs.size())
sequence_size, batch_size, embedding_size

Input shape: torch.Size([2, 1, 3])


(2, 1, 3)

2 is the number of inputs (words), 1 is the batch and 3 is the embedding dimension of each word.

### Parameters of Layer Normalization

In [136]:
parameter_shape = inputs.size()[-2:]
gamma = nn.Parameter(torch.ones(parameter_shape))
beta = nn.Parameter(torch.zeros(parameter_shape))

print("Parameter shape:", parameter_shape)
print(gamma)
print(beta)

Parameter shape: torch.Size([1, 3])
Parameter containing:
tensor([[1., 1., 1.]], requires_grad=True)
Parameter containing:
tensor([[0., 0., 0.]], requires_grad=True)



### Setting the dimensions

In [144]:
# We can dynamically get the last two dimensions of the tensor
# dims = [-(i+1) for i in range(len(parameter_shape))]
# dims

[-1, -2]

In [145]:
dims = (-2, -1)     # -1 also works but I do know the potential issues yet

### Mean

In [147]:
print(inputs.shape)
inputs

torch.Size([2, 1, 3])


tensor([[[0.2000, 0.1000, 0.3000]],

        [[0.5000, 0.1000, 0.1000]]])

In [148]:
# mean = inputs.mean(dims, keepdim=True)
# print(mean.size()) 
# mean

torch.Size([2, 1, 1])


tensor([[[0.2000]],

        [[0.2333]]])

A more detailed way to calculate the mean

In [156]:
X = inputs
m = X.size()[-1]
mu = (1 / m) * torch.sum(X, dim=-1, keepdim=True)

print(mu.size()) 
mu

torch.Size([2, 1, 1])


tensor([[[0.2000]],

        [[0.2333]]])

### Standard Deviation

We add epsilon to the standard deviation to avoid dividing by zero.

In [154]:
# variance = (
#     ((inputs - mean) ** 2).mean(dim=dims, keepdim=True)
# )
# variance

tensor([[[0.0067]],

        [[0.0356]]])

A more detailed way to calculate the standard deviation

In [153]:
# Standard Deviation
sigma2 = (1 / m) * torch.sum( (X - mu) ** 2, dim=dims, keepdim=True )
sigma2

tensor([[[0.0067]],

        [[0.0356]]])

### Denominator

In [None]:
# std = epsilon + torch.sqrt(sigma2)

In [157]:
epsilon = 1e-5
denominator = torch.sqrt(sigma2 + epsilon)
denominator

tensor([[[0.0817]],

        [[0.1886]]])

### Normalize the inputs

In [166]:
# y = (inputs - mean) / std
X.shape, mu.shape, denominator.shape

(torch.Size([2, 1, 3]), torch.Size([2, 1, 1]), torch.Size([2, 1, 1]))

In [172]:
X_norm = (X - mu) / denominator
X_norm

tensor([[[    -0.0000,     -1.2238,      1.2238]],

        [[     1.4140,     -0.7070,     -0.7070]]])

### Scale and shift

In [162]:
# output = gamma * y + beta

In [164]:
Y = gamma * X_norm + beta
Y

tensor([[[-1.8236e-07, -1.2238e+00,  1.2238e+00]],

        [[ 1.4140e+00, -7.0701e-01, -7.0701e-01]]], grad_fn=<AddBackward0>)