## **Layer normalisation**
Layer Normalization Explained
Layer normalization (LN) is a technique in deep learning used to stabilize the training process and improve the performance of neural networks. It addresses the internal covariate shift (ICS) problem, where the distribution of activations within a layer changes during training, making it difficult for the network to learn effectively.

- stable training
- faster

x' = f(W.transpose @ b)

y = $gamma * \frac{(x' - mean)} {sd} + beta$

Example:
```
[[0.2, 0.1, 0.3],
 [0.5, 0.1, 0.1]]

 mean1 = 0.2 + 0.1 + 0.3 / 3 = 0.6 / 3 = 0.2
 mean2 = 0.5 + 0.1 + 0.1 / 3 = 0.7 / 3 = 0.233

 sd1 = sqrt(1/3((0.2 - 0.2)^2 + (0.1 - 0.2)^2 + (0.3 - 0.2)^2))
     = 0.08614


 sd2 = sqrt(1/3((0.5 - 0.233)^2 + (0.1 - 0.233)^2 + (0.1 - 0.233)^2))
     = 0.1885

mean = [mean1,
        mean2]
     = [0.2,
        0.233]

sd = [sd1,
      sd2]
   = [0.086,
      0.1885]


Y = X - mean / sd
out = gamma * Y + beta (mean = 0, sd = 1)
```

In [1]:
import torch
import torch.nn as nn

In [2]:
input = torch.tensor([[[0.2, 0.1, 0.3],[0.5, 0.1, 0.1]]])
B,S,E = input.size()
input = input.reshape(S, B, E)
input.shape

torch.Size([2, 1, 3])

In [3]:
# embedding dim
params_shape = input.size()[-2 :]

# initialising gamma and beta
gamma = nn.Parameter(torch.ones(params_shape))
beta = nn.Parameter(torch.zeros(params_shape))

In [4]:
# dims across which mean and sd are calculated
dims = [-(i + 1) for i in range(len(params_shape))]
dims

[-1, -2]

In [5]:
mean = input.mean(dim = dims, keepdim = True)
mean

tensor([[[0.2000]],

        [[0.2333]]])

In [6]:
# Since for 0 variance, the sd becomes infinity, hence we add a small epsilon value to the denominator
epsilon = 1e-5

In [7]:
var = ((input - mean) ** 2).mean(dim = dims, keepdim = True)
sd = (var + epsilon).sqrt()
sd

tensor([[[0.0817]],

        [[0.1886]]])

In [8]:
X_dash = (input - mean) / sd
X_dash

tensor([[[ 0.0000, -1.2238,  1.2238]],

        [[ 1.4140, -0.7070, -0.7070]]])

In [9]:
Y = gamma * X_dash + beta
Y

tensor([[[ 0.0000, -1.2238,  1.2238]],

        [[ 1.4140, -0.7070, -0.7070]]], grad_fn=<AddBackward0>)

### **Summary**

In [10]:
class LayerNorm():
  def __init__(self, params_shape, eps = 1e-5):
    self.params_shape = params_shape
    self.eps = eps
    self.gamma = nn.Parameter(torch.ones(params_shape))
    self.beta = nn.Parameter(torch.zeros(params_shape))

  def forward(self, input):
    dims = [-(i + 1) for i in range(len(self.params_shape))]
    mean = input.mean(dim = dims, keepdim = True)
    var = ((input - mean) ** 2).mean(dim = dims , keepdim = True)
    sd = (var + self.eps).sqrt()
    X_dash = (input - mean) / sd
    Y = self.gamma * X_dash + self.beta

    return Y

In [11]:
input = torch.tensor([[[0.2, 0.1, 0.3],[0.5, 0.1, 0.1]]])
B,S,E = input.size()
input = input.reshape(S, B, E)
params_shape = input.size()[-2 :]

In [12]:
norm = LayerNorm(params_shape)
norm.forward(input)


tensor([[[ 0.0000, -1.2238,  1.2238]],

        [[ 1.4140, -0.7070, -0.7070]]], grad_fn=<AddBackward0>)

---
### END!
