# Layer Normalization

Here, we focus on the ADD & NORM parts

Normalization:

Activations of neurons will be a wide range of pos and neg values. Normalization encapsulates these values within a much smaller range and typically centered around zero and what this allows for is much more stable training as during the backprpogation phase when we actually perform a gradient step. we are taking much more even steps so it is now easier to learn and hence it is faster and also more stable training to get to the optimal position or these optimal parameter values more consistently and in a quick way.

In Layer Normalization, the strategy is that we apply normalization to a neural network in this case we are going to ensure that the activation values of every neuron in every layer is normalized such that all the activation values in a layer will have like a center like zero and std of 1.

Let x, y, z and o be the activation vectors for every single layer of NN. 

x' = f[w_1.T x + b_1] (w_1 = weights, b_1 = bias, f = activation)

y = γ_1[(x' - μ_1)/σ_1] + β_1 (μ_1 = mean of activation values of y layer, σ_1 = std of activation values of y layer, γ_1, β_1 = learnable params)

Example:

[ 0.2 0.1 0.3
  0.5 0.1 0.1 ] -> 2 words 3 dimensions

μ_11 = 1/3 [0.2+0.1+0.3] = 0.2
μ_21 = 1/3[0.5+0.1+0.1] = 0.233

σ_11 = (1/3{[0.2 - 0.2]^2 + [0.1 - 0.2]^2 + [0.3 - 0.2]^2})^(1/2)
     = 0.08164
σ_21 = (1/3{[0.5 - 0.233)^2 + [0.1 - 0.233]^2 + [0.1 - 0.233]^2})^(1/2)
     = 0.1885

y = (x - μ)/σ
y = [ 0     -1.2248 1.2248
      1.414 -0.707  -0.707 ]

out = γ * y + β

In [1]:
import torch
from torch import nn

In [2]:
inputs = torch.Tensor([[[0.2, 0.1, 0.3], [0.5, 0.1, 0.1]]])
B, S, E = inputs.size()
inputs = inputs.reshape(S, B, E)
inputs.size()

torch.Size([2, 1, 3])

In [3]:
parameter_shape = inputs.size()[-2:]
gamma = nn.Parameter(torch.ones(parameter_shape))
beta = nn.Parameter(torch.zeros(parameter_shape))

In [4]:
gamma.size(), beta.size()

(torch.Size([1, 3]), torch.Size([1, 3]))

In [5]:
# computing the dimensions for which we want to compute layer normalization that is the 
# batch dimension as well as the embedding dimension
dims = [-(i+1) for i in range(len(parameter_shape))]
dims

[-1, -2]

In [6]:
mean = inputs.mean(dim=dims, keepdim=True)
mean.size()

torch.Size([2, 1, 1])

In [7]:
var = ((inputs-mean) ** 2).mean(dim=dims, keepdim=True)
epsilon = 1e-5
std = (var + epsilon).sqrt()
std

tensor([[[0.0817]],

        [[0.1886]]])

In [8]:
y = (inputs - mean) / std
y

tensor([[[ 0.0000, -1.2238,  1.2238]],

        [[ 1.4140, -0.7070, -0.7070]]])

In [9]:
out = gamma * y + beta
out

tensor([[[ 0.0000, -1.2238,  1.2238]],

        [[ 1.4140, -0.7070, -0.7070]]], grad_fn=<AddBackward0>)

## Layer Normalization Class

In [10]:
import torch
from torch import nn

class LayerNormalization():
    def __init__(self, parameters_shape, eps=1e-5):
        self.parameters_shape=parameters_shape
        self.eps=eps
        self.gamma = nn.Parameter(torch.ones(parameters_shape))
        self.beta =  nn.Parameter(torch.zeros(parameters_shape))

    def forward(self, input):
        dims = [-(i + 1) for i in range(len(self.parameters_shape))]
        mean = inputs.mean(dim=dims, keepdim=True)
        print(f"Mean \n ({mean.size()}): \n {mean}")
        var = ((inputs - mean) ** 2).mean(dim=dims, keepdim=True)
        std = (var + self.eps).sqrt()
        print(f"Standard Deviation \n ({std.size()}): \n {std}")
        y = (inputs - mean) / std
        print(f"y \n ({y.size()}) = \n {y}")
        out = self.gamma * y  + self.beta
        print(f"out \n ({out.size()}) = \n {out}")
        return out

In [11]:
batch_size = 3
sentence_length = 5
embedding_dim = 8 
inputs = torch.randn(sentence_length, batch_size, embedding_dim)

print(f"input \n ({inputs.size()}) = \n {inputs}")

input 
 (torch.Size([5, 3, 8])) = 
 tensor([[[-0.0179,  0.4530,  0.4010, -0.2802,  0.5412,  1.4781, -0.2056,
          -0.3623],
         [ 2.0039, -0.1214,  0.7030, -0.8229, -1.9977,  0.7847, -0.1746,
          -0.4260],
         [ 0.3596, -1.4512,  0.7291,  0.1631, -0.8912,  0.8370,  0.0927,
          -0.9296]],

        [[ 0.8030, -1.3887,  0.7416, -0.1297,  1.3427,  0.2086,  0.6719,
           1.2836],
         [ 1.1494, -0.0777, -1.4635, -0.4438,  0.4408,  0.9419,  0.7639,
          -1.0528],
         [-0.6064, -0.2414, -0.5432,  1.1113,  0.2556,  0.3448,  1.6990,
          -0.8284]],

        [[ 1.2076, -1.2376,  1.0048,  0.4390,  0.4904, -0.5646,  1.7604,
          -0.4008],
         [-0.8284,  0.4402, -1.9090,  0.8574,  0.2157, -0.4522, -1.1233,
           0.7318],
         [ 0.7573,  0.0518, -1.3538, -1.3700, -0.6946,  0.9383,  0.0460,
           0.4487]],

        [[-1.8776, -0.7117, -0.6100, -1.7194, -2.0448, -0.0688,  0.3078,
          -0.1063],
         [ 0.9363, -1.4924, 

In [12]:
layer_norm = LayerNormalization(inputs.size()[-1:])

In [13]:
out = layer_norm.forward(inputs)

Mean 
 (torch.Size([5, 3, 1])): 
 tensor([[[ 0.2509],
         [-0.0064],
         [-0.1363]],

        [[ 0.4416],
         [ 0.0323],
         [ 0.1489]],

        [[ 0.3374],
         [-0.2585],
         [-0.1470]],

        [[-0.8538],
         [-0.4610],
         [-0.3893]],

        [[-0.2523],
         [-0.1055],
         [ 0.2719]]])
Standard Deviation 
 (torch.Size([5, 3, 1])): 
 tensor([[[0.5693],
         [1.1192],
         [0.7916]],

        [[0.8311],
         [0.8966],
         [0.8318]],

        [[0.9421],
         [0.9202],
         [0.8425]],

        [[0.8529],
         [0.9186],
         [0.7790]],

        [[1.0001],
         [0.5438],
         [0.6662]]])
y 
 (torch.Size([5, 3, 8])) = 
 tensor([[[-0.4722,  0.3551,  0.2637, -0.9331,  0.5099,  2.1557, -0.8019,
          -1.0773],
         [ 1.7962, -0.1028,  0.6339, -0.7296, -1.7793,  0.7069, -0.1503,
          -0.3749],
         [ 0.6265, -1.6612,  1.0933,  0.3782, -0.9537,  1.2296,  0.2894,
          -1.0022]],

