# Layer Normalization & Residual Connections

In this notebook we will take a look at Layer Normalization & other things.In this we will start with taking a tensor which will have parameters as :
- B: The batch size
- S: Number of input words 
- E: Input word embedding size

In [1]:
import torch
from torch import nn

# Lets take some random inputs
inputs = torch.Tensor([[[0.2, 0.1, 0.3], [0.5, 0.1, 0.1]]])
B, S, E = inputs.size()
inputs = inputs.reshape(S, B, E)
inputs.size()

torch.Size([2, 1, 3])

## Hyper Parameters

The Layer Normalization has 2 learnable parameters which are $\gamma$ & $\beta$ . We will calculate them as the model trains , but for initializing purpose we take $\gamma$ = matrix of ones & $\beta$ = matrix of zeros. They are updated in backpropagation using gradient descent steps.

In [2]:
parameter_shape = inputs.size()[-2:]
gamma = nn.Parameter(torch.ones(parameter_shape))
beta =  nn.Parameter(torch.zeros(parameter_shape))

gamma.size(), beta.size()     

(torch.Size([1, 3]), torch.Size([1, 3]))

Compute the dimensions for which we want to do Layer Normalization

In [3]:
dims = [-(i + 1) for i in range(len(parameter_shape))]
dims

[-1, -2]

Now we will take the mean across the batch dimension and the layer dimension

In [4]:
mean = inputs.mean(dim=dims, keepdim=True)
print(f"The size of this mean matrix is : {mean.size()}")
print(f"The mean is : {mean}")

The size of this mean matrix is : torch.Size([2, 1, 1])
The mean is : tensor([[[0.2000]],

        [[0.2333]]])


Calculate the Standard Deviation. We will be adding a small epsilon value here, this is to remove the posibility where the standard deviation is going to be zero. As the STD is the denominator of the equation it should not be zero.

In [5]:
var = ((inputs - mean) ** 2).mean(dim=dims, keepdim=True)
epsilon = 1e-5
std = (var + epsilon).sqrt()
(f"The Standard Deviation is : {std}")

'The Standard Deviation is : tensor([[[0.0817]],\n\n        [[0.1886]]])'

Input the numerator & denominator into the formula of Y and get the resultant matrix

In [6]:
y = (inputs - mean) / std
y

tensor([[[ 0.0000, -1.2238,  1.2238]],

        [[ 1.4140, -0.7070, -0.7070]]])

Add learnable parameters

In [7]:
out = gamma * y + beta
out

tensor([[[ 0.0000, -1.2238,  1.2238]],

        [[ 1.4140, -0.7070, -0.7070]]], grad_fn=<AddBackward0>)

# Layer Normalization Class 

In [8]:
import torch
from torch import nn

class LayerNormalization():
    #-----------Constrctor-----------#
    def __init__(self,parameters_shape,eps=1e-5):
        self.parameters_shape=parameters_shape
        self.eps=eps
        self.gamma = nn.Parameter(torch.ones(parameters_shape))
        self.beta =  nn.Parameter(torch.zeros(parameters_shape))

    def forward(self,input):
        dims = [-(i+1) for i in range(len(self.parameters_shape))]           #----Calculate Dims
        mean = inputs.mean(dim=dims, keepdim=True)                          #----Calculate the mean for (dims)
        print(f"Mean \n ({mean.size()}): \n {mean}")
        var = ((inputs - mean) ** 2).mean(dim=dims, keepdim=True)           #----Calculate the var for (dims)
        std = (var + self.eps).sqrt()                                       #----Calculate the std for (dims)
        print(f"Standard Deviation \n ({std.size()}): \n {std}")
        y = (inputs - mean) / std                                           #----Calculate the Layer Normalization
        print(f"y \n ({y.size()}) = \n {y}")                                
        out = self.gamma * y  + self.beta                                   #----Add the learnable parameters for Normalization
        print(f"out \n ({out.size()}) = \n {out}")
        return out

### Example : Lets have a example

----------------------Sentences----------------------
1. The sun sets behind mountains.
2. He sings sweet songs every night.
3. We ride bikes through the park.
4. She reads books by the river.
5. They eat lunch in the park.
6. I dance under the bright lights.
7. The wind blows through the trees.
8. He plays guitar on the beach.
9. She writes poems about love.

---------------------- Calculations ----------------------

Maximum length of sentence is `5` , embdeeing dims are `8` assumption , and we will keep the batch size `3` 

In [9]:
batch_size = 3
sentence_length = 5
embedding_dim = 8

inputs = torch.randn(sentence_length, batch_size, embedding_dim)
print(f"input \n ({inputs.size()}) = \n {inputs}")

input 
 (torch.Size([5, 3, 8])) = 
 tensor([[[ 0.4467, -1.3007, -0.0920, -1.2152,  0.3026, -0.2864, -1.5821,
          -0.0724],
         [-0.7923,  0.0367,  0.8204,  0.7569,  0.8571,  0.5695,  0.5892,
          -1.6977],
         [ 1.1445, -0.0371, -0.3484, -1.1498, -0.2298,  0.0535, -0.1654,
          -0.3413]],

        [[-0.1922, -0.5485, -0.3247, -0.2460,  0.4074,  1.9384,  0.2171,
          -1.8889],
         [ 0.8976, -0.0413,  0.6761, -0.9226,  0.3660,  0.5053,  0.1961,
           0.9786],
         [-1.4371, -0.3621,  1.8867, -0.0622, -0.3469,  0.6589, -1.1966,
          -1.2185]],

        [[ 0.1490, -0.2083, -0.2071, -1.5242,  1.3496, -1.3662,  0.5302,
          -0.3606],
         [-0.8799,  0.9166, -0.2566, -0.3860,  0.0735, -1.9301,  0.0702,
           0.0742],
         [-0.2577,  0.0575,  1.2190, -0.1572,  0.9279,  0.9861, -0.1931,
          -2.2979]],

        [[ 0.9427, -0.1144,  0.8744,  1.3006,  0.7610, -0.2281,  0.5894,
           0.2228],
         [ 2.1817, -2.0896, 

In [10]:
layer_norm = LayerNormalization(inputs.size()[-1:])

out = layer_norm.forward(inputs)

Mean 
 (torch.Size([5, 3, 1])): 
 tensor([[[-0.4749],
         [ 0.1425],
         [-0.1342]],

        [[-0.0797],
         [ 0.3320],
         [-0.2597]],

        [[-0.2047],
         [-0.2898],
         [ 0.0356]],

        [[ 0.5435],
         [ 0.3628],
         [ 0.2544]],

        [[-0.5417],
         [-0.5236],
         [-0.4032]]])
Standard Deviation 
 (torch.Size([5, 3, 1])): 
 tensor([[[0.7292],
         [0.8661],
         [0.5927]],

        [[0.9995],
         [0.5722],
         [1.0423]],

        [[0.8800],
         [0.7825],
         [1.0441]],

        [[0.5028],
         [1.3625],
         [1.1690]],

        [[0.8930],
         [0.8031],
         [0.9016]]])
y 
 (torch.Size([5, 3, 8])) = 
 tensor([[[ 1.2639, -1.1324,  0.5252, -1.0152,  1.0662,  0.2586, -1.5183,
           0.5520],
         [-1.0793, -0.1222,  0.7828,  0.7094,  0.8252,  0.4931,  0.5158,
          -2.1246],
         [ 2.1576,  0.1638, -0.3613, -1.7135, -0.1612,  0.3167, -0.0527,
          -0.3495]],



In [11]:
out[0].mean(), out[0].std()

(tensor(-1.4901e-08, grad_fn=<MeanBackward0>),
 tensor(1.0215, grad_fn=<StdBackward0>))