# How does Layer Normalization Work?

Layer normalization was initially intended to be used in Recurrent neural networks because the result of batch normalization is depending on the mini-batch size and it is not clear how to apply it to RNNs. But it actually became a thing after "Attention is all you need" and introduction of Transformer architecture.
The developers of Transformer architecture chose it as their preferred method of normalization throughout the model because it performs exceptionally well, especially in NLP tasks.

**But what exactly is layer normalization, and why we should normalize our data? Let’s begin with the later question.**

## Benefits of Normalization

<div align="center">
  <img width="460" height="300" src="https://i.ibb.co/0p0r6VW/Capture.png">
</div>

Normalization is good for your model. It reduces training time, unbiases model to higher value features and doesn’t allow weights to explode all over the place and restricts them to a certain range. 
All in all, It is undesirable to train a model with gradient descent with non-normalized features.

There are more then one way to perform normalization, two of which are presented here:

<div align="center">
  <img width="460" height="300" src="https://i.ibb.co/BfgTjGQ/Picture1.png">
</div>

the main difference between these normalization methods is the way we calculate average and variance in order to normalize our data.
You are probably familiar with the one on the right, the **batch norm**. 


## Batch Normalization

<div align="center">
  <img width="320" height="360" src="https://i.ibb.co/60m5Dm3/Picture1.png">
</div>

In batch norm, we take all sentences in a batch, and for each feature in these sentences, we can find an average and a variance, which will be used to normalize the data in that feature. 

For example, Imagine that we have a batch of 2 senteces: “`Popcorn popped.`” and “`Tea steeped.`” you can see that each sentence is displayed by a matrix in which each row represents a word:

<div align="center">
  <img width="320" height="360" src="https://i.ibb.co/4YSxCqr/Picture1.png">
</div>

In batch norm, we take one feature and calculate the average and variance of it. 

<div align="center">
  <img width="320" src="https://i.ibb.co/TPr8bM2/Picture1.png">
</div>

And then normalize the data so that the average is near zero and variance is about one. Here is the formula:

$$x_{norm}=\frac{x-avg(x)}{\sqrt{var(x)}}$$

<div align="center">
  <img width="320" src="https://i.ibb.co/gDKgrLk/Picture1.png">
</div>

Of course, we should repeat this for other features as well.

## Layer Normalization

<div align="center">
  <img width="320" height="360" src="https://i.ibb.co/tpDnBBg/Picture1.png">
</div>

In the layer norm, we take the average and variance from all of the features of a single sentence. 
Let’s see what it means using the same two sentences: 

<div align="center">
  <img width="320" height="360" src="https://i.ibb.co/4YSxCqr/Picture1.png">
</div>

Here we don’t care about the fact that these two sentences are from the same batch. In order to obtain the average and variance, we simply use all of the features in every sentence:

<div align="center">
  <img width="320" src="https://i.ibb.co/2ytx9Jh/Picture1.png">
</div>

And again, after normalization, we’ll have matrices with average of 0 and variance of 1:

<div align="center">
  <img width="320" src="https://i.ibb.co/42b38Q8/Picture1.png">
</div>

## Layer Norm in code

Now we want to implement what I just described in code. I create two numpy arrays, `sentence1` and `sentence2`, which are the same dummy matrices that we used in the illustrations:

In [1]:
import numpy as np

sentence1 = np.array([[0.31,0.14,0.93],[0.14,0.88,0.98]]) # "Popcorn Popped."
sentence2 = np.array([[0.85,0.2,0.14],[0.46,0.61,0.49]]) # "Tea Steeped."

Now we Calculte Average and Variance for each of these Sentences. 

In [2]:
# Average and Variace for First Sentence:
average1 = sentence1.mean()
variance1 = sentence1.var()

# Average and Variance for Second Sentence:
average2 = sentence2.mean()
variance2 = sentence2.var()

Now we normalize by applying following equation on our matrices:
$$x_{norm}=\frac{x-avg(x)}{\sqrt{var(x)}}$$

In [3]:
# Sentence 1 Normalization:
sentence1_norm = (sentence1 - average1)/(np.sqrt(variance1))
print(f"Sentence1:\n{sentence1}\n\n Sentence1 (Normalized):\n{sentence1_norm}\n")

# Sentence 2 Normalization:
sentence2_norm = (sentence2 - average2)/(np.sqrt(variance2))
print(f"Sentence2:\n{sentence2}\n\n Sentence2 (Normalized):\n{sentence2_norm}")


Sentence1:
[[0.31 0.14 0.93]
 [0.14 0.88 0.98]]

 Sentence1 (Normalized):
[[-0.68074565 -1.1375618   0.98528975]
 [-1.1375618   0.85093206  1.11964744]]

Sentence2:
[[0.85 0.2  0.14]
 [0.46 0.61 0.49]]

 Sentence2 (Normalized):
[[ 1.63221997 -1.07657062 -1.32661282]
 [ 0.00694562  0.63205114  0.13196672]]


You could also Simply use the [LayerNorm](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html) Implementation from Pytorch Library:

In [4]:
import torch

torch1 = torch.from_numpy(sentence1) # "Popcorn Popped."
torch2 = torch.from_numpy(sentence2) # "Tea Steeped."

layer_norm = torch.nn.LayerNorm(torch1.size())

# Sentence 1 Normalization:
torch1_norm = layer_norm(torch1.float())
print(f"Sentence1:\n{torch1}\n\n Sentence1 (Normalized):\n{torch1_norm}\n")

# Sentence 2 Normalization:
torch2_norm = layer_norm(torch2.float())
print(f"Sentence2:\n{torch2}\n\n Sentence2 (Normalized):\n{torch2_norm}")

Sentence1:
tensor([[0.3100, 0.1400, 0.9300],
        [0.1400, 0.8800, 0.9800]], dtype=torch.float64)

 Sentence1 (Normalized):
tensor([[-0.6807, -1.1375,  0.9853],
        [-1.1375,  0.8509,  1.1196]], grad_fn=<NativeLayerNormBackward>)

Sentence2:
tensor([[0.8500, 0.2000, 0.1400],
        [0.4600, 0.6100, 0.4900]], dtype=torch.float64)

 Sentence2 (Normalized):
tensor([[ 1.6321, -1.0765, -1.3265],
        [ 0.0069,  0.6320,  0.1320]], grad_fn=<NativeLayerNormBackward>)
