## About
Implementing transformers architecture.

## Self Attention

1. W = [X_transpose.X]
2. Y = W_Transpose.X

* It can be regarded as a set operation with zero parameters that can be tweaked durind training.

* Scaled Self attention is used when the W matrix's size grows proportionally to the input size.

where X can be numericalised text such as Hi There, Let's have a look --> [0, 5, 4, 7, 6]

* Multi-head attention- Different words related to each other by different relation in an input vector

For example - Hi relates to There and us directly whereas the intent is to look there. These subtle features are extracted better using multi head attentions.

- It is self attention applied parallely over each such relations.

- Self attention is linear operation and thus doesn't suffer from vanishing or exploding gradients.

- Self attention can be regarded as sequence to sequence layer used for machine translation for parallel computation, perfect long term memory.

- Self attention can be stacked onto each other to build powerful models known as transformers.




## Transformer
1. Transformer model is Seq to Seq model that uses self attention to propagate information along the time dimension like when it's done across pixesl it becomes image transformer and when it's done across graph nodes then it's graph transformer.


The basic transformer block modular class architecture is quoted below

1. pass the input through layer normalisation which is similar to batch normalisation.
2. The output is then passed through self attention.
3. The output in 1. and 2. are concatenated over a residual connection.
4. It is fed further to the patterns illustrated in 1. and 2.

```
class Block(nn.Module):
    def forward(self,x):
        y = self.layernorm(x)
        y = self.self_attention(y)
        x = x+y

        y = self.layernorm(x)
        y = self.linear(x)
        return x+y

```

- Basic architecture for a sequence to label transformer shall comprise of input embeddings extracted from sequence of chars fed to a stack of transformer block with output sequence pooled to the output label.

Limitations

1. By this approach, The positional information between words is lost such as the difference between 

Yesterday, The car gave a bad riding experience than the bike.
The bike doesn't not really give nice riding experience than the car.

2. To fix this, We introduce 
- a. Position embeddings
- b. Positing encodings
- c. Relative positions


- To take self attention layers and build transformers from it, We need the following ingredients.
1. Define a transformer block.
2. Mask the self attention block.
3. Stack various transformer blocks
4. Add positional information to the input vectors.


- All of the transformer architectures have a pretraining phase and a Fine tuning phase.



### BERT 

* BERT is trained on two tasks in Pretraining Phase on the corpus i.e Masking and Next Sentence prediction.

1. In Masking - Few tokens are randomly corrupted intentionally and the BERT is asked to predict those. Similar to Fill in the blanks.

2. Next Sentence Prediciton - CLS token is put at start of each sequence and SEP is used to concatenate sequence. In this task, BERT predicts whether two sequences are contiguous or from different parts of sentence.

In Fine tuning phase, It is fine tuned for the desired task.

For more information, Refer the blogs

1. <a href="https://suraj52.medium.com/the-transformers-985bfe679001" >Medium Blog1 - Suraj </a>
2. <a href="https://suraj52.medium.com/types-of-transformers-i-bert-aa38e04f2458" >Medium Blog2 - Suraj </a>

## Getting started with Implemetation of Transformer
<img src="/home/suraj/ClickUp/Jan-Feb/nlp-basics-to-advanced/22_deep-nlp-starter-04/1.png" alt="Alt text" title="BERT">


In [2]:
#importing libraries
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
import math


#### A. Input Embeddings.
We need to convert each word in the input sequence to an embedding vector. These are substitute to one hot encodings with more semantic representations embedded in them.

Dimension understanding

> Suppose each embedding vector's dimension is 256 and our vocab size is 1000 then our embedding matrix shall be 1000*256. The matrix shall be learned during training or one can use pretrained embeddings like Word2Vec or Glove. 

> During inference in the sequence, Each word will be mapped to corresponding 256 dimension vector.

> If batch size is 128, sequence length is 20 words then the output shall be 128*20*256

In [3]:
class EmbeddingLayer(nn.Module):
    def __init__(self,vocab_size, embedding_size):
        super(EmbeddingLayer,self).__init__()
        self.embedding_layer = nn.Embedding(vocab_size,embedding_size)

    def forward(self,x):
        out = self.embedding_layer(x)
        return out


In [4]:
x = ["Hi, There ! This is your first demo"]
seq_dict = {"Hi":0, "There":4, "This":3, "is":2, "your":1, "demo":5}
seq = [0,4,3,2,1,5]
vocab_size=100
embedding_size=256
seq = torch.LongTensor(seq)
embedding_layer = EmbeddingLayer(vocab_size,embedding_size)
print(embedding_layer.forward(seq))

tensor([[-0.0330,  0.9593,  0.1704,  ...,  0.1247,  0.0717,  0.5052],
        [ 0.6873, -0.4152, -1.7856,  ...,  0.1738,  0.0799, -0.4244],
        [ 0.8449, -0.0975,  0.5561,  ..., -2.1078, -1.1632,  1.7847],
        [ 0.6218,  0.2450, -0.6984,  ...,  0.4097, -0.8713, -0.7066],
        [ 1.3078,  0.3746, -2.0494,  ..., -0.2825, -0.5389,  3.4960],
        [ 0.0629, -0.4317, -0.3321,  ..., -0.5504,  1.6873, -0.5489]],
       grad_fn=<EmbeddingBackward0>)


#### B. Positional Encoding

> To understand the sentence better, The model needs to know two things about each word i.e. meaning of the word and the position of the word in sentence.

> As per "Attention is all you need" paper, A cosine function was used in odd time steps whereas a sine for even time steps.

PE(pos,2i) = sin(pos/10000^2i/d)
PE(pos,2i+1) = cos(pos/10000^2i/d)
where i refers to the position along embedding vector dimension, d is dimension of embedding.

> Positional embedding generates a matrix similar to embedding matrix. The output dimension is dimension_seq_len * embedding dimension. For each token in sequence, We'll find the embedding vector which is of dim 1*256 to get 1*256 dimension output for each token.

> So, If our batch size is 128 and seq_len is 20 then our output dimension of positional encoding shall be 128*20*256. At this stage positional embedding is concatenated with the previous embedding as per block diagram in paper.