<h1 align="center" style="color:green;font-size: 3em;" >Implementing Transformers From Scratch Using Pytorch</h1>




* [1. Introduction](#section1)
* [2. Import libraries](#section2)
* [3. Basic components](#section3)
  - [Create Word Embeddings](#section4)
  - [Positional Encoding](#section5)
  - [Self Attention](#section6)
* [4. Encoder](#section7)
* [5. Decoder](#section8)
* [6. Testing our code](#section9)
* [7. Some useful resources](#section10)


<img src="https://theaisummer.com/static/6122618d7e1466853e88473ba375cdc7/40ffe/transformer.png">


<a class="anchor" id="section1"></a>
<h2 style="color:green;font-size: 2em;">1. Introduction</h2>

In this tutorial, we will explain the try to implement transformers in "Attention is all you need paper" from scratch using Pytorch. Basically transformer have an encoder-decoder architecture. It is common for language translation models. 



Note: Here we are  not going to a indepth explaination of transformers. For that please refer [blog](http://jalammar.github.io/illustrated-transformer/.) by Jay alammar. He has given a indepth explanation about the inner working of the transformers. We will just focus on the coding part.


<img src = "https://jalammar.github.io/images/t/The_transformer_encoders_decoders.png" width=600 height=400>

The above image shows a language translation model from French to English. Actually we can use stack of encoder(one in top of each) and stack of decoders as below:


<img src = "https://jalammar.github.io/images/t/The_transformer_encoder_decoder_stack.png" width=600 height=400>

Before going further Let us see a full fledged image of our attention model.

<img src = "https://miro.medium.com/max/760/1*2vyKzFlzIHfSmOU_lnQE4A.png" width=350 height=200>

<a class="anchor" id="section2"></a>
<h2 style="color:green;font-size: 2em;">2. Import Libraries</h2>

In [1]:
# importing required libraries
import torch.nn as nn
import torch
import torch.nn.functional as F
import math,copy
import warnings
warnings.simplefilter("ignore")
print(torch.__version__)

We know that transformer has an encoder decoder architecture for language translation. Before getting in to encoder pr decoder, let us discuss some common components.


<a class="anchor" id="section3"></a>
<h2 style="color:green;font-size: 2em;"> Basic components</h2>

<a class="anchor" id="section4"></a>
<h2 style="color:green;"> Create Word Embeddings</h2>

First of all we need to convert each word in the input sequence to an embedding vector. Embedding vectors will create a more semantic representation of each word.

Suppoese each embedding vector is of 512 dimension and suppose our vocab size is 100, then our embedding matrix will be of size 100x512. These marix will be learned on training and during inference each word will be mapped to corresponding 512 d vector. Suppose we have batch size of 32 and sequence length of 10(10 words). The the output will be 32x10x512.



In [2]:
class Embedding(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        """
        Args:
            vocab_size: size of vocabulary
            embed_dim: dimension of embeddings
        """
        super(Embedding, self).__init__()
        self.embed = nn.Embedding(vocab_size, embed_dim)
    def forward(self, x):
        """
        Args:
            x: input vector
        Returns:
            out: embedding vector
        """
        out = self.embed(x)
        return out

<a class="anchor" id="section5"></a>
<h3 style="color:green">Positional Encoding</h3>


Next step is to generate positional encoding. Inorder for the model to make 
sense of the sentence, it needs to know two things about the each word.
* what does the word mean?
* what is the position of the word in the sentence.

In "attention is all you need paper" author used the following functions to create positional encoding. On odd time steps a cosine function is used and in even time steps a sine function is used.

<img src="https://miro.medium.com/max/524/1*yWGV9ck-0ltfV2wscUeo7Q.png">

<img src="https://miro.medium.com/max/564/1*SgNlyFaHH8ljBbpCupDhSQ.png">

```
pos -> refers to order in the sentence
i -> refers to position along embedding vector dimension
```

Positinal embedding will generate a matrix of similar to embedding matrix. It will create a matrix of dimension sequence length x embedding dimension. For each token(word) in sequence, we will find the embedding vector which is of dimension 1 x 512 and it is added with the correspondng positional vector which is of dimension 1 x 512 to get 1 x 512 dim out for each word/token.

for eg: if we have batch size of 32 and seq length of 10 and let embedding dimension be 512. Then we will have embedding vector of dimension 32 x 10 x 512. Similarly we will have positional encoding vector of dimension 32 x 10 x 512. Then we add both.

<img src="https://miro.medium.com/max/906/1*B-VR6R5vJl3Y7jbMNf5Fpw.png" height=200 width=400>

In [3]:
# register buffer in Pytorch ->
# If you have parameters in your model, which should be saved and restored in the state_dict,
# but not trained by the optimizer, you should register them as buffers.


class PositionalEmbedding(nn.Module):
    def __init__(self,max_seq_len,embed_model_dim):
        """
        Args:
            seq_len: length of input sequence
            embed_model_dim: demension of embedding
        """
        super(PositionalEmbedding, self).__init__()
        self.embed_dim = embed_model_dim

        pe = torch.zeros(max_seq_len,self.embed_dim)
        for pos in range(max_seq_len):
            for i in range(0,self.embed_dim,2):
                pe[pos, i] = math.sin(pos / (10000 ** ((2 * i)/self.embed_dim)))
                pe[pos, i + 1] = math.cos(pos / (10000 ** ((2 * (i + 1))/self.embed_dim)))
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)


    def forward(self, x):
        """
        Args:
            x: input vector
        Returns:
            x: output
        """
      
        # make embeddings relatively larger
        x = x * math.sqrt(self.embed_dim)
        #add constant to embedding
        seq_len = x.size(1)
        x = x + torch.autograd.Variable(self.pe[:,:seq_len], requires_grad=False)
        return x
               


<a class="anchor" id="section6"></a>
<h2 style="color:green"> Self Attention</h2>

Let me give a glimpse on Self Attention and Multihead attention

***What is self attention?***

Suppose we have a sentence "Dog is crossing the street because it saw the kitchen".What does it refers to here? It's easy to understand for the humans that it is Dog. But not for the machines.

As model proceeses each word, self attention allows it to look at other positions in the input sequence for clues. It will creates a vector based on dependency of each word with the other.


Let us go through a step by step illustration of self attention.

* **Step 1:** The first step in calculating self-attention is to create three vectors from each of the encoder’s input vectors (in this case, the embedding of each word). So for each word, we create a Query vector, a Key vector, and a Value vector. Each of the vector will be of dimension 1x64.

Since we have a multihead attention we will have 8 self attention heads.I will explain the code with 8 attention head in mind.

**How key,queries and values can be created?**

We will have a key matrix,query matrix and a value matrix to generate key, query and value.
These matrixes are learned during training.

```
code hint:
Suppose we have batch_size=32,sequence_length=10, embedding dimension=512. So after embedding and positional encoding our output will be of dimension 32x10x512.
We will resize it to 32x10x8x16.(About 8, it is the number of heads in multihead attention.Dont worry you will get to know about it once you go through the code.).

```


* **Step 2:**  Second step is to calculate the score. ie, we will multiply query marix with key matrix. [Q x K.t]

```
code hint:
Suppose our key,query and value dimension be 32x10x8x64. Before proceeding further, we will transpose each of them for multiplication convinience (32x8x10x64). Now multiply query matrix with transpose key matrix. ie (32x8x10x64) x (32x8x64x10) -> (32x8x10x10).
```


* **Step 3:** Now divide the output matrix with dimension of key matrix and then apply Softmax over it.


* **Step 4:** Then this gets multiply it with value matrix.

```
code hint:
After step 3 our output will be of dimension 32x8x10x10. Now muliply it with value matrix (32x8x10x64) to get output of dimension (32x8x10x64).Here 8 is the number of attention heads and 10 is the sequence length.Thus for each word we have 64 dim vector.
```

* **Step 5:** Once we have this we will pass this through a linear layer. This forms the output of multihead attention.

```
code hint:
(32x8x10x64) vector gets transposed to (32x10x8x64) and then reshaped as (32x10x512).Then it is passed through a linear layer to get output of (32x10x512).
```


Now you got an idea on how multihead attention works. You will be more clear once you go through the implementation part of it.

In [4]:
class MultiHeadAttention(nn.Module):
    def __init__(self, embed_dim=512, n_heads=8):
        """
        Args:
            embed_dim: dimension of embeding vector output
            n_heads: number of self attention heads
        """
        super(MultiHeadAttention, self).__init__()

        self.embed_dim = embed_dim    #512 dim
        self.n_heads = n_heads   #8
        self.single_head_dim = int(self.embed_dim / self.n_heads)   #512/8 = 64  . each key,query, value will be of 64d
       
        #key,query and value matrixes    #64 x 64   
        self.query_matrix = nn.Linear(self.single_head_dim , self.single_head_dim ,bias=False)  # single key matrix for all 8 keys #512x512
        self.key_matrix = nn.Linear(self.single_head_dim  , self.single_head_dim, bias=False)
        self.value_matrix = nn.Linear(self.single_head_dim ,self.single_head_dim , bias=False)
        self.out = nn.Linear(self.n_heads*self.single_head_dim ,self.embed_dim) 

    def forward(self,key,query,value,mask=None):    #batch_size x sequence_length x embedding_dim    # 32 x 10 x 512
        
        """
        Args:
           key : key vector
           query : query vector
           value : value vector
           mask: mask for decoder
        
        Returns:
           output vector from multihead attention
        """
        batch_size = key.size(0)
        seq_length = key.size(1)
        # 32x10x512
        key = key.view(batch_size, seq_length, self.n_heads, self.single_head_dim)  #batch_size x sequence_length x n_heads x single_head_dim = (32x10x8x64)
        query = query.view(batch_size, seq_length, self.n_heads, self.single_head_dim) #(32x10x8x64)
        value = value.view(batch_size, seq_length, self.n_heads, self.single_head_dim) #(32x10x8x64)
       
        k = self.key_matrix(key)       # (32x10x8x64)
        q = self.query_matrix(query)   
        v = self.value_matrix(value)

        q = q.transpose(1,2)  # (batch_size, n_heads, seq_len, single_head_dim)    # (32 x 8 x 10 x 64)
        k = k.transpose(1,2)  # (batch_size, n_heads, seq_len, single_head_dim)
        v = v.transpose(1,2)  # (batch_size, n_heads, seq_len, single_head_dim)
       
        # computes attention
        # adjust key for matrix multiplication
        k_adjusted = k.transpose(-1,-2)  #(batch_size, n_heads, single_head_dim, seq_ken)  #(32 x 8 x 64 x 10)
        product = torch.matmul(q, k_adjusted)  #(32 x 8 x 10 x 64) x (32 x 8 x 64 x 10) = #(32x8x10x10)
      
       
        if mask is not None:
             product = product.masked_fill(mask == 0, float("-1e20"))

        #divising by square root of key dimension
        product = product / math.sqrt(self.single_head_dim) # / sqrt(64)

        #applying softmax
        scores = F.softmax(product, dim=-1)

        #mutiply with value matrix
        scores = torch.matmul(scores, v)  ##(32x8x 10x 10) x (32 x 8 x 10 x 64) = (32 x 8 x 10 x 64) 
       
        #concatenated output
        concat = scores.transpose(1,2).contiguous().view(batch_size, seq_length, self.single_head_dim*self.n_heads)  # (32x8x10x64) -> (32x10x8x64)  -> (32,10,512)
        
        output = self.out(concat) #(32,10,512) -> (32,10,512)
       
        return output


Ok, now a sudden question can strike your mind. What is this mask used for? Don't worry we will go through it once we are talking about the decoder.

<a class="anchor" id="section7"></a>
<h2 style="color:green;font-size: 2em;"> 4. Encoder</h2>

<img src="https://www.researchgate.net/profile/Ehsan-Amjadian/publication/352239001/figure/fig1/AS:1033334390013952@1623377525434/Detailed-view-of-a-transformer-encoder-block-It-first-passes-the-input-through-an.jpg" width=300 height=200>



In the encoder section -

**Step 1:** First input(padded tokens corresponding to the sentence) get passes through embedding layer and positional encoding layer.

```
code hint
suppose we have input of 32x10 (batch size=32 and sequence length=10). Once it passes through embedding layer it becomes 32x10x512. Then it gets added with correspondng positional encoding vector and produces output of 32x10x512. This gets passed to the multihead attention
```

**Step 2:** As discussed above it will passed through the multihead attention layer and creates useful representational matrix as output.

```
code hint
input to multihead attention will be a 32x10x512 from which key,query and value vectors are generated as above and finally produces a 32x10x512 output.
```

**Step 3:** Next we have a normalization and residual connection. The output from multihead attention is added with its input and then normalized. 

```
code hint
output of multihead attention which is 32x10x512 gets added with 32x10x512 input(which is output created by embedding vector) and then the layer is normalized.

```

**Step 4:** Next we have a feed forward layer and a then normalization layer with residual connection from input(input of feed forward layer) where we passes the output after normalization though it and finally gets the output of encoder.

```
code hint
The normalized output will be of dimension 32x10x512. This gets passed through 2 linear layers: 32x10x512 -> 32x10x2048 -> 32x10x512. Finally we have a residual connection which gets added with the output and the layer is normalized. Thus a 32x10x512 dimensional vector is created as output for the encoder.

```

In [5]:
class TransformerBlock(nn.Module):
    def __init__(self, embed_dim, expansion_factor=4, n_heads=8):
        super(TransformerBlock, self).__init__()
        
        """
        Args:
           embed_dim: dimension of the embedding
           expansion_factor: fator ehich determines output dimension of linear layer
           n_heads: number of attention heads
        
        """
        self.attention = MultiHeadAttention(embed_dim, n_heads)
        
        self.norm1 = nn.LayerNorm(embed_dim) 
        self.norm2 = nn.LayerNorm(embed_dim)
        
        self.feed_forward = nn.Sequential(
                          nn.Linear(embed_dim, expansion_factor*embed_dim),
                          nn.ReLU(),
                          nn.Linear(expansion_factor*embed_dim, embed_dim)
        )

        self.dropout1 = nn.Dropout(0.2)
        self.dropout2 = nn.Dropout(0.2)

    def forward(self,key,query,value,mask=None):
        
        """
        Args:
           key: key vector
           query: query vector
           value: value vector
           mask: mask to be given for multi head attnetion(used only for the decoder)
        Returns:
           norm2_out: output of transformer block
        
        """
        
        attention_out = self.attention(key,query,value,mask)  #32x10x512
        attention_residual_out = attention_out + value  #32x10x512
        norm1_out = self.dropout1(self.norm1(attention_residual_out)) #32x10x512

        feed_fwd_out = self.feed_forward(norm1_out) #32x10x512 -> #32x10x2048 -> 32x10x512
        feed_fwd_residual_out = feed_fwd_out + norm1_out #32x10x512
        norm2_out = self.dropout2(self.norm2(feed_fwd_residual_out)) #32x10x512

        return norm2_out



class TransformerEncoder(nn.Module):
    """
    Args:
        seq_len : length of input sequence
        embed_dim: dimension of embedding
        num_layers: number of encoder layers
        expansion_factor: factor which determines number of linear layers in feed forward layer
        n_heads: number of heads in multihead attention
        
    Returns:
        out: output of the encoder
    """
    def __init__(self, seq_len, vocab_size, embed_dim, num_layers=2, expansion_factor=4, n_heads=8):
        super(TransformerEncoder, self).__init__()
        
        self.embedding_layer = Embedding(vocab_size, embed_dim)
        self.positional_encoder = PositionalEmbedding(seq_len, embed_dim)

        self.layers = nn.ModuleList([TransformerBlock(embed_dim, expansion_factor, n_heads) for i in range(num_layers)])
    
    def forward(self, x):
        embed_out = self.embedding_layer(x)
        out = self.positional_encoder(embed_out)
        for layer in self.layers:
            out = layer(out,out,out)

        return out  #32x10x512


<a class="anchor" id="section8"></a>
<h2 style="color:green;font-size: 2em;"> 5. Decoder</h2>

<img src="https://discuss.pytorch.org/uploads/default/optimized/3X/8/e/8e5d039948b8970e6b25395cb207febc82ba320a_2_177x500.png" height=100 width=250>


Now we have gone through most parts of the encoder.Let us get in to the components of the decoder. We will use the output of encoder to generate key and query vectors for the decoder.There are two kinds of multi head attention in the decoder.One is the decoder attention and other is the encoder decoder attention. Don't worry we will go step by step.

Let us explain with respect to the training phase. Firt

**Step 1:**

First the output  gets passed through the embeddin and positional encoding to create a embedding vector of dimension 1x512 corresponding to each word in the target sequence.

```
code hint
Suppose we have a sequence length of 10. batch size of 32 and embedding vector dimension of 512. we have input of size 32x10 to the embedding matrix which produces and output of dimension 32x10x512 which gets added with the positional encoding of same dimension and produces a 32x10x512 out

```

**Step 2:**

The embeddig output gets passed through a multihead attention layers as before(creating key,query and value matrixes from the target input) and produces an output vector. This tame the major difference is that we uses a mask with multihead attention. 

**Why mask?**

Mask is used because while creating attention of target words, we donot need a word to look in to the future words to check the dependency. ie, we already learned that why we create attention because we need to know contribution of each word with the other word. Since we are creating attention for words in target sequnce, we donot need a particular word to see the future words. For eg: in word "I am a strudent", we donot need the word "a" to look word "student".


```
code hint
For creating attention we created a triangular matrix with 1 and 0.eg:traingular matrix for seq length 5 looks as below:

1 0 0 0 0
1 1 0 0 0
1 1 1 0 0
1 1 1 1 0
1 1 1 1 1

After the key gets multiplied with query, we fill all zero positions with negative inifinity, In code we will fill it with a very small number to avoid division errors.
(with -1e 20)


```

**Step 3:**

As before we have a add and norm layer where we add with output of embedding with attention out and normalized it.


**Step 4:**


Next we have another multihead attention and then a add and norm layer. This multihead attention is called encoder-decorder multihead attention. For this multihead attention we create we create key and query vectors from the encoder output. Value is created from the output of previous decoder layer.

```
code hint:
Thus we have 32x10x512 out from encoder out. key and query for all words are generated from it. Similary value matrix is generated from otput from previous layer of decoder(32x10x512).

```

Thus it is passed through a multihead atention (we used number of heads = 8) the through a Add and Norm layer. Here the output from previous encoder layer(ie previoud add and norm layer) gets added with encoder-decoder attention output and then normalized.

**Step 5:**
Next we have a feed forward layer(linear layer) with add and nom which is similar to that of present in the encoder.


**Step 6:**
Finally we create a linear layer with length equal to number of words in total target corpus and a softmax function with it to get probablity of each word.

In [6]:
class DecoderBlock(nn.Module):
    def __init__(self, embed_dim, expansion_factor=4, n_heads=8):
        super(DecoderBlock, self).__init__()

        """
        Args:
           embed_dim: dimension of the embedding
           expansion_factor: fator ehich determines output dimension of linear layer
           n_heads: number of attention heads
        
        """
        self.attention = MultiHeadAttention(embed_dim, n_heads=8)
        self.norm = nn.LayerNorm(embed_dim)
        self.dropout = nn.Dropout(0.2)
        self.transformer_block = TransformerBlock(embed_dim, expansion_factor, n_heads)
        
    
    def forward(self, key, query, x, mask):
        
        """
        Args:
           key: key vector
           query: query vector
           value: value vector
           mask: mask to be given for multi head attention 
        Returns:
           out: output of transformer block
    
        """
        attention = self.attention(x,x,x,mask) #32x10x512
        value = self.dropout(self.norm(attention + x))
        out = self.transformer_block(key, query, value, mask)

        
        return out


class TransformerDecoder(nn.Module):
    def __init__(self, target_vocab_size, embed_dim, seq_len, num_layers=2, expansion_factor=4, n_heads=8):
        super(TransformerDecoder, self).__init__()
        """  
        Args:
           target_vocab_size: vocabulary size of taget
           embed_dim: dimension of embedding
           seq_len : length of input sequence
           num_layers: number of encoder layers
           expansion_factor: factor which determines number of linear layers in feed forward layer
           n_heads: number of heads in multihead attention
        
        """
        self.word_embedding = nn.Embedding(target_vocab_size, embed_dim)
        self.position_embedding = PositionalEmbedding(seq_len, embed_dim)

        self.layers = nn.ModuleList(
            [
                DecoderBlock(embed_dim, expansion_factor=4, n_heads=8) 
                for _ in range(num_layers)
            ]

        )
        self.fc_out = nn.Linear(embed_dim, target_vocab_size)
        self.dropout = nn.Dropout(0.2)

    def forward(self, x, enc_out, trg_mask):
        
        """
        Args:
            x: input vector from target
            enc_out : output from encoder layer
            trg_mask: mask for decoder self attention
        Returns:
            out: output vector
        """
        batch_size, seq_length = x.shape[0],x.shape[1]  #32x10

        x = self.word_embedding(x)  #32x10x512
        x = self.position_embedding(x) #32x10x512
        x = self.dropout(x)
     
        for layer in self.layers:
            x = layer(enc_out, enc_out, x, trg_mask) 

        out = F.softmax(self.fc_out(x))

        return out


Finally we will arrange all submodules and creates the entire tranformer architecture.

In [7]:


class Transformer(nn.Module):
    def __init__(self, embed_dim, src_vocab_size, target_vocab_size, seq_length,num_layers=2, expansion_factor=4, n_heads=8):
        super(Transformer, self).__init__()
        
        """  
        Args:
           embed_dim:  dimension of embedding 
           src_vocab_size: vocabulary size of source
           target_vocab_size: vocabulary size of target
           seq_length : length of input sequence
           num_layers: number of encoder layers
           expansion_factor: factor which determines number of linear layers in feed forward layer
           n_heads: number of heads in multihead attention
        
        """
        
        

        self.encoder = TransformerEncoder(seq_length, src_vocab_size, embed_dim, num_layers=num_layers, expansion_factor=expansion_factor, n_heads=n_heads)
        self.decoder = TransformerDecoder(target_vocab_size, embed_dim, seq_length, num_layers=num_layers, expansion_factor=expansion_factor, n_heads=n_heads)

    
    def make_trg_mask(self, trg):
        """
        Args:
            trg: target sequence
        Returns:
            trg_mask: target mask
        """
        batch_size, trg_len = trg.shape
        # returns the lower triangular part of matrix filled with ones
        trg_mask = torch.tril(torch.ones((trg_len, trg_len))).expand(
            batch_size, 1, trg_len, trg_len
        )
        return trg_mask    

    def forward(self, src, trg):
        """
        Args:
            src: input to encoder 
            trg: input to decoder
        out:
            out: final vector which returns probabilities of each target word
        """
        trg_mask = self.make_trg_mask(trg)
        enc_src = self.encoder(src)
        out = self.decoder(trg, enc_src, trg_mask)
        return out




<a class="anchor" id="section9"></a>
<h2 style="color:green;font-size: 2em;"> 6. Testing Our code </h2>

Suppose we have input sequence oflength 9 and target sequence of length 10.

In [8]:
x = torch.tensor([[1, 5, 6, 4, 3, 9, 5, 2, 0], [1, 8, 7, 3, 4, 5, 6, 7, 2]])
target = torch.tensor([[1, 7, 4, 3, 5, 9, 2, 0, 1,9], [1, 5, 6, 2, 4, 7, 6, 2, 0,2]])

In [9]:
src_vocab_size = 10
target_vocab_size = 10
num_layers = 6
seq_length= 9

model = Transformer(embed_dim=512, src_vocab_size=src_vocab_size, target_vocab_size=target_vocab_size, seq_length=seq_length, num_layers=num_layers, expansion_factor=4, n_heads=8)

In [10]:
model

In [11]:
out = model(x, target[:, :-1])
print(x.shape,target.shape)
print(out.shape)


<a class="anchor" id="section10"></a>
<h2 style="color:green;font-size: 2em;"> 10.  Some useful resources </h2>

* Understanding transformers
  - https://theaisummer.com/transformer/
  - https://jalammar.github.io/illustrated-transformer/
* Pytorch implementation
  - https://www.youtube.com/watch?v=U0s0f995w14