Here I've attempted to build a Transformer from scratch taking reference from the article  https://towardsdatascience.com/build-your-own-transformer-from-scratch-using-pytorch-84c850470dcb and the research paper https://arxiv.org/abs/1706.03762

Let me walk you through everything I've learned from these references.

We'll start by importing all the necessary libraries. 

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
import math
import copy

## Architecture of Transformer model

The transformer model uses a self-attention mechanisms, which allows it to consider all the previous words in a sequence when processing the next word. This is in contrast to traditional RNNs, which processes one word at a time in a sequential manner. The self-attention mechanism allows the Transformer to capture long-term sequences more efficiently.
 

The Transformer model consists of an encoder and a decoder. The encoder takes an input sequence and produces a sequence of hidden states, while the decoder takes the encoder output and generates an output sequence. Both the encoder and decoder are made up of multiple layers, each of which includes a self-attention mechanism and a feed-forward neural network.

##1. Multi-Head Attention

Multi-Head Attention is a key component of the Transformer model. The idea behind it is to compute the attention mechanism multiple times in parallel, with different sets of weights, in order to allow the model to attend to different parts of the input representation simultaneously.

In [2]:
## Multi-Head Attention
class MultiHeadAttention(nn.Module):
    def __init__(self,d_model,num_heads):
        super(MultiHeadAttention, self).__init__()
        assert d_model % num_heads ==0, "d_model must be divisible by num_heads"

        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model ,d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def scaled_dot_product_attention(self, Q,K,V, mask=None):
        attn_scores = torch.matmul(Q, K.transpose(-2,-1)) / math.sqrt(self.d_k)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0,-1e9)
        attn_probs = torch.softmax(attn_scores, dim=-1)
        output = torch.matmul(attn_probs, V)
        return output

    def split_heads(self, x):
        batch_size, seq_length, d_model = x.size()
        return x.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1,2)

    def combine_heads(self,x):
        batch_size, _, seq_length, d_k = x.size()
        return x.transpose(1,2).contiguous().view(batch_size, seq_length, self.d_model)

    def forward(self, Q,K,V,mask=None):
        Q = self.split_heads(self.W_q(Q))
        K = self.split_heads(self.W_k(K))
        V = self.split_heads(self.W_v(V))

        attn_output = self.scaled_dot_product_attention(Q, K, V,mask)
        output = self.W_o(self.combine_heads(attn_output))
        return output

## 2.Position-wise Feed Forward Networks

In a Transformer model, a position-wise feed-forward network (FFN) is applied to each position of the encoder and decoder. It consists of two linear transformations with a ReLU activation in between them:
FFN(x) = max(0, xW1 + b1)W2 + b2 
where x is the input tensor of shape (seq_len, embed_dim), W1, W2 are the learnable weight matrices of the two linear transformations, and b1, b2 are the bias terms. The output of the FFN is of the same shape as the input.

In [3]:
## Position-wise Feed Forward Networks
class PositionWiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(PositionWiseFeedForward,self).__init__()
        self.fc1 = nn.Linear(d_model, d_ff)
        self.fc2 = nn.Linear(d_ff, d_model)
        self.relu = nn.ReLU()

    def forward(self,x):
        return self.fc2(self.relu(self.fc1(x)))

## 3. Positional Encoding

Self-attention mechanism operates on the input sequence as a whole, rather than individual words or phrases. The self-attention mechanism calculates a weight for each input token based on its similarity to all other tokens in the sequence. This means that the output of self-attention is a weighted sum of all input tokens, where each weight is based on the similarity between the corresponding token and all other tokens in the sequence. *So the self-attention mechanism is not sensitive to word ordering.*

**Does it matter?**
Let us consider an example:
* All humans are smart and some are dumb.
* All humans are dumb and some are smart.

Does these two sentences have same meaning?  (Of course not!)

The solution to this problem is to add a position embedding to the input word embedding. We can encode absolute and relative positions of the words such that the semantic of the input is not altered. It is of same dimension as the word embedding.

In [4]:
##Positional encoding
class PositionalEncoding(nn.Module):
    def __init__(self,d_model,max_seq_length):
        super(PositionalEncoding,self).__init__()
        pe = torch.zeros(max_seq_length, d_model)
        position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0,d_model,2).float() * -(math.log(10000.0) / d_model))

        pe[:,0::2] = torch.sin(position*div_term)
        pe[:,1::2] = torch.cos(position*div_term)

        self.register_buffer('pe',pe.unsqueeze(0))

    def forward(self,x):
        return x+self.pe[:,:x.size(1)]

## 4. a) Encoder Layer

The encoder layer is a core building block of the Transformer model used in natural language processing tasks. It consists of two main components: multi-head self-attention mechanism and position-wise feedforward network.

The Transformer model typically stacks multiple encoder layers on top of each other, allowing the model to capture increasingly complex relationships between the input sequence elements. The output of the final encoder layer is then fed to the decoder layer to generate the final output sequence.

In [5]:
## Encoder layer
class EncoderLayer(nn.Module):
    def __init__(self,d_model,num_heads,d_ff,dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model,num_heads)
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self,x,mask):
        attn_output = self.self_attn(x,x,x,mask)
        x = self.norm1(x+self.dropout(attn_output))
        ff_output = self.feed_forward(x)
        x = self.norm2(x+self.dropout(ff_output))
        return x

## 4 b) Decoder Layer

The decoder layer of a transformer model is similar to the encoder layer, but with some differences to enable it to perform the task of language generation. The decoder layer has three sub-layers: masked multi-head attention, multi-head attention, and position-wise feedforward network.

The masked multi-head attention layer is similar to the self-attention layer in the encoder, but with a mask to prevent the decoder from attending to future tokens. This mask ensures that at each step, the decoder attends only to the tokens generated in the previous steps.

The second sub-layer is the multi-head attention layer, where the decoder attends to the encoder's output to obtain context information for the current step. This attention mechanism allows the decoder to focus on the relevant parts of the encoder output and suppress irrelevant information.

The final sub-layer of the decoder layer is the position-wise feedforward network, which is similar to the one in the encoder layer. This network applies a two-layer linear transformation followed by a ReLU activation to each position in the sequence independently.

In [6]:
## Decoder Layer
class DecoderLayer(nn.Module):
    def __init__(self,d_model,num_heads,d_ff,dropout):
        super(DecoderLayer,self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionalEncoding(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
    def forward(self,x,enc_output,src_mask,tgt_mask):
        attn_output = self.self_attn(x,x,x,tgt_mask)
        x = self.norm1(x+self.dropout(attn_output))
        attn_output = self.cross_attn(x,enc_output,enc_output,src_mask)
        x = self.norm2(x+self.dropout(attn_output))
        ff_output = self.feed_forward(x)
        x = self.norm3(x+self.dropout(ff_output))
        return x

## 5. Transformer Architecture

In [7]:
##  Transformer Model
class Transformer(nn.Module):
    def __init__(self,src_vocab_size,tgt_vocab_size,d_model,num_heads,num_layers,d_ff,max_seq_length,dropout):
        super(Transformer,self).__init__()
        self.encoder_embedding = nn.Embedding(src_vocab_size,d_model)
        self.decoder_embedding = nn.Embedding(tgt_vocab_size,d_model)
        self.positional_encoding = PositionalEncoding(d_model,max_seq_length)
        self.encoder_layers = nn.ModuleList([EncoderLayer(d_model,num_heads,d_ff,dropout) for _ in range(num_layers)])
        self.decoder_layers = nn.ModuleList([DecoderLayer(d_model,num_heads,d_ff, dropout) for _ in range(num_layers)])
        self.fc = nn.Linear(d_model, tgt_vocab_size)
        self.dropout = nn.Dropout(dropout)

    def generate_mask(self,src,tgt):
        src_mask = (src != 0).unsqueeze(1).unsqueeze(2)
        tgt_mask = (tgt != 0).unsqueeze(1).unsqueeze(2)
        seq_length = tgt.size(1)
        nopeak_mask = (1-torch.triu(torch.ones(1,seq_length,seq_length),diagonal=1)).bool()
        tgt_mask = tgt_mask & nopeak_mask
        return src_mask, tgt_mask

    def forward(self,src,tgt):
        src_mask, tgt_mask = self.generate_mask(src, tgt)
        src_embedded = self.dropout(self.positional_encoding(self.encoder_embedding(src)))
        tgt_embedded = self.dropout(self.positional_encoding(self.decoder_embedding(tgt)))

        enc_output = src_embedded
        for enc_layer in self.encoder_layers:
            enc_output = enc_layer(enc_output,src_mask)
        
        dec_output = tgt_embedded
        for dec_layer in self.decoder_layers:
            dec_output = dec_layer(dec_output,enc_output,src_mask,tgt_mask)

        output = self.fc(dec_output)
        return output

## Testing the model on sample data

In [8]:
src_vocab_size = 5000
tgt_vocab_size = 5000
d_model = 512
num_heads = 8
num_layers = 6
d_ff = 2048
max_seq_length = 100
dropout = 0.1

In [9]:
transformer = Transformer(src_vocab_size, tgt_vocab_size,d_model,num_heads,num_layers,d_ff,max_seq_length,dropout)

In [10]:
## Generate some random data
src_data = torch.randint(1,src_vocab_size, (64,max_seq_length)) #(batch_size,seq_length)
tgt_data = torch.randint(1,tgt_vocab_size, (64,max_seq_length)) #(batch_size,seq_length)

## Train the model
criterion = nn.CrossEntropyLoss(ignore_index=0)
optimizer = optim.Adam(transformer.parameters(),lr=0.0001,betas=(0.9,0.98),eps=1e-9)

In [11]:
transformer.train()

## training Loop
for epoch in range(100):
    optimizer.zero_grad()
    output = transformer(src_data,tgt_data[:,:-1])
    loss = criterion(output.contiguous().view(-1,tgt_vocab_size),tgt_data[:,1:].contiguous().view(-1))
    loss.backward()
    optimizer.step()
    print(f"Epoch:{epoch+1}, Loss:{loss.item()}")

Epoch:1, Loss:8.677156448364258
Epoch:2, Loss:8.606093406677246
Epoch:3, Loss:8.556018829345703
Epoch:4, Loss:8.513888359069824
Epoch:5, Loss:8.479804039001465
Epoch:6, Loss:8.448040008544922
Epoch:7, Loss:8.418245315551758
Epoch:8, Loss:8.39105224609375
Epoch:9, Loss:8.367931365966797
Epoch:10, Loss:8.345005989074707
Epoch:11, Loss:8.31788158416748
Epoch:12, Loss:8.290118217468262
Epoch:13, Loss:8.266304016113281
Epoch:14, Loss:8.239670753479004
Epoch:15, Loss:8.215306282043457
Epoch:16, Loss:8.186251640319824
Epoch:17, Loss:8.159092903137207
Epoch:18, Loss:8.132688522338867
Epoch:19, Loss:8.105985641479492
Epoch:20, Loss:8.077210426330566
Epoch:21, Loss:8.049870491027832
Epoch:22, Loss:8.021150588989258
Epoch:23, Loss:7.990198135375977
Epoch:24, Loss:7.961597919464111
Epoch:25, Loss:7.932947158813477
Epoch:26, Loss:7.90463924407959
Epoch:27, Loss:7.871876239776611
Epoch:28, Loss:7.840559959411621
Epoch:29, Loss:7.809232234954834
Epoch:30, Loss:7.779618263244629
Epoch:31, Loss:7.74614