## Transformer Model

![Transformer Model](transformer1.png) 

This notebook contains the code for the transformer model. The model is based on the paper [Attention is all you need](https://arxiv.org/abs/1706.03762) by Ashish Vaswani et al. The model is implemented using the [PyTorch](https://pytorch.org/) framework.

 ### 1. Importing the libraries

In [1]:
import torch 
import torch.nn as nn
import torch.optim as optim
import time

# In order to use GPU if you have (Actually for this model having a GPU is not an option, it is a must :D)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### 2. Defining the hyperparameters

In [2]:
d_model= 256 # AKA embedding size: the size of the token embedding vector
nhead = 4 # the number of heads in the MultiHead Attention layers
num_encoder_layers = 1 # the number of sub-encoder-layers in the encoder
num_decoder_layers = 1 # the number of sub-decoder-layers in the decoder
forward_expansion= 4 # the augmention rate of neurons inside the feedforward network in the Transformer encoder/decoder
learning_rate = 3e-4 # the learning rate of the Adam optimizer
block_size = 128 # the (max) length of the input sequence
vocab_size = 30000 # the size of the vocabulary (the number of tokens known by the tokenizer)
dropout = 0.25 # the dropout rate of the dropout layers

### 3. Implementing blocks of the Transformer

#### 3.1. Positional Encoding
 
In this part we are going to use the positional encoding as described in the paper.

To do so, we are going to use the following formula :


$$PE_{(pos, 2i)} = sin(pos / 10000^{2i / d_{model}})$$
$$PE_{(pos, 2i+1)} = cos(pos / 10000^{2i / d_{model}})$$

In [3]:
class PositionalEncoding(nn.Module):

    def __init__(self, d_model, max_sequence_length):
        super().__init__()
        self.max_sequence_length = max_sequence_length
        self.d_model = d_model

    def forward(self):
        position = torch.arange(self.max_sequence_length, device=device, dtype=torch.float32).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, self.d_model, 2, device=device, dtype=torch.float32) * (-math.log(10000.0) / self.d_model))
        pe = torch.zeros(self.max_sequence_length, self.d_model, device=device, dtype=torch.float32)
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        return pe

#### 3.2. Embedding Block
 
In this part we are going to create the embedding block. This block will be used to convert the input data into a vector representation.

This block mainly consists of two parts:
1. Token Embedding
2. Positional Encoding

We implemented the Positional Encoding block as a separate class just above. 

So we will implement the Embedding Block using this Positional Encoding class and using nn.Embedding class of PyTorch for Token Embedding.

In [4]:
class EmbeddingLayer(nn.Module):
    def __init__(self, d_model, vocab_size, block_size):
        super().__init__()
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model, block_size)
    def forward(self, x):
        out = self.token_embedding(x) + self.positional_encoding()
        return  out

#### 3.3. Multi Head Attention

In this part, we will implement the multi-head attention layer. The multi-head attention layer consists of $h$ attention heads. Each attention head has its own query, key, and value matrices.

In [5]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads, dropout=0.1):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads

        # We check if d_model is divisible by num_heads
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        # In order to assign the head size in each head we divide d_model by num_heads. This will be the for both d_k and d_v
        self.d_qkv = d_model // num_heads

        # We use nn.Linear to project the queries, keys, and values. We used the same linear projection for all heads.
        # The reason for this is that it allows us to use a single matrix multiplication to project the queries, keys and values
        # This is more efficient than using separate matrices for each
        self.W_keys = nn.Linear(d_model, d_model)
        self.W_queries = nn.Linear(d_model, d_model)
        self.W_values = nn.Linear(d_model, d_model)

        # We use a single linear projection to project the output of the attention heads
        self.linear_proj = nn.Linear(d_model, d_model)

        self.dropout = nn.Dropout(dropout) 

    def forward(self, key_src, query_src, value_src, mask=None):
        
        # We get the shape of the input batch
        B,T,C = key_src.shape # (batch_size, seq_len, d_model)


        # We project the queries, keys and values using their respective weight matrices
        keys = self.W_keys(key_src) # (batch_size, seq_len, d_model)
        queries = self.W_queries(query_src) # (batch_size, seq_len, d_model)
        values = self.W_values(value_src) # (batch_size, seq_len, d_model)
        

        # We reshape the queries, keys and values so that we can split them into multiple heads
        
        keys = keys.view(B,T,self.num_heads,self.d_qkv) # (batch_size, seq_len, num_heads, d_qkv)
        queries = queries.view(B,T,self.num_heads,self.d_qkv) # (batch_size, seq_len, num_heads, d_qkv)
        values = values.view(B,T,self.num_heads,self.d_qkv) # (batch_size, seq_len, num_heads, d_qkv)


        # We transpose the queries, keys and values so that the shape of the tensor becomes (batch_size, num_heads, seq_len, d_qkv)

        keys = keys.transpose(1,2) # (batch_size, num_heads, seq_len, d_qkv)
        queries = queries.transpose(1,2) # (batch_size, num_heads, seq_len, d_qkv)
        values = values.transpose(1,2) # (batch_size, num_heads, seq_len, d_qkv)

        # We compute the attention scores.
        atn_scr = queries @ keys.transpose(-2,-1) # (batch_size, num_heads, seq_len, seq_len)
        # We scale the attention scores and apply the mask (if provided)
        scaled_atn_scr = atn_scr / self.d_qkv**0.5
        if mask is not None:
            scaled_atn_scr = scaled_atn_scr.masked_fill(mask==0,float('-inf'))
        
        # We apply the softmax activation to compute the attention weights
        attention_weights = torch.softmax(scaled_atn_scr, dim=-1)
        attention_weights = self.dropout(attention_weights)  # Applying dropout
        # Lastly we multiply the attention weights with the values
        out = attention_weights @ values
        out = out.transpose(1, 2)
        # Reshape the matrix to (batch_size, seq_len, d_model) in order to be able to feed it to the next layer
        out = out.reshape(B, T, C)
        # Apply one last linear projection
        out = self.linear_proj(out)
        return out

#### 3.4. Feed Forward Neural Network

In this part we are going to implement the FFN just like it's in the paper. That mean the net will have 2 linear layer with ReLU activation function between them. The input and the output size of the net will remain the same but inside we will augment the dimension by a factor we assigned called "Forward Expansion".

In [6]:
class FeedForwardNet(nn.Module):
    def __init__(self, d_model, forward_expansion, dropout=0.1):  # Added dropout argument
        super(FeedForwardNet, self).__init__()
        # The output size of the first linear layer is forward_expansion time d_model (d_model*forward_expansion)
        self.fc1 = nn.Linear(d_model, d_model * forward_expansion)
        self.relu = nn.ReLU()
        # The input size of the second linear layer is d_model * forward_expansion and the output is just d_model 
        # in order to remain the same size as the input to the feed forward net and be able to use residual connection 
        self.fc2 = nn.Linear(d_model * forward_expansion, d_model)

        self.dropout = nn.Dropout(dropout)  # Added dropout layer

    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.dropout(out)  # Applying dropout
        out = self.fc2(out)
        return out

#### 3.5. Encoder Stack

In this part we are going to build the encoder stack. The encoder stack is composed of 1 MHA and 1 FFN. The MHA is followed by a residual connection and a layer normalization. The FFN is also followed by a residual connection and a layer normalization.

In the following section we are going to use this block in order to be able o build and Encoder block with multiple layers.

In [7]:
class EncoderStack(nn.Module):
    def __init__(self, d_model, num_heads, forward_expansion, dropout=0.1):  # Added dropout argument
        super().__init__()
        self.MHA = MultiHeadAttention(d_model=d_model, num_heads=num_heads, dropout=dropout)
        self.FFN = FeedForwardNet(d_model=d_model, forward_expansion=forward_expansion, dropout=dropout)
        self.layer_norm1 = nn.LayerNorm(d_model)
        self.layer_norm2 = nn.LayerNorm(d_model)

        self.dropout1 = nn.Dropout(dropout)  # Added dropout layer
        self.dropout2 = nn.Dropout(dropout)  # Added dropout layer

    def forward(self, x):
        out = x + self.dropout1(self.MHA(x, x, x))  # Applying dropout
        norm_out = self.layer_norm1(out)
        out = norm_out + self.dropout2(self.FFN(norm_out))  # Applying dropout
        norm_out = self.layer_norm2(out)
        return norm_out

#### 3.6. Encoder

In this part as we are going to build the Encoder block. The Encoder block will contain one embedding layer and Nx Encoder layers (N = num_layers).

In [8]:
class Encoder(nn.Module):
    def __init__(self, vocab_size, block_size, d_model, num_heads, forward_expansion, num_layers):
        super().__init__()
        self.block_size = block_size
        self.d_model = d_model
        self.embeding_layer = EmbeddingLayer(d_model, vocab_size, block_size)
        self.layers = nn.ModuleList([EncoderStack(d_model, num_heads, forward_expansion) for _ in range(num_layers)])
    
    def forward(self, x):
        x = self.embeding_layer(x)
        for layer in self.layers:
            x = layer(x)
        return x

#### 3.7. Decoder Stack

In this part we are going to build the decoder stack. The decoder stack is composed of 2 MHA (one for masked attention and one one for cross attention) and 1 FFN. Each sublayer is followed by residual connection and a layer normalization.

In the following section we are going to use this block in order to be able o build and Decoder block with multiple layers.

In [9]:
class DecoderStack(nn.Module):
    def __init__(self, d_model, num_heads, forward_expansion, dropout=0.1):  # Added dropout argument
        super(DecoderStack, self).__init__()
        self.Masked_MHA = MultiHeadAttention(d_model=d_model, num_heads=num_heads, dropout=dropout)
        self.Crossed_MHA = MultiHeadAttention(d_model=d_model, num_heads=num_heads, dropout=dropout)
        self.FFN = FeedForwardNet(d_model=d_model, forward_expansion=forward_expansion, dropout=dropout)
        self.LayerNorm1 = nn.LayerNorm(d_model)
        self.LayerNorm2 = nn.LayerNorm(d_model)
        self.LayerNorm3 = nn.LayerNorm(d_model)

        self.dropout1 = nn.Dropout(dropout)  # Added dropout layer
        self.dropout2 = nn.Dropout(dropout)  # Added dropout layer
        self.dropout3 = nn.Dropout(dropout)  # Added dropout layer

    def forward(self, x, encoder_out, trg_mask):
        masked_att_out = self.dropout1(self.Masked_MHA(x, x, x, trg_mask))  # Applying dropout
        masked_att_out = self.LayerNorm1(masked_att_out + x)
        crossed_att_out = self.dropout2(self.Crossed_MHA(encoder_out, masked_att_out, encoder_out))  # Applying dropout
        crossed_att_out = self.LayerNorm2(crossed_att_out + masked_att_out)
        ffn_out = self.dropout3(self.FFN(crossed_att_out))  # Applying dropout
        ffn_out = self.LayerNorm3(ffn_out + crossed_att_out)
        return ffn_out

#### 3.8. Decoder

In this part as we are going to build the Decoder block. The Decoder block will contain one embedding layer and Nx Decoder layers (N = num_layers).

In [10]:
class Decoder(nn.Module):
    def __init__(self,vocab_size, block_size, d_model, num_heads, forward_expansion, num_layers):
        super().__init__()
        self.block_size = block_size
        self.d_model = d_model
        self.embeding_layer = EmbeddingLayer(d_model, vocab_size, block_size)
        self.layers = nn.ModuleList([DecoderStack(d_model, num_heads, forward_expansion) for _ in range(num_layers)])

    def forward(self, x, encoder_output, trg_mask):
        x = self.embeding_layer(x)
        for layer in self.layers:
            x = layer(x, encoder_output, trg_mask)
        return x

#### 3.9. Transformer

Finally we are builtinf the transformer model. The transformer model is composed of the encoder and decoder. The encoder is composed of the embedding layer, and N encoder layers. The decoder is composed of the embedding layer and N decoder layers. In order to make prediction in the end we put a linear layer with the size of the target vocabulary at the end.

In [12]:
class Transformer(nn.Module):
    def __init__(self, vocab_size, block_size, d_model, nhead, num_encoder_layers, num_decoder_layers,
                 forward_expansion, learning_rate, dropout=0.1):  # Added dropout argument
        super(Transformer, self).__init__()
        self.encoder = Encoder(vocab_size, block_size, d_model, nhead, forward_expansion, num_encoder_layers)
        self.decoder = Decoder(vocab_size, block_size, d_model, nhead, forward_expansion, num_decoder_layers)
        
        self.vocab_size = vocab_size
        self.d_model = d_model
        self.block_size = block_size
        # We defined the mask for the decoder in the init function
        self.mask = torch.tril(torch.ones((block_size, block_size))).to(device)

        # We defined the linear head in order to get the output of the decoder
        self.linear_head = nn.Linear(d_model, vocab_size)

        self.batch_loss = []
        self.train_loss = []
        self.test_loss = []

        self.optimizer = optim.Adam(self.parameters(), lr=learning_rate)
        self.criterion = nn.CrossEntropyLoss()

        self.dropout = nn.Dropout(dropout)  # Added dropout layer

    def forward(self, src, trg):
        B, T = trg.shape
        enc_src = self.dropout(self.encoder(src))  # Applying dropout
        out = self.dropout(self.decoder(trg, enc_src, self.mask))  # Applying dropout
        out = self.linear_head(torch.mean(out, dim=1))
        out = out.reshape(B, self.vocab_size)
        return out

## The End

### If you have any questions, please contact me.

Mail: i_konak@hotmail.com

Linkedin: https://www.linkedin.com/in/ismail-konak/

GitHub: https://github.com/IsmailKonak
