## GPT (Generative Pre-Trained Transformer) 

<img src="images/gpt_stable_dif.png" width="20%" height="20%">

GPT is a transformer-based language model which is first introduced in the paper [Improving Language Understanding by Generative Pre-Training by Alec Radford et al. in 2018](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf).

In this notebook we are going to implement the model in PyTorch according to the paper.

Note: In addition to its architecture, the model is particularly notable for its training methods. For this reason, it is useful to examine how the model is trained separately, but in this notebook we will focus only on the architecture of the model. In the future, I will try to examine the training methods as much as possible.

## Importing the libraries

In [1]:
import torch 
import torch.nn as nn
import torch.nn.functional as F

# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = torch.device("cpu")

## Defining Hyperparameters

In [2]:
# All of the hyperparameters are set according to the values in the article
embed_dim = 768
head_nums = 12
layer_num = 12
batch_size = 64
block_size = 512
forward_expansion = 4
vocab_size = 40000

attn_dropout = 0.1
dropout_rate = 0.1

## Implementing GPT Model blocks

### Multi Head Attention

In this block we are going to implement the multi head attention block. This block is the core of the transformer model.

I should note that the implementation of this block does not contain the scaled dot product attention process from scratch because the implementation of PyTorch use the [FlashAttention](https://arxiv.org/abs/2205.14135) implementation (PyTorch >= 2.0) which is more efficient than the naive implementation.

In [3]:
class MultiHeadAttention(nn.Module):
    def __init__(self, embed_dim, num_heads, attn_drop_rate, proj_drop_rate):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads
        assert self.head_dim * num_heads == embed_dim, "embed_dim must be divisible by num_heads"

        self.qkv = nn.Linear(embed_dim, embed_dim * 3, bias=False)
        self.attn_drop_rate = attn_drop_rate
        self.proj = nn.Linear(embed_dim, embed_dim)
        self.proj_drop = nn.Dropout(proj_drop_rate)

    def forward(self, x):
        B,T,C = x.shape # x: (batch_size, block_size, embedding_dim)
        qkv = self.qkv(x) # x: (batch_size, block_size, embedding_dim*3)
        qkv = qkv.reshape(B,T,3,self.num_heads,self.head_dim).permute(2,0,3,1,4) # x: (3, batch_size, num_heads, block_size, head_dim)
        query,key,value = qkv[0],qkv[1],qkv[2] # query, key, value: (batch_size, num_heads, block_size, head_dim)
        attn = F.scaled_dot_product_attention(query,key,value, dropout_p=self.attn_drop_rate, is_causal=True) # attn: (batch_size, num_heads, block_size, head_dim)
        attn = attn.transpose(1, 2).contiguous().view(B, T, C) # attn: (batch_size, block_size, embedding_dim)
        out = self.proj(attn) # out: (batch_size, block_size, embedding_dim)
        out = self.proj_drop(out)
        return out

### Feed Forward Network

This block is a simple feed forward network with 2 hidden layers. For the feed-forward network, the paper used 3072 dimensional inner states which we express by the term "forward expansion" (embedding_dim*forward_expansion = inner_states). Additionaly we use GELU activation function instead of ReLU as it is shown to perform better in the paper.

In [4]:
class FFN(nn.Module):
    def __init__(self, embedding_dim, forward_expansion, drop_rate):
        super().__init__()
        self.linear_1 = nn.Linear(embedding_dim, embedding_dim*forward_expansion)
        self.gelu = nn.GELU()
        self.linear_2 = nn.Linear(embedding_dim*forward_expansion, embedding_dim)
        self.dropout = nn.Dropout(drop_rate)
    
    def forward(self, x):
        x = self.linear_1(x) # (batch_size, seq_len, embedding_dim*forward_expansion)
        x = self.gelu(x) # (batch_size, seq_len, embedding_dim*forward_expansion)
        x = self.linear_2(x) # (batch_size, seq_len, embedding_dim)
        out = self.dropout(x)
        return out


### GPT Block

In [5]:
class GPTBlock(nn.Module):
    def __init__(self, embed_dim, num_heads, forward_expansion, attn_drop_rate, drop_rate):
        super().__init__()
        self.MHA = MultiHeadAttention(embed_dim=embed_dim, num_heads=num_heads,
                                       attn_drop_rate=attn_drop_rate, proj_drop_rate=drop_rate)
        
        self.FFN = FFN(embedding_dim=embed_dim, forward_expansion=forward_expansion, drop_rate=drop_rate)
        self.layer_norm1 = nn.LayerNorm(embed_dim)
        self.layer_norm2 = nn.LayerNorm(embed_dim)
    
    def forward(self, x):
        # x: (batch_size, seq_len, embed_dim)
        mha_out = self.MHA(x) # (batch_size, seq_len, embed_dim)
        x = self.layer_norm1(mha_out + x)
        ffn_out = self.FFN(x) # (batch_size, seq_len, embed_dim)
        out = self.layer_norm2(ffn_out + x)
        return out


### GPT 

This is the final block where we wrap up everything together and create a GPT model. Apart from the other difference we have seen in previous blocks the GPT model also differ from the original Transformer by using a different positional embedding. The original Transformer model uses a sinusoidal positional embedding but we use a learned positional embedding as described in the paper.

In [10]:
class GPT(nn.Module):
    def __init__(self, vocab_size ,block_size, embed_dim, num_heads, num_layers, forward_expansion, attn_drop_rate=0., drop_rate=0.):
        super().__init__()

        self.token_embed = nn.Embedding(vocab_size, embed_dim) #  Defining the token embedding layer
        self.pos_embed = nn.Embedding(block_size, embed_dim) # Defining the positional embedding layer

        self.blocks = nn.ModuleList([GPTBlock(embed_dim, num_heads, forward_expansion, attn_drop_rate, drop_rate) for _ in range(num_layers)]) # Defining the GPT blocks (12 layers in the original paper)

        self.linear_head = nn.Linear(embed_dim, vocab_size) # Defining the linear layer to the predictions

        self.dropout = nn.Dropout(drop_rate)

        self.apply(self._init_weights) # Weights initialization

    # We initialize the weights of the model with a normal distribution, with mean 0 and standard deviation 0.02 as in the original paper.
    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, x):
        B,T = x.shape
        token_embeddings = self.token_embed(x) # shape (B,T,C)
        pos = torch.arange(0, T, dtype=torch.long, device=device) # shape (T)
        pos_embeddings = self.pos_embed(pos) # shape (T,C)
        x = self.dropout(token_embeddings+pos_embeddings) # shape (B,T,C)
        for block in self.blocks:
            x = block (x) # shape (B,T,C)
        out = self.linear_head(x) # shape (B,T,V) where V is the vocab size
        return out

### The End 

If you have any questions, please contact me:

- Email: [i_konak@hotmail.com](mailto:i_konak@hotmail.com)
- Linkedin: [Ismail Konak](https://www.linkedin.com/in/ismail-konak/)