<a href="https://colab.research.google.com/github/AnilKumarSingh9856/GPT2_CONFIG_355M_MODEL/blob/main/GPT2_CONFIG_355M_MODEL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## IMPLEMENTING A GPT2_CONFIG_355M MODEL FROM SCRATCH TO GENERATE TEXT

In [None]:
GPT2_CONFIG_355M = {
  "vocab_size": 50257,    # Vocabulary size
  "context_length": 1024, # Context length
  "emb_dim": 1024,         # Embedding dimension
  "n_heads": 16,          # Number of attention heads
  "n_layers": 24,         # Number of layers
  "drop_rate": 0.1,       # Dropout rate
  "qkv_bias": False       # Query-Key-Value bias
}

### Reading in a short story as text sample into Python

In [None]:
import os
import urllib.request

file_path = "The-Harry-Potter.txt"
url = "https://raw.githubusercontent.com/AnilKumarSingh9856/Complete_Harry_Potter_txt_file/refs/heads/main/Harry_Potter_complete_dataset.txt"

if not os.path.exists(file_path):
  with urllib.request.urlopen(url) as response:
    text_data = response.read().decode('utf-8')
  with open(file_path, "w", encoding="utf-8") as file:
    file.write(text_data)
else:
  with open(file_path, "r", encoding="utf-8") as file:
    text_data = file.read()

The print command prints the total number of characters followed by the first 100 characters of this file for illustration purpose

In [None]:
with open(file_path, "r", encoding="utf-8") as f:
  raw_text = f.read()

print("Total number of characters:", len(raw_text))
print(raw_text[:99])

Total number of characters: 6285613
M r. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly nor


### Step 2: Creating Token IDs by the help of Tiktoken Open AI library

In [None]:
! pip3 install tiktoken



In [None]:
import importlib
import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.11.0


Once installed, we can instantiate the BPE tokenizer from tiktoken as follows:

In [None]:
tokenizer = tiktoken.get_encoding("gpt2")

In [None]:
text = (
  "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
  "of someunknownPlace."
)

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]


In [None]:
string = tokenizer.decode(integers)

In [None]:
print(string)

Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.


In [None]:
import tiktoken

# Initialize the encodings for GPT-2, GPT-3, and GPT-4
encodings = {
  "gpt2": tiktoken.get_encoding("gpt2"),
  "gpt3": tiktoken.get_encoding("p50k_base"),  # Commonly associated with GPT-3 models
  "gpt4": tiktoken.get_encoding("cl100k_base")  # Used for GPT-4 and later versions
}

# Get the vocabulary size for each encoding
vocab_sizes = {model: encoding.n_vocab for model, encoding in encodings.items()}

# Print the vocabulary sizes
for model, size in vocab_sizes.items():
  print(f"The vocabulary size for {model.upper()} is: {size}")

The vocabulary size for GPT2 is: 50257
The vocabulary size for GPT3 is: 50281
The vocabulary size for GPT4 is: 100277


### IMPLEMENTING A DATA LOADER

For the efficient data loader implementation, we will use PyTorch's built-in Dataset and Dataloader classes.

* Step1: Tokenizer the entire text
* Step2: Use a sliding window to chunk the book into overlapping sequences of max_length
* Step3: Return the total number of rows in the dataset
* Step4: Return a single row from the datset

In [None]:
import torch
from torch.utils.data import Dataset, DataLoader

class GPTDataset:
  def __init__(self, text, tokenizer, max_length, stride):
    self.input_ids = []
    self.target_ids = []

    # Tokenizer the entire text
    token_ids = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

    # Use a sliding window to chunk the book into overlapping sequences of max_length
    for i in range(0, len(token_ids)-max_length, stride):
      input_chunk = token_ids[i:i+max_length]
      output_chunk = token_ids[i+1:i+max_length+1]
      self.input_ids.append(torch.tensor(input_chunk))
      self.target_ids.append(torch.tensor(output_chunk))

  def __len__(self):
    return len(self.input_ids)

  def __getitem__(self, idx):
    return self.input_ids[idx], self.target_ids[idx]

The GPTDatasetV1 class in listing 2.5 is based on the PyTorch Dataset class.

It defines how individual rows are fetched from the dataset.

Each row consists of a number of token IDs (based on a max_length) assigned to an input_chunk tensor.

The target_chunk tensor contains the corresponding targets.

I recommend reading on to see how the data returned from this dataset looks like when we combine the dataset with a PyTorch DataLoader -- this will bring additional intuition and clarity.

The following code will use the GPTDatasetV1 to load the inputs in batches via  a PyTorch DataLoader

* Step1: Initialize the tokenizer
* Step2: Create dataset
* Step3: drop_last = True drops the last batch if it is shorter than the specified batch_size to prevent loss spikes during training
* Step4: The number of CPU processes to use for preprocessing

In [None]:
def create_dataloader_v1(text, batch_size=4, max_length=256, stride=128,
                         shuffle=True, drop_last=True, num_workers=1):
  # Initialize the tokenizer
  tokenizer = tiktoken.get_encoding('gpt2')

  # Create dataset
  dataset = GPTDataset(text, tokenizer, max_length, stride)

  # Create dataloader
  dataloader = DataLoader(
    dataset,
    batch_size=batch_size,
    shuffle=shuffle,
    drop_last=drop_last,
    num_workers=num_workers
  )

  return dataloader

Convert dataloader into a Python iterator to fetch the next entry via Python's built-in next() function

In [None]:
dataloader = create_dataloader_v1(
  raw_text, batch_size = 8, max_length=4, stride=4, shuffle=False
)

data_iter = iter(dataloader)
intputs_pair, target_pair = next(data_iter)
print(f'Input pair \n{intputs_pair}')
print(f'Output pair \n{target_pair}')

Input pair 
tensor([[   44,   374,    13,   290],
        [ 9074,    13,   360,  1834],
        [ 1636,    11,   286,  1271],
        [ 1440,    11,  4389, 16809],
        [ 9974,    11,   547,  6613],
        [  284,   910,   326,   484],
        [  547,  7138,  3487,    11],
        [ 5875,   345,   845,   881]])
Output pair 
tensor([[  374,    13,   290,  9074],
        [   13,   360,  1834,  1636],
        [   11,   286,  1271,  1440],
        [   11,  4389, 16809,  9974],
        [   11,   547,  6613,   284],
        [  910,   326,   484,   547],
        [ 7138,  3487,    11,  5875],
        [  345,   845,   881,    13]])



The first_batch variable contains two tensors: the first tensor stores the input token IDs, and the second tensor stores the target token IDs.

Since the max_length is set to 4, each of the two tensors contains 4 token IDs.

Note that an input size of 4 is relatively small and only chosen for illustration purposes. It is common to train LLMs with input sizes of at least 256.

### Creating Token Embeddings and Positional Embeddings

Previously, we focused on very small embedding sizes in this chapter for illustration purposes.

We now consider more realistic and useful embedding sizes and encode the input tokens into a 1024-dimensional vector representation.

This is smaller than what the original GPT-3 model used (in GPT-3, the embedding size is 12,288 dimensions) but still reasonable for experimentation.

Furthermore, we assume that the token IDs were created by the BPE tokenizer that we implemented earlier, which has a vocabulary size of 50,257:

In [None]:
vocab_sizes = 50257
output_dim = 1024

token_embedding_layer = torch.nn.Embedding(vocab_sizes, output_dim)


Using the token_embedding_layer above, if we sample data from the data loader, we embed each token in each batch into a 1024-dimensional vector. If we have a batch size of 8 with four tokens each, the result will be an 8 x 4 x 1024 tensor.

Let's instantiate the data loader (Data sampling with a sliding window), first:

In [None]:
max_length = 4
dataloader = create_dataloader_v1(
  raw_text, batch_size = 8, max_length=max_length, stride=max_length, shuffle=False
)

data_iter = iter(dataloader)
intputs_pair, target_pair = next(data_iter)
print(f'Input pair shape \n{intputs_pair.shape}')
print(f'Output pair shape \n{target_pair.shape}')

Input pair shape 
torch.Size([8, 4])
Output pair shape 
torch.Size([8, 4])


As we can see, the token ID tensor is 8 X 4 dimensional, meaning that the data batch consists of 8 text samples with 4 token each

Let's now use the embedding layer to embed these token IDs into 1024-dimensional vectors:

In [None]:
token_embeddings = token_embedding_layer(intputs_pair)
print(token_embeddings.shape)

torch.Size([8, 4, 1024])


For a GPT model's absolute embedding approach, we just need to create another embedding layer that has the same dimension as the token_embedding layer, for positions embeddings

In [None]:
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)

In [None]:
pos_embeddings = pos_embedding_layer(torch.arange(max_length))
print(pos_embeddings.shape)

torch.Size([4, 1024])



As we can see, the positional embedding tensor consists of four 1024-dimensional vectors. We can now add these directly to the token embeddings, where PyTorch will add the 4x1024- dimensional pos_embeddings tensor to each 4x1024-dimensional token embedding tensor in each of the 8 batches:

In [None]:
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings[0]) # printing first batch
print(input_embeddings.shape)

tensor([[ 1.7253, -0.7782, -0.9055,  ...,  0.3064,  0.2535,  0.8952],
        [-0.2899,  1.2833, -0.1203,  ...,  1.1181, -0.7650,  0.0185],
        [ 0.8057, -0.6638,  1.1007,  ...,  0.5396,  0.2012, -1.9123],
        [-0.5348,  0.1790, -0.1225,  ...,  0.4496,  0.9809, -0.4332]],
       grad_fn=<SelectBackward0>)
torch.Size([8, 4, 1024])


### **IMPLEMENTING MULTI-HEAD ATTENTION WITH WEIGHT SPLITS**

Instead of maintaining two separate classes, MultiHeadAttentionWrapper and CausalAttention, we can combine both of these concepts into a single MultiHeadAttention class.

Also, in addition to just merging the MultiHeadAttentionWrapper with the CausalAttention code, we will make some other modifications to implement multi-head attention more efficiently.

In the MultiHeadAttentionWrapper, multiple heads are implemented by creating a list of CausalAttention objects (self.heads), each representing a separate attention head.

The CausalAttention class independently performs the attention mechanism, and the results from each head are concatenated.

In contrast, the following MultiHeadAttention class integrates the multi-head functionality within a single class.

It splits the input into multiple heads by reshaping the projected query, key, and value tensors and then combines the results from these heads after computing attention.

Let's take a look at the MultiheadAttention class before we discuss it further

In [None]:
import torch.nn as nn

class MultiHeadAttention(nn.Module):
  def __init__(self, d_in, d_out, context_length, dropout, num_heads,
               qkv_bias=False):
    super().__init__()
    assert (d_out % num_heads == 0), \
      "d_out must be divisible by num_heads"

    self.d_out = d_out
    self.num_heads = num_heads
    self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim

    self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
    self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
    self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
    self.out_proj = nn.Linear(d_out, d_out)  # Linear layer to combine head outputs
    self.dropout = nn.Dropout(dropout)
    self.register_buffer(
        "mask",
        torch.triu(torch.ones(context_length, context_length),
                    diagonal=1)
    )

  def forward(self, x):
    b, num_tokens, d_in = x.shape

    keys = self.W_key(x) # Shape: (b, num_tokens, d_out)
    queries = self.W_query(x)
    values = self.W_value(x)

    # We implicitly split the matrix by adding a `num_heads` dimension
    # Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
    keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
    values = values.view(b, num_tokens, self.num_heads, self.head_dim)
    queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)

    # Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
    keys = keys.transpose(1, 2)
    queries = queries.transpose(1, 2)
    values = values.transpose(1, 2)

    # Compute scaled dot-product attention (aka self-attention) with a causal mask
    attn_scores = queries @ keys.transpose(2, 3)  # Dot product for each head

    # Original mask truncated to the number of tokens and converted to boolean
    mask_bool = self.mask.bool()[:num_tokens, :num_tokens]

    # Use the mask to fill attention scores
    attn_scores.masked_fill_(mask_bool, -torch.inf)

    attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
    attn_weights = self.dropout(attn_weights)

    # Shape: (b, num_tokens, num_heads, head_dim)
    context_vec = (attn_weights @ values).transpose(1, 2)

    # Combine heads, where self.d_out = self.num_heads * self.head_dim
    context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
    context_vec = self.out_proj(context_vec) # optional projection

    return context_vec

The MultiHeadAttention class can be used similar to the SelfAttention and CausalAttention classes we implemented earlier

In [None]:
print(intputs_pair.shape)
batch_size, context_length, d_in = intputs_pair

