In this assignment, you will be implementing a GPT model and train it using CLM objective.
 * If you get stuck at something or need more clarrifications, you may refer to : https://github.com/karpathy/minGPT/blob/master/mingpt/model.py

 * We will be using ReLU activation function instead of GELU.

 * As usual, let us install the required libraries

 * **Note** that if you are not getting the exact loss values as mentioned in this notebook, that is absolutely fine. Just see whether your implementation overfits the given toy-and-tiny paragraph!

# Installation


In [1]:
!pip install torchdata==0.6.0 # to be compatible with torch 2.0
!pip install portalocker==2.0.0

Collecting torchdata==0.6.0
  Downloading torchdata-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (919 bytes)
Collecting torch==2.0.0 (from torchdata==0.6.0)
  Downloading torch-2.0.0-cp310-cp310-manylinux1_x86_64.whl.metadata (24 kB)
Collecting nvidia-cuda-nvrtc-cu11==11.7.99 (from torch==2.0.0->torchdata==0.6.0)
  Downloading nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu11==11.7.99 (from torch==2.0.0->torchdata==0.6.0)
  Downloading nvidia_cuda_runtime_cu11-11.7.99-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cuda-cupti-cu11==11.7.101 (from torch==2.0.0->torchdata==0.6.0)
  Downloading nvidia_cuda_cupti_cu11-11.7.101-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu11==8.5.0.96 (from torch==2.0.0->torchdata==0.6.0)
  Downloading nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu11==11.1

* See [here](https://github.com/pytorch/text) for compatability

In [2]:
!pip install -U torchtext==0.15.1

Collecting torchtext==0.15.1
  Downloading torchtext-0.15.1-cp310-cp310-manylinux1_x86_64.whl.metadata (7.4 kB)
Downloading torchtext-0.15.1-cp310-cp310-manylinux1_x86_64.whl (2.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m27.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: torchtext
Successfully installed torchtext-0.15.1


# Imports

In [3]:
import torch
from torch import Tensor

import torch.nn as nn
from torch.nn import Parameter
import torch.nn.functional as F
from torch.nn.functional import one_hot

import torch.optim as optim

#text lib
import torchtext

# tokenizer
from torchtext.data.utils import get_tokenizer

#build vocabulary
from torchtext.vocab import vocab
from torchtext.vocab import build_vocab_from_iterator

# get input_ids (numericalization)
from torchtext.transforms import VocabTransform

# get embeddings
from torch.nn import Embedding

from  pprint import pprint
from yaml import safe_load
import copy
import numpy as np

In [4]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

# Load the dataset for LM modeling

 * We use a simple tokenizer and put

In [5]:
batch_size = 10

In [6]:
class Tokenizer(object):

  def __init__(self,text):
    self.text = text
    self.word_tokenizer = get_tokenizer(tokenizer="basic_english",language='en')
    self.vocab_size = None

  def get_tokens(self):
    for sentence in self.text.strip().split('\n'):
      yield self.word_tokenizer(sentence)

  def build_vocab(self):
    v = build_vocab_from_iterator(self.get_tokens(),
                                  min_freq=1,specials=['<unk>','<start>','<end>'])
    v.set_default_index(v['<unk>']) # index of OOV
    self.vocab_size = len(v)
    self.vocab = v  # Store the vocab in the object

    return v

  def token_ids(self):
    v = self.build_vocab()
    vt = VocabTransform(v)
    num_tokens = len(self.word_tokenizer(self.text))
    max_seq_len = np.ceil(num_tokens/batch_size)
    data = torch.zeros(size=(1,num_tokens))
    data = vt(self.word_tokenizer(self.text))
    data = torch.tensor(data,dtype=torch.int64)
    return data.reshape(batch_size,torch.tensor(max_seq_len,dtype=torch.int64))



In [7]:
text = """Best known for the invention of Error Correcting Codes, he was a true polymath who applied his mathematical and problem-solving skills to numerous disciplines.
Reflecting on the significant benefits I received from Hamming, I decided to develop a tribute to his legacy. There has not been a previous biography of Hamming, and the few articles about him restate known facts and assumptions and leave us with open questions.
One thought drove me as I developed this legacy project: An individual's legacy is more than a list of their attempts and accomplishments. Their tribute should also reveal the succeeding generations they inspired and enabled and what each attempted and achieved.
This book is a unique genre containing my version of a biography that intertwines the story "of a life" and a multi-player memoir with particular events and turning points recalled by those, including me, who he inspired and enabled.
Five years of research uncovered the people, places, opportunities, events, and influences that shaped Hamming. I discovered unpublished information, stories, photographs, videos, and personal remembrances to chronicle his life, which helped me put Hamming's
legacy in the context I wanted.The result demonstrates many exceptional qualities, including his noble pursuit of excellence and helping others. Hamming paid attention to the details, his writings continue to influence, and his guidance is a timeless gift to the world.
This biography is part of """

In [8]:
Tk = Tokenizer(text)

In [9]:
x_raw = Tk.token_ids()
print(x_raw.shape)

torch.Size([10, 26])


In [10]:
# let us display the first 10 tokens of the vocabulary
v = Tk.build_vocab()
pprint(v.vocab.get_itos()[0:10])

['<unk>', '<start>', '<end>', ',', 'and', '.', 'the', 'a', 'of', 'to']


* Create the input_ids and Labels from the raw input sequence

In [11]:
bs,raw_seq_len = x_raw.shape
x = torch.empty(size=(bs,raw_seq_len+2),dtype=torch.int64)
x[:,1:-1] =x_raw

# insert the index of special tokens
x[:,0] = torch.full(size=(1,batch_size),fill_value=v.vocab.get_stoi()['<start>'])
x[:,-1] = torch.full(size=(1,batch_size),fill_value=v.vocab.get_stoi()['<end>'])

#Quickly check implem
v = Tk.build_vocab()
words = []
for idx in x[0,:]:
  words.append(v.vocab.get_itos()[idx.item()])
print(' '.join(words))

<start> best known for the invention of error correcting codes , he was a true polymath who applied his mathematical and problem-solving skills to numerous disciplines . <end>


In [12]:
# labels are just the input_ids shifted by right
bs,seq_len = x.shape
y = torch.empty(size=(bs,seq_len),dtype=torch.int64)
y[:,0:-1] = copy.deepcopy(x[:,1:])

#ignore the index of padded tokens while computing loss
y[:,-1] = torch.full(size=(1,batch_size),fill_value=-100)

# Configuration

In [13]:
vocab_size = Tk.vocab_size
seq_len = x.shape[1]
embed_dim = 32
dmodel = embed_dim
dq = torch.tensor(4)
dk = torch.tensor(4)
dv = torch.tensor(4)
heads = torch.tensor(8)
d_ff = 4*dmodel

* Define all the sub-layers (mhma,ffn) in the transformer blocks
* Seed for $W_Q,W_K,W_V,W_O$, 43, 44 and 45, 46, respectively
* Seed for ffn $W_1,W_2$,  47 and 48. There are no biases
* Seed for output layer 49

In [14]:
def create_mask(batch_size, head, seq1, seq2):
  # Create a tensor of size (dim, dim) filled with negative infinity
  tensor = torch.full((batch_size, head, seq1, seq2), float('-inf'))
  # print(tensor.shape)
  # Fill the upper triangular part (including the diagonal) with zeros
  tensor = torch.triu(tensor, diagonal=0)
  # print(tensor.shape)

  # mask = torch.concat([tensor[i].fill_diagonal_(0) for i in range(batch_size)])
  # Fill the diagonal with zeros for each batch slice in-place
  for i in range(batch_size):
    for j in range(head):
      tensor[i][j].fill_diagonal_(0)
  return tensor

In [15]:
class MHMA(nn.Module):
    def __init__(self, embed_dim, heads, dq, dk, dv):
        super(MHMA, self).__init__()
        self.heads = heads
        self.dq, self.dk, self.dv = dq, dk, dv
        self.embed_dim = embed_dim

        # Seeds for reproducibility
        torch.manual_seed(43)
        self.WQ = nn.Parameter(torch.randn(embed_dim, heads * dq))

        torch.manual_seed(44)
        self.WK = nn.Parameter(torch.randn(embed_dim, heads * dk))

        torch.manual_seed(45)
        self.WV = nn.Parameter(torch.randn(embed_dim, heads * dv))

        torch.manual_seed(46)
        self.WO = nn.Parameter(torch.randn(heads * dv, embed_dim))

    def forward(self, x, mask=None):
        # Compute Q, K, V
        Q = x @ self.WQ
        K = x @ self.WK
        V = x @ self.WV

        # Split into multiple heads
        Q = Q.view(Q.shape[0], Q.shape[1], self.heads, self.dq).transpose(1, 2)
        K = K.view(K.shape[0], K.shape[1], self.heads, self.dk).transpose(1, 2)
        V = V.view(V.shape[0], V.shape[1], self.heads, self.dv).transpose(1, 2)

        # Compute scaled dot-product attention
        scores = (Q @ K.transpose(-2, -1)) / torch.sqrt(torch.tensor(self.dk, dtype=torch.float32))

        # Apply mask (if provided)
        # if mask is not None:
        #     scores = scores.masked_fill(mask == 0, float('-inf'))
        scores += create_mask(x.shape[0],self.heads,x.shape[1],x.shape[1])
        # print(scores)
        # Compute attention weights
        weights = F.softmax(scores, dim=-1)
        attention = weights @ V

        # Combine attention heads
        attention = attention.transpose(1, 2).contiguous().view(x.shape[0], x.shape[1], -1)
        out = attention @ self.WO

        return out


In [16]:
class FFN(nn.Module):
    def __init__(self, embed_dim, d_ff):
        super(FFN, self).__init__()

        # Seeds for reproducibility
        torch.manual_seed(47)
        self.W1 = nn.Parameter(torch.randn(embed_dim, d_ff))

        torch.manual_seed(48)
        self.W2 = nn.Parameter(torch.randn(d_ff, embed_dim))

    def forward(self, x):
        # Apply ReLU activation instead of GELU
        x = F.relu(x @ self.W1)
        x = x @ self.W2
        return x


In [17]:
class PredictionHead(nn.Module):
    def __init__(self, dmodel, vocab_size):
        super(PredictionHead, self).__init__()

        # Seed for reproducibility
        torch.manual_seed(49)
        self.W_out = nn.Parameter(torch.randn(dmodel, vocab_size))

    def forward(self, x):
        return x @ self.W_out


In [18]:
class PositionalEncoding(nn.Module):
    def __init__(self, seq_len, embed_dim):
        super(PositionalEncoding, self).__init__()
        self.positional_encoding = torch.zeros(seq_len, embed_dim)

        position = torch.arange(0, seq_len, dtype=torch.float32).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, embed_dim, 2, dtype=torch.float32) * (-torch.log(torch.tensor(10000.0)) / embed_dim))

        self.positional_encoding[:, 0::2] = torch.sin(position * div_term)
        self.positional_encoding[:, 1::2] = torch.cos(position * div_term)
        self.positional_encoding = self.positional_encoding.unsqueeze(0)  # Shape: (1, seq_len, embed_dim)

    def forward(self, x):
        return x + self.positional_encoding[:, :x.size(1), :]


In [19]:
class DecoderLayer(nn.Module):
    def __init__(self, dmodel, dq, dk, dv, d_ff, heads, mask=None):
        super(DecoderLayer, self).__init__()
        self.mhma = MHMA(dmodel, heads, dq, dk, dv)
        self.layer_norm_1 = nn.LayerNorm(dmodel)
        self.layer_norm_2 = nn.LayerNorm(dmodel)
        self.ffn = FFN(dmodel, d_ff)

    def forward(self, dec_rep):
        """
        Args:
            dec_rep: Tensor of shape (batch_size, seq_len, dmodel)

        Returns:
            out: Tensor of shape (batch_size, seq_len, dmodel)
        """
        # Multi-Head Attention + Residual + LayerNorm
        attention_out = self.mhma(dec_rep)
        residual_1 = dec_rep + attention_out
        norm_1 = self.layer_norm_1(residual_1)

        # Feed-Forward Network + Residual + LayerNorm
        ffn_out = self.ffn(norm_1)
        residual_2 = norm_1 + ffn_out
        out = self.layer_norm_2(residual_2)

        return out


In [20]:
class Embed(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super(Embed, self).__init__()

        # Seed for reproducibility
        torch.manual_seed(70)

        # Embedding layer
        self.embed = nn.Embedding(vocab_size, embed_dim)

        # Positional Encoding
        self.pe = PositionalEncoding(seq_len=seq_len, embed_dim=embed_dim)

    def forward(self, x):
        """
        Args:
            x: Tensor of shape (batch_size, seq_len)

        Returns:
            out: Tensor of shape (batch_size, seq_len, embed_dim)
        """
        # Apply embedding and positional encoding
        out = self.pe(self.embed(x))
        return out


In [21]:
class Decoder(nn.Module):

  def __init__(self,vocab_size,dmodel,dq,dk,dv,d_ff,heads,mask,num_layers=1):
    super(Decoder,self).__init__()
    self.embed_lookup = Embed(vocab_size,embed_dim)
    self.dec_layers = nn.ModuleList(copy.deepcopy(DecoderLayer(dmodel,dq,dk,dv,d_ff,heads,mask)) for i in range(num_layers))
    self.predict = PredictionHead(dmodel,vocab_size)

  def forward(self,input_ids):
    out = self.embed_lookup(input_ids)
    for dec_layer in self.dec_layers:
      out = dec_layer(out)
    out = self.predict(out)

    return out

In [22]:
model = Decoder(
    vocab_size=vocab_size,
    dmodel=dmodel,
    dq=dq,
    dk=dk,
    dv=dv,
    d_ff=d_ff,
    heads=heads,
    mask=None,
    num_layers=1  # Single decoder layer for simplicity
)


In [23]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

In [24]:
def train(input_ids, labels, epochs=1000):
    for epoch in range(epochs):
        # Forward pass
        out = model(input_ids)

        # Compute loss
        loss = criterion(out.view(-1, vocab_size), labels.view(-1))

        # Backward pass and optimization
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        # Log progress every 100 epochs
        if (epoch + 1) % 100 == 0:
            print(f"Epoch [{epoch + 1}/{epochs}], Loss: {loss.item():.4f}")


In [25]:
# run the model for 10K epochs
train(x, y, epochs=10000)


  scores = (Q @ K.transpose(-2, -1)) / torch.sqrt(torch.tensor(self.dk, dtype=torch.float32))


Epoch [100/10000], Loss: 7.2947
Epoch [200/10000], Loss: 5.5397
Epoch [300/10000], Loss: 4.9794
Epoch [400/10000], Loss: 4.7600
Epoch [500/10000], Loss: 4.6498
Epoch [600/10000], Loss: 4.5809
Epoch [700/10000], Loss: 4.5280
Epoch [800/10000], Loss: 4.4791
Epoch [900/10000], Loss: 4.4358
Epoch [1000/10000], Loss: 4.3865
Epoch [1100/10000], Loss: 4.3469
Epoch [1200/10000], Loss: 4.3046
Epoch [1300/10000], Loss: 4.2617
Epoch [1400/10000], Loss: 4.2132
Epoch [1500/10000], Loss: 4.1678
Epoch [1600/10000], Loss: 4.1188
Epoch [1700/10000], Loss: 4.0695
Epoch [1800/10000], Loss: 4.0171
Epoch [1900/10000], Loss: 3.9620
Epoch [2000/10000], Loss: 3.9044
Epoch [2100/10000], Loss: 3.8441
Epoch [2200/10000], Loss: 3.7843
Epoch [2300/10000], Loss: 3.7204
Epoch [2400/10000], Loss: 3.6585
Epoch [2500/10000], Loss: 3.5967
Epoch [2600/10000], Loss: 3.5325
Epoch [2700/10000], Loss: 3.4675
Epoch [2800/10000], Loss: 3.4036
Epoch [2900/10000], Loss: 3.3333
Epoch [3000/10000], Loss: 3.2678
Epoch [3100/10000],

The loss is about 0.09 after 10K epochs

# **Generate text**

In [26]:
@torch.inference_mode()
def generate(model, prompt=['<start>'], max_words=10):
    # Initialize the tokenizer and vocabulary
    Tk = Tokenizer(text)  # Assuming `text` contains your data
    vocab = Tk.build_vocab()
    stoi = vocab.vocab.get_stoi()
    itos = vocab.vocab.get_itos()

    # Ensure the prompt is valid
    if isinstance(prompt, list):
        for token in prompt:
            if token not in stoi:
                raise ValueError(f"Token '{token}' not found in vocabulary.")
    elif prompt not in stoi:
        raise ValueError(f"Token '{prompt}' not found in vocabulary.")

    # Convert prompt tokens to input IDs
    input_ids = torch.tensor([[stoi[token] for token in prompt]], dtype=torch.long)  # Shape: (1, len(prompt))

    # Initialize the generated sequence
    generated = input_ids.clone()

    for _ in range(max_words):
        # Forward pass through the model
        out = model(generated)  # Output shape: (batch_size, seq_len, vocab_size)

        # Get logits for the last token
        logits = out[:, -1, :]  # Shape: (1, vocab_size)

        # Predict the next token
        next_token_id = torch.argmax(logits, dim=-1, keepdim=True)  # Shape: (1, 1)

        # Append the predicted token to the sequence
        generated = torch.cat((generated, next_token_id), dim=1)

        # Stop generation if <end> token is generated
        if next_token_id.squeeze().item() == stoi['<end>']:
            break

    # Convert token IDs to words
    words = [itos[token_id] for token_id in generated.squeeze().tolist()]

    return ' '.join(words)


In [27]:
print(Tk.vocab.vocab.get_stoi())  # Check stoi mappings
print(Tk.vocab.vocab.get_itos()[:10])  # Check first 10 itos mappings


{'years': 154, 'which': 151, 'what': 150, 'videos': 147, 'unique': 143, 'turning': 141, 'timeless': 139, 'than': 134, 'succeeding': 133, 'stories': 131, 'skills': 130, 'containing': 51, 'problem-solving': 113, 'who': 30, 'research': 123, 'has': 76, 'best': 46, 'paid': 103, 'benefits': 45, 'been': 44, 'version': 146, 'assumptions': 40, 'attention': 43, 'us': 145, 'part': 104, 'attempts': 42, '<start>': 1, ',': 3, 'attempted': 41, 'project': 114, 'as': 39, 'was': 149, 'those': 137, 'applied': 37, 'an': 36, 'uncovered': 142, 'also': 35, 'drove': 62, 'chronicle': 49, 'opportunities': 101, 'correcting': 54, 'about': 32, 'qualities': 117, 'tribute': 29, 'continue': 53, 'this': 17, 'error': 64, 'that': 27, 'multi-player': 93, 'should': 128, 'codes': 50, 'with': 31, 'by': 48, 'a': 7, 'questions': 118, 'me': 16, 'writings': 153, 'i': 12, 'accomplishments': 33, '.': 5, 'individual': 81, '<end>': 2, 'demonstrates': 56, 'few': 68, 'wanted': 148, 'is': 13, 'true': 140, 'he': 21, 'decided': 55, 'the

In [28]:
generate(model, prompt=['<start>'], max_words=25)

  scores = (Q @ K.transpose(-2, -1)) / torch.sqrt(torch.tensor(self.dk, dtype=torch.float32))


'<start> biography of hamming , and the few articles about him restate known facts and assumptions and leave us with open questions . one thought drove'

* Note the model has memorized the sentence from the training set. Given the start token, if your implementation reproduce a sentence as is in the training set, then your implementation is likely to be correct.
* Suppose the prompt is `<start> best known`, then we expect the model to produce the first sentence as is

In [29]:
generate(model,prompt=['<start>','best','known'],max_words=25)

  scores = (Q @ K.transpose(-2, -1)) / torch.sqrt(torch.tensor(self.dk, dtype=torch.float32))


'<start> best known for the invention of error correcting codes , he was a true polymath who applied his mathematical and problem-solving skills to numerous disciplines . <end>'

* Change the prompt

In [30]:
generate(model,prompt=['<start>','reflecting','on'],max_words=25)

  scores = (Q @ K.transpose(-2, -1)) / torch.sqrt(torch.tensor(self.dk, dtype=torch.float32))


'<start> reflecting on the significant benefits i received from hamming , i decided to develop a tribute to his legacy . there has not been a previous <end>'