Language Model Implementation with Transformer Decoder

This notebook demonstrates the implementation of a simple language model using a Transformer decoder architecture. Here's a summary of the key components and steps:

1. Installation and Imports  
Installs necessary libraries like torchdata, portalocker, and torchtext.  
Imports required modules from PyTorch and other libraries.  

2. Data Preparation  
Implements a custom Tokenizer class to process text data.  
Builds vocabulary and converts text to token IDs.  
Prepares input sequences and labels for training.  

3. Model Architecture  
Implements key components of the Transformer architecture:  
- Multi-Head Attention (MHMA)  
- Feed-Forward Network (FFN)  
- Positional Encoding  
- Embedding layer  
Combines these components into a Decoder layer and full Decoder model.  

4. Training  
Defines loss function (CrossEntropyLoss) and optimizer (SGD).  
Implements a training loop that runs for 10,000 epochs.  

5. Text Generation  
Implements a generate function for text generation using the trained model.  
Demonstrates text generation with different prompts.  

6. Results  
Shows that the model has memorized sentences from the training set.  
Generates text based on given prompts, reproducing training sentences accurately.  

The notebook effectively demonstrates the process of building, training, and using a simple language model based on the Transformer architecture for text generation tasks.


# Installation


In [2]:
!pip install torchdata==0.6.0
!pip install portalocker==2.0.0



* See [here](https://github.com/pytorch/text) for compatability

In [3]:
!pip install -U torchtext==0.15.1



# Imports

In [4]:
import torch
from torch import Tensor

import torch.nn as nn
from torch.nn import Parameter
import torch.nn.functional as F
from torch.nn.functional import one_hot

import torch.optim as optim

#text lib
import torchtext

# tokenizer
from torchtext.data.utils import get_tokenizer

#build vocabulary
from torchtext.vocab import vocab
from torchtext.vocab import build_vocab_from_iterator

# get input_ids (numericalization)
from torchtext.transforms import VocabTransform

# get embeddings
from torch.nn import Embedding

from  pprint import pprint
from yaml import safe_load
import copy
import numpy as np

In [6]:
!pip install nltk 

Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2024.11.6-cp310-cp310-win_amd64.whl.metadata (41 kB)
Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   ------------- -------------------------- 0.5/1.5 MB 1.7 MB/s eta 0:00:01
   ---------------------------------- ----- 1.3/1.5 MB 2.6 MB/s eta 0:00:01
   ---------------------------------------- 1.5/1.5 MB 2.5 MB/s eta 0:00:00
Downloading regex-2024.11.6-cp310-cp310-win_amd64.whl (274 kB)
Installing collected packages: regex, nltk
Successfully installed nltk-3.9.1 regex-2024.11.6


In [7]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\BalaM\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

# Load the dataset for LM modeling

 * We use a simple tokenizer and put

In [8]:
batch_size = 10

In [9]:
class Tokenizer(object):

  def __init__(self,text):
    self.text = text
    self.word_tokenizer = get_tokenizer(tokenizer="basic_english",language='en')
    self.vocab_size = None

  def get_tokens(self):
    for sentence in self.text.strip().split('\n'):
      yield self.word_tokenizer(sentence)

  def build_vocab(self):
    v = build_vocab_from_iterator(self.get_tokens(),
                                  min_freq=1,specials=['<unk>','<start>','<end>'])
    v.set_default_index(v['<unk>']) # index of OOV
    self.vocab_size = len(v)
    return v

  def token_ids(self):
    v = self.build_vocab()
    vt = VocabTransform(v)
    num_tokens = len(self.word_tokenizer(self.text))
    max_seq_len = np.ceil(num_tokens/batch_size)
    data = torch.zeros(size=(1,num_tokens))
    data = vt(self.word_tokenizer(self.text))
    data = torch.tensor(data,dtype=torch.int64)
    return data.reshape(batch_size,torch.tensor(max_seq_len,dtype=torch.int64))



In [10]:
text = """Best known for the invention of Error Correcting Codes, he was a true polymath who applied his mathematical and problem-solving skills to numerous disciplines.
Reflecting on the significant benefits I received from Hamming, I decided to develop a tribute to his legacy. There has not been a previous biography of Hamming, and the few articles about him restate known facts and assumptions and leave us with open questions.
One thought drove me as I developed this legacy project: An individual's legacy is more than a list of their attempts and accomplishments. Their tribute should also reveal the succeeding generations they inspired and enabled and what each attempted and achieved.
This book is a unique genre containing my version of a biography that intertwines the story "of a life" and a multi-player memoir with particular events and turning points recalled by those, including me, who he inspired and enabled.
Five years of research uncovered the people, places, opportunities, events, and influences that shaped Hamming. I discovered unpublished information, stories, photographs, videos, and personal remembrances to chronicle his life, which helped me put Hamming's
legacy in the context I wanted.The result demonstrates many exceptional qualities, including his noble pursuit of excellence and helping others. Hamming paid attention to the details, his writings continue to influence, and his guidance is a timeless gift to the world.
This biography is part of """

In [11]:
Tk = Tokenizer(text)

In [12]:
x_raw = Tk.token_ids()
print(x_raw.shape)

torch.Size([10, 26])


In [13]:
# let us display the first 10 tokens of the vocabulary
v = Tk.build_vocab()
pprint(v.vocab.get_itos()[0:10])

['<unk>', '<start>', '<end>', ',', 'and', '.', 'the', 'a', 'of', 'to']


* Create the input_ids and Labels from the raw input sequence

In [14]:
bs,raw_seq_len = x_raw.shape
x = torch.empty(size=(bs,raw_seq_len+2),dtype=torch.int64)
x[:,1:-1] =x_raw

# insert the index of special tokens
x[:,0] = torch.full(size=(1,batch_size),fill_value=v.vocab.get_stoi()['<start>'])
x[:,-1] = torch.full(size=(1,batch_size),fill_value=v.vocab.get_stoi()['<end>'])

#Quickly check implem
v = Tk.build_vocab()
words = []
for idx in x[0,:]:
  words.append(v.vocab.get_itos()[idx.item()])
print(' '.join(words))

<start> best known for the invention of error correcting codes , he was a true polymath who applied his mathematical and problem-solving skills to numerous disciplines . <end>


In [15]:
# labels are just the input_ids shifted by right
bs,seq_len = x.shape
y = torch.empty(size=(bs,seq_len),dtype=torch.int64)
y[:,0:-1] = copy.deepcopy(x[:,1:])

#ignore the index of padded tokens while computing loss
y[:,-1] = torch.full(size=(1,batch_size),fill_value=-100)

# Configuration

In [16]:
vocab_size = Tk.vocab_size
seq_len = x.shape[1]
embed_dim = 32
dmodel = embed_dim
dq = torch.tensor(4)
dk = torch.tensor(4)
dv = torch.tensor(4)
heads = torch.tensor(8)
d_ff = 4*dmodel

* Define all the sub-layers (mhma,ffn) in the transformer blocks
* Seed for $W_Q,W_K,W_V,W_O$, 43, 44 and 45, 46, respectively
* Seed for ffn $W_1,W_2$,  47 and 48. There are no biases
* Seed for output layer 49

In [17]:
class MHMA(nn.Module):
    def __init__(self, dmodel, dq, dk, dv, heads, mask=None):
        super(MHMA, self).__init__()
        self.heads = heads
        self.dmodel = dmodel
        self.dk = dk
        self.dv = dv
        self.Wq = nn.Linear(dmodel, dq * heads)
        self.Wk = nn.Linear(dmodel, dk * heads)
        self.Wv = nn.Linear(dmodel, dv * heads)
        self.linear = nn.Linear(dv * heads, dmodel)
        self.mask = mask

    def forward(self, x):
        batch_size, seq_len, _ = x.size()
        q = self.Wq(x).view(batch_size, seq_len, self.heads, self.dk).transpose(1, 2)
        k = self.Wk(x).view(batch_size, seq_len, self.heads, self.dk).transpose(1, 2)
        v = self.Wv(x).view(batch_size, seq_len, self.heads, self.dv).transpose(1, 2)

        attn_scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.dk)
        if self.mask is not None:
            attn_scores = attn_scores.masked_fill(self.mask == 0, -1e9)
        attn_weights = F.softmax(attn_scores, dim=-1)

        attn_output = torch.matmul(attn_weights, v).transpose(1, 2).contiguous().view(batch_size, seq_len, self.heads * self.dv)
        out = self.linear(attn_output)
        return out

class FFN(nn.Module):
    def __init__(self, dmodel, d_ff):
        super(FFN, self).__init__()
        self.linear1 = nn.Linear(dmodel, d_ff)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(d_ff, dmodel)

    def forward(self, x):
        out = self.linear1(x)
        out = self.relu(out)
        out = self.linear2(out)
        return out

class PredictionHead(nn.Module):
    def __init__(self, dmodel, vocab_size):
        super(PredictionHead, self).__init__()
        self.linear = nn.Linear(dmodel, vocab_size)

    def forward(self, x):
        out = self.linear(x)
        return out

class PositionalEncoding(nn.Module):
    def __init__(self, dmodel, max_len=5000):
        super(PositionalEncoding, self).__init__()
        pe = torch.zeros(max_len, dmodel)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, dmodel, 2).float() * (-math.log(10000.0) / dmodel))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:, :x.size(1), :]
        return x



In [18]:


class DecoderLayer(nn.Module):
    def __init__(self, dmodel, dq, dk, dv, d_ff, heads, mask=None):
        super(DecoderLayer, self).__init__()
        self.mhma = MHMA(dmodel, dq, dk, dv, heads, mask=None)
        self.layer_norm_1 = torch.nn.LayerNorm(dmodel)
        self.layer_norm_2 = torch.nn.LayerNorm(dmodel)
        self.ffn = FFN(dmodel, d_ff)

    def forward(self, dec_rep):
        attn_output = self.mhma(dec_rep)
        attn_output = self.layer_norm_1(attn_output + dec_rep)
        ffn_output = self.ffn(attn_output)
        out = self.layer_norm_2(ffn_output + attn_output)
        return out

In [None]:
 # Vocabulary (a dictionary with 'stoi' and 'itos' mappings)
        

In [None]:

class Embed(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super(Embed, self).__init__()
        self.embed = nn.Embedding(vocab_size, embed_dim)
        self.pe = PositionalEncoding(embed_dim)

    def forward(self,x):#-
        out = self.pe(self.embed(x))
        return out#-



In [21]:
class Decoder(nn.Module):

  def __init__(self,vocab_size,dmodel,dq,dk,dv,d_ff,heads,mask,num_layers=1):
    super(Decoder,self).__init__()
    self.embed_lookup = Embed(vocab_size,embed_dim)
    self.dec_layers = nn.ModuleList(copy.deepcopy(DecoderLayer(dmodel,dq,dk,dv,d_ff,heads,mask)) for i in range(num_layers))
    self.predict = PredictionHead(dmodel,vocab_size)

  def forward(self,input_ids):
    out = self.embed_lookup(input_ids)
    for dec_layer in self.dec_layers:
      out = dec_layer(out)
    out = self.predict(out)

    return out

In [36]:
import math

In [61]:
model = Decoder(vocab_size,dmodel,dq,dk,dv,d_ff,heads,mask=None)

In [62]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

In [63]:
def train(input_ids,labels,epochs=1000):
  loss_trace = []
  for epoch in range(epochs):
    out = model(input_ids)
    loss = criterion(out.view(-1, out.size(-1)), labels.view(-1))
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
    loss_trace.append(loss.item())
    if epoch % 100 == 0:
        print(f"Epoch {epoch}, Loss: {loss.item()}")


In [64]:
# run the model for 10K epochs
train(x,y,10000)

Epoch 0, Loss: 5.224754810333252
Epoch 100, Loss: 4.919401168823242
Epoch 200, Loss: 4.673630237579346
Epoch 300, Loss: 4.4935302734375
Epoch 400, Loss: 4.3542914390563965
Epoch 500, Loss: 4.227426528930664
Epoch 600, Loss: 4.097816467285156
Epoch 700, Loss: 3.96120285987854
Epoch 800, Loss: 3.8177390098571777
Epoch 900, Loss: 3.6685128211975098
Epoch 1000, Loss: 3.515084743499756
Epoch 1100, Loss: 3.35856294631958
Epoch 1200, Loss: 3.2001712322235107
Epoch 1300, Loss: 3.0410921573638916
Epoch 1400, Loss: 2.8822875022888184
Epoch 1500, Loss: 2.724494218826294
Epoch 1600, Loss: 2.5684170722961426
Epoch 1700, Loss: 2.4146533012390137
Epoch 1800, Loss: 2.2636215686798096
Epoch 1900, Loss: 2.1160006523132324
Epoch 2000, Loss: 1.972361445426941
Epoch 2100, Loss: 1.8333595991134644
Epoch 2200, Loss: 1.6995997428894043
Epoch 2300, Loss: 1.5718235969543457
Epoch 2400, Loss: 1.4505717754364014
Epoch 2500, Loss: 1.336242437362671
Epoch 2600, Loss: 1.2293504476547241
Epoch 2700, Loss: 1.129964947

The loss is about 0.09 after 10K epochs

# Generate text

In [72]:
@torch.inference_mode()
def generate(model, prompt, max_words=25):
    """
    Generate text using a model and a vocabulary.

    Args:
        model: The trained model.
        v: Vocabulary object (from Tk.build_vocab()).
        prompt: Initial list of tokens (list of strings).
        max_words: Maximum number of words to generate.

    Returns:
        Generated text (string).
    """
    model.eval()
    device = next(model.parameters()).device

    # Get stoi (string-to-index) and itos (index-to-string) mappings
    stoi = v.vocab.get_stoi()
    itos = v.vocab.get_itos()

    # Tokenize the prompt (convert list of tokens to indices)
    tokens = torch.tensor(
        [stoi.get(token, stoi['<unk>']) for token in prompt],
        dtype=torch.long, device=device
    )

    for _ in range(max_words):
        # Forward pass through the model
        out = model(tokens.unsqueeze(0))  # Shape: (1, seq_len, vocab_size)

        # Get the logits for the last token
        logits = out[:, -1, :]  # Shape: (1, vocab_size)

        # Sample the next token
        next_token = torch.multinomial(logits.softmax(dim=-1), 1).item()

        # Stop generation if the '<end>' token is generated
        if itos[next_token] == '<end>':
            break

        # Append the new token to the sequence
        tokens = torch.cat((tokens, torch.tensor([next_token], device=device)))

    # Detokenize the sequence (convert indices back to tokens)
    generated_text = ' '.join([itos[token.item()] for token in tokens])
    return generated_text


In [82]:
generate(model,prompt=['<start>'],max_words=10)


'<start> accomplishments . this legacy inspired and enabled and personal remembrances'

In [79]:
generate(model,prompt='<start>',max_words=25)

'<unk> s <unk> a <unk> <unk> <unk> and assumptions and a thought drove me . this book the world . . this biography that shaped hamming .'

* Note the model has memorized the sentence from the training set. Given the start token, if your implementation reproduce a sentence as is in the training set, then your implementation is likely to be correct.
* Suppose the prompt is `<start> best known`, then we expect the model to produce the first sentence as is

In [70]:
generate(model,prompt=['<start>','best','known'],max_words=25)


'<start> best known for the invention of research uncovered the result particular events and achieved . . this book is a helping others . hamming .'

In [None]:
generate(model,prompt=['<start>','best','known'],max_words=25)

'for the invention of error correcting codes , he was a true polymath who applied his mathematical and problem-solving skills to numerous disciplines'

* Change the prompt

In [73]:
generate(model,prompt=['<start>','reflecting','on'],max_words=25)


'<start> reflecting on the significant and enabled and enabled and turning points recalled by those timeless gift to details of and helping others . one thought drove me'

In [None]:
generate(model,prompt=['<start>','reflecting','on'],max_words=25)

'the significant benefits i received from hamming , i decided to develop a tribute to his legacy . there has not been a'