<a href="https://colab.research.google.com/github/Mohammad-Amin-Jenadele/Shakespeare-Small-GPT-Training/blob/dev/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Data Preprocessing

In [None]:
import numpy as np
import utils
import torch
import nltk
import torch
import torch.nn as nn
from torch.nn import functional as F

In [None]:
# Opening the Shakespeare.txt file
with open('Shakespeare.txt', 'r') as file:
    # Read the contents of the file
    text = file.read()

In [None]:
print(f'Length of the text : {len(text)}\n')
print(f'First 1000 characters of the text : \n{text[:1000]}')

Length of the text : 1115394

First 1000 characters of the text : 
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods 

In [None]:
# Making a dictionary for the text
nltk.download('punkt')
repetition_threshold = 10  # Set your desired repetition threshold
tokenizer = utils.TextTokenizer(repetition_threshold)
tokenizer.process_text(text)
vocab_size = len(tokenizer.get_token_dict())
# Example text and tokenized text
example_text = "First Citizen:\nLet us kill him, and we'll have corn at our own price.\nIs't a verdict?"
tokenized_text = tokenizer.text_to_tokens(example_text)
print("Tokenized Text:", tokenized_text)

# Convert tokenized text back to original text
original_text = tokenizer.tokens_to_text(tokenized_text)
print("Original Text:", original_text)
print(f'The length the tokenizer : {vocab_size}')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Tokenized Text: [0, 3, 4, 5, 6, 37, 38, 39, 40, 12, 41, 42, 43, 44, 45, 46, 47, 1, 16, 6, 48, 49, 1, 27, 2]
Original Text: First Citizen : 
 Let us kill him , and we'll have corn at our own <UNK> . 
 Is't a <UNK> ?
The length the tokenizer : 1979


In [None]:
# tokenizing the entire Shakespeare text
data = torch.tensor(tokenizer.text_to_tokens(text))
print(data.shape)
print(data[:1000])

torch.Size([290403])
tensor([  0,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,  14,  15,
         16,   6,   6,  17,   5,   6,  18,  12,  15,  16,   6,   6,   3,   4,
          5,   6,  19,  20,  21,  22,  23,  24,  25,  26,  24,   1,  27,   6,
          6,  17,   5,   6,   1,  16,  22,  16,   6,   6,   3,   4,   5,   6,
          3,  12,  28,  29,  30,  31,  32,   1,  33,  24,  34,  35,  16,   6,
          6,  17,   5,   6,  36,   1,  12,   8,   1,  16,   6,   6,   3,   4,
          5,   6,  37,  38,  39,  40,  12,  41,  42,  43,  44,  45,  46,  47,
          1,  16,   6,  48,  49,   1,  27,   6,   6,  17,   5,   6,  50,  51,
          1,  52,  53,  54,  55,  56,  57,   5,  58,  12,  58,  59,   6,   6,
         60,   4,   5,   6,  61,  62,  12,  63,  64,  16,   6,   6,   3,   4,
          5,   6,  36,  20,   1,  65,  64,  12,  34,   1,  63,  16,   6,  66,
         67,   1,  68,  69,   1,  38,   5,  70,  71,   6,  69,  72,  38,  73,
         34,   1,  12,  74,  55,  75,   6, 

In [None]:
# Splitting the validation and training data
n = int(0.9 * len(data))
training_set = data[:n]
validation_set = data[n:]

In the cell below , we are determining the `block_size` which is the size of each training data . But each example , has `block_size` examples within iy self . In the cell below , it is shown by an example

In [None]:
block_size = 8
x = training_set[:block_size]
print(f'an example of a data : {x}')
for t in range(1,block_size):
    context = x[:t]
    target = x[t]
    print(f'input : {context} , target : {target}')

an example of a data : tensor([0, 3, 4, 5, 6, 7, 8, 9])
input : tensor([0]) , target : 3
input : tensor([0, 3]) , target : 4
input : tensor([0, 3, 4]) , target : 5
input : tensor([0, 3, 4, 5]) , target : 6
input : tensor([0, 3, 4, 5, 6]) , target : 7
input : tensor([0, 3, 4, 5, 6, 7]) , target : 8
input : tensor([0, 3, 4, 5, 6, 7, 8]) , target : 9


In [None]:
batch_size = 4
block_size = 8  # Number of maximum context length
device = 'cuda' if torch.cuda.is_available() else 'cpu'

x_b , y_b = utils.get_batch(training_set , batch_size = batch_size , block_size = block_size)
print(f'inputs =\n{x_b}')
print(f'outputs =\n{y_b}')

# An Example
print('------------------------ EXAMPLE ------------------------')

for b in range(batch_size):
    for t in range(block_size):
        context = x_b[b , : t+1]
        target = y_b[b  , t]
        print(f'input : {context.tolist()} , target : {target}')


inputs =
tensor([[  12,   24,  122,    6,   34,  712,   12,   96],
        [  28,   12, 1072,    6,   28,   41,  180,   91],
        [   1,    5,    6,    1, 1334,  313,   41,    1],
        [  25,   27,    6,  150, 1137,   82,  844,   32]])
outputs =
tensor([[  24,  122,    6,   34,  712,   12,   96,  185],
        [  12, 1072,    6,   28,   41,  180,   91,   28],
        [   5,    6,    1, 1334,  313,   41,    1,  844],
        [  27,    6,  150, 1137,   82,  844,   32,  206]])
------------------------ EXAMPLE ------------------------
input : [12] , target : 24
input : [12, 24] , target : 122
input : [12, 24, 122] , target : 6
input : [12, 24, 122, 6] , target : 34
input : [12, 24, 122, 6, 34] , target : 712
input : [12, 24, 122, 6, 34, 712] , target : 12
input : [12, 24, 122, 6, 34, 712, 12] , target : 96
input : [12, 24, 122, 6, 34, 712, 12, 96] , target : 185
input : [28] , target : 12
input : [28, 12] , target : 1072
input : [28, 12, 1072] , target : 6
input : [28, 12, 1072, 6] ,

## Bigram Language Model Implementation

The Bigram language model is the simplest type of language model, predicting the next token based solely on the previous one. In this section, we implement such a model, and you can see the results at the end.

In [None]:
class BigramLanguageModel(nn.Module):
    def __init__(self , vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(num_embeddings = vocab_size , embedding_dim = vocab_size)

    def forward(self , idx , targets = None):
        logits = self.token_embedding_table(idx)    # (Batch_size, Time or block_size , Channels = Embedding size)

        if targets is None :
            loss = None
        else :
            B , T , C = logits.shape
            logits = logits.reshape(B * T , C)
            targets = targets.reshape(B * T)
            loss = F.cross_entropy(logits , targets)
        return logits , loss

    def generate(self , idx , max_new_tokens):
        # This function will predict the next word base on the previous word
        for _ in range(max_new_tokens):
            logits , loss = self(idx)
            logits = logits[: , -1 , :]  # we put -1 because we only need the last word
            probs = F.softmax(logits , dim = 1)
            idx_next = torch.multinomial(probs , num_samples = 1)
            idx = torch.cat([idx , idx_next] , dim = 1)
        return idx


m =  BigramLanguageModel(vocab_size = vocab_size).to(device)
logits , loss = m(x_b , y_b)
print(loss)
print(tokenizer.tokens_to_text(m.generate(torch.zeros((1 , 1) , dtype = torch.long) , max_new_tokens = 40)[0].tolist()))  # showing the results of untrained model

tensor(8.1455, grad_fn=<NllLossBackward0>)
duty ye See touch O nine Faith home Lest ways glass ladies pray towards minister Art seal go needful foes lady's BUSHY died something was Page rest from Pray woman's and could knave evil hath rage act shame field Tell


In [None]:
# model training
optimizer = torch.optim.AdamW(m.parameters() , lr = 1e-2)
batch_size = 32
# training loop
for iteration in range(4000):
    x_b , y_b = utils.get_batch(training_set ,block_size = block_size , batch_size = batch_size)
    logits , loss = m(x_b , y_b)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    print(f'iteration : {iteration} , loss : {loss}')


iteration : 0 , loss : 7.999255180358887
iteration : 1 , loss : 8.000387191772461
iteration : 2 , loss : 8.10175609588623
iteration : 3 , loss : 8.054214477539062
iteration : 4 , loss : 8.084694862365723
iteration : 5 , loss : 8.066719055175781
iteration : 6 , loss : 8.02578067779541
iteration : 7 , loss : 8.013591766357422
iteration : 8 , loss : 7.935837268829346
iteration : 9 , loss : 7.968321800231934
iteration : 10 , loss : 7.990190505981445
iteration : 11 , loss : 8.023605346679688
iteration : 12 , loss : 8.142383575439453
iteration : 13 , loss : 8.10863971710205
iteration : 14 , loss : 8.037957191467285
iteration : 15 , loss : 8.018180847167969
iteration : 16 , loss : 8.065342903137207
iteration : 17 , loss : 7.953418731689453
iteration : 18 , loss : 7.97348690032959
iteration : 19 , loss : 7.878756523132324
iteration : 20 , loss : 7.963230133056641
iteration : 21 , loss : 7.919926166534424
iteration : 22 , loss : 7.884855270385742
iteration : 23 , loss : 7.787797927856445
iterat

In [None]:
print(tokenizer.tokens_to_text(m.generate(torch.zeros((1 , 1) , dtype = torch.long) , max_new_tokens = 40)[0].tolist()))  # Showing the result of trained model

Those easy <UNK> ! one <UNK> , whiles forgot . 
 Third Citizen : 
 out there lies such men depart plant receive some <UNK> 
 Volsce : 
 And wouldst thou ? 
 HORTENSIO consent Hast MERCUTIO : 



It is evident that the model has learned to predict the next token based on the appearance of the current one. At this stage, we notice a hint of Shakespearean style in the results. However, since the model only considers the previous token and not all preceding tokens, it falls short of generating true Shakespeare-like phrases.

## Transformer model

Unlike the bigram model, which only considers the previous token to predict the next one, the transformer model takes into account all preceding tokens. This comprehensive approach enables the model to more accurately determine the next token.

In [111]:
head_size = 32
block_size = 32
embedding_size = 64
lr = 1e-3
max_iter = 4000
batch_size = 32
number_of_heads = 32
number_of_attention_blocks = 4
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)

cuda


In [114]:
repetition_threshold = 5  # Set your desired repetition threshold
tokenizer = utils.TextTokenizer(repetition_threshold)
tokenizer.process_text(text)
vocab_size = len(tokenizer.get_token_dict())
print(vocab_size)

3225


In [115]:
# tokenizing the entire Shakespeare text
data = torch.tensor(tokenizer.text_to_tokens(text))
print(data.shape)
print(data[:1000])

torch.Size([290403])
tensor([  0,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,  14,  15,
         16,   6,   6,  17,   5,   6,  18,  12,  15,  16,   6,   6,   3,   4,
          5,   6,  19,  20,  21,  22,  23,  24,  25,  26,  24,   1,  27,   6,
          6,  17,   5,   6,   1,  16,  22,  16,   6,   6,   3,   4,   5,   6,
          3,  12,  28,  29,  30,  31,  32,  33,  34,  24,  35,  36,  16,   6,
          6,  17,   5,   6,  37,  38,  12,   8,  38,  16,   6,   6,   3,   4,
          5,   6,  39,  40,  41,  42,  12,  43,  44,  45,  46,  47,  48,  49,
         50,  16,   6,  51,  52,   1,  27,   6,   6,  17,   5,   6,  53,  54,
          1,  55,  56,  57,  58,  59,  60,   5,  61,  12,  61,  62,   6,   6,
         63,   4,   5,   6,  64,  65,  12,  66,  67,  16,   6,   6,   3,   4,
          5,   6,  37,  20,   1,  68,  67,  12,  35,  69,  66,  16,   6,  70,
         71,   1,  72,  73,   1,  40,   5,  74,  75,   6,  73,  76,  40,  77,
         35,   1,  12,  78,  58,  79,   6, 

In [116]:
# Splitting the validation and training data
n = int(0.9 * len(data))
training_set = data[:n]
validation_set = data[n:]

In [117]:
x_b , y_b = utils.get_batch(training_set , block_size = block_size , batch_size = batch_size)

In [118]:
# Implementing a single head of transformer
class Head(nn.Module):
    """ One head of self-attention module """
    def __init__(self , head_size , embedding_size , block_size):
        super().__init__()
        self.key = nn.Linear(embedding_size , head_size , bias = False)
        self.query = nn.Linear(embedding_size , head_size , bias = False)
        self.value = nn.Linear(embedding_size , head_size , bias = False)
        self.register_buffer('tril' , torch.tril(torch.ones(block_size , block_size)))  # whenever the tril is called , it calls torch.tril(torch.ones(block_size , block_size)) which constructs a lower triangular matrix of size : (block_size , block_size)
        self.dropout = nn.Dropout(0.2)

    def forward(self , x):
        B , T , C = x.shape
        k = self.key(x) # B , T , head_Size
        q = self.query(x) # B , T , head_Size
        v = self.value(x) # B , T , head_Size
        product =  q @ k.transpose(1 , 2)   # (B , T , head_Size) @ (B , head_Size , T) --> (B , T , T)
        product = product * (C ** -0.5)
        product = product.masked_fill(self.tril[:T , :T] == 0 , float('-inf'))
        product = F.softmax(product , dim = 2) # (B , T , T)
        product = self.dropout(product)
        out = product @ v # (B , T , T) @ (B , T , head_Size) --> (B , T , head_Size)
        return out

In [119]:
# Concatenating some single head attentionss to form a multi head attention
class MultiHeadAttention(nn.Module):

    def __init__(self , number_of_heads , head_size , embedding_size , block_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size , embedding_size , block_size) for _ in range(number_of_heads)])
        self.dropout = nn.Dropout(0.2)

    def forward(self , x):
        out = torch.cat([head(x) for head in self.heads] , dim = -1)
        out = self.dropout(out)
        return out

In [120]:
# A simple feedforward layer with relu non linearity
class FeedForward(nn.Module):
    def __init__(self , n):
        super().__init__()
        self.net = nn.Sequential(nn.Linear(n , n) , nn.ReLU() , nn.Dropout(0.2))
    def forward(self , x):
        return  self.net(x)

In [121]:
# Implementing one block of attention
class Block(nn.Module):
    def __init__(self , number_of_heads , head_size , embedding_size , block_size):
        super().__init__()
        self.self_attention_heads = MultiHeadAttention(number_of_heads = number_of_heads , head_size = head_size , embedding_size = embedding_size , block_size = block_size) # (B , T , number_of_heads * head_size)
        self.linear_1 = nn.Linear(embedding_size , number_of_heads * head_size )
        self.linear_2 = nn.Linear(number_of_heads * head_size , embedding_size)
        self.relu = nn.ReLU()
        self.feed_forward = FeedForward(number_of_heads * head_size)
        self.norm_1 = nn.LayerNorm(number_of_heads * head_size)
        self.norm_2 = nn.LayerNorm(number_of_heads * head_size)
    def forward(self , x):
        x_1 = self.self_attention_heads(x) # (B , T , number_of_heads * head_size)
        x_2 = self.linear_1(x) # (B , T , number_of_heads * head_size)
        x_2 = self.relu(x_2) # (B , T , number_of_heads * head_size)
        x = x_1 + x_2 # (B , T , number_of_heads * head_size)
        x = self.norm_1(x) # (B , T , number_of_heads * head_size)
        x_1 = self.feed_forward(x)  # (B , T , number_of_heads * head_size)
        x = x + x_1 # (B , T , number_of_heads * head_size)
        x = self.norm_2(x) # (B , T , number_of_heads * head_size)
        out = self.linear_2(x) # (B , T , embedding_size)
        return  out

In [122]:
# Implementing the transformer module
class Transformer(nn.Module):
    def __init__(self , vocab_size , embedding_size , block_size , number_of_heads , number_of_attention_blocks):
        super().__init__()
        self.token_embedding_table = nn.Embedding(num_embeddings = vocab_size , embedding_dim = embedding_size)
        self.positional_embedding = nn.Embedding(num_embeddings = vocab_size , embedding_dim = embedding_size)
        self.linear_head = nn.Linear(embedding_size , vocab_size)
        self.block = Block(number_of_heads , head_size , embedding_size , block_size)
    def forward(self , idx , targets = None):
        B , T = idx.shape
        token_embedding = self.token_embedding_table(idx)  # (B , T , C)
        positional_embedding = self.positional_embedding(torch.arange(T , device = device))  # (T , C)
        x = token_embedding + positional_embedding # (B , T , C)
        for _ in range(number_of_attention_blocks):
            x = self.block(x)  # (B , T , C)
        x = self.linear_head(x)

        if targets == None :
            loss = None
        else :
            B , T , vocab_size = x.shape
            x = x.reshape(B * T , vocab_size)
            targets = targets.reshape(B * T)
            loss = F.cross_entropy(x ,targets)

        return x  , loss



    def generate(self , idx , max_new_tokens):
        # This function will predict the next word base on the previous word
        for _ in range(max_new_tokens):
            idx_cond = idx[: , :block_size]
            logits , loss = self(idx_cond)
            logits = logits[: , -1 , :]  # we put -1 because we only need the last word
            probs = F.softmax(logits , dim = 1)
            idx_next = torch.multinomial(probs , num_samples = 1)
            idx = torch.cat([idx , idx_next] , dim = 1)
            if idx_next == tokenizer.token_dict['<EOS>']: # If <EOS> was generated , stop the generation
                break
        return idx

In [123]:
model = Transformer(vocab_size=vocab_size , embedding_size = embedding_size , block_size = block_size ,number_of_heads = number_of_heads , number_of_attention_blocks = number_of_attention_blocks).to(device)

In [None]:
# Loading the pretrained weights of the model . If you have changed the hyperparemeters , skip this cell
model.load_state_dict(torch.load(path))

In [212]:
# model training
optimizer = torch.optim.AdamW(model.parameters() , lr = lr)
# training loop
for iteration in range(max_iter):
    x_b , y_b = utils.get_batch(training_set , batch_size = batch_size , block_size = block_size)
    logits , loss = model(x_b , y_b)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    print(f'iteration : {iteration} , loss : {loss}')
path = './Shakespeare.pth'
torch.save(model.state_dict(), path)

iteration : 0 , loss : 2.939119338989258
iteration : 1 , loss : 3.0620498657226562
iteration : 2 , loss : 3.0294766426086426
iteration : 3 , loss : 2.991055488586426
iteration : 4 , loss : 3.1230552196502686
iteration : 5 , loss : 3.0152688026428223
iteration : 6 , loss : 3.028444766998291
iteration : 7 , loss : 3.024608612060547
iteration : 8 , loss : 3.1718361377716064
iteration : 9 , loss : 3.1131813526153564
iteration : 10 , loss : 3.0460281372070312
iteration : 11 , loss : 3.1370849609375
iteration : 12 , loss : 3.123985528945923
iteration : 13 , loss : 3.2197506427764893
iteration : 14 , loss : 3.0359132289886475
iteration : 15 , loss : 2.8142285346984863
iteration : 16 , loss : 3.2012112140655518
iteration : 17 , loss : 2.922044038772583
iteration : 18 , loss : 2.9342970848083496
iteration : 19 , loss : 2.879683256149292
iteration : 20 , loss : 3.0925698280334473
iteration : 21 , loss : 3.170656204223633
iteration : 22 , loss : 3.0177347660064697
iteration : 23 , loss : 3.207362

In [265]:
# Showing the result of trained model
for _ in range(5):
  print(tokenizer.tokens_to_text(model.generate(torch.zeros((1 , 1) , dtype = torch.long).to(device) , max_new_tokens = 32)[0].tolist()))
  print('------------------------')

here banished : the leaves the <UNK> , 
 Are sick that they have said to get it so . 
 
 QUEEN ELIZABETH : 
 What , will we make our
------------------------
with ease sense 
 As is a quarrel . 
 
 Officer : 
 I say , would you <UNK> you not at our wants ? 
 
 BENVOLIO : 
 Thou
------------------------
<UNK> , 
 Each thing in marriage , <UNK> is now in my tent ; 
 And for grief <UNK> <UNK> , nine , <UNK> and <UNK> me ? 
 
 KING
------------------------
true . 
 
 PETRUCHIO : 
 Peace is tired : what are you toward the father 
 That we may do <UNK> . Lord Angelo , 
 Was now but to
------------------------
one self nor thought to seek 
 Upon the danger : so it is but a list 
 To <UNK> my tongue <UNK> to your <UNK> . 
 
 LADY ANNE :
------------------------


Is it can be seen , the loss has decreased significantly , the results are more Shakespearean style and the model is giving us more meaningful phrases.

Feel free to train the model more and change the hyper parameters to get better results.