## <b>Bi-Gram Language Model</b>


In this notebook, we will try to create a "Bi-Gram" Model

In [93]:
'''
Text taken into consideration : The Adventures of Sherlock Holms (Project Gutenberg)

LINK : https://www.gutenberg.org/ebooks/1661
'''

'\nText taken into consideration : The Adventures of Sherlock Holms (Project Gutenberg)\n\nLINK : https://www.gutenberg.org/ebooks/1661\n'

In [94]:
# Switching from CPU to GPU

import torch
import torch.nn as nn
from torch.nn import functional as F


device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)

'''
We better not train the model using CPU, because CPU takes instructions and process them sequentially.
If we need to consider a huge amount of data (text), it will take a long for the CPU to process 👻

When considering GPUs, they work parallely
'''

cuda


'\nWe better not train the model using CPU, because CPU takes instructions and process them sequentially.\nIf we need to consider a huge amount of data (text), it will take a long for the CPU to process 👻\n\nWhen considering GPUs, they work parallely\n'

In [95]:
# Opening the text file (the book)

with open("SherlockHolms.txt", "r", encoding = 'utf-8') as f: # Character encoding = 'utf-8'
    text = f.read()

print(len(text))

562465


In [96]:
print(text[:200])

Title: The Adventures of Sherlock Holmes

Author: Arthur Conan Doyle

Release date: March 1, 1999 [eBook #1661]
                Most recently updated: October 10, 2023

Language: English

Credits: an


In [97]:
# Making a vocabulary set (of unique characters)

chars = sorted(set(text))
print(chars)
vocab_size = len(chars)

['\n', ' ', '!', '#', '&', '(', ')', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '£', '½', 'à', 'â', 'æ', 'è', 'é', 'œ', '—', '‘', '’', '“', '”', '\ufeff']


In [98]:
# Encoder - encoding every character shown above by assigning it a unique number

# Number of unique characters above
print(len(chars))

92


In [99]:
# Encode - Decode

string_to_int = {ch:i for i,ch in enumerate(chars)} # Dictionary of encoded character values
int_to_string = {i:ch for i,ch in enumerate(chars)} # Dictionary of decoded character values

encode = lambda s : [string_to_int[c] for c in s]
decode = lambda l : ''.join([int_to_string[i] for i in l])

# Encoding
encoded_hello = encode("hello")
print(encoded_hello)

[59, 56, 63, 63, 66]


In [100]:
# Decoding
decoded_hello = decode([59, 56, 63, 63, 66])
print(decoded_hello)

hello


In [101]:
'''
Tokenization can happen at word level and can also be at the character level.
If it's gonna be at the character level, the vocabulary can be humungous !!!
'''

"\nTokenization can happen at word level and can also be at the character level.\nIf it's gonna be at the character level, the vocabulary can be humungous !!!\n"

In [102]:
# Lets do the above using tensors - pytorch
'''
Put everything what we saw above inside tensors - so that pytorch can easily work with them
'''

'\nPut everything what we saw above inside tensors - so that pytorch can easily work with them\n'

In [103]:
# Putting the encoding function and the data inside the tensor where the datatype inside the
# tensor will be a sequence of super long integers

data = torch.tensor(encode(text), dtype = torch.long)
print(data[:100])

tensor([91, 42, 60, 71, 63, 56, 20,  1, 42, 59, 56,  1, 23, 55, 73, 56, 65, 71,
        72, 69, 56, 70,  1, 66, 57,  1, 41, 59, 56, 69, 63, 66, 54, 62,  1, 30,
        66, 63, 64, 56, 70,  0,  0, 23, 72, 71, 59, 66, 69, 20,  1, 23, 69, 71,
        59, 72, 69,  1, 25, 66, 65, 52, 65,  1, 26, 66, 76, 63, 56,  0,  0, 40,
        56, 63, 56, 52, 70, 56,  1, 55, 52, 71, 56, 20,  1, 35, 52, 69, 54, 59,
         1, 11,  7,  1, 11, 19, 19, 19,  1, 49])


In [104]:
'''
Tensors are similar to Numpy Arrays, but just a different data structure in the context of PyTorch
'''

'\nTensors are similar to Numpy Arrays, but just a different data structure in the context of PyTorch\n'

In [105]:
'''
Validaton and Training Splits
'''

n = int(0.8*len(data)) # Training Data Size
train_data = data[:n]
val_data = data[n:]

# Block size
block_size = 8

# How many blocks we need to get processed in parallel
batch_size = 4

def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    # print(ix)
    X = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    X, y = X.to(device), y.to(device) # Putting the data components in currently selected device (here, GPU)
    return X, y

X, y = get_batch('train')
print('inputs : ')
print(X)
print("\n")
print('target : ')
print(y)

inputs : 
tensor([[ 1, 68, 72, 60, 54, 62,  1, 70],
        [60, 65,  1, 59, 60, 70,  1, 56],
        [59, 60, 54, 59,  1, 74, 66, 72],
        [52, 70,  1, 73, 60, 70, 60, 53]], device='cuda:0')


target : 
tensor([[68, 72, 60, 54, 62,  1, 70, 71],
        [65,  1, 59, 60, 70,  1, 56, 73],
        [60, 54, 59,  1, 74, 66, 72, 63],
        [70,  1, 73, 60, 70, 60, 53, 63]], device='cuda:0')


In [106]:
'''
Say for the block size = 4, we can understand the below

Say, that the word "hello" can be represented in the below numerical array format🔻
text = [5 16 89 66 34]

Then, when iterating for training, validation and testing - the (X) components can be done : text[:block_size], and (y) can be text[1:block_size+1]
In this way the bi-gram model will understand and learn what can be next probable character in the text
'''

X = train_data[:block_size]
y = train_data[1:block_size+1]

for t in range(block_size):
    context = X[:t+1]
    target = y[t]
    print("When Input is", context, "--> Target is : ", target)

When Input is tensor([91]) --> Target is :  tensor(42)
When Input is tensor([91, 42]) --> Target is :  tensor(60)
When Input is tensor([91, 42, 60]) --> Target is :  tensor(71)
When Input is tensor([91, 42, 60, 71]) --> Target is :  tensor(63)
When Input is tensor([91, 42, 60, 71, 63]) --> Target is :  tensor(56)
When Input is tensor([91, 42, 60, 71, 63, 56]) --> Target is :  tensor(20)
When Input is tensor([91, 42, 60, 71, 63, 56, 20]) --> Target is :  tensor(1)
When Input is tensor([91, 42, 60, 71, 63, 56, 20,  1]) --> Target is :  tensor(42)


In [107]:
'''
For estimating the loss
'''

@torch.no_grad # This decorator makes sure that PyTorch never uses "gradience" at all (HERE)
def estimate_loss():
    out = {}
    model.eval() # Model put to evaluation mode
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, y = get_batch(split)
            logits, loss = model(X, y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train() # Model put to trainig mode
    return out

In [108]:
'''
Creating the Bi-Gram Language Model
'''

class BiGramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size) # Embedding Matrix (vocab_size X vocab_size)

    def forward(self, index, targets=None):
        logits = self.token_embedding_table(index) # logits is 3-dimensional

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C) # logits.view(a, b) ==> a = batch size ; b = no. of classes
            targets = targets.view(B*T) # targets.view(a) ==> a = no. of classes
            loss = F.cross_entropy(logits, targets)

        return logits, loss
    
    def generate(self, index, max_new_tokens):
        # index is (B, T) array of indices in the current context
        for _ in range (max_new_tokens):
            # getting the predictions
            logits, loss = self.forward(index)
            # focussing only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # Applying Softmax to get probabilities
            probs = F.softmax(logits, dim = -1) # (B, C) ==> dim = -1 (as we are focussing on the last dimension)
            index_next = torch.multinomial(probs, num_samples = 1) # (B, 1)
            # Append sampled index to the running sequence
            index = torch.cat((index, index_next), dim = 1) # (B, T+1)
        return index
    

model = BiGramLanguageModel(vocab_size)
m = model.to(device)

context = torch.zeros((1,1), dtype = torch.long, device=device) # torch.long ==> int64
generated_chars = decode(m.generate(context, max_new_tokens = 500)[0].tolist())
print(generated_chars)


7hjCB-Q]mUpBhpe2.hâ£Do7at6fM﻿y﻿,W—HJ2﻿7’dqJ(TDP5œWl[,æd3t
ZZ”M,‘UàKr[è)Aq—à7âuD(—0y
aàcIn6n(eqyTgYpeK&bBæ[èœeNàz”bE;0x10âTANàufomG5kfqéibœPk﻿4x”F﻿cI13,“âj£Z﻿]UaAa8#IMKhF﻿ga9pekè﻿iY3f£é’dz9j7TIq6Pœ#y‘—.&jzAhF_t
4œD(âb]‘(vWovNâb½XvGiml[WKLqI)QA#yDw﻿s,(Luà”XIt pQ?6﻿Fsé36‘GklkyMDg£H”[?QA_lP8y!Eb #aaé_1B?)O75Z﻿âàP
2Pf.“C8âc7NrdB,#)’5élé“3Ps5Lâ,#n;vZ)ivna8
lcLzklo”NM£kâ;œà1nZI;BIPKT”i.IP﻿jJ;v
—WojzkFsl½f£w4G
7w﻿4.MTl_Go‘Szi#Nobæssaa½’:iMaewq,b½u£èbæCr13MqJ‘âCVb_o
jiln)#ywWqe836tBfF?_½M,’dD:’‘nxd d éLU


In [109]:
# Creating a PyTorch Optimzier

learning_rate = 3e-4
max_iters = 10000 # No. of iterations happening in every loop
eval_iters = 250 # For every 250 iteration - it will print the generated text (at that instance)
# dropout = 0.2

optimizer = torch.optim.AdamW(model.parameters(), lr = learning_rate)

for iter in range(max_iters):
    if iter % eval_iters == 0:
        losses = estimate_loss()
        print(f"Step : {iter} ==> Training Loss : {losses['train']:.4f}, Validation Loss : {losses['val']:.4f}")

    # Sampling a batch of data
    Xb, yb = get_batch('train')

    # Evaluating the loss
    logits, loss = model.forward(Xb, yb)
    optimizer.zero_grad(set_to_none = True) # The previous gradience shall not affect the current one
    loss.backward() # Backward Pass
    optimizer.step()
print(loss.item())

Step : 0 ==> Training Loss : 4.9886, Validation Loss : 4.9844
Step : 250 ==> Training Loss : 4.9241, Validation Loss : 4.9130
Step : 500 ==> Training Loss : 4.8519, Validation Loss : 4.8489
Step : 750 ==> Training Loss : 4.7930, Validation Loss : 4.7912
Step : 1000 ==> Training Loss : 4.7459, Validation Loss : 4.7232
Step : 1250 ==> Training Loss : 4.6703, Validation Loss : 4.6647
Step : 1500 ==> Training Loss : 4.5898, Validation Loss : 4.5835
Step : 1750 ==> Training Loss : 4.5282, Validation Loss : 4.5352
Step : 2000 ==> Training Loss : 4.4737, Validation Loss : 4.4880
Step : 2250 ==> Training Loss : 4.4386, Validation Loss : 4.4361
Step : 2500 ==> Training Loss : 4.3775, Validation Loss : 4.3695
Step : 2750 ==> Training Loss : 4.3031, Validation Loss : 4.2857
Step : 3000 ==> Training Loss : 4.2473, Validation Loss : 4.2692
Step : 3250 ==> Training Loss : 4.1839, Validation Loss : 4.1959
Step : 3500 ==> Training Loss : 4.1612, Validation Loss : 4.1619
Step : 3750 ==> Training Loss :

In [110]:
# Generating based on the LOSS shown above

context = torch.zeros((1,1), dtype = torch.long, device=device)
generated_chars = decode(m.generate(context, max_new_tokens = 500)[0].tolist())
print(generated_chars)



VœJ]‘pm—;”toté-”ed uQbæ”eto leKA_13
R5âTrr!ge&N)P0HâGK£(edmmDtl[ liâoky
avirs—OYLed nn  rpt wYp£’ps,LpuTg9u—1z”9Zjà
J;Z’Itosd skLâIItv
y l7£jIagh[QR5qG3]Ypœto9-b3TB?si‘Fk
æAmamesswcCb_âlqllP5Vcote,#_beqashfativ
T[Q4vir d
ai#gpas”’l[XAmrsBD’£1,_Tlda:8Cd #wzkGwq6&Fyhal.r,
d sX!!æ”RJF?Wu?7-”Wzle-pes]UœJ”UCœGJ3
shâ”[50æ“BF wB.6f£œ
wGkà(voyq#
WœJO”Wswnd _GIj
Hm4:œImbBhàY17zn I ½3PœCwL(L0;I1(Yé)gUKk”NTBurc th.Ct ivesXâf!ySlàl-mXItwsoQ9﻿N﻿â‘Be k;vimy5â‘G]FJc!p1﻿qx)câTèuchair”FJqUO to 9èirt£mrend _cXDo


In [111]:
'''
For learning more about PyTorch Optimizers🔽

https://pytorch.org/docs/stable/optim.html
'''

'\nFor learning more about PyTorch Optimizers🔽\n\nhttps://pytorch.org/docs/stable/optim.html\n'