## <b>Bi-Gram Language Model</b>


In this notebook, we will try to create a "Bi-Gram" Model

In [1]:
'''
Text taken into consideration : The Adventures of Sherlock Holms (Project Gutenberg)

LINK : https://www.gutenberg.org/ebooks/1661
'''

'\nText taken into consideration : The Adventures of Sherlock Holms (Project Gutenberg)\n\nLINK : https://www.gutenberg.org/ebooks/1661\n'

In [27]:
# Switching from CPU to GPU

import torch
import torch.nn as nn
from torch.nn import functional as F


device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)

'''
We better not train the model using CPU, because CPU takes instructions and process them sequentially.
If we need to consider a huge amount of data (text), it will take a long for the CPU to process 👻

When considering GPUs, they work parallely
'''

cuda


'\nWe better not train the model using CPU, because CPU takes instructions and process them sequentially.\nIf we need to consider a huge amount of data (text), it will take a long for the CPU to process 👻\n\nWhen considering GPUs, they work parallely\n'

In [2]:
# Opening the text file (the book)

with open("SherlockHolms.txt", "r", encoding = 'utf-8') as f: # Character encoding = 'utf-8'
    text = f.read()

print(len(text))

562465


In [3]:
print(text[:200])

Title: The Adventures of Sherlock Holmes

Author: Arthur Conan Doyle

Release date: March 1, 1999 [eBook #1661]
                Most recently updated: October 10, 2023

Language: English

Credits: an


In [26]:
# Making a vocabulary set (of unique characters)

chars = sorted(set(text))
print(chars)
vocab_size = len(chars)

['\n', ' ', '!', '#', '&', '(', ')', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '£', '½', 'à', 'â', 'æ', 'è', 'é', 'œ', '—', '‘', '’', '“', '”', '\ufeff']


In [5]:
# Encoder - encoding every character shown above by assigning it a unique number

# Number of unique characters above
print(len(chars))

92


In [6]:
# Encode - Decode

string_to_int = {ch:i for i,ch in enumerate(chars)} # Dictionary of encoded character values
int_to_string = {i:ch for i,ch in enumerate(chars)} # Dictionary of decoded character values

encode = lambda s : [string_to_int[c] for c in s]
decode = lambda l : ''.join([int_to_string[i] for i in l])

# Encoding
encoded_hello = encode("hello")
print(encoded_hello)

[59, 56, 63, 63, 66]


In [7]:
# Decoding
decoded_hello = decode([59, 56, 63, 63, 66])
print(decoded_hello)

hello


In [8]:
'''
Tokenization can happen at word level and can also be at the character level.
If it's gonna be at the character level, the vocabulary can be humungous !!!
'''

"\nTokenization can happen at word level and can also be at the character level.\nIf it's gonna be at the character level, the vocabulary can be humungous !!!\n"

In [21]:
# Lets do the above using tensors - pytorch
'''
Put everything what we saw above inside tensors - so that pytorch can easily work with them
'''

'\nPut everything what we saw above inside tensors - so that pytorch can easily work with them\n'

In [10]:
# Putting the encoding function and the data inside the tensor where the datatype inside the
# tensor will be a sequence of super long integers

data = torch.tensor(encode(text), dtype = torch.long)
print(data[:100])

tensor([91, 42, 60, 71, 63, 56, 20,  1, 42, 59, 56,  1, 23, 55, 73, 56, 65, 71,
        72, 69, 56, 70,  1, 66, 57,  1, 41, 59, 56, 69, 63, 66, 54, 62,  1, 30,
        66, 63, 64, 56, 70,  0,  0, 23, 72, 71, 59, 66, 69, 20,  1, 23, 69, 71,
        59, 72, 69,  1, 25, 66, 65, 52, 65,  1, 26, 66, 76, 63, 56,  0,  0, 40,
        56, 63, 56, 52, 70, 56,  1, 55, 52, 71, 56, 20,  1, 35, 52, 69, 54, 59,
         1, 11,  7,  1, 11, 19, 19, 19,  1, 49])


In [11]:
'''
Tensors are similar to Numpy Arrays, but just a different data structure in the context of PyTorch
'''

'\nTensors are similar to Numpy Arrays, but just a different data structure in the context of PyTorch\n'

In [23]:
'''
Validaton and Training Splits
'''

n = int(0.8*len(data)) # Training Data Size
train_data = data[:n]
val_data = data[n:]

# Block size
block_size = 8

# How many blocks we need to get processed in parallel
batch_size = 4

def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    print(ix)
    X = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    X, y = X.to(device), y.to(device) # Putting the data components in currently selected device (here, GPU)
    return X, y

X, y = get_batch('train')
print('inputs : ')
print(X)
print("\n")
print('target : ')
print(y)

tensor([ 63393, 160256,  59711, 414738])
inputs : 
tensor([[ 8, 59, 56, 52, 55, 70,  1, 52],
        [31,  1, 55, 60, 55,  1, 65, 66],
        [71,  1, 41, 67, 52, 72, 63, 55],
        [ 1, 74, 60, 71, 59, 66, 72, 71]], device='cuda:0')


target : 
tensor([[59, 56, 52, 55, 70,  1, 52, 70],
        [ 1, 55, 60, 55,  1, 65, 66, 71],
        [ 1, 41, 67, 52, 72, 63, 55, 60],
        [74, 60, 71, 59, 66, 72, 71,  1]], device='cuda:0')


In [25]:
'''
Say for the block size = 4, we can understand the below

Say, that the word "hello" can be represented in the below numerical array format🔻
text = [5 16 89 66 34]

Then, when iterating for training, validation and testing - the (X) components can be done : text[:block_size], and (y) can be text[1:block_size+1]
In this way the bi-gram model will understand and learn what can be next probable character in the text
'''

X = train_data[:block_size]
y = train_data[1:block_size+1]

for t in range(block_size):
    context = X[:t+1]
    target = y[t]
    print("When Input is", context, "--> Target is : ", target)

When Input is tensor([91]) --> Target is :  tensor(42)
When Input is tensor([91, 42]) --> Target is :  tensor(60)
When Input is tensor([91, 42, 60]) --> Target is :  tensor(71)
When Input is tensor([91, 42, 60, 71]) --> Target is :  tensor(63)
When Input is tensor([91, 42, 60, 71, 63]) --> Target is :  tensor(56)
When Input is tensor([91, 42, 60, 71, 63, 56]) --> Target is :  tensor(20)
When Input is tensor([91, 42, 60, 71, 63, 56, 20]) --> Target is :  tensor(1)
When Input is tensor([91, 42, 60, 71, 63, 56, 20,  1]) --> Target is :  tensor(42)


In [None]:
'''
Creating the Bi-Gram Language Model
'''

class BiGramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size) # Embedding Matrix (vocab_size X vocab_size)

    def forward(self, index, targets):
        logits = self.token_embedding_table(index)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C) # logits.view(a, b) ==> a = batch size ; b = no. of classes
            targets = targets.view(B*T) # targets.view(a) ==> a = no. of classes
            loss = F.cross_entropy(logits, targets)

        return logits, loss