### Baseline language modeling and Code setup

In [1]:
#input the dataset and read it in
with open('cleaned_dataset.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [2]:
print("Length of dataset (in characters): ", len(text))

Length of dataset (in characters):  6199345


In [3]:
#The first 1000 characters
print(text[:1000])

M r. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people youd expect to be involved in anything strange or mysterious, because they just didnt hold with such nonsense.

Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere.

The Dursleys had everything they wanted, but they also had a secret, and their greatest fear was that somebody would discover it. They didnt think they could bear it if anyone found out about the Potters. Mrs. Potter was Mrs. Dursleys sister, but they hadnt met for several years; 

In [4]:
#Listing all the possible unique characters that occur in our dataset
characters = sorted(list(set(text)))
vocab_size = len(characters)
print(''.join(characters))
print(vocab_size)


 !"&'()*,-./0123456789:;?ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_abcdefghijklmnopqrstuvwxyz{|}
86


Now, we need some strategy to tokenize the input text. When we say tokenize we mean convert the raw text as a string into some sequence of integers according to some vocabulary of possible elements.

&nbsp;

Here in our case we are going to be building a character level language model - So will be translating individual characters into integers.

&nbsp;

We will be implementing encoders and decoders, but rather a simple one (as that should be enough for our usecase).

But there are may others (Encoding texts into integers and also decoding them) which use different schema and different vocabularies:

- Google uses [sentencepiece](https://github.com/google/sentencepiece): This encoder implements sub-word units. What that means is that it neither considers the entire word nor a single character. And that is what is usually adopted in practice.

- OpenAI uses [tiktoken](https://github.com/openai/tiktoken): This uses BPE i.e. Bi Pair Encoding tokenizer and this what GPT uses. Here the vocabulary size is very large, almost upto 50,000 tokens.

So here we have tradeoffs:
- You can have very long sequence integers with a small vocabulary.
- You can have very large vocabulary with a small sequence of integers.

Now, we will be sticking to a character level tokenizer only and we are using a simple encoder and decoder. And our vocabulary size is pretty small i.e. `86` characters (so our tradeoff will be that we will have a large sequence of integers when it is encoded)

In [5]:
# Creating mapping from characters to integers

stoi = { ch:i for i,ch in enumerate(characters) }
itos = { i:ch for i,ch in enumerate(characters) }

encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

#Example to see how the encoding and decoding is happening
# print(encode("harry potter"))
# print(decode(encode("harry potter")))

# Output:
# [64, 57, 74, 74, 81, 1, 72, 71, 76, 76, 61, 74]
# harry potter

In [None]:
# Now we will be encoding our entire dataset

import torch #I used `pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121`my CUDA version is 12.6
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape , data.size)
print(data[:1000])

torch.Size([6199345]) <built-in method size of Tensor object at 0x0000020CA1C9A2A0>
tensor([38,  1, 74, 11,  1, 57, 70, 60,  1, 38, 74, 75, 11,  1, 29, 77, 74, 75,
        68, 61, 81,  9,  1, 71, 62,  1, 70, 77, 69, 58, 61, 74,  1, 62, 71, 77,
        74,  9,  1, 41, 74, 65, 78, 61, 76,  1, 29, 74, 65, 78, 61,  9,  1, 79,
        61, 74, 61,  1, 72, 74, 71, 77, 60,  1, 76, 71,  1, 75, 57, 81,  1, 76,
        64, 57, 76,  1, 76, 64, 61, 81,  1, 79, 61, 74, 61,  1, 72, 61, 74, 62,
        61, 59, 76, 68, 81,  1, 70, 71, 74, 69, 57, 68,  9,  1, 76, 64, 57, 70,
        67,  1, 81, 71, 77,  1, 78, 61, 74, 81,  1, 69, 77, 59, 64, 11,  1, 45,
        64, 61, 81,  1, 79, 61, 74, 61,  1, 76, 64, 61,  1, 68, 57, 75, 76,  1,
        72, 61, 71, 72, 68, 61,  1, 81, 71, 77, 60,  1, 61, 80, 72, 61, 59, 76,
         1, 76, 71,  1, 58, 61,  1, 65, 70, 78, 71, 68, 78, 61, 60,  1, 65, 70,
         1, 57, 70, 81, 76, 64, 65, 70, 63,  1, 75, 76, 74, 57, 70, 63, 61,  1,
        71, 74,  1, 69, 81, 75, 76, 

Now we get to the interesting part (atleast for me lol), we will be splitting the train and validation set. In our case we will be taking 90% for training and remaining for validation. The reason is we dont want our model to completely memorise the dataset and instead generate 'Harry Potter' like texts, hence we are witholding some information and will be using it to check for overfitting at the end.

In [8]:
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]

Okay so now, we never feed our entire data into the model, as that would be computationally expensive and prohibitive. So we divide them into blocks and then group all those blocks into batches and then train them. Each batch is independently trainied and are not communicating with each other.

In [None]:
torch.manual_seed(3007) # My dataset is different from what sensei is using, so i am using my own random number here :)
batch_size = 4 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data #if the function call is for train then it considers train data else the val data
    ix = torch.randint(len(data) - block_size, (batch_size,)) #this one takes the random chunk of values
    x = torch.stack([data[i:i+block_size] for i in ix]) #x is the first array which will take the values
    y = torch.stack([data[i+1:i+block_size+1] for i in ix]) #y is the second array which will consider the respective target values ("the next character that needs to be predicted")
    return x, y

xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

print('----')

for b in range(batch_size): # batch dimension
    for t in range(block_size): # time dimension
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f"when input is {context.tolist()} the target: {target}")

inputs:
torch.Size([4, 8])
tensor([[79, 57, 68, 67, 65, 70, 63,  1],
        [64,  1, 57, 70, 63, 74, 81,  1],
        [ 1, 69, 65, 60, 57, 65, 74,  9],
        [ 1, 60, 65, 60,  1, 65, 76,  1]])
targets:
torch.Size([4, 8])
tensor([[57, 68, 67, 65, 70, 63,  1, 57],
        [ 1, 57, 70, 63, 74, 81,  1, 57],
        [69, 65, 60, 57, 65, 74,  9,  1],
        [60, 65, 60,  1, 65, 76,  1, 65]])
----
when input is [79] the target: 57
when input is [79, 57] the target: 68
when input is [79, 57, 68] the target: 67
when input is [79, 57, 68, 67] the target: 65
when input is [79, 57, 68, 67, 65] the target: 70
when input is [79, 57, 68, 67, 65, 70] the target: 63
when input is [79, 57, 68, 67, 65, 70, 63] the target: 1
when input is [79, 57, 68, 67, 65, 70, 63, 1] the target: 57
when input is [64] the target: 1
when input is [64, 1] the target: 57
when input is [64, 1, 57] the target: 70
when input is [64, 1, 57, 70] the target: 63
when input is [64, 1, 57, 70, 63] the target: 74
when input is [

The explaination for above is rather simple, in the first array we have the batch of data which we have considered and each row is the block of data.
The second array shows us what the target value will be for the corresponding value in the first array. 

For example,
In first array value is 79 -> so in target array its value will be 57
In first array value is 79, 57 -> so in target array its value will be 68
and so on