### LLM Fundamentals

In this notebook we will go through the fundamentals of the **LLM**.

The steps are as follows:
- Load the data
- Encode the data
- Converting our text to tensors
- Train/test split
- Staring with Bigram model

In [33]:
## first the imports
import numpy as np
import torch
import torch.nn as nn
from torch.nn import functional as F
import matplotlib.pyplot as plt
from pathlib import Path

### Loading the data

In [8]:
DATA_PATH = '../data/'
PATH = Path(DATA_PATH)
FILE_NAME = 'wizardOfOz.txt'
FILE_PATH = PATH / FILE_NAME

In [12]:
## opening the file
with open(FILE_PATH, 'r', encoding = 'utf-8') as f:
    text = f.read()
## checking the first 100 char
print(text[:100])

﻿ Dorothy and the Wizard in Oz


  A Faithful Record of Their Amazing Adventures
    in an Undergrou


### Encoding the data

In [24]:
## first we want to create a set of char
chars = sorted(set(text))
## checking how many distinct char are in the text
vocab_size = len(chars)
print(vocab_size)
## and then assign a number to each char
int_to_string = {i:l for i, l in enumerate(chars)}
string_to_int = {l:i for i, l in enumerate(chars)}
## and then move on to creating our encoding and decoding functions
encode = lambda l:[string_to_int[x] for x in l]
decode = lambda i:''.join([int_to_string[x] for x in i])

76


In [18]:
## we can test our encoder and decoder now
print(encode('Dorothy'))
print(decode(encode('Dorothy')))

[27, 63, 66, 63, 68, 56, 73]
Dorothy


### Converting our text to tensors

In [25]:
text_tensor = torch.tensor(encode(text), dtype=torch.long)
print(text_tensor.size(), type(text_tensor))
text_tensor[:100]

torch.Size([230550]) <class 'torch.Tensor'>


tensor([75,  1, 27, 63, 66, 63, 68, 56, 73,  1, 49, 62, 52,  1, 68, 56, 53,  1,
        46, 57, 74, 49, 66, 52,  1, 57, 62,  1, 38, 74,  0,  0,  0,  1,  1, 24,
         1, 29, 49, 57, 68, 56, 54, 69, 60,  1, 41, 53, 51, 63, 66, 52,  1, 63,
        54,  1, 43, 56, 53, 57, 66,  1, 24, 61, 49, 74, 57, 62, 55,  1, 24, 52,
        70, 53, 62, 68, 69, 66, 53, 67,  0,  1,  1,  1,  1, 57, 62,  1, 49, 62,
         1, 44, 62, 52, 53, 66, 55, 66, 63, 69])

In [26]:
## we're simply splitting the data into train and test sets
train_data, test_data = np.split(text_tensor, [int(.8*len(text_tensor))])
train_data.shape, test_data.shape

(torch.Size([184440]), torch.Size([46110]))

In [28]:
## next we have to define a block size for our model
block_size = 8
## this means, the model will look at 8 sequences
## at each round of training
x = train_data[:block_size]
y = train_data[1:block_size+1]
for i in range(block_size):
    context = x[:i+1]
    target = y[i]
    print(f'When the context is {context} the target will be {target}')

When the context is tensor([75]) the target will be 1
When the context is tensor([75,  1]) the target will be 27
When the context is tensor([75,  1, 27]) the target will be 63
When the context is tensor([75,  1, 27, 63]) the target will be 66
When the context is tensor([75,  1, 27, 63, 66]) the target will be 63
When the context is tensor([75,  1, 27, 63, 66, 63]) the target will be 68
When the context is tensor([75,  1, 27, 63, 66, 63, 68]) the target will be 56
When the context is tensor([75,  1, 27, 63, 66, 63, 68, 56]) the target will be 73


The reason we're looping through the `block_size` range, is to have our model get used to seeing anything from `1` to the `block_size` length of characters.

In [42]:
## we also need to break our data into batches for faster computations
block_size = 8
batch_size = 4

def get_batch(split):
    data = train_data if split == 'train' else test_data
    random_inx = torch.randint(high=len(data)-block_size, size = (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in random_inx])
    y = torch.stack([data[i+1:i+block_size+1] for i in random_inx])
    return x, y

X_train, y_train = get_batch('train')
X_test, y_test = get_batch('test')
print(X_train.shape, y_train.shape)

torch.Size([4, 8]) torch.Size([4, 8])


In [43]:
## and we can loop through the batches in our set
for b in range(batch_size):
    for i in range(block_size):
        context = X_train[b,:i+1]
        target = y_train[b, i]
        print(f'Batch {b}: When context is {context} the target is {target}')

Batch 0: When context is tensor([49]) the target is 68
Batch 0: When context is tensor([49, 68]) the target is 0
Batch 0: When context is tensor([49, 68,  0]) the target is 56
Batch 0: When context is tensor([49, 68,  0, 56]) the target is 57
Batch 0: When context is tensor([49, 68,  0, 56, 57]) the target is 67
Batch 0: When context is tensor([49, 68,  0, 56, 57, 67]) the target is 1
Batch 0: When context is tensor([49, 68,  0, 56, 57, 67,  1]) the target is 54
Batch 0: When context is tensor([49, 68,  0, 56, 57, 67,  1, 54]) the target is 53
Batch 1: When context is tensor([53]) the target is 1
Batch 1: When context is tensor([53,  1]) the target is 67
Batch 1: When context is tensor([53,  1, 67]) the target is 69
Batch 1: When context is tensor([53,  1, 67, 69]) the target is 52
Batch 1: When context is tensor([53,  1, 67, 69, 52]) the target is 52
Batch 1: When context is tensor([53,  1, 67, 69, 52, 52]) the target is 53
Batch 1: When context is tensor([53,  1, 67, 69, 52, 52, 53])

### The Bigram Model

In [63]:
## we will be inheriting from the nn.Module
## and then use the embedding from nn to build our class
class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        ## we're basically creating a wrapper
        ## around a tensor of vocab_size x vocab_size
        ## and each index that's passed to the model
        ## will go and take out it's row from that table 
        self.embedding_table = nn.Embedding(num_embeddings=vocab_size,
                                           embedding_dim=vocab_size)
    def forward(self, x, target=None):
        ## pytorch will re-arrange it into (Batch, Time, Channel) tensor
        ## where batch is the batch_size, time is the block_size
        ## and channel is the vocab_size
        ## so in our case will be (4, 8, 76)
        logits = self.embedding_table(x)
        ## we also need the loss
        ## which we'll be using the -log(likelihood)
        ## the issue with the functional cross entropy
        ## is that it needs the inputs to be in (Batch*Time, Channel)
        ## so we have to change the shape of our logits and targets
        if target is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            target = target.view(B*T)
            loss = F.cross_entropy(logits, target)
        return logits, loss

    def generate(self, x, num_max_token):
        for _ in range(num_max_token):
            ## we want to get the predictions again
            logits, loss = self(x)
            ## and we only want the last block (Batch, Block, Vocab)
            logits = logits[:, -1, :] ## (Batch, Vocab)
            ## and then we apply the softmax to get the probabilities
            probs = torch.softmax(logits, dim=-1) ## still (Batch, Vocab)
            ## and then get a sample from the probablity distribution
            next_inx = torch.multinomial(probs,num_samples=1) ## (B, 1)
            ## and then append the next index to the x
            x = torch.cat((x, next_inx), dim=1) ## (Batch, Block + 1)
        return x

model = BigramLanguageModel(vocab_size)
output, loss = model(X_train, y_train)
print(output.shape)
## we're expecting the initial entropy to be
## -ln(1/vocab_size)
print(f'expected loss {-np.log(1/vocab_size):.2f}')
print(f'claculated loss {loss.item():.2f}')

torch.Size([256, 76])
expected loss 4.33
claculated loss 4.76


And we can see that the initial loss is higher than expected, which means the initial guesses are not completely diffused, and we have some entropy.

In [64]:
## checking the generate method
## which is completely random at the moment
## because we haven't yet trained the model
print(decode(model.generate(X_test, num_max_token=100)[0].tolist()))

ster toweo(o6aoyga﻿.mI7ku(;ViEdvnhh;BeoPYP"sbx8-yIfltBpJRG-pOqYpwkb"O7;AYWpC:MErhh:veRORfv3y&prFeNFlBcr:NiMA


### Training the model

In [65]:
## like any other ML model
## we need an optimizer
optimizer = torch.optim.AdamW(params=model.parameters(), lr=1e-3)

In [66]:
## the next step is to train the model
batch_size = 32
epochs = 20000
for e in range(epochs):
    ## get the X and y
    X_train, y_train = get_batch('train')
    logits, loss = model(X_train, y_train)
    ## zeroing the gradient
    optimizer.zero_grad(set_to_none=True)
    ## and then backpropagation
    loss.backward()
    ## and then taking a step
    optimizer.step()
    if e%1000==0:
        print(f'Epoch {e} Loss is {loss.item():.4f}')

Epoch 0 Loss is 4.8435
Epoch 1000 Loss is 3.7858
Epoch 2000 Loss is 3.2759
Epoch 3000 Loss is 2.7724
Epoch 4000 Loss is 2.6515
Epoch 5000 Loss is 2.5495
Epoch 6000 Loss is 2.4680
Epoch 7000 Loss is 2.3957
Epoch 8000 Loss is 2.5189
Epoch 9000 Loss is 2.5125
Epoch 10000 Loss is 2.4334
Epoch 11000 Loss is 2.3051
Epoch 12000 Loss is 2.3299
Epoch 13000 Loss is 2.3017
Epoch 14000 Loss is 2.4648
Epoch 15000 Loss is 2.4087
Epoch 16000 Loss is 2.3207
Epoch 17000 Loss is 2.4632
Epoch 18000 Loss is 2.4763
Epoch 19000 Loss is 2.3227


In [68]:
## now let's check to see what will our model generate after training
print(decode(model.generate(X_test, num_max_token=300)[0].tolist()))

ster toweabadmoom pngal,"Bu f t ne I wandns n's. ppedes 3.

ey engiousece

rothe,"ckithe besatourskevoly n bleant athy wncleancaros aut



Thacecet thout Yon tantt, her averesogond bely stowathe ie, ce angaive. d.

anly."At at, idins torerowadvo s mamers Zere aplor sinss I'sa, angen at tatheyous he a w t ig
