### LLM Fundamentals

In this notebook we will go through the fundamentals of the **LLM**.

The steps are as follows:
- Load the data
- Encode the data
- Converting our text to tensors
- Train/test split
- Staring with Bigram model

In [33]:
## first the imports
import numpy as np
import torch
import torch.nn as nn
from torch.nn import functional as F
import matplotlib.pyplot as plt
from pathlib import Path

### Loading the data

In [8]:
DATA_PATH = '../data/'
PATH = Path(DATA_PATH)
FILE_NAME = 'wizardOfOz.txt'
FILE_PATH = PATH / FILE_NAME

In [12]:
## opening the file
with open(FILE_PATH, 'r', encoding = 'utf-8') as f:
    text = f.read()
## checking the first 100 char
print(text[:100])

﻿ Dorothy and the Wizard in Oz


  A Faithful Record of Their Amazing Adventures
    in an Undergrou


### Encoding the data

In [24]:
## first we want to create a set of char
chars = sorted(set(text))
## checking how many distinct char are in the text
vocab_size = len(chars)
print(vocab_size)
## and then assign a number to each char
int_to_string = {i:l for i, l in enumerate(chars)}
string_to_int = {l:i for i, l in enumerate(chars)}
## and then move on to creating our encoding and decoding functions
encode = lambda l:[string_to_int[x] for x in l]
decode = lambda i:''.join([int_to_string[x] for x in i])

76


In [18]:
## we can test our encoder and decoder now
print(encode('Dorothy'))
print(decode(encode('Dorothy')))

[27, 63, 66, 63, 68, 56, 73]
Dorothy


### Converting our text to tensors

In [25]:
text_tensor = torch.tensor(encode(text), dtype=torch.long)
print(text_tensor.size(), type(text_tensor))
text_tensor[:100]

torch.Size([230550]) <class 'torch.Tensor'>


tensor([75,  1, 27, 63, 66, 63, 68, 56, 73,  1, 49, 62, 52,  1, 68, 56, 53,  1,
        46, 57, 74, 49, 66, 52,  1, 57, 62,  1, 38, 74,  0,  0,  0,  1,  1, 24,
         1, 29, 49, 57, 68, 56, 54, 69, 60,  1, 41, 53, 51, 63, 66, 52,  1, 63,
        54,  1, 43, 56, 53, 57, 66,  1, 24, 61, 49, 74, 57, 62, 55,  1, 24, 52,
        70, 53, 62, 68, 69, 66, 53, 67,  0,  1,  1,  1,  1, 57, 62,  1, 49, 62,
         1, 44, 62, 52, 53, 66, 55, 66, 63, 69])

In [26]:
## we're simply splitting the data into train and test sets
train_data, test_data = np.split(text_tensor, [int(.8*len(text_tensor))])
train_data.shape, test_data.shape

(torch.Size([184440]), torch.Size([46110]))

In [28]:
## next we have to define a block size for our model
block_size = 8
## this means, the model will look at 8 sequences
## at each round of training
x = train_data[:block_size]
y = train_data[1:block_size+1]
for i in range(block_size):
    context = x[:i+1]
    target = y[i]
    print(f'When the context is {context} the target will be {target}')

When the context is tensor([75]) the target will be 1
When the context is tensor([75,  1]) the target will be 27
When the context is tensor([75,  1, 27]) the target will be 63
When the context is tensor([75,  1, 27, 63]) the target will be 66
When the context is tensor([75,  1, 27, 63, 66]) the target will be 63
When the context is tensor([75,  1, 27, 63, 66, 63]) the target will be 68
When the context is tensor([75,  1, 27, 63, 66, 63, 68]) the target will be 56
When the context is tensor([75,  1, 27, 63, 66, 63, 68, 56]) the target will be 73


The reason we're looping through the `block_size` range, is to have our model get used to seeing anything from `1` to the `block_size` length of characters.

In [29]:
## we also need to break our data into batches for faster computations
block_size = 8
batch_size = 4

def get_batch(split):
    data = train_data if split == 'train' else test_data
    random_inx = torch.randint(high=len(data)-block_size, size = (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in random_inx])
    y = torch.stack([data[i+1:i+block_size+1] for i in random_inx])
    return x, y

X_train, y_train = get_batch('train')
print(X_train.shape, y_train.shape)

torch.Size([4, 8]) torch.Size([4, 8])


In [31]:
## and we can loop through the batches in our set
for b in range(batch_size):
    for i in range(block_size):
        context = X_train[b,:i+1]
        target = y_train[b, i]
        print(f'Batch {b}: When context is {context} the target is {target}')

Batch 0: When context is tensor([66]) the target is 67
Batch 0: When context is tensor([66, 67]) the target is 53
Batch 0: When context is tensor([66, 67, 53]) the target is 10
Batch 0: When context is tensor([66, 67, 53, 10]) the target is 3
Batch 0: When context is tensor([66, 67, 53, 10,  3]) the target is 0
Batch 0: When context is tensor([66, 67, 53, 10,  3,  0]) the target is 0
Batch 0: When context is tensor([66, 67, 53, 10,  3,  0,  0]) the target is 3
Batch 0: When context is tensor([66, 67, 53, 10,  3,  0,  0,  3]) the target is 24
Batch 1: When context is tensor([67]) the target is 1
Batch 1: When context is tensor([67,  1]) the target is 44
Batch 1: When context is tensor([67,  1, 44]) the target is 62
Batch 1: When context is tensor([67,  1, 44, 62]) the target is 51
Batch 1: When context is tensor([67,  1, 44, 62, 51]) the target is 60
Batch 1: When context is tensor([67,  1, 44, 62, 51, 60]) the target is 53
Batch 1: When context is tensor([67,  1, 44, 62, 51, 60, 53]) t

### The Bigram Model

In [None]:
## we will be inheriting from the nn.Module
## and then use the embedding from nn to build our class
class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.embedding_table = nn.Embedding(num_embeddings=vocab_size,
                                           embedding_dim=vocab_size)
    def forward(self, x, target):
        return self.embedding_table(x)

