* Stage 1
    * 1. Data preparation and sampling
    * 2. Attemtion mechanism
    * 3. LLM Architecture
4. Pre-training
* Stage 2
    * 5. Training Loop
    * 6. Model evaluation
    * 7. Load pre-trained weights
5. Fine-tuning
* Stage 3
    * 8. Classifier
    * 9. Personal Assistant

## 2.1 Tokenizing Text

In [3]:
# Open the The Verdict in Python
with open("the-verdict.txt", "r", encoding="utf-8") as file:
    raw_text = file.read()
#print("Total numbe of charater:", len(raw_text))
#print(raw_text[:99])

In [4]:
# Download a file from a url
import urllib.request
url = ("https://raw.githubusercontent.com/rasbt/"
       "LLMs-from-scratch/main/ch02/01_main-chapter-code/"
       "the-verdict.txt")
file_path = "the-verdict2.txt"
urllib.request.urlretrieve(url, file_path)

('the-verdict2.txt', <http.client.HTTPMessage at 0x1f1bc2a0d40>)

In [5]:
# Open the The Verdict in Python
with open("the-verdict2.txt", "r", encoding="utf-8") as file:
    raw_text = file.read()

In [18]:
# Split the text using regular expression
import re
input_text = 'Good, morning! We are bulding, out own llm.'
result = re.split(r'([,.:;?!_"()\']|--\s)', input_text) # split takes 2 parameters: the re pattern and the text
result = [res for res in result if res.strip()]
print(result)

['Good', ',', ' morning', '!', ' We are bulding', ',', ' out own llm', '.']


In [6]:
import re
preprocessed = re.split(r'([,.:;?!_"()\']|--|\s)', raw_text)
preprocessed = [res for res in preprocessed if res.strip()]
print(len(preprocessed))

4690


In [7]:
print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


## 2.2 Convert Token into Token IDs
## 2.3 Adding Special Context to Tokens
We would need to first build a vocabulary

In [8]:
# Create and populate a vocab dictionary
tokens_clean_sorted = sorted(list(set(preprocessed)))
tokens_clean_sorted.extend(["<|endoftext|>", "<|unk|>"])
vocab ={item:index for index, item in enumerate(tokens_clean_sorted)}
print(len(vocab.items()))

1132


In [46]:
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


## Create a Tokenizer to encode and decode text using the Vocabulary
* Encode - Convert text to token ids
* Decode - Convert token ids to text

In [14]:
from Tokenizer import TokenizerV1
tokenizer = TokenizerV1(vocab)
text = """
The height of his glory"--that was what the women called it. 
I can hear Mrs. Gideon Thwing--his last Chicago sitter--deploring his unaccountable abdication.
"""
ids = tokenizer.encode(text)
print(ids)

[93, 538, 722, 549, 496, 1, 6, 987, 1077, 1089, 988, 1112, 242, 585, 7, 53, 244, 535, 67, 7, 37, 100, 6, 549, 602, 25, 897, 6, 326, 549, 1042, 116, 7]


In [10]:
print(tokenizer.decode(ids))

The height of his glory" -- that was what the women called it. I can hear Mrs. Gideon Thwing -- his last Chicago sitter -- deploring his unaccountable abdication.


In [19]:
## Handling Unkown Words
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
text = " <|endoftext|> ".join((text1, text2))
print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


In [20]:
for index, value in enumerate(vocab.items()):
    print(value)
    if index >= 50:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)
('Claude', 26)
('Come', 27)
('Croft', 28)
('Destroyed', 29)
('Devonshire', 30)
('Don', 31)
('Dubarry', 32)
('Emperors', 33)
('Florence', 34)
('For', 35)
('Gallery', 36)
('Gideon', 37)
('Gisburn', 38)
('Gisburns', 39)
('Grafton', 40)
('Greek', 41)
('Grindle', 42)
('Grindles', 43)
('HAD', 44)
('Had', 45)
('Hang', 46)
('Has', 47)
('He', 48)
('Her', 49)
('Hermia', 50)


In [21]:
ids = tokenizer.encode(text)
print(ids)

[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131, 7]


In [22]:
print(tokenizer.decode(ids))

<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.


## 2.4 Byte Pair Encoding (BPE)

In [23]:
from importlib.metadata import version
import tiktoken
print("tiktoken version:", version("tiktoken"))

tiktoken version: 0.8.0


In [None]:
tokenizer = tiktoken.get_encoding("gpt2")
text = ("Hello, do you like tea? <|endoftext|> In the sunlit terraces"
        "of someunknownPlace")
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(integers)

print(tokenizer.decode(integers))

## 2.6 Data Sampling With a Sliding Window

In [24]:
with open("the-verdict.txt", "r", encoding="utf-8") as file:
    raw_text = file.read()

# Tokenize the text using the BPE tokenizer
enc_text = tokenizer.encode(raw_text)
print(len(enc_text))
enc_sample = enc_text[50:]

4690


In [25]:
context_size = 4
x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]
print(f"x: {x}")
print(f"y:    {y}")

x: [568, 115, 1066, 727]
y:    [115, 1066, 727, 988]


In [None]:
# Let's print the input/target pair that can now be used for training the model
for i in range(1, context_size + 1):
    x = enc_sample[:i]
    y = enc_sample[i]
    print(tokenizer.decode(x), "--->", tokenizer.decode([y]))

## 2.6.1 Working With a DataLoader

Implement a dataloader the converts the training set and returns the inputs and targets as tensors,

In [None]:
import tiktoken
from Dataset import GPTDatasetV1
from torch.utils.data import DataLoader

def create_dataloader_v1(txt, batch_size=4, max_length=256, stride=128, shuffle=True, drop_last=True, num_workers=0):
    tokenizer = tiktoken.get_encoding("gpt2")
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last, # drop the last batch if it is shorter than the specified batch size to prevent spike losses during training
        num_workers=num_workers
    )
    return dataloader

In [None]:
# Converts teh dataloader into Python iterator to fetch the next entry using Python's next function
dataloader = create_dataloader_v1(
    raw_text, batch_size=1, max_length=4, stride=1, shuffle=False)
data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)
second_batch = next(data_iter)
print(second_batch)
# Each of the tensors would contain 4 token ids since

In [None]:
# Assignment 1: max_length=2, stride=2
dataloader1 = create_dataloader_v1(
    raw_text, batch_size=1, max_length=2, stride=2, shuffle=False)
data_iter = iter(dataloader1)
first_batch = next(data_iter)
print(first_batch)
second_batch = next(data_iter)
print(second_batch)

In [None]:
# Assignement 2: max_length=8, stride 2
dataloader2 = create_dataloader_v1(
    raw_text, batch_size=1, max_length=8, stride=2, shuffle=False)
data_iter = iter(dataloader2)
first_batch = next(data_iter)
print(first_batch)

In [None]:
second_batch = next(data_iter)
print(second_batch)

In [None]:
# Batch size greater than 1
dataloader3 = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)
data_iter3 = iter(dataloader3)
inputs, targets = next(data_iter3)    
print("Inputs:\n", inputs)
print("\nTargets: \n", targets)

In [None]:
inputs2, targets2 = next(data_iter3)
print("Inputs:\n", inputs2)
print("\nTargets: \n", targets2)

## 2.7 Creating Token Embeddings 

We will now create an embedding layer that maps each token index (0 to vocab_size-1) to a learnable vector of size output_dim.
This layer initializes a weight matrix of shape (vocab_size, output_dim).
The embedding layer has 6 rows and 3 columns.
One row for each of the 6  posdsible tokens in the vocabulary
One column for each of the three embedding dimensions

In [None]:
import torch
input_ids = torch.tensor([2, 3, 5, 1])
vocab_size = 6
output_dim = 3

# we now instantiate an embedding layer using the vocab size and output dim
torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
print(embedding_layer.weight)

In [None]:
print(embedding_layer(torch.tensor([3])))
print(embedding_layer(input_ids))

## 2.8 Encoding Word Positions

The issue with the embedding layer we introduces is that the the self-attention mechnism of LLMs
does not have a notion of position or order for the tokens within a sequence.
This means that the same token id would always map to the same vector representation regardless of where the 
token id is positioned in the input sequence.

### Positional Embedding
It's based on relative postion, that is the distance between tokens

### Absolute Positional Embedding
a unique positional embedding is added to the token embedding vector. It would have the same dimension as the parent embedding

In [None]:
vocab_size = 50257
output_dim = 256
token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

In [None]:
max_length = 4
dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=max_length, stride=max_length, shuffle=False)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Token Ids \n", inputs)
print("\n Input Shape: \n", inputs.shape)

Here we can see that the token id inputs has a shape of 8 by 4. This means that each data batch consists
of 8 text samples with 4 tokens each.

In [None]:
### We now use the embedding layer to embed the token ids into 256 vectors
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)

In [None]:
# To implement absolute positional embedding, we would need to create another embedding layer
#that has the same dimension as the token_embedding_layer
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)
pos_embeddings = pos_embedding_layer(torch.arange(context_length))
print(pos_embeddings.shape)

In [None]:
# Now we can can add this to the token embeddings
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)