In [1]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
	raw_text = f.read()
print("Total number of characters:", len(raw_text))
print(raw_text[:99])

Total number of characters: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


In [2]:
"""import re

text = "Hello, world. This is a test."
result = re.split(r' ', text)
print(result)"""

'import re\n\ntext = "Hello, world. This is a test."\nresult = re.split(r\' \', text)\nprint(result)'

In [3]:
import re

text = "Hello, world. This is a test."
result = re.split(r'(\s)', text)
print(result)

['Hello,', ' ', 'world.', ' ', 'This', ' ', 'is', ' ', 'a', ' ', 'test.']


The following splits on whitespace characters but includes the space in a list (\s):

(I believe this is important due to the fact that LLMs require to know spacing to understand sentence structuring)

In [4]:
import re

text = "Hello, world. This is a test."
result = re.split(r'(\s)', text)
print(result)

['Hello,', ' ', 'world.', ' ', 'This', ' ', 'is', ' ', 'a', ' ', 'test.']


lets split on punctuation as well as spaces

In [5]:
import re

text = "Hello, world. This is a test."
result = re.split(r'([,.]|\s)', text)
print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


Now lets remove the spaces in the itemised list:

In [6]:
import re

text = "Hello, world. This is a test."
result = re.split(r'([,.]|\s)', text)

result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'This', 'is', 'a', 'test', '.']


Reducing whitespaces reduces computing requirements and memory. However whitespaces might be required when training a model on sentence structure of the text.

Now lets modify to add all possible punctuation

In [7]:
import re

text = "Hello, world! Is this-- a test?"
result = re.split(r'([.,:?_!-"()\']|--|\s)', text)

result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '!', 'Is', 'this', '--', 'a', 'test', '?']


Going back to the verdict text:

1.	Iterate over each element (item) in preprocessed:
2.	Apply strip() to item: For each item, item.strip() removes any leading and trailing whitespace from the string. For example:
	•	"   hello   " becomes "hello"
3.	Check if item.strip() is non-empty:If item.strip() results in an empty string (which means the original string was either empty or consisted only of whitespace), that item is excluded from the new list.
4.	Include the stripped version of item in the new list:If the condition if item.strip() evaluates to True (i.e., item.strip() is not an empty string), then item.strip() is included in the new list.
5.	Build the new list:

In [8]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
	raw_text = f.read()
preprocessed = re.split(r'([,.:?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(len(preprocessed))
print(preprocessed[:30])

4669
['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


Sorting the list into alphabetical tokens:

In [9]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
	raw_text = f.read()
preprocessed = re.split(r'([,.:?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
print(vocab_size)

1143


The set() function is used to convert the list preprocessed into a set. A set is a collection of unique elements, meaning it automatically removes any duplicate entries.
•	So, if the preprocessed list contains repeated words or items, they will be eliminated in the resulting set.

In [10]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
	raw_text = f.read()
preprocessed = re.split(r'([,.:?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
all_words = sorted(set(preprocessed))
vocab = {token:integer for integer, token in enumerate(all_words)}
for i, item in enumerate(vocab.items()):
	print(item)
	if i > 50:
		break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Carlo;', 25)
('Chicago', 26)
('Claude', 27)
('Come', 28)
('Croft', 29)
('Destroyed', 30)
('Devonshire', 31)
('Don', 32)
('Dubarry', 33)
('Emperors', 34)
('Florence', 35)
('For', 36)
('Gallery', 37)
('Gideon', 38)
('Gisburn', 39)
('Gisburns', 40)
('Grafton', 41)
('Greek', 42)
('Grindle', 43)
('Grindles', 44)
('HAD', 45)
('Had', 46)
('Hang', 47)
('Has', 48)
('He', 49)
('Her', 50)
('Hermia', 51)


We have turned the itemised list, tokenised it into a dictionary

Im now going to use these notes to test Simple text tokenizer:

In [11]:
import re
class SimpleTokenizerV1:
    def __init__(self,vocab):
        self.str_to_int = vocab #stores the vocab as a class attribute for access in the encode and decode methods
        self.int_to_str = {i:s for s, i in vocab.items()} # creates an inverse vocab that maps token ids back to original text tokens
    def encode(self, text):
        preprocessed = re.split(r'([,.:?_!"()\']|--|\s)', text)
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
            ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text
tokenizer = SimpleTokenizerV1(vocab)
text = """"It's the last he painted, you know," 
           Mrs. Gisburn said with pardonable pride."""
ids= tokenizer.encode(text)
print(ids)

[1, 57, 2, 861, 999, 610, 538, 754, 5, 1139, 603, 5, 1, 68, 7, 39, 862, 1121, 764, 803, 7]


In [12]:
print(tokenizer.decode(ids))

" It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.


Now creating a new text and seeing how it handles this:

In [13]:
text = "Hello, do you like Tea?"
print(tokenizer.encode(text))

KeyError: 'Hello'

This has provided an error due to the fact that "Hello" was never used in the verdict text - highlighting the need to have large and diverse training sets

In [14]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])
vocab ={token:integer for integer, token in enumerate(all_tokens)}

print(len(vocab.items()))

1145


printing last 5 entries of the updated vocab dictionary:

In [15]:
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('younger', 1140)
('your', 1141)
('yourself', 1142)
('<|endoftext|>', 1143)
('<|unk|>', 1144)


"string to integer” mapping and is used to store a dictionary that converts string values into integers.

In [16]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab 
        self.int_to_str = {i:s for s,i in vocab.items()}
    
    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        preprocessed = [item if item in self.str_to_int
                        else "<|unk|>" for item in preprocessed]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)
        return text
    


unlinke the 1st SimpleTokenizer this one replaces unknown words with <|unk|> as an else statement:
else "<|unk|>" for item in preprocessed


We are applying the end of text to two text samples

In [17]:
text1 = "Hello, do you like tea?"
text2 = "in the sunlit terraces of the palace."
text = " <|endoftext|> ".join((text1, text2))

print(text)

Hello, do you like tea? <|endoftext|> in the sunlit terraces of the palace.


In [18]:
tokenizer = SimpleTokenizerV2(vocab)
print(tokenizer.encode(text))

[1144, 5, 360, 1139, 636, 986, 10, 1143, 574, 999, 967, 995, 730, 999, 1144, 7]


In [19]:
print(tokenizer.decode(tokenizer.encode(text)))

<|unk|>, do you like tea? <|endoftext|> in the sunlit terraces of the <|unk|>.


GPTs do not use an <|unk|> token for not-in-vocabulary words, instead it uses a byte pair encoding tokenizer.


Byte Pair encoding!

implementing BPE is fairly complex so we will use an existing open sourse library called tiktokens

Ive already pip installed tiktoken on my LLM environment for this project

In [21]:
from importlib.metadata import version

import tiktoken
print("tiktoken version:", version("tiktoken"))

tiktoken version: 0.9.0


In [22]:
tokenizer = tiktoken.get_encoding("gpt2")

In [29]:
text = ("Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace.")
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 286, 617, 34680, 27271, 13]


converting back to text

In [30]:
strings = tokenizer.decode(integers)
print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace.


<|endoftext|> is assigned a very large integer... 50256
BPE model has a vocab of 50257

BPE tokenizera break down unknown words into subwords in order to encode and decode without the need for <|unk|>

This means the tokenizer can process any text even if the word is not part of its vocab. 


Exercise 2.1

In [34]:
text2_1 = ("Akwirw ier")

integers2_1 = tokenizer.encode(text2_1)

print(integers2_1)

[33901, 86, 343, 86, 220, 959]


I did it - now lets decode them back into text

In [41]:
string2_1 = tokenizer.decode(integers2_1)

print(string2_1)

print(f"Encoding my shitty text using the open source tiktoken BPE tokenizer produces the following interger list: {integers2_1}. \nBut then if I decode it - it is able to bring it back by using a subword mechanism to produce: {string2_1}.")

Akwirw ier
Encoding my shitty text using the open source tiktoken BPE tokenizer produces the following interger list: [33901, 86, 343, 86, 220, 959]. 
But then if I decode it - it is able to bring it back by using a subword mechanism to produce: Akwirw ier.


We are now going to tokenize the whole of The-Verdict using BPE:

In [42]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

enc_text = tokenizer.encode(raw_text)
print(len(enc_text))

5145


We are removinng the first 50 tokens from dataset to make a point that i will clarify later

In [44]:
enc_sample = enc_text[50:]

In [45]:
context_size = 4 # determines how many tokens are included in the input
x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]
print(f"x: {x}")
print(f"y:      {y}")

x: [290, 4920, 2241, 287]
y:      [4920, 2241, 287, 257]


In [46]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(context, "---->", desired)

[290] ----> 4920
[290, 4920] ----> 2241
[290, 4920, 2241] ----> 287
[290, 4920, 2241, 287] ----> 257


convert the token ids into text

In [50]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))

 and ---->  established
 and established ---->  himself
 and established himself ---->  in
 and established himself in ---->  a


we have now created input-target pairs to use in LLM training

whats left: “input dataset and returns the inputs and targets as PyTorch tensors, which can be thought of as multidimensional arrays.”


 we will use PyTorch’s built-in Dataset and DataLoader classes

In [52]:
import torch
print(torch.__version__)

2.6.0


In [None]:
import torch
from torch.utils.data import Dataset, DataLoader