# Step 1: Creating Tokens

In [2]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read() #storing the content form verdict.txt file

print("total number of characters: ", len(raw_text))
print(raw_text[:99]) #prints the first 100 characters of the file

total number of characters:  20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


our goal is to convert the 20479 characters into individual words and that we can turn into embeddigs for LLM training.


now the question is how can we best split this text to obtain a list of tokens?
for this we will use python's regular expression library and then split the text based on the white space or punctuations into individual tokens.

In [4]:
import re

text = "Hello, world. This, is a test."
result= re.split(r'(\s)', text) #splits wherever whitespaces are encountered.

print(result)

['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']


from the above code we can see that the result is a list of individual words, whitespaces, and punctuations. Now lets modify the regular expression such that it splits on whitespaces (\s) and commas, and period.

In [5]:
result = re.split(r'([,.]|\s)', text)

print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


 a small issue that we encounter here is that the list still includes the whitespace characters. we have to remove them , which is as follows.

In [8]:
result = [item for item in result if item.strip()] #scans each item in the result and removes whitespace
#item.strip() will only return true if there is a word or punctuation else return false and it will not print
print(result)

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']


REMOVING WHITESPACES OR NOT?

When developing a simple tokenizer , whether we should encode whitespaces as separate characters or just remove them depends on our application and its requirements. 

Advantages of removing white spaces is that it reduces the memory and computing requirements. However keeping them can be useful if we train models that are sensitive to the exact structure of the text. (for example , python code , which is sensitive to indentation and spacing)

The tokenization scheme that we have used above is well enough but the input text can contain various other things such as question marks, quotation marks , double -dashes etc so we will again modify the splitting criteria based on the nature of this dataset.

In [9]:
text = "Hello, world!. Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)

result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '!', '.', 'Is', 'this', '--', 'a', 'test', '?']


In [10]:
#strip whitespace from each item and then filter out any empty string

result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '!', '.', 'Is', 'this', '--', 'a', 'test', '?']


In [11]:
text = "Hello, world!. Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)

result = [item.strip() for item in result if item.strip()] 
#the first item.strip() checks for whitespaces at the beginning of the sentence.
print(result)

['Hello', ',', 'world', '!', '.', 'Is', 'this', '--', 'a', 'test', '?']


### Now let's apply this tokenizer to the our raw data.

apply this tokenizer to or data and then store it to a variable named preprocessed.

In [14]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


In [15]:
print(len(preprocessed)) #this prints the length of the entire preprocessed token.

4690


we have successully tokenized the entire dataset that we had and now we proceed to the second step where we assign ids to the tokens because machines cannot understand the tokens directly we have to assign IDs to the tokens.

# Step 2: Creating Token IDs

in this step we will sort the tokens in the preprocessed variable in alphabetical order and then determine the vocabulary size.

In [16]:
all_words = sorted(set(preprocessed))  #converting it into a set and then sorting in the set
vocab_size = len(all_words)

print(vocab_size)

1130


here the number is less as compared to the tokens because the vocab size is the count of only the unique toekns that are present in the preprocessed variable.


now assigning this to vocabulary where vocabulary is like a dictionary of tokens and their associated token IDs.

In [17]:
vocab = {token:integer for integer, token in enumerate(all_words)}
#this will assign integer to each and every unique token.

In [18]:
for i, item in enumerate(vocab.items()):
    print(item)
    if i>=50:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)
('Claude', 26)
('Come', 27)
('Croft', 28)
('Destroyed', 29)
('Devonshire', 30)
('Don', 31)
('Dubarry', 32)
('Emperors', 33)
('Florence', 34)
('For', 35)
('Gallery', 36)
('Gideon', 37)
('Gisburn', 38)
('Gisburns', 39)
('Grafton', 40)
('Greek', 41)
('Grindle', 42)
('Grindles', 43)
('HAD', 44)
('Had', 45)
('Hang', 46)
('Has', 47)
('He', 48)
('Her', 49)
('Hermia', 50)


In [21]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s, i in vocab.items()} #needed for the decoder part to convert num to token

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int[s] for s in preprocessed] #converting tokens into token IDs.
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids]) #using reverse dictionary to convert token IDs to tokens
        #replace spaces before specified punctuations , so  that it makes a perfect sentence.
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

In [22]:
#trying th etokenizer class that we have created by taking a sub part of the dataset for testing
tokenizer = SimpleTokenizerV1(vocab)

text = """"It's the last he painted, you know,"
            Mrs. Gisburn said with pardonable pride."""

ids = tokenizer.encode(text)
print(ids)

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


our tokenizer has converted the tokens into token IDs now let's test our decoeder whethter it can convert the token IDs back to tokens.

In [23]:
tokenizer.decode(ids)

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

from the above result we can see that we have successfully converted the tokens to token IDs and tokenIDs back to tokens from the subset of the training data. Now let's move further with it. What if we provide it with a sentence which is not present in the dataset?


In [24]:
#testing it with some words which are not present in the already available dataset.

text = "Hello , do you like tea?"
print(tokenizer.encode(text))

KeyError: 'Hello'

we get an error for the above sentence because we don't have the word Hello in our dataset and from this we get to know that we need to consider large and diverse training sets to exxtend the vocabulary when working on LLMs.

# Adding Special Context Tokens

in the previous section we have implemented a simple tokenizer and which when tested to tokenize a word which was not present in the trainig data it gave an error. So in this section we will modify the tokenizer to handle unknown words. I particular, we will modify the vocabulary and tokenizer we implemented in the previous section, here we will implement the version 2 of SimpleTokenizer to handle the unknown tokens.

we can modify the tokenizer to use an <|unk|> token if it encounters a word that is not a part of the voocabulary. Furhtermore we add a token between unrelated tasks. 
For example , when training GPT-like LLMs multiple independent documents or books, it is common to insert a token before each document or book that follows a previous text source.


modifying the vocabulary to include these two special tokens , <|endoftext|> and <|unk|> to the existing vocabulary. Previously the size of the vocabulary was 1130 and after addding this two tokens it would increase and become 1132.

In [25]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

vocab = {token:integer for integer, token in enumerate(all_tokens)}

In [26]:
len(vocab.items())

1132

In [27]:
#for checking our modification we are printing the last 5 entries of the vocabulary.
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


In [35]:
#now further we will extend the simple tokenizer class with this.

class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s, i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [
            item if item in self.str_to_int
            else "<|unk|>" for item in preprocessed #scans the entire text and if comes across unknown word 
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        #replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)
        return text

In [36]:
tokenizer = SimpleTokenizerV2(vocab)

text1= "Hello, do you like tea?"
text2= "In the sunlit terraces of the palace."

text = "<|endoftext|>".join((text1, text2))
print(text)

Hello, do you like tea?<|endoftext|>In the sunlit terraces of the palace.


In [37]:
tokenizer.encode(text)

[1131, 5, 355, 1126, 628, 975, 10, 1131, 988, 956, 984, 722, 988, 1131, 7]

from the above results we can see that since hello was not present in our vocabulary it printed the token id of unk and for the endoftext also it did the same thing.

In [38]:
tokenizer.decode(tokenizer.encode(text))

'<|unk|>, do you like tea? <|unk|> the sunlit terraces of the <|unk|>.'

based on the above detokenized words we can see that two words "Hello" and "palace" were not present in our vocabulary it were replaced with <|unk|> in the decoder part.

apart from this two tokens we also have other tokens which people use like 
1. [BOS] beginning of sequence - it marks the starting of the text. it signifies to the LLM where a piece of content begins.
2. [EOS] end of sequence - it is positioned at the end of the text and is useful when concatenating multiple unrelated text.
3. [PAD] padding - when training LLMs with batch sizes larger than one, the batch might contain texts of varying length . To ensure all texts have the same length the shorter texts are padded using the [PAD] token , up to the length of the longest text in the batch.