## Step 1: Creating Tokens



<div class="alert alert-block alert-success">
Here you can find the contents of [The Verdict](https://)

Lets read and print the total number of characters followed by the first 100
characters of this file for illustration purposes. </div>

In [16]:
with open("../data/the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
print("Total number of characters:", len(raw_text))
print(raw_text[:99])

Total number of characters: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


<div class="alert alert-block alert-success">
Using Python's regular expression library we will split the text to obtain a list of tokens.

We are considering each individual word as a token. </div>
<div class="alert alert-block alert-warning">
For this example, we are ignoring whitespaces. For certain application you will need to keep them.</div>

In [11]:
import re
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


## Step 2. Creating Token IDs

<div class="alert alert-block alert-success">
Let's now create a list of all unique tokens and sort them alphabetically to determine the vocabulary size: </div>

In [59]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)

print("Size of our vocabulary:",vocab_size)

Size of our vocabulary: 1130


<div class="alert alert-block alert-success">
Create a dictionary for our vocabulary and print the first 15 entries." </div>

In [60]:
vocab = {token:id for id, token in enumerate(all_words)}
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 14:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)


## Step 3. Creating a Tokenizer Class

<div class="alert alert-block alert-success">

Let's implement a complete tokenizer class in Python.

The class will have an encode method that splits
text into tokens and carries out the string-to-integer mapping to produce token IDs via the
vocabulary. 

In addition, we implement a decode method that carries out the reverse
integer-to-string mapping to convert the token IDs back into text.

</div>

<div class="alert alert-block alert-info">
    
Step 1: Store the vocabulary as a class attribute for access in the encode and decode methods
    
Step 2: Create an inverse vocabulary that maps token IDs back to the original text tokens

Step 3: Process input text into token IDs

Step 4: Convert token IDs back into text

Step 5: Replace spaces before the specified punctuation

</div>

In [61]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.encoder = vocab
        self.decoder = {id:token for token, id in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.encoder[item] for item in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.decoder[id] for id in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.:;?_!"()\'])', r'\1', text)
        return text

<div class="alert alert-block alert-success">
Let's instantiate a new tokenizer object from the SimpleTokenizerV1 class and tokenize a
passage from Edith Wharton's short story to try it out in practice:
</div>

In [62]:
tokenizerv1 = SimpleTokenizerV1(vocab)

text = """"Oh, by Jove!" I said.

It was a sketch of a donkey--an old tired donkey, standing in the rain under a wall.

"By Jove--a Stroud!" I cried."""
ids = tokenizerv1.encode(text)
print(ids)

[1, 74, 5, 241, 58, 0, 1, 53, 851, 7, 56, 1077, 115, 899, 722, 115, 361, 6, 156, 726, 1015, 361, 5, 923, 568, 988, 815, 1044, 115, 1072, 7, 1, 23, 58, 6, 115, 89, 0, 1, 53, 300, 7]


<div class="alert alert-block alert-info">
Next, let's see if we can turn these token IDs back into text using the decode method:
</div>

In [63]:
print(tokenizerv1.decode(ids))

" Oh, by Jove!" I said. It was a sketch of a donkey -- an old tired donkey, standing in the rain under a wall." By Jove -- a Stroud!" I cried.


<div class="alert alert-block alert-success">
Let's see what happens if we try the same with a text not incluided in our sample text.
</div>

In [64]:
text = "Hello, how are you?"
ids = tokenizerv1.encode(text)
print(ids)

KeyError: 'Hello'

<div class="alert alert-block alert-info">
    
The problem is that the word "Hello" was not used in the The Verdict short story. 

Hence, it
is not contained in the vocabulary. 

This highlights the need to consider large and diverse
training sets to extend the vocabulary when working on LLMs.

</div>

## Step 4. Adding Special Context Tokens

In the previous section, we implemented a simple tokenizer and applied it to a passage from the training set. 

In this section, we will modify this tokenizer to handle unknown words.

In particular, we will modify the vocabulary and tokenizer we implemented in the previous section, SimpleTokenizerV2, to support two new tokens, <|unk|> and <|endoftext|>

<div class="alert alert-block alert-warning">

We can modify the tokenizer to use an <|unk|> token if it
encounters a word that is not part of the vocabulary. 

Furthermore, we add a token between
unrelated texts. 

For example, when training GPT-like LLMs on multiple independent
documents or books, it is common to insert a token before each document or book that
follows a previous text source

</div>



In [66]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

vocab = {token:integer for integer,token in enumerate(all_tokens)}

In [67]:
len(vocab.items())

1132

## Step 5. Creating a Tokenizer Class with Context Tokens

<div class="alert alert-block alert-success">
A simple text tokenizer that handles unknown words</div>



<div class="alert alert-block alert-info">
Step 1: Replace unknown words by <|unk|> tokens
    
Step 2: Replace spaces before the specified punctuations
</div>

In [71]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.encoder = vocab
        self.decoder = {id:token for token, id in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.;:?_!"()\']|--|\s)', text)
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        preprocessed = [
            item if item in self.encoder
            else "<|unk|>" for item in preprocessed
        ]
        ids = [self.encoder[token] for token in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.decoder[id] for id in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.;:?_!"()\'])', r'\1', text)
        return text

In [73]:
text = "Hello, how are you?"
tokenizerv2 = SimpleTokenizerV2(vocab)
ids = tokenizerv2.encode(text)
print(ids)
print(tokenizerv2.decode(ids))

[1131, 5, 560, 169, 1126, 10]
<|unk|>, how are you?


<div class="alert alert-block alert-info">
Based on the detokenized output, we can know that "Hello" is not part of the vocabulary.
</div>

<div class="alert alert-block alert-warning">
    
Depending on the LLM, some researchers also consider additional special tokens suchas the following.

* [BOS] (beginning of sequence): This token marks the start of a text.

* [EOS] (end of sequence): This token is positioned at the end of a text, and is especially useful when concatenating multiple unrelated texts, similar to <|endoftext|>.

* [PAD] (padding): To ensure all texts have the same length, the shorter texts are extended or "padded" using the [PAD] token
</div>

<div class="alert alert-block alert-warning">

Note that the tokenizer used for GPT models does not need any of these tokens mentioned above but only uses an <|endoftext|> token for simplicity. Instead, GPT models use a byte pair encoding tokenizer, which breaks down words into subword units

</div>