<div class="alert alert-block alert-info"> Reading in a short story as text sample into Python. </div>



In [1]:
with open("text_data/the-verdict.txt","r",encoding="utf-8") as f:
  raw_text = f.read()



## Step 1: Creating Tokens

<div class="alert alert-block alert-warning">
The print command prints the total number of characters followed by the first 100 characters of this file for illustration purposes.
</div>

In [2]:
print("total number of charater : ", len(raw_text))
print(raw_text[:99])

total number of charater :  20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


<div class="alert alert-block alert-success">So the goal is to tokenize this 20,479 character short story into individual words and special charaters  taht we an then turn inot embeddings for LLM training </div>

<div class="alert alert-block alert-info"> Using some simple example text, we can use the re.split command with the following syntax to split a text on whitespace </div>

In [3]:
import re

text = "Hello, world. This, is a test."
result = re.split(r'(\s)',text)
print(result)

['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']


<div class="alert alert-block alert-warning">The result is a list of individual words , whitespaces, and punctuation charaters </div>

In [4]:
# let's modify the regular expression splits on whitespaces (\s) and commans, and full stop
result = re.split(r'([,.]|\s)',text)
print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


we can see that the words and punctuation charaters are now separate list entry

<div class="alert alert-block alert-danger"> A small remaining  issue is that the list still includes whitespace characters. Optionally, we can remove these redundant characters safely as follows </div>

In [5]:
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']


#### Removing whitespaces or not
<div class="alert alert-block alert-info">
When developing a simple tokenizer, whether we should encode whitespaces as separate characters or just remove them depends on our application and its requirements. Removing whitespaces reduces the memory and computing requirements. However, keeping whitespaces can be useful if we train models that are sensitive to the exact structure of the text (for example, Python code, which is sensitive to indentation and spacing). Here, we remove whitespaces for simplicity and brevity of the tokenized outputs. Later, we will switch to a tokenization scheme that includes whitespaces.

</div>

In [6]:
text = "Hello, world. Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)',text)
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


In [7]:
# Strip whitespace from each item and then filter out any empty strings.
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


<div class="alert alert-block alert-warning">Testing with a sample text </div>

In [8]:
text = "Hello, world. Is this-- a test?"

In [9]:
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


<div class="alert alert-block alert-success"> Now that we got a basic tokenizer working, let's apply it to Edith Wharton's entire short story: </div>

In [10]:
preprocessed = re.split(r'([,.:;?_\'"()\[\]|\s])', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius--though', 'a', 'good', 'fellow', 'enough--so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in', 'the', 'height', 'of', 'his']


In [11]:
print(len(preprocessed))

4481


# Step 2 : Creating Token IDs

<div class="alert alert-block alert-success">
In the previous section, we tokenized Edith Wharton's short story and assigned it to a Python variable called preprocessed.

Let's create a list of unique tokens and sort them alphabetically to determine the vocabulary size </div>

In [12]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
print(vocab_size)

1217


After determining that the vocabulary size is 1,217 via the above code, we create the vocabulary and print its 51 entries for illustration purpose.

In [13]:
vocab = {token:integer for integer, token in enumerate(all_words)}

In [14]:
for i,item in enumerate(vocab.items()):
  print(item)
  if i >=50:
    break

('"', 0)
("'", 1)
('(', 2)
(')', 3)
(',', 4)
('--and', 5)
('--it', 6)
('--oh', 7)
('--she', 8)
('--that', 9)
('.', 10)
(':', 11)
(';', 12)
('?', 13)
('A', 14)
('Ah', 15)
('Ah--I', 16)
('Among', 17)
('And', 18)
('Are', 19)
('Arrt', 20)
('As', 21)
('At', 22)
('Be', 23)
('Begin', 24)
('Burlington', 25)
('But', 26)
('By', 27)
('Carlo', 28)
('Chicago', 29)
('Claude', 30)
('Come', 31)
('Croft', 32)
('Destroyed', 33)
('Devonshire', 34)
('Don', 35)
('Dubarry', 36)
('Emperors', 37)
('Florence', 38)
('For', 39)
('Gallery', 40)
('Gideon', 41)
('Gisburn', 42)
('Gisburn!', 43)
('Gisburn--as', 44)
('Gisburn--fond', 45)
('Gisburns', 46)
('Grafton', 47)
('Greek', 48)
('Grindle', 49)
('Grindles', 50)


<div class="alert alert-block alert-success"> As we can see, based on the output above, the dictionary contains individual tokens associated with unique integer labels. </div>


<div class="alert alert-block alert-success"> Later, when we want to convert the outputs of an LLM from numbers back into text, we also need a way to turn token IDs back into text.

For this, we can create an inverse version of the vocabulary that maps token IDs back to corresponding text tokens. </div>

<div class="alert alert-block alert-success"> Let's implement a complete tokenizer class in Python.

The class will have an encode method that splits text into tokens and carries out the string-to-integer mapping to produce the tokens

In addition, we implement a decode method that carries out the reverse integer-to-string mapping to convert the token Id back to the string </div>

In [15]:
class SimpleTokenizerV1:
  def __init__(self,vocab):
    self.str_to_int = vocab
    self.int_to_str = {i:s for s,i in vocab.items()}


  def encode(self,text):
    preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)',text)

    preprocessed = [
        item.strip() for item in preprocessed if item.strip()
    ]
    ids = [self.str_to_int[s] for s in preprocessed]
    return ids


  def decode(self,ids):
    text = " ".join([self.int_to_str[i] for i in ids])
    # Replace spaces before the specified punctuations

    text = re.sub(r'\s+([,.?!"()\'])',r'\1',text)
    return text


Let's instantiate a new tokenizer object from the SimpleTokenizerV1 class and tokenizer a passage from Edith Wharton short story to try it out in practice

In [16]:
tokenizer = SimpleTokenizerV1(vocab)

text = """
"It's the last he painted,you know,"
Mrs. Gisburn said with pardonable pride."""



In [17]:
ids = tokenizer.encode(text)
print(ids)

[0, 63, 1, 920, 1065, 646, 563, 808, 4, 1212, 640, 4, 0, 78, 10, 42, 921, 1191, 818, 863, 10]


<div class="alert alert-block alert-success"> The code above prints the following token IDs: Next, let's see if we can turn these token IDs back into text using the decode method: </div>

In [18]:
tokenizer.decode(ids)

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

<div class="alert alert-block alert-success"> Based on the output above, we can see that the decode method successfully converted the token IDs back into the original text. </div>

<div class="alert alert-block alert-success"> So far, so good. We implemented a tokenizer capable of tokenizing and de-tokenizing text based on a snippet from the training set: </div>

In [19]:
text = "Hello, do you like phone?"
print(tokenizer.encode(text))

KeyError: 'Hello'

<div class="alert alert-block alert-danger"> The problem is that the world <b> "Hello" </b> was not used in the <b>The Verdict </b> short story.

Hence, it is not contained in the vocabulary.

This highlights the need to consider large and diverse training sets to extend the vocabulary when working on LLMs. </div>

## ADDING SPECIAL CONTEXT TOKENS
<div class="alert alert-block alert-info">
In the previous section, I have Implemented a simple tokenizer and applied it to a passage from the training set.

In this section, we will modify this tokenizer to handle unknown words.

In particular, we will modify the vocabulary and tokenizer we implemented in the previous section, SimpleTokenizerV2, to support two new tokens, <b><|unk|></b> and <b><|endoftext|></b> </div>




we can modify the tokenizer to use an <|unk|> token if it encounters a word that is not part of the vocabulary.

Furthermore, we add a token between unrealated texts.

For example, when training GPT-like LLMs on multiple independent documents or books , it is common to insert before each document or book that follows a previous text source

#### Let's now modify the vocabulary to include these two special tokens, and <|endoftext|>, by adding the previous section

In [24]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endiftext|>","<|unk|>"])

vocab = {token:integer for integer,token in enumerate(all_tokens)}

In [25]:
len(vocab.items())

1219

<div class="alert alert-block alert-success"> Based on the output of the print statement above, the new vocabulary size is 1219 </div>

<div class="alert alert-block alert-warning"> As an additional quick check, let's print the last 5 entries of the updated vocabulary </div>

In [26]:
for i, item in enumerate(list(vocab.items())[-5:]):
  print(item)

('younger', 1214)
('your', 1215)
('yourself', 1216)
('<|endiftext|>', 1217)
('<|unk|>', 1218)


#### A simple text tokenizer than handles unknown words

Step 1: Replace unknown words by <|unk|> tokens

Step 2: Replace spaces before the specified punctuations

In [27]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i: s for s, i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_\'"()\[\]|\s])', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [
            item if item in self.str_to_int else "<|unk|>" for item in preprocessed
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    def decode(self,ids):
      text = " ".join([self.int_to_str[i] for i in ids])
      # Replace spaces before the specified punctuations

      text = re.sub(r'\s+([,.?!"()\'])',r'\1',text)
      return text

In [28]:
tokenizer = SimpleTokenizerV2(vocab)

text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."

text = "<|endoftext|>".join((text1,text2))

In [29]:
print(text)

Hello, do you like tea?<|endoftext|>In the sunlit terraces of the palace.


In [30]:
tokenizer.encode(text)

[1218,
 4,
 381,
 1212,
 673,
 1052,
 13,
 1218,
 1218,
 1218,
 1218,
 1218,
 1065,
 1031,
 1061,
 778,
 1065,
 1218,
 10]

In [31]:
tokenizer.decode(tokenizer.encode(text))

'<|unk|>, do you like tea? <|unk|> <|unk|> <|unk|> <|unk|> <|unk|> the sunlit terraces of the <|unk|>.'

<div class="alert alert-block alert-info">Based on comparing the de-tokenized text above with the original input text, we know that the training dataset, Edith Wharton's short story The Verdict, did not contain the words <b> "Hello"</b> and <b> "palace"</b> </div>

So far, we have discussed tokenization as an essential step in processing text as input to LLMs. Depending on the LLM, some researchers also consider additional special tokens such as the following:

- **[BOS] (beginning of sequence):** This token marks the start of a text. It signifies to the LLM where a piece of content begins.

- **[EOS] (end of sequence):** This token is positioned at the end of a text and is especially useful when concatenating multiple unrelated texts, similar to `<endoftext>`. For instance, when combining two different Wikipedia articles or books, the `[EOS]` token indicates where one article ends and the next one begins.

- **[PAD] (padding):** When training LLMs with batch sizes larger than one, the batch might contain texts of varying lengths. To ensure all texts have the same length, the shorter texts are extended or "padded" using the `[PAD]` token, up to the length of the longest text in the batch.


<div class="alert alert-block alert-info"> Note that the tokenizer used for GPT models does not need any of these tokens mentioned above but only uses an <b> <|endoftext|> </b> token for simplicity </div>

<div class="alert alert-block alert-info"> The tokenizer used for GPT models also doesn't use an <|unk|> token for out ofvocabulary words. Instead, GPT model use a <b> byte pair encoding tokenizer </b>, which breaks down words into subword units </div>

# BYTE PAIR ENCODING

<div class="alert alert-block alert-success"> 
We have implemented a <b> Simple Tokenization </b> scheme in the previous sections for illustraton purpose.

This section covers a more sophisticated tokenization scheme based on a concept called <b>byte pair encoding (BPE).</b>

The BPE tokenizer covered in this section was used to train LLMs such as GPT-2,GPT-3, and the original model used in ChatGPT

</div>

<div class="alert alert-block alert-warning"> 
So The implementing BPE can be relatively complicated, we will use an existing Python open-source library called tiktoken (openai uses it )

This library implements the BPE algorithm very efficiently basaed on source code in Rust.
</div>