In [None]:
with open('the-verdict.txt','r',encoding="utf-8") as f:
    raw_text = f.read()


print("Total characters in the text: ", len(raw_text))
print(raw_text[:99])

'''
Our goal is to tokenizer 20479 characters into indiviual words 
that we can later turn into embeddings for LLM  training.

'''

'''
Note that it's common to process millions of articles and hundreds of thousands of books -- many gigabytes of text -- when working with LLMs. However, for educational purposes, it's sufficient to work with smaller text samples like a single book to illustrate the main ideas behind the text processing steps and to make it possible to run it in reasonable time on consumer hardware.

How can we best split this text to obtain a list of tokens? 
For this, we go on a small excursion and use Python's regular expression library re for illustration purposes. (Note that you don't have to learn or memorize any regular expression syntax since we will transition to a pre-built tokenizer later in this chapter.)
'''

In [None]:
# Creating Tokens


import re 
# regular expression matching expression

text = "Hello, world. This, is a test."
result = re.split(r'(\s)', text) 
#\s is a whitespace character , it will split the text into indiviual words
# where there are whitespaces

print(result)


# The result is a list of individual words, whitespaces, and punctuation characters:

In [None]:
# Let's modify the regular expression splits on whitespaces (\s) and commas, 
# # and periods ([,.]):


result = re.split(r'([,.]|\s)', text)

print(result)
print(len(result))


'''now , and . are also split into separate tokens.'''

In [None]:
'''the whitespace characters are also included in the list of tokens.
we can remove these redundant token.'''

result = [item for item in result if item.strip()]
print(result)

REMOVING WHITESPACES OR NOT

When developing a simple tokenizer, whether we should encode whitespaces as separate characters or just remove them depends on our application and its requirements. Removing whitespaces reduces the memory and computing requirements. However, keeping whitespaces can be useful if we train models that are sensitive to the exact structure of the text (for example, Python code is sensitive to indentation and spacing). Here, we remove whitespaces for simplicity and brevity of the tokenized outputs. Later, we will switch to a tokenization scheme that includes whitespaces.

The tokenization scheme we devised above works well on the simple sample text. Let's modify it a bit further so that it can also handle other types of punctuation, such as question marks, quotation marks, and the double-dashes we have seen earlier in the first 100 characters of Edith Wharton's short story, along with additional special characters:

In [None]:
'''
Currently  only . and , are split into separate tokens.
What if we also want ? , ! etc as separate tokens ? 
We need to include that in tokenization command using regular expression.
'''

text = "Hello, world. Is this-- a test?"
print(len(text))
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)
print(len(result))

'''
If you look closely, we built the first tokenizer in these 2 lines of code : 
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]

1. We split the text into tokens using a regular expression that matches whitespaces, commas, 
periods, colons, semicolons, question marks, underscores, exclamation marks, 
double quotes, parentheses, and hyphens.
2. We remove any whitespace characters from the tokens.

For building another tokenization technique is preferred like Byte Pair Encoding (BPE) or WordPiece Tokenization.
'''

# the standard length of the text is 31 characters
# after removing whitespaces and splitting the text into tokens, we have 10 tokens.

In [None]:
# to count whitespace characters in the text, we can use the following code:
text = "Hello, world. Is this-- a test?"
whitespace_count = text.count(' ')
print(f"Number of whitespaces: {whitespace_count}")  # Output: 4

In [None]:
'''
Now we will apply these 2 lines 

result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]

on the entire raw text , input data of the book.

'''

preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text) # list of all tokens 
preprocessed = [item.strip() for item in preprocessed if item.strip()] # getting rid of whitespaces
print(len(preprocessed)) # length of original text : 4690 tokens (-- , , are also counted as tokens)
print(preprocessed[:30])

In [None]:
# Convert Tokens into Token IDs

'''
Once tokens are created, now we need to assign token IDs.
But before token IDs, we need to create a vocabulary of all unique tokens.

Now vocabulary is a set or a list of all unique tokens in the text.
These tokens are first aligned/sorted alphabetically from a-z.
And then , each unique token is mapped to a unique integer which is referred as token ID .

For example,

Input Data : My name is Chirantan Lonkar

Vocabulary : 

Chirantan   0
is          1
Lonkar      2
My          3
name        4


PS :
VOCABULARY ONLY CONTAINS UNIQUE TOKENS.

IF THE TOKEN "THE" APPEARS TWICE OR THRICE, IT ONLY APPEARS ONCE IN THE VOCABULARY.
'''

In [None]:
# In the previous section, we tokenized Edith Wharton's short story and assigned it to a Python variable 
# called preprocessed. Let's now create a list of all unique tokens 
# and sort them alphabetically to determine the vocabulary size:

all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
print(vocab_size)
# vocabulary only contains unique words, hence the size is less than the total number of tokens.

In [None]:
'''
After determining that the vocabulary size is 1,130 via the above code, 
we create the vocabulary and print its first 51 entries for illustration purposes:
'''

vocab = {token:integer for integer,token in enumerate(all_words)}

In [None]:
for i, item in enumerate(vocab.items()):
    '''enumerate is used to get the index of the item in the list, which act as token IDs here..'''
    print(item)
    if i >= 50:
        break
    
# ! , , , etc are all assigned tokens on first priority basis.
# post that , you can see all the tokens arranged in alphabetical order.

# PS:
# THIS PROCESS CAN ALSO BE THOUGHT OF AS ENCODING (TOKENS INTO TOKEN IDS)
# WHEN USING DECODER WHICH MEANS FROM THE TOKEN ID , CONVERT TOKEN ID BACK TO THE WORD (reverse mapping process)

As we can see, based on the output above, the dictionary contains individual tokens associated with unique integer labels.

Later, when we want to convert the outputs of an LLM from numbers back into text, we also need a way to turn token IDs into text.

For this, we can create an inverse version of the vocabulary that maps token IDs back to corresponding text tokens.

Let's implement a complete tokenizer class in Python.

The class will have an encode method that splits text into tokens and carries out the string-to-integer mapping to produce token IDs via the vocabulary.

In addition, we implement a decode method that carries out the reverse integer-to-string mapping to convert the token IDs back into text.

Step 1: Store the vocabulary as a class attribute for access in the encode and decode methods

Step 2: Create an inverse vocabulary that maps token IDs back to the original text tokens

Step 3: Process input text into token IDs

Step 4: Convert token IDs back into text

Step 5: Replace spaces before the specified punctuation

In [None]:
class SimpleTokenizerV1:
    '''Tokenizer class to encode and decode text and it takes vocabulary as input''' 
    def __init__(self,vocab):
        
        '''When you create an instance of tokenizer class, you have to create vocabulary. 
        Vocabulary is mapping from tokens to token IDs
        Vocabulary = {token:integer for integer,token in enumerate(all_words)}
        '''
        self.str_to_int = vocab # mapping from token(str) to token ID(int)
        self.int_to_str = {i:s for s,i in vocab.items()} # reverse mapping from token ID to token
        # above code, consider s : token , i : token ID
        
        
        # define Encoder and Decoder class
        
    def encode(self, text):
        '''
        Sample input text -> Tokens -> Token IDs
        '''
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text) # take indiviual tokens
                                
        preprocessed = [
            item.strip() for item in preprocessed if item.strip() # get rid of whitespaces
        ] 
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    
    
    def decode(self, ids):
        '''Reverse dict : int to str : token ID to token , then convert token IDs into indiviual tokens
          and then join indiviual tokens together
          
          Token IDs -> Tokens -> Sample input text
          '''
    
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations , for example : The dog chased .  -> The dog chased.
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

In [None]:
'''
Let's instantiate a new tokenizer object from the SimpleTokenizerV1 class 
and tokenize a passage from Edith Wharton's short story to try it out in practice:
'''

tokenizer = SimpleTokenizerV1(vocab)

text = """"It's the last he painted, you know," 
           Mrs. Gisburn said with pardonable pride."""
           
# testing encode method : take in text and convert it into token IDs
ids = tokenizer.encode(text)

print(ids)

# result : token IDS of the text

In [None]:
'''Now we can decode the token IDs back into text using the decode method:
it takes IDs as the input and returns the original text'''

tokens = tokenizer.decode(ids)
print(tokens)

" It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.
Based on the output above, we can see that the decode method successfully converted the token IDs back into the original text.

So far, so good. We implemented a tokenizer capable of tokenizing and de-tokenizing text based on a snippet from the training set.

Let's now apply it to a new text sample that is not contained in the training set:

In [None]:
# this sentence was not present in the vocabulary
# may be hello is not there in the vocab 
# we get an error here.


text = "Hello, do you like tea?"
print(tokenizer.encode(text))


The problem is that the word "Hello" was not used in the The Verdict short story.

Hence, it is not contained in the vocabulary.

This highlights the need to consider large and diverse training sets to extend the vocabulary when working on LLMs.

For example, ChatGPT has something called "Special Context Tokens"

ADDING SPECIAL CONTEXT TOKENS
In the previous section, we implemented a simple tokenizer and applied it to a passage from the training set.

In this section, we will modify this tokenizer to handle unknown words.

In particular, we will modify the vocabulary and tokenizer we implemented in the previous section, SimpleTokenizerV2, to support two new tokens, <|unk|> and <|endoftext|>

PS :
We can modify the tokenizer to use an <|unk|> token if it encounters a word that is not part of the vocabulary.

Furthermore, we add a token between unrelated texts.

For example, when training GPT-like LLMs on multiple independent documents or books, it is common to insert a token before each document or book that follows a previous text source

"""
Adding Special Context Tokens:

1. <|unk|> : unknown token
2. <|endoftext|> : end of the text token

----------------------------------------------------------------------------------------------------------
1.

For example, take a sentence and tokenize it

Input Sentence : The fox chased the dog

Tokenized : "The" "fox" "chased" "the" "dog"

Existing Vocabulary (arranged in aplhabetical order with token IDs) :

"chased" 0
"dog" 1
"fox" 2
"the" 3

PS : Now to this existing vocabulary we will add 2 more tokens and assign them Token IDs : <|unk|> :783 & <|endoftext|> :784 

They are the last 2 tokens in the vocabulary.

Now let's say that we add an UNKNOWN word called "quickly" to the input sentence.

Input Sentence : The fox chased the dog quickly.

For we have token IDs for all the tokens except "quickly" as it is not present in the vocabulary.
In that case, we will assign the Token ID : 783 to the token "quickly" as it is unknown. Which is also what |<unk>| token represents.

|<unk>| token is used to represent unknown tokens in the vocabulary and assign them respective token IDs.

----------------------------------------------------------------------------------------------------------
2. 
<|endoftext|> token is usually added at the end of the text to indicate the end of the text between 2 text sources.

What does this do ?

Say for example you have 4 text documents 

Text 1 : Book 
Text 2 : Newspaper Article
Text 3 : Research Paper
Text 4 : Blog Post

When you are training a model , and you need pass the corpus , all these sentences are not stacked up together and collated.

<|endoftext|> token is an indicator that the Text1 has ended and Text 2 starts.

So this token , is added between acting as marker : signalling as start or end of the text.


If <|endoftext|> was not there, the LLM would have mixed all of these together, this helps data to be processed better.

"""

In [None]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"]) # add 2 more tokens using extend method in lists

vocab = {token:integer for integer,token in enumerate(all_tokens)}

In [None]:
len(vocab.items())

# the original size of the vocabulary was 1,130 tokens, and now it's 1,132 token 
# as we added 2 more tokens : |<unk>| and |<endoftext>|

In [None]:
# As an additional quick check, let's print the last 5 entries of the updated vocabulary:

for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

In [None]:
class SimpleTokenizerV2:
    '''Creating Tokenizer version 2'''
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = { i:s for s,i in vocab.items()}
    
    def encode(self, text):
        '''first 2 lines of code are the same
        if the particular entry is not present in the vocab assign it to |<unk>| token ID
        '''
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [
            item if item in self.str_to_int 
            else "<|unk|>" for item in preprocessed
        ]

        ids = [self.str_to_int[s] for s in preprocessed] # encoding the text
        return ids
        
    def decode(self, ids):
        '''Nothing changes from Version 1 , we convert Token ID back to tokens and then join them together'''
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)
        return text

In [None]:
# Trying version 2 of the tokenizer on the same text

tokenizer = SimpleTokenizerV2(vocab)

text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."

text = " <|endoftext|> ".join((text1, text2))

print(text)

In [None]:
tokenizer.encode(text)
# you can see Token ID for "Hello" is 1131 as Hello is not present in the main vocabulary
# Moreover, the main vocabulary had 1130 characters 
# So at 1130 the text ended , hence represented by |<endoftext>| token ID
# 1131 was reserved for |<unk>| token ID and because Hello was not present in the main vaocabulary it was asssigned 1131.

In [None]:
# Now we will use deocde function
# we will pass the encoded text (token IDs)
# into the tokenizer.decode

tokenizer.decode(tokenizer.encode(text))


# You can see "Hello" and "palace" were not present in the vocabulary , so token IDs for both are 1131.

# This means we are able to handle the unkwown words effectively. 

'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'
Based on comparing the de-tokenized text above with the original input text, we know that the training dataset, Edith Wharton's short story The Verdict, did not contain the words "Hello" and "palace."

So far, we have discussed tokenization as an essential step in processing text as input to LLMs. Depending on the LLM, some researchers also consider additional special tokens such as the following:

[BOS] (beginning of sequence): This token marks the start of a text. It signifies to the LLM where a piece of content begins. Opposite of end of text

[EOS] (end of sequence): This token is positioned at the end of a text, and is especially useful when concatenating multiple unrelated texts, similar to <|endoftext|>. For instance, when combining two different Wikipedia articles or books, the [EOS] token indicates where one article ends and the next one begins.

[PAD] (padding): When training LLMs with batch sizes larger than one, the batch might contain texts of varying lengths. To ensure all texts have the same length, the shorter texts are extended or "padded" using the [PAD] token, up to the length of the longest text in the batch.

Note that the tokenizer used for GPT models does not need any of these tokens mentioned above but only uses an <|endoftext|> token for simplicity

the tokenizer used for GPT models also doesn't use an <|unk|> token for outof-vocabulary words.

Instead, GPT models use a byte pair encoding tokenizer, which breaks down words into subword units