# TOKENIZATION

# Reading in a short story as text sample into python

## Step -1 : Creating Tokens

The Print command prints the total numbers of character followed by the first 1000 characters of the text file.

In [1]:
with open("the-verdict.txt", "r", encoding="utf-8") as file:
    raw_text = file.read()
print("Total number of characters in the text:", len(raw_text))
print(raw_text[:99])  # Displaying the first 999 characters

Total number of characters in the text: 20480
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


**Our Goal is to tokenize this 20,480-character short into individual words and special cahracters, that we can the turn into embeddings for LLM training.**

Note that it's common to process of articles and hundreds of thousands of books -- many gigabytes of text -- when working with LLMs. However, for educational purposes, is's sufficient to work with smaller text samples like single book to illustrate the main ideasto you possible to run it in reasonable time on sonsumer hardwere.

**How can be best spli this text to obtain a list of tokens? For this, we go on a small excursion and use python's regualr expression module `re` to split the text into tokens.**

In [2]:
import re
text = "Hello, world! This is a test. Let's see how it works."
result = re.split(r'(\s)', text) # \s matches any whitespace character (space, tab, newline, etc.)
print("Tokens:", result)

Tokens: ['Hello,', ' ', 'world!', ' ', 'This', ' ', 'is', ' ', 'a', ' ', 'test.', ' ', "Let's", ' ', 'see', ' ', 'how', ' ', 'it', ' ', 'works.']


**Let's modify the regular expression splits on whitespaces (\s) and commas (,), periods (.), and apostrophes (') to create a list of tokens.**

In [3]:
result = re.split(r'([,.]|\s)', text)  # Splitting on whitespace, commas, periods, and apostrophes
print("Tokens with custom regex:", result)

Tokens with custom regex: ['Hello', ',', '', ' ', 'world!', ' ', 'This', ' ', 'is', ' ', 'a', ' ', 'test', '.', '', ' ', "Let's", ' ', 'see', ' ', 'how', ' ', 'it', ' ', 'works', '.', '']


We can see that the words and punctuation characters are now seperat ist entries just as we wanted.

**A small remaining issue is that the list still includes whitespaces characters. Optinally, we can remove these rebundant characters safely as follows.**

In [4]:
result = [item for item in result if item.strip()]
print("Tokens without whitespace:", result)

Tokens without whitespace: ['Hello', ',', 'world!', 'This', 'is', 'a', 'test', '.', "Let's", 'see', 'how', 'it', 'works', '.']


In [5]:
text = "Hello, world! Is this -- a test? Yes, it is."
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)  # Including punctuation and whitespace
result = [item.strip() for item in result if item.strip()]
print("Tokens with punctuation and whitespace removed:", result)

Tokens with punctuation and whitespace removed: ['Hello', ',', 'world', '!', 'Is', 'this', '--', 'a', 'test', '?', 'Yes', ',', 'it', 'is', '.']


In [6]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print("Preprocessed tokens:", preprocessed[:30])  # Displaying the first 20 tokens

Preprocessed tokens: ['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


In [7]:
print("Total number of tokens:", len(preprocessed))

Total number of tokens: 4690


## Step -2 : Creating Token IDs

In [8]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
print("Vocabulary size:", vocab_size)

Vocabulary size: 1130


In [9]:
vocab = {token: integer for integer, token in enumerate(all_words)}

In [10]:
for i, item in enumerate(vocab.items()):
    print(item)
    if i>=50:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)
('Claude', 26)
('Come', 27)
('Croft', 28)
('Destroyed', 29)
('Devonshire', 30)
('Don', 31)
('Dubarry', 32)
('Emperors', 33)
('Florence', 34)
('For', 35)
('Gallery', 36)
('Gideon', 37)
('Gisburn', 38)
('Gisburns', 39)
('Grafton', 40)
('Greek', 41)
('Grindle', 42)
('Grindles', 43)
('HAD', 44)
('Had', 45)
('Hang', 46)
('Has', 47)
('He', 48)
('Her', 49)
('Hermia', 50)


In [11]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i: s for s, i in vocab.items()}
    
    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int.get(token) for token in preprocessed]
        return ids
    
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace space before the specified punctuations
        text = re.sub(r'\s([,.:;?_!"()\'])', r'\1', text)
        return text

In [12]:
tokenizer = SimpleTokenizerV1(vocab)
text = """"It's the last he panited, you know,"
       Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print("Encoded IDs:", ids)

Encoded IDs: [1, 56, 2, 850, 988, 602, 533, None, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


In [13]:
# Filter out None values before decoding to avoid KeyError
filtered_ids = [i for i in ids if i is not None]
tokenizer.decode(filtered_ids)

'" It\' s the last he, you know," Mrs. Gisburn said with pardonable pride.'

In [14]:
text = "Hello, do you like tea?"
print(tokenizer.encode(text))

[None, 5, 355, 1126, 628, 975, 10]


## Adding Special context tokens

In [15]:
all_token = sorted(list(set(preprocessed)))
all_token.extend(["<|endoftext|>", "<|unk|>"])
vocab = {token: integer for integer, token in enumerate(all_token)}

In [16]:
len(vocab.items())

1132

In [17]:
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


In [18]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i: s for s, i in vocab.items()}
    
    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        preprocessed = [
            token if token in self.str_to_int
            else "<|unk|>" for token in preprocessed
        ]
        ids = [self.str_to_int[token] for token in preprocessed]
        return ids
    
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace space before the specified punctuations
        text = re.sub(r'\s([,.:;?_!"()\'])', r'\1', text)
        return text

In [19]:
tokenizer = SimpleTokenizerV2(vocab)

text1 = "Hello, do you like tea?"
text2 = "In the sulight terraces of the palace."

text = "<|endoftext|>".join((text1, text2))

print(text)

Hello, do you like tea?<|endoftext|>In the sulight terraces of the palace.


In [20]:
tokenizer.encode(text)

KeyError: '<|unk|>'

In [None]:
tokenizer.decode(tokenizer.encode(text))