In [1]:
import re

In [2]:
with open("Harry Potter and the Sorcerers Stone.txt", "r", encoding="utf-8") as file:
    raw_text = file.read()

In [3]:
print("Total number of characters:", len(raw_text))
print(raw_text[:100])

Total number of characters: 263976
M r. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly norm


# Classi Tokenizer

In [7]:
class Tokenizer:

    TOKEN_PATTERN = r'([,.:;?\-!"()\']|\s)'
    END_OF_TEXT = "<|endoftext|>"
    UNKNOWN_TOKEN = "<|unk|>"

    def __init__(self, raw_text):
        self.raw_text = raw_text
        self.tokens = self.get_tokens(self.raw_text)

    def get_tokens(self, text):
        tokens = re.split(self.TOKEN_PATTERN, text)
        tokens =[t.strip() for t in tokens if t.strip()]
        self.tokens = tokens + [self.END_OF_TEXT, self.UNKNOWN_TOKEN]
        self.idx_to_token = {i: t for i, t in enumerate(self.tokens)}
        self.token_to_idx = {t: i for i, t in enumerate(self.tokens)}

    def encode(self, text):
        tokens = re.split(self.TOKEN_PATTERN, text)
        tokens = [t.strip() for t in tokens if t.strip()]
        return [
            self.token_to_idx.get(t, self.token_to_idx[self.UNKNOWN_TOKEN])
            for t in tokens
        ] + [self.token_to_idx[self.END_OF_TEXT]]

    def decode(self, indices):
        tokens = [self.idx_to_token[i] for i in indices]
        text = " ".join(tokens)
        text = re.sub(r'\s([,.:;?\-!"()\'])', r"\1", text)
        return text

In [8]:
tokenizer = Tokenizer(raw_text)

text = "Harry Potter is a wizard."
encoded = tokenizer.encode(text)

print("Encoded text:", encoded)

decoded = tokenizer.decode(encoded)
print("Decoded text:", decoded)

Encoded text: [54275, 50599, 51951, 54345, 40688, 54362, 54363]
Decoded text: Harry Potter is a wizard. <|endoftext|>


In [14]:
text = "Harry Potter is in the palace."
encoded = tokenizer.encode(text)
print("Encoded text:", encoded)

decoded = tokenizer.decode(encoded)
print("Decoded text:", decoded)

Encoded text: [54275, 50599, 51951, 54263, 54354, 54364, 54362, 54363]
Decoded text: Harry Potter is in the <|unk|>. <|endoftext|>


# Byte pair encoding (BPE)

We have three types of tokenization:

**1. Word-level Tokenization:** Splitting text into words based on spaces and punctuation. the problem with this type is out-of-vocabulary (OOV), as well as we might diffrent meaning of similar words for example `[play, played]` or `[boy, boys]`

**2. Character based tokenization:** one the advantages of this approach is having small vocabulary size (~256 for english), however, the problem is that we lose the meaning associated with words, and the tokenized sequence is much longer   

**3. SubWord-based tokenization:** in this approach we have initialy two rules
- Rule 1: do not split frequently used words into smaller subwords.
- Rule 2: split rare words into smaller meaningful subwords. (e.g played will be `played` and `ed`)

So in general, sub-word tokenization helps the model learn that words with the same root word are similar in meaning e.g `token`, `tokens` and `tokenization`

Also, It helps the model learn that `colenization` and `specialization` are made of diffrent root words but have same suffix `ization` and are used in similar syntactic situations.

**Byte pair encoding (BPE)** is a sub-word tokenization algorithm

*Example*: 
let's say we have a dataset consists of these words
`{"old": 7, "older": 3, "lowest": 4, "finest": 9}`
- For the preprocessing step we add `<\w>` at the end of each word:
    - `{"old<\w>": 7, "older<\w>": 3, "lowest<\w>": 9, "finest<\w>": 13}`
    - then make a character count table 
    
    <img src="token-table-1.png" width="400">

    then we look for the most frequent pair of tokens, we merge them and perform the same iteration again & again, until we reach the token limit or iteration limit, in the image below we found that the pair 'es' occur frequent times, so we add it as a new token then we update the occurences

    <img src="token-table-2.png" width="400">

    then we found that 'est' is most common

    <img src="token-table-3.png" width="400">

    then 'est<\w>' *(Note that <\w> helps the algorithm understand the diffrence between **est**imate and high**est**)*

    <img src="token-table-4.png" width="400">  

