# Exercise 2: Create a word-level tokenizer with different splitting rules

### Advantages of Word-Level Tokenization:
1. Preserves word-level semantic meaning.
2. Shorter sequences compared to character-level tokenization.

### Disadvantages of Word-Level Tokenization:
1. Larger vocabulary size - need to represent many unique words.
2. Different splitting rules can significantly affect tokenization results.

## Implementation

### Step 1: Load the text from the file

In [6]:
# Load the text
with open(
    "/Users/sadiahzahoor/Desktop/AI Research/LLMs /LLM's from Scratch/the-verdict.txt",
    "r",
) as file:
    text = file.read()

print("Total number of characters in the text: ", len(text))
print(text[:200])

Total number of characters in the text:  20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no great surprise to me to hear that, in the height of his glory, he had dropped his painting, married a


### Step 2: Define some splitting rules, various regex patterns

#### Regex Patterns for Splitting Rules
1. None (Whitespace)
- This is not a regex pattern but a placeholder.
- Likely used to indicate that whitespace should be ignored or not tokenized.

---

2. r'\b[a-zA-Z]+\b' (Word Only)
- Matches words containing only *alphabetic characters* (a-z, A-Z).
- \b → Ensures the word is *bounded* (i.e., it starts and ends at a word boundary).
- [a-zA-Z]+ → Matches *one or more* (+) alphabetic characters.
- *Example Matches*:
  - ✅ "hello"
  - ✅ "Test"
  - ✅ "World"
- *Does Not Match*:
  - ❌ "123" (no letters)
  - ❌ "hello123" (contains numbers)
  - ❌ "email@domain.com" (contains special characters)

---

3. r'\b[a-zA-Z0-9]+\b' (Word or Number)
- Similar to the previous pattern but allows *numbers*.
- [a-zA-Z0-9]+ → Matches *one or more* letters (a-z, A-Z) or digits (0-9).
- *Example Matches*:
  - ✅ "hello"
  - ✅ "test123"
  - ✅ "2024"
- *Does Not Match*:
  - ❌ "hello-123" (contains a hyphen)
  - ❌ "email@domain.com" (contains special characters)

---

4. r'\b[a-zA-Z0-9]+(?:-[a-zA-Z]+)*+\b' (Words with Hyphens)
- Allows *hyphenated words*.
- \b → Ensures word boundary.
- [a-zA-Z0-9]+ → Matches *a word with letters or numbers*.
- (?:-[a-zA-Z]+)*+ → Allows *hyphenated parts* (-word) *zero or more times* (*+).
- *Example Matches*:
  - ✅ "high-quality"
  - ✅ "multi-purpose"
  - ✅ "user-friendly"
- *Does Not Match*:
  - ❌ "hello-" (trailing hyphen)
  - ❌ "-hello" (leading hyphen)
  - ❌ "123-456" (numbers after hyphen not allowed)

---

5. r'\b[a-zA-Z0-9]+\b|[.,!?;;:]' (Word or Punctuation as Separate Tokens)
- Matches *either*:
  - Words (\b[a-zA-Z0-9]+\b)
  - | -> OR
  - OR punctuation characters ([.,!?;;:])
- *Example Matches*:
  - ✅ "hello"
  - ✅ "world"
  - ✅ "123"
  - ✅ "!", ".", ";"
- *Does Not Match*:
  - ❌ "email@domain.com" (contains @)
  - ❌ "hello-world" (hyphen not included in this pattern)

---

In [43]:
# Define a function to tokenize text using a given regex pattern

import re

def tokenize_with_pattern(text, pattern=None):
    if pattern is None:
        return text.split()
    else:
        return re.findall(pattern, text)
    
# Example:
text = "Hello, world! This is a test."

# Tokenize with whitespace pattern (default)
tokens = tokenize_with_pattern(text)
print(tokens)


['Hello,', 'world!', 'This', 'is', 'a', 'test.']


In [44]:
# Tokenise with a particular pattern

# Define patterns for different splitting rules

patterns = {
    "whitespace": None,  # Use basic whitespace splitting
    "word_only": r"\b[a-zA-Z]+\b",  # Word only (letters)
    "alphanumeric": r"\b[a-zA-Z0-9]+\b",  # Word or number
    "hyphenated": r"\b[a-zA-Z0-9]+(?:-[a-zA-Z0-9]+)*\b",  # Words with hyphens
    "word_and_punct": r"[a-zA-Z0-9]+|[.,!?;:]",  # Word or punctuation as separate tokens
}


# Example:
text = "Hello, world! This is a test."
pattern = patterns.get("word_only")
tokens = tokenize_with_pattern(text, pattern)
print(tokens)

# We see no punctuation is included in the tokens

['Hello', 'world', 'This', 'is', 'a', 'test']


### Step 3: Define a general tokenizer class that handles any regex pattern.

In [46]:
# Define tokenization patterns outside the class, Keep adding more if needed.
TOKENIZATION_PATTERNS = {
    "whitespace": None,  # Use basic whitespace splitting
    "word_only": r"\b[a-zA-Z]+\b",  # Word only (letters)
    "alphanumeric": r"\b[a-zA-Z0-9]+\b",  # Word or number
    "hyphenated": r"\b[a-zA-Z0-9]+(?:-[a-zA-Z0-9]+)*\b",  # Words with hyphens
    "word_and_punct": r"\b[a-zA-Z0-9]+\b|[.,!?;:]",  # Word or punctuation as separate tokens
}

class PatternTokenizer:
    def __init__(self, text, pattern_type=None):
        self.text = text
        self.pattern_type = pattern_type
        self.pattern = TOKENIZATION_PATTERNS.get(pattern_type, None)

        # Method 1: Split the text into tokens based on the pattern
        self.tokens = self._tokenize(
            text
        ) 
        # Sort the tokens and remove duplicates
        self.unique_tokens = sorted(
            list(set(self.tokens))
        )  

        # Method 2: Create mapping dictionaries
        self.token_to_id = {token: i for i, token in enumerate(self.unique_tokens)}

    def _tokenize(self,text):
        if self.pattern is None:
            return text.split()
        else:
            return re.findall(self.pattern, text)

    def encode(self,text):
        tokens = self._tokenize(text)
        token_ids = []

        # Handle tokens not seen during initialization and add them to the vocabulary
        for token in tokens:
            if token not in self.token_to_id:
                self.token_to_id[token] = len(self.token_to_id)
            token_ids.append(self.token_to_id[token])

        return token_ids

    def decode(self,token_ids):
        tokens = []
        # Create a reverse mapping dictionary based on updated token_to_id
        self.id_to_token = {id: token for token, id in self.token_to_id.items()}

        # Decode the token ids back to tokens
        for token_id in token_ids:
            tokens.append(self.id_to_token[token_id]) 

        # Join tokens with spaces
        return " ".join(tokens)  


# Pattern 1: Word Only
# Example 1:
text = "Hello, world! This is a test."
tokenizer = PatternTokenizer(text, "word_only")
print(tokenizer.tokens)
print(tokenizer.unique_tokens)

# Encode the text
encoded = tokenizer.encode(text)
print(encoded)
print(tokenizer.token_to_id)

# Decode the text
decoded = tokenizer.decode(encoded)
print("Encoding of Example 1: ", encoded)
print("Decoding of Example 1: ", decoded)

# Example 2:
text_2 = "This is the best way to tokenize text."
encoded_2 = tokenizer.encode(text_2)
print("Encoding of Example 2: ", encoded_2)

decoded_2 = tokenizer.decode(encoded_2)
print("Decoding of Example 2: ", decoded_2)

['Hello', 'world', 'This', 'is', 'a', 'test']
['Hello', 'This', 'a', 'is', 'test', 'world']
[0, 5, 1, 3, 2, 4]
{'Hello': 0, 'This': 1, 'a': 2, 'is': 3, 'test': 4, 'world': 5}
Encoding of Example 1:  [0, 5, 1, 3, 2, 4]
Decoding of Example 1:  Hello world This is a test
Encoding of Example 2:  [1, 3, 6, 7, 8, 9, 10, 11]
Decoding of Example 2:  This is the best way to tokenize text


In [50]:
# Pattern 2: Word and Punctuation

# Get new tokenizer for word and punctuation pattern
text = "Hello, world! This is a test."
tokenizer_2 = PatternTokenizer(text, "word_and_punct")
print(tokenizer_2.tokens)
print(tokenizer_2.unique_tokens)

# Example 1:

# Encode the text
encoded = tokenizer_2.encode(text)
print("Encoding of Example 1: ", encoded)
print("Token to ID of Example 1: ", tokenizer_2.token_to_id)

# Decode the text
decoded = tokenizer_2.decode(encoded)
print("Decoding of Example 1: ", decoded)

# Example 2:

# Encode the text
encoded_2 = tokenizer_2.encode(text_2)
print("Encoding of Example 2: ", encoded_2)
print("Token to ID of Example 2: ", tokenizer_2.token_to_id)

# Decode the text
decoded_2 = tokenizer_2.decode(encoded_2)
print("Decoding of Example 2: ", decoded_2)


['Hello', ',', 'world', '!', 'This', 'is', 'a', 'test', '.']
['!', ',', '.', 'Hello', 'This', 'a', 'is', 'test', 'world']
Encoding of Example 1:  [3, 1, 8, 0, 4, 6, 5, 7, 2]
Token to ID of Example 1:  {'!': 0, ',': 1, '.': 2, 'Hello': 3, 'This': 4, 'a': 5, 'is': 6, 'test': 7, 'world': 8}
Decoding of Example 1:  Hello , world ! This is a test .
Encoding of Example 2:  [4, 6, 9, 10, 11, 12, 13, 14, 2]
Token to ID of Example 2:  {'!': 0, ',': 1, '.': 2, 'Hello': 3, 'This': 4, 'a': 5, 'is': 6, 'test': 7, 'world': 8, 'the': 9, 'best': 10, 'way': 11, 'to': 12, 'tokenize': 13, 'text': 14}
Decoding of Example 2:  This is the best way to tokenize text .


In [52]:
# Pattern 3: Hyphenated

# Get new tokenizer for hyphenated pattern
text = "This is a hyphenated-word."
tokenizer_3 = PatternTokenizer(text, "hyphenated")
print(tokenizer_3.tokens)
print(tokenizer_3.unique_tokens)

# Example 1:

# Encode the text
encoded = tokenizer_3.encode(text)
print("Encoding of Example 1: ", encoded)
print("Token to ID of Example 1: ", tokenizer_3.token_to_id)

# Decode the text
decoded = tokenizer_3.decode(encoded)
print("Decoding of Example 1: ", decoded)

# Example 2:

text_2 = "This has many hyphenated-words. Like one is bigger-than-the-other, and another one is smaller-than-the-other."
encoded_2 = tokenizer_3.encode(text_2)
print("Encoding of Example 2: ", encoded_2)
print("Token to ID of Example 2: ", tokenizer_3.token_to_id)

decoded_2 = tokenizer_3.decode(encoded_2)
print("Decoding of Example 2: ", decoded_2)

['This', 'is', 'a', 'hyphenated-word']
['This', 'a', 'hyphenated-word', 'is']
Encoding of Example 1:  [0, 3, 1, 2]
Token to ID of Example 1:  {'This': 0, 'a': 1, 'hyphenated-word': 2, 'is': 3}
Decoding of Example 1:  This is a hyphenated-word
Encoding of Example 2:  [0, 4, 5, 6, 7, 8, 3, 9, 10, 11, 8, 3, 12]
Token to ID of Example 2:  {'This': 0, 'a': 1, 'hyphenated-word': 2, 'is': 3, 'has': 4, 'many': 5, 'hyphenated-words': 6, 'Like': 7, 'one': 8, 'bigger-than-the-other': 9, 'and': 10, 'another': 11, 'smaller-than-the-other': 12}
Decoding of Example 2:  This has many hyphenated-words Like one is bigger-than-the-other and another one is smaller-than-the-other


In [56]:
# Pattern 4: Alphanumeric

# Get new tokenizer for subword pattern
text = "This is a subword-like tokenization. It has 1, 23, 456, 7890 and 1234567890"
tokenizer_4 = PatternTokenizer(text, "alphanumeric")
print(tokenizer_4.tokens)
print(tokenizer_4.unique_tokens)

# Example 1:

# Encode the text
encoded = tokenizer_4.encode(text)
print("Encoding of Example 1: ", encoded)
print("Token to ID of Example 1: ", tokenizer_4.token_to_id)

# Decode the text
decoded = tokenizer_4.decode(encoded)
print("Decoding of Example 1: ", decoded)

# Example 2:

text_2 = "This is the best way we can do this 234567890"
encoded_2 = tokenizer_4.encode(text_2)
print("Encoding of Example 2: ", encoded_2)
print("Token to ID of Example 2: ", tokenizer_4.token_to_id)

decoded_2 = tokenizer_4.decode(encoded_2)
print("Decoding of Example 2: ", decoded_2)

['This', 'is', 'a', 'subword', 'like', 'tokenization', 'It', 'has', '1', '23', '456', '7890', 'and', '1234567890']
['1', '1234567890', '23', '456', '7890', 'It', 'This', 'a', 'and', 'has', 'is', 'like', 'subword', 'tokenization']
Encoding of Example 1:  [6, 10, 7, 12, 11, 13, 5, 9, 0, 2, 3, 4, 8, 1]
Token to ID of Example 1:  {'1': 0, '1234567890': 1, '23': 2, '456': 3, '7890': 4, 'It': 5, 'This': 6, 'a': 7, 'and': 8, 'has': 9, 'is': 10, 'like': 11, 'subword': 12, 'tokenization': 13}
Decoding of Example 1:  This is a subword like tokenization It has 1 23 456 7890 and 1234567890
Encoding of Example 2:  [6, 10, 14, 15, 16, 17, 18, 19, 20, 21]
Token to ID of Example 2:  {'1': 0, '1234567890': 1, '23': 2, '456': 3, '7890': 4, 'It': 5, 'This': 6, 'a': 7, 'and': 8, 'has': 9, 'is': 10, 'like': 11, 'subword': 12, 'tokenization': 13, 'the': 14, 'best': 15, 'way': 16, 'we': 17, 'can': 18, 'do': 19, 'this': 20, '234567890': 21}
Decoding of Example 2:  This is the best way we can do this 234567890