# Exercise 2: Create a word-level tokenizer with different splitting rules

### Advantages of Word-Level Tokenization:
1. Preserves word-level semantic meaning.
2. Shorter sequences compared to character-level tokenization.

### Disadvantages of Word-Level Tokenization:
1. Larger vocabulary size - need to represent many unique words.
2. Different splitting rules can significantly affect tokenization results.

## Implementation

### Step 1: Load the text from the file

In [3]:
# Load the text
with open(
    "/Users/sadiahzahoor/Desktop/AI Research/LLMs /LLM's from Scratch/the-verdict.txt",
    "r",
) as file:
    text = file.read()

print("Total number of characters in the text: ", len(text))
print(text[:200])

Total number of characters in the text:  20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no great surprise to me to hear that, in the height of his glory, he had dropped his painting, married a


### Step 2: Define some splitting rules, various regex patterns

#### Regex Patterns for Splitting Rules
1. None (Whitespace)
- This is not a regex pattern but a placeholder.
- Likely used to indicate that whitespace should be ignored or not tokenized.

---

2. r'\b[a-zA-Z]+\b' (Word Only)
- Matches words containing only *alphabetic characters* (a-z, A-Z).
- \b → Ensures the word is *bounded* (i.e., it starts and ends at a word boundary).
- [a-zA-Z]+ → Matches *one or more* (+) alphabetic characters.
- *Example Matches*:
  - ✅ "hello"
  - ✅ "Test"
  - ✅ "World"
- *Does Not Match*:
  - ❌ "123" (no letters)
  - ❌ "hello123" (contains numbers)
  - ❌ "email@domain.com" (contains special characters)

---

3. r'\b[a-zA-Z0-9]+\b' (Word or Number)
- Similar to the previous pattern but allows *numbers*.
- [a-zA-Z0-9]+ → Matches *one or more* letters (a-z, A-Z) or digits (`0-9`).
- *Example Matches*:
  - ✅ "hello"
  - ✅ "test123"
  - ✅ "2024"
- *Does Not Match*:
  - ❌ "hello-123" (contains a hyphen)
  - ❌ "email@domain.com" (contains special characters)

---

4. r'\b[a-zA-Z0-9]+(?:-[a-zA-Z]+)*+\b' (Words with Hyphens)
- Allows *hyphenated words*.
- \b → Ensures word boundary.
- [a-zA-Z0-9]+ → Matches *a word with letters or numbers*.
- (?:-[a-zA-Z]+)*+ → Allows *hyphenated parts* (-word) *zero or more times* (*+).
- *Example Matches*:
  - ✅ "high-quality"
  - ✅ "multi-purpose"
  - ✅ "user-friendly"
- *Does Not Match*:
  - ❌ "hello-" (trailing hyphen)
  - ❌ "-hello" (leading hyphen)
  - ❌ "123-456" (numbers after hyphen not allowed)

---

5. r'\b[a-zA-Z0-9]+\b|[.,!?;;:]' (Word or Punctuation as Separate Tokens)
- Matches *either*:
  - Words (\b[a-zA-Z0-9]+\b)
  - OR punctuation characters ([.,!?;;:])
- *Example Matches*:
  - ✅ "hello"
  - ✅ "world"
  - ✅ "123"
  - ✅ "!", ".", ";"
- *Does Not Match*:
  - ❌ "email@domain.com" (contains @)
  - ❌ "hello-world" (hyphen not included in this pattern)

---

6. r'\b[a-zA-Z]{2,}|[a-zA-Z]' (Subword-Like: At Least 2 Letters)
- Matches *words of at least 2 letters*, but also allows *single-letter words*.
- \b[a-zA-Z]{2,} → Matches words with *at least two* ({2,}) letters.
- |[a-zA-Z] → If a word *does not** have at least 2 letters, match a *single letter* instead.
- *Example Matches*:
  - ✅ "hello"
  - ✅ "AI"
  - ✅ "I"
  - ✅ "A"
- *Does Not Match*:
  - ❌ "123" (numbers not included)
  - ❌ "@#" (no letters)

---

Summary
| Pattern | Description | Example Matches | Does Not Match |
|---------|------------|----------------|---------------|
| None | Placeholder for whitespace | " " | N/A |
| \b[a-zA-Z]+\b | Words only (letters) | "hello", "World" | "123", "hello123" |
| \b[a-zA-Z0-9]+\b | Words and numbers | "test123", "42" | "hello-123", "email@domain.com" |
| \b[a-zA-Z0-9]+(?:-[a-zA-Z]+)*+\b | Words with hyphens | "high-quality", "multi-purpose" | "123-456", "-hello" |
| \b[a-zA-Z0-9]+\b|[.,!?;;:] | Words and punctuation as separate tokens | "hello", "!", ";" | "email@domain.com" |
| \b[a-zA-Z]{2,}|[a-zA-Z] | Words with 2+ letters, or single letters if needed | "hello", "I", "AI" | "123", "@#" |

In [4]:
# Define the splitting rules (regex patterns)
patterns = [
    None, # Whitespace
    r'\b[a-zA-Z]+\b', # Word only
    r'\b[a-zA-Z0-9]+\b', # Word or number
    r'\b[a-zA-Z0-9]+(?:-[a-zA-Z]+)*+\b', # Words with hyphens
    r'\b[a-zA-Z0-9]+\b|[.,!?;;:]', # Word or punctuation as separate tokens
    r'\b[a-zA-Z]{2,}|[a-zA-Z]', # Subword-like: least 2 letters
]