
---


### ✅ Tokenizers:

- **Tokenization is the first step** in any Transformer-based Natural Language Processing (NLP) pipeline.
- **Neural networks cannot process raw text**, so tokenization converts human-readable text into **numerical token IDs** that models can understand.
- An effective tokenizer must balance:
  - **Preserving the semantic meaning** of the text.
  - **Keeping the vocabulary size manageable** for efficient training and generalization.
- The choice of tokenizer has a **direct impact on the model’s performance**, especially in tasks like text classification, translation, and question answering.
- **Subword-based tokenizers** (e.g., BPE, WordPiece, Unigram LM) are preferred over **word-level** or **character-level** approaches because they:
  - Reduce the out-of-vocabulary (OOV) problem.
  - Handle **rare and compound words** more effectively.
  - Support **multilingual** and domain-specific tasks better.
  - Produce **compact and generalizable representations**.

---


---

## 🧩 Why Tokenization Matters

1. **Efficiency** – Subword tokens balance sequence length and vocabulary size.
2. **Generalization** – Models can handle rare/unseen words by breaking them down.
3. **Multilingual Support** – Subword tokenization helps handle multiple languages with shared vocab.

---

In [7]:
## This is how tokenization is done

from transformers import AutoTokenizer

# Load the tokenizer for BERT
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Sample sentence
text = "Transformers are revolutionizing natural language processing."

# Tokenize the text
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)

# Convert tokens to IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Token IDs:", token_ids)

# Alternatively, use the encode method to get IDs directly (includes special tokens like [CLS], [SEP])
encoded = tokenizer.encode(text)
print("Encoded with special tokens:", encoded)

# Decode back to text to see how the model interprets the ID sequence
decoded = tokenizer.decode(encoded)
print("Decoded text:", decoded)


Tokens: ['transformers', 'are', 'revolution', '##izing', 'natural', 'language', 'processing', '.']
Token IDs: [19081, 2024, 4329, 6026, 3019, 2653, 6364, 1012]
Encoded with special tokens: [101, 19081, 2024, 4329, 6026, 3019, 2653, 6364, 1012, 102]
Decoded text: [CLS] transformers are revolutionizing natural language processing. [SEP]


---
---
### ❓ **Question:**  
Earlier, for generating embeddings from text, we had models like Word2Vec (CBOW, Skip-Gram) and GloVe. Were tokenizers used at that time as well, or did tokenizers come into existence only after transformer models?

---
---
### ✅ **Answer:**  
Yes, tokenizers were used even before transformers — they are not new. However, the **type and complexity** of tokenization used in models like Word2Vec and GloVe were much **simpler** compared to those used in modern transformer models.

- **Pre-transformer models** (like Word2Vec, CBOW, Skip-Gram, and GloVe) generally used **basic word-level tokenization**, such as splitting text on whitespace or punctuation.
- These models treated each **word** as a distinct unit and learned **static embeddings** for them.
- If a word wasn’t in the vocabulary, it was considered **out-of-vocabulary (OOV)** and either ignored or replaced with a special token like `[UNK]`.

In contrast, **transformer-based models** (like BERT and GPT) required more advanced tokenization methods due to:
- The need to handle **rare and unseen words**
- **Multilingual and domain-specific** text
- **Longer context windows** and **contextual embeddings**

Hence, tokenizers like **Byte Pair Encoding (BPE)**, **WordPiece**, and **Unigram Language Model** were introduced to tokenize at the **subword** level and build a more **robust, flexible vocabulary**.

So, while tokenization **did exist before transformers**, the **modern, learned subword tokenization** techniques were introduced with and for transformers.

--- 


---

## 🕰️ **Early Tokenization Approaches (Pre-Transformer Era)**

### 🔹 Word-level Tokenization
- Example: `"I love machine learning"` → `["I", "love", "machine", "learning"]`
- Each word gets a unique ID.
- **Problem**: Vocabulary size is huge, and unseen words (out-of-vocabulary or OOV) are a big problem.

### 🔹 Character-level Tokenization
- Example: `"hello"` → `["h", "e", "l", "l", "o"]`
- Helps with OOV, but **loses semantic meaning** at word/subword level.
- Input sequences become longer, slowing training and inference.

---

### 🚫 **Limitations of Word-Based Tokenizers**

- **Massive Vocabulary Requirement:**
  - To fully cover a language like English, a word-based tokenizer needs an **identifier for every possible word**.
  - English alone has over **500,000 words**, leading to a **huge vocabulary size** and memory footprint.

- **No Generalization Across Similar Words:**
  - Words like `"dog"` and `"dogs"` or `"run"` and `"running"` are treated as **completely separate tokens**.
  - The model has **no inherent understanding** of their similarity, unless it learns it during training.
  - This leads to **inefficient learning**, especially for morphological variants of the same root word.

- **Out-of-Vocabulary (OOV) Problem:**
  - Words not present in the tokenizer's vocabulary are replaced with a special **“unknown” token**, commonly `[UNK]` or `<unk>`.
  - This results in a **loss of information**, as the model receives **no meaningful representation** of the word.
  - If many `[UNK]` tokens appear, it's a sign that the tokenizer is **failing to capture the input effectively**.

- **Vocabulary Design Trade-off:**
  - A larger vocabulary reduces `[UNK]` usage but **increases memory and computation**.
  - A smaller vocabulary leads to more `[UNK]` tokens, causing **information loss**.

---

### ⚠️ **Limitations of Character-Level Tokenization**

- **Reduced Semantic Meaning (Especially in Latin Languages):**
  - Characters carry **less intuitive meaning** on their own compared to full words.
  - For example, the letter `"e"` has little meaning by itself, unlike a word like `"eat"`.

- **Language Dependency:**
  - In languages like **Chinese**, each character often represents a full word or concept, making **character-level tokenization more meaningful**.
  - In contrast, for **Latin-based languages** (like English), individual characters provide **less useful context**.

- **Longer Sequences:**
  - A single word that would normally be **one token** using a word-based or subword-based tokenizer can become **10 or more tokens** at the character level.
  - This leads to:
    - **Increased computational cost**
    - **Higher memory usage**
    - Potential for **longer training and inference times**

---




---

## 🧬 **Subword Tokenization — The Game Changer**

Subword tokenization is a **middle ground** between **word-level** & **character-level** tokenization. It breaks words into smaller **frequent units**.

---

### 🔠 **Why Subword Tokenization Works Well**

- **Core Principle:**
  - Subword tokenization relies on the idea that:
    - **Frequent words** should be kept **as whole tokens**.
    - **Rare words** should be **split into meaningful subword units**.

- **Example:**
  - The word `"annoyingly"` might be split into:
    - `"annoying"` and `"ly"`
    - Both are common subwords with individual meanings.
    - The **composite meaning** is preserved while improving token efficiency.

---

### 🧪 **Tokenization Example:**

Given the sentence:  
**"Let’s do tokenization!"**

- A subword tokenizer might split it as:
  ```
  ["Let", "’", "s", "do", "token", "ization", "!"]
  ```

- Key observations:
  - `"token"` and `"ization"` are semantically meaningful.
  - These **subwords reduce the total number of tokens** while preserving meaning.
  - **Long words get efficiently represented** without adding `[UNK]`.

---

### ✅ **Advantages of Subword Tokenization:**

- **Efficient Vocabulary Coverage:**
  - Enables **compact vocabularies** while still representing a vast number of word forms.
  - Reduces the occurrence of **unknown tokens** (e.g., `[UNK]`).

- **Preserves Semantic Information:**
  - Splits maintain **interpretable pieces** of meaning (e.g., `"token"` + `"ization"`).
  - Helps the model **learn better contextual representations**.

- **Highly Effective in Agglutinative Languages:**
  - Languages like **Turkish** or **Finnish** allow long, complex words formed by stringing smaller parts.
  - Subword tokenization handles these gracefully without needing massive vocabularies.

---



---

## 🧩 **Understanding Normalization and Pre-tokenization in Transformers**

Before diving into subword tokenization algorithms like **Byte Pair Encoding (BPE)**, **WordPiece**, and **Unigram**, it's important to understand two essential preprocessing steps that all tokenizers perform:

---

### 🧼 **1. Normalization**

**Purpose:**  
Cleans up and standardizes the text( **not splitted yet** ) before tokenization. It ensures that different textual variations map to the same base form.

**Common operations include:**
- Lowercasing (in uncased models)
- Removing accents (diacritics)
- Unicode normalization (e.g., NFC, NFKC)
- Stripping extra whitespace

---

#### 🔠 **1.1. Lowercasing (in uncased models)**  
Converts all characters to lowercase for case-insensitive models.

- **Input:** `"Transformers are AWESOME!"`  
- **Output:** `"transformers are awesome!"`  
- ✅ Helps reduce vocabulary size and treats `"Dog"` and `"dog"` as the same.

---

#### 🇦🇨 **1.2. Removing Accents (Diacritics)**  
Strips accents from letters to handle accented and unaccented forms as equivalent.

- **Input:** `"Héllo, hów àré yöu?"`  
- **Output:** `"Hello, how are you?"`  
- ✅ Useful in multilingual settings where accents may vary or be inconsistently typed.

---

#### 🔡 **1.3. Unicode Normalization (e.g., NFC, NFKC)**  
Ensures that characters with multiple Unicode representations are treated as equivalent.

> 💡 **Every character in digital text is represented using a unique Unicode code point.** However, some characters can be represented in **multiple ways** in Unicode.

For example, the character **"é"** can be represented as:
- A **single code point**: `U+00E9` (é)
- Or as **two combined code points**: `U+0065` (e) + `U+0301` (combining acute accent)

- **Input:** `"école"` (using `e` + accent)
- **Output after NFC normalization:** `"école"` (single-character `é`)

✅ Unicode normalization ensures both forms are treated as the same during processing.

---

#### 🧹 **1.4. Stripping Extra Whitespace**  
Removes leading/trailing whitespace and reduces multiple spaces to a single one.

- **Input:** `"   Hello     world   "`  
- **Output:** `"Hello world"`  
- ✅ Keeps the text clean and avoids misleading token boundaries.

---


In [8]:
## This is how normalizer of a Tokenizer can be used.
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
print(tokenizer.backend_tokenizer.normalizer.normalize_str("Héllò hôw are ü?"))

hello how are u?



### ✂️ **2. Pre-tokenization**

**Purpose:**  
Breaks the normalized text into preliminary segments (like words, punctuation, or space tokens) before applying subword algorithms.

**Why it matters:**  
Tokenizers can't be trained on raw unsegmented text. Pre-tokenization helps define **boundaries** for subword learning.

🔹 **Key Behavior:**
- Splits on **whitespace and punctuation**
- Collapses extra spaces
- Keeps **offsets**, which are useful for alignment (e.g., in question answering)

---

In [9]:
## ***BERT Tokenizer*** ##

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
print(tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are  you?"))

[('Hello', (0, 5)), (',', (5, 6)), ('how', (7, 10)), ('are', (11, 14)), ('you', (16, 19)), ('?', (19, 20))]



### 🔠 **3. Tokenizer Differences in Pre-tokenization**

#### 🔸 **GPT-2 Tokenizer**
- Keeps whitespace by encoding it as a **special character `Ġ`**
- This allows reconstructing exact text formatting during decoding
- Does **not ignore** double spaces


In [10]:
## GPT2 Tokenizer 
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are  you?")


[('Hello', (0, 5)),
 (',', (5, 6)),
 ('Ġhow', (6, 10)),
 ('Ġare', (10, 14)),
 ('Ġ', (14, 15)),
 ('Ġyou', (15, 19)),
 ('?', (19, 20))]

#### 🔸 **T5 Tokenizer (SentencePiece-based)**
- Uses **`▁` (U+2581)** to represent space
- Only splits on **whitespace**, not punctuation
- Prepends a space automatically at the beginning


In [11]:
tokenizer = AutoTokenizer.from_pretrained("t5-small")
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are  you?")

[('▁Hello,', (0, 6)),
 ('▁how', (7, 10)),
 ('▁are', (11, 14)),
 ('▁you?', (16, 20))]

---

### ✅ **Summary: Why These Steps Matter**

These preprocessing stages are **crucial** for ensuring that tokenization is:
- **Consistent**
- **Reversible**
- **Language- and domain-agnostic**

---


---

## 🔧 **Byte-Pair Encoding (BPE) Tokenization**
(HuggingFace Code Link: https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/en/chapter6/section5.ipynb )
### 📘 **1. Input Preprocessing and Pre-tokenization**

- The raw text is first **pre-tokenized** — usually split into individual words using whitespace and punctuation.
- **Example Input Corpus:**
  ```
  "hug", "pug", "pun", "bun", "hugs"
  ```

- After pre-tokenization, the algorithm counts the **frequency** of each word in the corpus:
  ```
  ("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)
  ```

---

### 📚 **2. Initial Vocabulary Construction**

- Each word is then **split into individual characters**:
  ```
  "hug" → ["h", "u", "g"]
  "hugs" → ["h", "u", "g", "s"]
  ```

- All **unique characters** across the corpus are added to the **base vocabulary**:
  ```
  ["b", "g", "h", "n", "p", "s", "u", "[UNK]"] 
  ```

- This base vocabulary usually contains:
  - All **ASCII characters**
  - Possibly **Unicode characters**, depending on the dataset

🛑 **Note:**  
If, during inference, a character not present in the training corpus is encountered (like an emoji or unseen character), it will be converted to the **unknown token** (`[UNK]`).

---

### 🔄 **3. Learning Merge Rules**

- The goal is to gradually **merge the most frequent pairs of adjacent tokens**.
- At each step:
  1. Identify the **most frequent consecutive pair** of tokens in the corpus.
  2. Merge that pair into a new token.
  3. Add this new token to the vocabulary.

🧠 **Why?**  
This allows the tokenizer to build more meaningful subwords over time, starting from characters → two-character units → full subwords.

---

### 🔁 **4. Example – Step-by-Step BPE Merge**

**Initial Splits:**

| Word    | Split Characters     | Frequency |
|---------|----------------------|-----------|
| "hug"   | ["h", "u", "g"]      | 10        |
| "pug"   | ["p", "u", "g"]      | 5         |
| "pun"   | ["p", "u", "n"]      | 12        |
| "bun"   | ["b", "u", "n"]      | 4         |
| "hugs"  | ["h", "u", "g", "s"] | 5         |

**Count all adjacent pairs across corpus:**

| Pair     | Frequency |
|----------|-----------|
| ("u", "g") | 20        |
| ("u", "n") | 16        |
| ("h", "u") | 15        |
| ("p", "u") | 17        |
| ...        | ...       |

**Step 1:**  
- Most frequent: `("u", "g") → "ug"`
- Vocabulary after merge: `["b", "g", "h", "n", "p", "s", "u", "ug"]`
- Updated corpus:
  ```
  ("h", "ug"), ("p", "ug"), ("p", "u", "n"), ("b", "u", "n"), ("h", "ug", "s")
  ```

**Step 2:**  
- Most frequent: `("u", "n") → "un"`
- Vocabulary after merge: `["b", "g", "h", "n", "p", "s", "u", "ug", "un"]`
- Updated corpus:
  ```
  ("h", "ug"), ("p", "ug"), ("p", "un"), ("b", "un"), ("h", "ug", "s")
  ```

**Step 3:**  
- Most frequent: `("h", "ug") → "hug"`
- Vocabulary after merge: `["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"]`
- Updated corpus:
  ```
  ("hug"), ("p", "ug"), ("p", "un"), ("b", "un"), ("hug", "s")
  ```

And so on, until the desired **vocabulary size** (e.g., 30,000 or 50,000 tokens) is reached.

---

### 🧩 **5. Inference / Tokenization Process**

**Tokenization of new inputs follows these steps:**

5.1. **Normalization**  
   - Input text is cleaned (e.g., lowercased, Unicode normalized).

5.2. **Pre-tokenization**  
   - Text is split into words or symbols (e.g., using whitespace, punctuation).

5.3. **Character Splitting**  
   - Each word is split into its individual characters.  
     Example: `"bug"` → `["b", "u", "g"]`

5.4. **Apply Merge Rules**  
   - Use the **learned merge rules** (from training) in order to combine frequent character pairs.

---

### 🧠 Example: Learned Merge Rules  
Assume the following merge rules were learned during training:
- `("u", "g") → "ug"`
- `("u", "n") → "un"`
- `("h", "ug") → "hug"`

---

### 📦 Tokenizing Examples

- **"bug"**  
  - Start: `["b", "u", "g"]`  
  - Apply: `"u" + "g" → "ug"`  
  - Result: `["b", "ug"]`  
  - ✅ Both tokens are known → Valid tokens

- **"mug"**  
  - Start: `["m", "u", "g"]`  
  - Apply: `"u" + "g" → "ug"`  
  - Result: `["m", "ug"]`  
  - ❌ "m" not in base vocabulary → replaced with `[UNK]`  
  - Final: `["[UNK]", "ug"]`

- **"thug"**  
  - Start: `["t", "h", "u", "g"]`  
  - Apply: `"u" + "g" → "ug"` → `["t", "h", "ug"]`  
  - Then: `"h" + "ug" → "hug"` → `["t", "hug"]`  
  - ❌ "t" not in base vocabulary → replaced with `[UNK]`  
  - Final: `["[UNK]", "hug"]`

---


---

## 🧠 **What Is Byte-Level BPE (Used in GPT-2 and RoBERTa)?**

### 📌 The Problem:
Traditional subword tokenizers (like WordPiece or standard BPE) treat **characters as Unicode symbols**. But this can cause problems:
- Some characters (like emojis, special symbols, accented letters, rare scripts) may not be in the training corpus.
- These characters end up as **`[UNK]` tokens**, which means the model can’t understand or use them meaningfully.

---

### 💡 The GPT-2 & RoBERTa Solution: Byte-Level BPE

Instead of treating text as a sequence of **Unicode characters**, Byte-Level BPE treats it as a sequence of **raw bytes**.

### 🔍 What does that mean?

- Any text — regardless of language or symbols — can be converted to a sequence of **bytes** (integers from `0` to `255`).
- Since a byte can represent any character, the **base vocabulary size is fixed at 256**, covering **all possible characters**, including:
  - Emojis (🙂)
  - Accented letters (é, ñ)
  - Currency symbols (₹, ¥)
  - Control characters and whitespace (like `\n`, `\t`)
  
🛡️ **Result:** No `[UNK]` tokens — every possible character is representable.

---

### 🔧 Example:
Let’s take the word:  
```text
"café☕"
```

- Standard tokenizer might fail with `é` or `☕` if not in vocab → `[UNK]`
- Byte-level BPE encodes each character into its **byte value**:
  - `"c"` → 99
  - `"a"` → 97
  - `"f"` → 102
  - `"é"` → 195, 169 (multi-byte)
  - `"☕"` → 226, 152, 149 (multi-byte)
  
Then BPE learns merge rules **on byte sequences**, not characters.

---

### 🔁 So What’s the BPE Part?

Just like standard BPE:
- Byte-level BPE starts with individual **bytes**.
- It merges **frequent byte sequences** into longer tokens:
  - `"##ca"`, `"##fé"`, `"##☕"` might be common merges.
  
Eventually, the tokenizer might learn to represent `"café"` as one token and `"☕"` as another.

---

### ✅ Key Benefits of Byte-Level BPE:
- **Robustness:** Can handle **any character** — no need for `[UNK]`.
- **Compact Base Vocabulary:** Only 256 initial tokens (the 256 byte values).
- **Cross-language friendly:** Works for **multilingual and noisy text**.
- **Precise recovery:** Can **reconstruct the original text exactly** from tokens.

---

In [12]:
text = "café ☕"
byte_sequence = list(text.encode("utf-8"))
print(byte_sequence)

[99, 97, 102, 195, 169, 32, 226, 152, 149]



---

## 💡 What Are Bytes and Why Are They 0–255?

### 🔢 1. **What is a Byte?**
- A **byte** is a unit of digital data that consists of **8 bits**.
- Each **bit** is a binary digit: either `0` or `1`.
- So, a byte is something like `01001101`, `11111111`, etc.

---

### 📊 2. **How Many Values Can a Byte Represent?**

Each bit can be:
- `0` or `1` → **2 possibilities**

So, with 8 bits, the total number of unique combinations is:
```
2^8 = 256
```

That means:
- A single byte can represent **256 distinct values**.
- These values range from:
  - `00000000` (in binary) = **0**
  - to `11111111` (in binary) = **255**

---

### 🔍 3. **Why 0 to 255?**

Because binary counting starts from 0:
- `00000000` → 0  
- `00000001` → 1  
- `00000010` → 2  
- ...  
- `11111111` → 255  

So all byte values fall in the range **[0, 255]**

---

### 🌐 4. **Why is This Used in NLP Tokenizers Like GPT-2?**

When you treat text as a sequence of **raw bytes**, you ensure:
- Every character — even emojis, accented letters, or symbols — can be represented.
- No special encoding needed — it’s just numbers from 0 to 255.
- This allows **robust and universal tokenization**.

---


---

## 🔤 **What is Unicode?**

- **Unicode** is a **standard** — it defines a **universal set of characters** and assigns each character a **unique code point** (just a number).
- Example characters and their Unicode code points:
  - `"A"` → U+0041
  - `"é"` → U+00E9
  - `"☕"` → U+2615

💡 **Unicode ≠ Encoding**  
Unicode only **assigns IDs to characters**, it does **not say how to store or transmit them** in bytes.

---

## 📦 **What is UTF-8 (and How is it Different)?**

- **UTF-8** is an **encoding format** — it tells the computer **how to store Unicode characters as bytes**.
- UTF-8 is the most common encoding used on the web and in modern software.

### 🔢 How UTF-8 Works:
UTF-8 stores Unicode code points using **1 to 4 bytes**, depending on the character:

| Unicode Range        | UTF-8 Byte Length |
|----------------------|-------------------|
| U+0000 to U+007F     | 1 byte (ASCII)    |
| U+0080 to U+07FF     | 2 bytes           |
| U+0800 to U+FFFF     | 3 bytes           |
| U+10000 to U+10FFFF  | 4 bytes           |

### 🧪 Example:
Let’s take the string: `"café ☕"`

| Character | Unicode Code Point | UTF-8 Bytes (decimal) | UTF-8 Bytes (hex) |
|-----------|--------------------|------------------------|-------------------|
| `c`       | U+0063             | [99]                   | 0x63              |
| `a`       | U+0061             | [97]                   | 0x61              |
| `f`       | U+0066             | [102]                  | 0x66              |
| `é`       | U+00E9             | [195, 169]             | 0xC3 0xA9         |
| space     | U+0020             | [32]                   | 0x20              |
| `☕`       | U+2615             | [226, 152, 149]        | 0xE2 0x98 0x95    |

So the UTF-8 encoded byte sequence is:
```python
[99, 97, 102, 195, 169, 32, 226, 152, 149]
```

---

## ✅ **Why UTF-8 Is Useful**

- **Backward compatible with ASCII** (first 128 characters use just 1 byte)
- **Compact for English text**, while still supporting all of Unicode
- Handles **all global scripts**, emojis, and symbols
- Avoids `[UNK]` in byte-level tokenization (like in GPT-2)

---


---

## 🔤 WordPiece Tokenization Algorithm 

---

(HuggingFace Notebook: https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/en/chapter6/section6.ipynb)

### 🛠️ **Training Phase**

- **1. Normalization**
  - Clean the training corpus text.
  - Examples: convert to lowercase, strip accents, normalize Unicode, etc.

- **2. Pre-tokenization**
  - Split the normalized text into words (usually using whitespace and punctuation).
  - Example:  
    ```
    "hugging is fun" → ["hugging", "is", "fun"]
    ```

- **3. Initial Vocabulary Construction**
  - Start with:
    - A few **special tokens** (like `[CLS]`, `[SEP]`, `[UNK]`, etc.)
    - All **unique characters** seen in the training corpus.
  - Characters are treated differently depending on their position:
    - **First character** of a word is kept as-is.
    - **Subsequent characters** are prefixed with `##`.
  - Example for word `"word"`:
    ```
    Split as: ["w", "##o", "##r", "##d"]
    ```
  - This creates a vocabulary like:
    ```
    ["w", "##o", "##r", "##d", ...]
    ```

- **4. Corpus Representation (Example)**  
  Let’s say our corpus is:
  ```
  ("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)
  ```
  We represent them as:
  - `"hug"` → ["h", "##u", "##g"]
  - `"pug"` → ["p", "##u", "##g"]
  - `"pun"` → ["p", "##u", "##n"]
  - `"bun"` → ["b", "##u", "##n"]
  - `"hugs"` → ["h", "##u", "##g", "##s"]

- **5. Merge Rule Selection (Scoring)**  
  - WordPiece selects which pairs to merge based on this score:
    \[
    \text{score} = \frac{\text{frequency of pair}}{\text{freq of first element} \times \text{freq of second element}}
    \]
  - This helps prioritize **rare parts** merging over just frequent ones.
  - Example:
    - Even though `("##u", "##g")` is common, `"##g"` + `"##s"` may be merged first due to a higher score.

- **6. Vocabulary Growth**  
  - Merge pairs iteratively until the desired vocabulary size is reached.
  - Each merge adds a new token to the vocabulary and updates the corpus.
  - Example merge steps:
    1. `("##g", "##s")` → `"##gs"`
    2. `("h", "##u")` → `"hu"`
    3. `("hu", "##g")` → `"hug"`
    4. ...


---

## 🧩 Tokenization Algorithm (WordPiece vs BPE)

### 🚀 WordPiece Tokenization Logic

- **WordPiece does not store merge rules**, only the **final vocabulary** built during training.
- Tokenization is based on a **greedy longest-match-first strategy**:
  - At each step, find the **longest substring (subword)** from the current position that exists in the vocabulary.
  - If no such subword exists, the entire word is marked as `[UNK]`.

---

### 📘 Example 1: Tokenizing `"hugs"` (WordPiece)

- Assume `"hug"` and `"##s"` are in the vocabulary.
- Longest match from beginning: `"hug"` → split
- Remaining: `"##s"` → also in vocabulary
- ✅ Final tokenization: `["hug", "##s"]`

#### 🔁 Comparison with BPE:
- BPE would apply merges in order, possibly resulting in:  
  `["hu", "##gs"]`  
- So **the token sequence is different** between WordPiece and BPE.

---

### 📘 Example 2: Tokenizing `"bugs"`

- Step-by-step:
  1. `"b"` is the longest match → keep `"b"`, remaining is `"##ugs"`
  2. `"##u"` is the longest subword match in `"##ugs"` → keep `"##u"`, remaining is `"##gs"`
  3. `"##gs"` is in the vocabulary → keep it
- ✅ Final tokenization: `["b", "##u", "##gs"]`

---

### 📘 Example 3: Tokenizing `"mug"`

- `"m"` might be in the initial alphabet, but `"##u"` and `"##g"` need to be in the vocabulary too.
- Let’s assume `"##g"` is not in vocabulary:
  - Tokenization fails → no valid split found
- ❌ Final tokenization: `["[UNK]"]`

---

### 📘 Example 4: Tokenizing `"bum"`

- Even though `"b"` and `"##u"` might be in vocabulary,
  - `"##m"` is **not** in vocabulary
- So, tokenization is **not partial** — **entire word is marked as unknown**
- ❌ Final tokenization: `["[UNK]"]`

---

### ✏️ Try This: Tokenize `"pugs"` (WordPiece)

Let’s say the following are in vocabulary:
- `"p"`, `"##u"`, `"##g"`, `"##s"`, `"##gs"`, and `"pug"`

**Tokenization steps:**
1. `"pug"` is the longest valid match from beginning → keep `"pug"`
2. Remaining: `"##s"` → in vocabulary
3. ✅ Final tokenization: `["pug", "##s"]`


---

## ✅ Advantages of WordPiece over BPE

### 1. **Better Handling of Subword Composition**
- **WordPiece uses a scoring function** (based on mutual information) instead of just frequency.
  \[
  \text{score} = \frac{\text{freq(pair)}}{\text{freq(left)} \times \text{freq(right)}}
  \]
- This **penalizes common subwords** and favors merges that create **meaningful units** (e.g., medical terms, rare suffixes).
- 🟢 **Result**: WordPiece builds more semantically useful subwords.

---

### 2. **Lower Risk of Ambiguity in Frequent Subword Reuse**
- BPE may repeatedly merge frequent chunks (like `##ing`, `##tion`) even when they’re not optimal.
- WordPiece avoids merging overly common units just because they appear often, reducing **ambiguity and overfitting to common patterns**.
- WordPiece penalizes merging frequent parts unless the pair appears more often together than you'd expect by chance. Since uses score instead of frequency
---

### 3. **More Controlled Vocabulary Growth**
- Because WordPiece merges based on information content, it often forms **more diverse and useful subwords** within the same vocab size.
- 🟢 Leads to better **vocabulary efficiency**.

---

### 4. **More Consistent and Compact Tokens**
- In many cases, WordPiece produces **fewer tokens** per word for morphologically complex or rare words.
- This results in **shorter sequences**, which is beneficial for:
  - **Training speed**
  - **Memory efficiency**
  - **Model performance** (e.g., in attention mechanisms)

---

### 5. **Better Performance on Downstream Tasks**
- Empirically, models using WordPiece (like **BERT**) often achieve **higher accuracy** on tasks like NER, sentiment analysis, and QA.
- While this is partly due to BERT's architecture, WordPiece contributes by giving **cleaner, more meaningful token splits**.

---


## 🔤 Unigram Tokenization Algorithm

HuggingFace Link: (https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/en/chapter6/section7.ipynb)
### 🧠 Core Idea:
- Unlike **BPE** and **WordPiece** which build vocabulary by merging subwords,
- **Unigram** starts with a **large vocabulary** of possible subwords and **removes** the least useful tokens until it reaches a desired size.

---

### 🛠️ Training Phase

#### 1. **Initial Vocabulary Creation**
- Start with a **very large vocabulary** of candidate subwords.
- These subwords can be:
  - All **strict substrings** of the corpus words.
  - Or generated using BPE with a **very large** initial vocab size.

##### 🔎 Example Corpus:
```text
("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)
```

##### 🎒 Initial Vocabulary (All substrings):
```
["h", "u", "g", "hu", "ug", "hug", 
 "p", "pu", "pug", 
 "n", "un", "pun", 
 "b", "bu", "bun", 
 "s", "gs", "hugs", ...]
```

---

#### 2. **Tokenization & Loss Computation**
- For each word, the algorithm tries to **tokenize it in all possible ways** using the current vocabulary.
- It assigns a **probability** to each token and computes the **likelihood (loss)** of the corpus under the current vocab.

---

#### 3. **Prune Vocabulary Iteratively**
- For each token, compute how much the **total loss would increase** if we removed that token.
- Tokens that **increase the loss the least** are considered **least important**.
- Instead of removing one token at a time (slow!), remove the **lowest `p%`** of tokens (e.g., 10–20%) in each iteration.
- 🔁 **Repeat** until the desired vocabulary size is reached.

---

#### 4. **Preserve Base Characters**
- To ensure any word can be tokenized, the **base characters (a–z, digits, etc.) are never removed.**

---

### 🧮 Key Features of Unigram Tokenization

| Feature | Unigram |
|---------|---------|
| Direction | Starts from large vocab → prunes down |
| Core Mechanism | Removes least important tokens by measuring loss impact |
| Probabilistic? | ✅ Yes — considers all valid segmentations |
| Final Vocabulary | Subwords that contribute most to corpus likelihood |
| Unknown Token Handling | Rare (base chars always kept) |

---

### ✅ Advantages Over BPE/WordPiece

- Allows **multiple valid tokenizations** of a word and picks the best.
- Produces **more compact and semantically meaningful** token sets.
- **More flexible** during decoding and post-processing.
- Often used in **SentencePiece** (used by models like T5, XLNet).

---



## 🔤 Unigram Language Model Tokenization

---

### 📘 What is a Unigram Language Model?

- A **Unigram Language Model** treats each token as **independent** of the previous tokens.
- That means:
  \[
  P(w_1, w_2, ..., w_n) = P(w_1) \times P(w_2) \times ... \times P(w_n)
  \]
- If we **generated** text using this model, it would always produce the **most frequent token**, ignoring context — hence, it’s the simplest possible language model.

---

### 🛠️ Vocabulary & Token Probabilities

- In Unigram tokenization, each **subword** in the vocabulary is assigned a probability based on its **frequency in the corpus**:
  \[
  P(token) = \frac{\text{frequency of token}}{\text{sum of frequencies of all tokens}}
  \]

#### 🔎 Example Corpus:
```text
("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)
```

#### 📦 Subword Frequencies (from all words):
```
("h", 15)   ("u", 36)   ("g", 20)
("hu", 15)  ("ug", 20)  ("hug", 15)
("p", 17)   ("pu", 17)  ("pug", 5)
("n", 16)   ("un", 16)  ("pun", 12)
("b", 4)    ("bu", 4)   ("bun", 4)
("s", 5)    ("gs", 5)   ("ugs", 5)
```

- 🔢 **Total frequency sum** = **210**
- Example:  
  \[
  P("ug") = \frac{20}{210} \approx 0.0952
  \]

---

### ✂️ Tokenizing a Word: Try All Possible Segmentations

To tokenize a word:
- Try **all possible ways** of splitting it using valid subwords.
- Compute the **probability** of each segmentation by multiplying the probabilities of individual subwords.

#### 📌 Example: Tokenizing `"pug"`

**Option 1:** `["p", "u", "g"]`
\[
P = \frac{17}{210} \times \frac{36}{210} \times \frac{20}{210} \approx 0.000389
\]

**Option 2:** `["pu", "g"]`
\[
P = \frac{17}{210} \times \frac{20}{210} \approx 0.0022676
\]

**Option 3:** `["p", "ug"]`
\[
P = \frac{17}{210} \times \frac{20}{210} \approx 0.0022676
\]

🟢 **Best Tokenization**: Either `["p", "ug"]` or `["pu", "g"]` (equal score in this case)

---

### 💡 Why Shorter Tokenizations Are Preferred

- Each additional token means multiplying by an extra probability (less than 1).
- So **fewer tokens → higher total probability**.
- This matches our intuition: **split into as few tokens as possible**.

---

### 🤖 Efficient Tokenization with Viterbi Algorithm

- Trying all segmentations becomes slow for long words.
- So we use the **Viterbi algorithm** — a dynamic programming approach.

#### 🧩 How Viterbi Works:

1. For each character position in the word, store the **best segmentation** ending there.
2. For each possible subword ending at a position, look back at the best segmentation that ends at the start of that subword.
3. Multiply their probabilities to find the best path.
4. After reaching the last character, **backtrack** to recover the best segmentation.

---

### 📘 Example: Tokenizing `"unhug"` using Viterbi

Given subword probabilities:
- `"u"`: 0.1714
- `"un"`: 0.0762
- `"h"`: 0.0714
- `"hu"`: 0.0714
- `"hug"`: 0.0714

**Step-by-step Segmentation & Scores:**

| Position | Best Subword Ending | Score           | Full Segmentation       |
|----------|---------------------|------------------|--------------------------|
| 0 (u)    | `"u"`               | 0.1714           | `["u"]`                  |
| 1 (n)    | `"un"`              | 0.0762           | `["un"]`                 |
| 2 (h)    | `"h"`               | `0.0762 × 0.0714 ≈ 0.0054` | `["un", "h"]`      |
| 3 (u)    | `"hu"`              | `0.0762 × 0.0714 ≈ 0.0054` | `["un", "hu"]`     |
| 4 (g)    | `"hug"`             | `0.0762 × 0.0714 ≈ 0.0054` | ✅ `["un", "hug"]` |

🟢 Final Tokenization: `["un", "hug"]` (highest score path)

---

## ✅ Summary: Why Use Unigram?

| Feature                            | Unigram Tokenizer                          |
|-----------------------------------|--------------------------------------------|
| Training Direction                | Start with large vocab → prune down        |
| Probabilistic Tokenization        | ✅ Yes                                      |
| Token Scoring                     | Based on token frequencies in corpus       |
| Preferred Segmentation            | Shortest (fewest tokens = highest score)   |
| Algorithm for Efficiency          | Viterbi algorithm                          |
| Unknown Token Handling            | Base characters are always kept (fallback) |

---

---
## HuggingFace Link for Building Tokenizer Block by Block: (https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/en/chapter6/section8.ipynb)
---