
## **What is a Token?**

* A **token** is the smallest unit of text that a model processes.
* Depending on the tokenization method, a token can be:

  * A **character** → `"h"`, `"e"`, `"l"`, `"l"`, `"o"`
  * A **word** → `"hello"`, `"world"`
  * A **subword** → `"hel"`, `"lo"`, `"world"`
  * A **symbol/punctuation** → `"."`, `","`, `"?"`
  * Sometimes even **bytes**

👉 In LLMs, we usually don’t work with raw words, but with **subword tokens** because:

* Vocabulary stays smaller (manageable embedding matrix).
* Can handle rare words by breaking them down.
* Helps with morphologically rich languages (e.g., German, Hindi).

---

## **Tokenization Approaches & Algorithms**

### **1. Word-level Tokenization**

* Splits text into words using spaces/punctuation.
* Example: `"I love AI!"` → `["I", "love", "AI"]`
* Issues:

  * Huge vocabulary size (millions of words).
  * Out-of-vocabulary (OOV) problem → model can’t handle unseen words.
* Used in early NLP models (pre-transformers).

---

### **2. Character-level Tokenization**

* Splits text into characters.
* Example: `"I love AI"` → `["I", " ", "l", "o", "v", "e", " ", "A", "I"]`
* Pros:

  * Small vocabulary (\~100 chars).
  * No OOV issue.
* Cons:

  * Sequences are very long → harder to model dependencies.
* Used in early RNN/CNN text models, and some modern LLM experiments (Charformer).

---

### **3. Subword Tokenization (Most Common in LLMs)**

Balances between word-level and character-level.

#### **a) Byte Pair Encoding (BPE)**

* Start with characters.
* Iteratively merge most frequent character pairs into bigger units.
* Example:

  * `"l o w e r"` → merge `"lo"` → `"lo w e r"` → merge `"er"` → `"lo wer"`
  * `"lowest"` → `"lo w est"`
* Used in GPT-2, GPT-3, LLaMA.

#### **b) WordPiece (used in BERT, ALBERT)**

* Similar to BPE, but uses a **likelihood-based objective** instead of greedy frequency merges.
* Example: `"unhappiness"` → `["un", "##happiness"]`

#### **c) Unigram Language Model (used in SentencePiece, XLNet, T5)**

* Start with a large vocabulary, then **prune** tokens that don’t improve likelihood.
* Probabilistic: chooses the best segmentation for a word.

---

### **4. Byte-Level Tokenization**

* Treats raw text as UTF-8 bytes (0–255).
* Example: `"hello"` → `[104, 101, 108, 108, 111]`
* Pros:

  * Universal for all languages (no OOV issue).
  * Supports emojis, rare symbols.
* Used in GPT-2 (Byte-Level BPE).

---

### **5. SentencePiece**

* Framework from Google (used in T5, XLNet).
* Doesn’t rely on whitespace.
* Can handle languages without spaces (like Chinese, Japanese).
* Supports both **BPE** and **Unigram LM**.

---

### **6. Modern Tokenizer Approaches**

* **SentencePiece Unigram LM** → T5, XLNet.
* **Byte-Level BPE** → GPT-2, GPT-3, LLaMA.
* **tiktoken** (OpenAI’s tokenizer for GPT-3.5/4) → Highly optimized, uses byte fallback.
* **BPETokenizer in Hugging Face** → Standard for training custom LLMs.

---

## **Comparison of Tokenization Approaches**

| Method          | Pros                                   | Cons                                  | Used In             |
| --------------- | -------------------------------------- | ------------------------------------- | ------------------- |
| Word-level      | Simple, intuitive                      | OOV problem, huge vocab               | Early NLP           |
| Character-level | Small vocab, no OOV                    | Long sequences, harder training       | Char-CNN, CharRNN   |
| BPE             | Handles rare words, compact vocab      | Greedy merges, deterministic          | GPT-2, GPT-3, LLaMA |
| WordPiece       | Probabilistic, robust                  | More complex training                 | BERT, RoBERTa       |
| Unigram LM      | Probabilistic, flexible                | Requires careful pruning              | T5, XLNet           |
| Byte-level      | Universal (supports emojis, all langs) | Sequences longer than subword methods | GPT-2, GPT-3        |
| SentencePiece   | Language-agnostic, supports scripts    | Slightly slower                       | T5, XLNet           |

---

⚡ In **modern LLMs (GPT, LLaMA, Mistral, Falcon, etc.)**, the standard choice is **Byte-Level BPE or SentencePiece Unigram LM**.



## **What is Byte Pair Encoding (BPE)?**

Byte Pair Encoding (BPE) is a **subword tokenization algorithm**.

* Goal: represent text using a manageable vocabulary while handling rare/unseen words.
* Key idea: Start from characters, then repeatedly merge the **most frequent pairs** of symbols to form bigger units (subwords).

So instead of storing `"antidisestablishmentarianism"` as a single word (huge vocab), BPE breaks it into **subwords** like `"anti"`, `"dis"`, `"establish"`, `"ment"`, `"arian"`, `"ism"`.

---

## **How BPE Works – Step by Step**

Let’s tokenize `"lower lowest"` as an example.

### **Step 1: Initialization**

* Start with characters (plus a special end-of-word marker, like `</w>`).
* Vocabulary:
  `"l o w e r </w>"`
  `"l o w e s t </w>"`

So sequence looks like:

* `"lower"` → `["l", "o", "w", "e", "r", "</w>"]`
* `"lowest"` → `["l", "o", "w", "e", "s", "t", "</w>"]`

---

### **Step 2: Count symbol pairs**

* Count all adjacent pairs:

  * `"l o"`, `"o w"`, `"w e"`, `"e r"`, `"e s"`, `"s t"`, `"t </w>"`, etc.
* Find the **most frequent pair**.

---

### **Step 3: Merge most frequent pair**

* Suppose `"l o"` occurs most often → merge into `"lo"`.
* New vocab:
  `"lo w e r </w>"`,
  `"lo w e s t </w>"`

---

### **Step 4: Repeat merges**

* Next frequent: `"w e"` → `"we"`
  `"lo we r </w>"`,
  `"lo we s t </w>"`

* Next frequent: `"we r"` → `"wer"`
  `"lo wer </w>"`,
  `"lo we s t </w>"`

* Next frequent: `"s t"` → `"st"`
  `"lo wer </w>"`,
  `"lo west </w>"`

---

### **Step 5: Final tokens**

Now we have a vocabulary:
`["lo", "wer", "west", "</w>"]`

So:

* `"lower"` → `[lo, wer]`
* `"lowest"` → `[lo, west]`

---

## **Properties of BPE**

✅ Handles **rare words**: `"unhappiness"` → `"un" + "happiness"`
✅ Keeps **frequent words** whole: `"the"` stays `"the"`
✅ Keeps **vocab size small** (usually 30k–50k tokens in LLMs).
✅ Easy to train and implement.

❌ Greedy merges: doesn’t always produce the *linguistically best* segmentation.
❌ Doesn’t adapt dynamically → fixed vocab once trained.

---

## **Why BPE for GPT-2/3/LLaMA?**

* GPT models use **Byte-Level BPE**:

  * Works on raw **UTF-8 bytes** (so it can handle emojis, non-English scripts).
  * Then applies BPE merges on top of bytes.
  * Example: `"😊"` is just a byte sequence, so no OOV issue.

This makes GPT universal across languages + symbols.

---

## **Mini Example of BPE Merge Table**

If we train BPE on `"banana bandana"`:

| Iteration | Most Frequent Pair | New Tokens |
| --------- | ------------------ | ---------- |
| 1         | `"a n"`            | `"an"`     |
| 2         | `"b an"`           | `"ban"`    |
| 3         | `"ban a"`          | `"bana"`   |
| 4         | `"ban an"`         | `"banan"`  |
| 5         | `"banan a"`        | `"banana"` |

So `"banana"` becomes a single token in the final vocab.




## **What is WordPiece?**

* **WordPiece** is a **subword tokenization algorithm** (like BPE).
* Developed by Google (for Neural Machine Translation, then BERT).
* Goal: Balance between **small vocabulary** and **coverage of rare words**.

### Difference from BPE:

* **BPE** merges pairs of symbols based on **highest frequency**.
* **WordPiece** merges based on **likelihood of the training corpus** (using a probabilistic language model).

So instead of being greedy with frequency, WordPiece asks:
👉 *“Does merging this pair make my model better at predicting words?”*

---

## **How WordPiece Works – Step by Step**

Let’s say we have the corpus:

```
"unwanted", "unwatched", "undo", "redo"
```

### **Step 1: Start with characters**

* `"u", "n", "w", "a", "n", "t", "e", "d"`, etc.
* Plus special tokens like `[UNK]`, `[CLS]`, `[SEP]`, `[PAD]`, `[MASK]`.

---

### **Step 2: Iteratively add new subwords**

* WordPiece tries to maximize the **likelihood** of training data under a simple **unigram language model**.
* At each step, it evaluates candidate merges and picks the one that most increases the log-likelihood.

Example merges:

* `"un"` (common across `unwanted`, `unwatched`, `undo`)
* `"want"`, `"watched"`, `"redo"`

---

### **Step 3: Subword segmentation**

* Vocabulary after merges might look like:
  `["u", "n", "un", "want", "watched", "redo", "do", "ed", "##ed"]`

* WordPiece uses `"##"` to indicate that a token is a **continuation** of a previous token (not word start).

Example:

* `"unwanted"` → `["un", "want", "##ed"]`
* `"redo"` → `["re", "##do"]`
* `"unknown"` → `["unk", "##now", "##n"]` (or `[UNK]` if completely missing)

---

## **WordPiece vs BPE**

| Feature        | BPE (GPT, LLaMA)             | WordPiece (BERT)                   |
| -------------- | ---------------------------- | ---------------------------------- |
| Merge strategy | Greedy frequency merges      | Maximizes likelihood of corpus     |
| Token notation | Plain subwords (`lo`, `wer`) | Uses `##` prefix for continuations |
| OOV handling   | Breaks into smaller subwords | `[UNK]` token for unknown pieces   |
| Typical usage  | Autoregressive LMs (GPT)     | Masked LMs (BERT)                  |

---

## **Why WordPiece in BERT?**

* BERT is **bidirectional** and used for **masked language modeling (MLM)**.
* Needs good handling of rare words → `##ed`, `##ing`, `##tion`, etc.
* `[UNK]` token acts as a fallback (though in practice, rare since subwords cover most words).

---

## **Example**

Sentence:

```
"playing football"
```

WordPiece tokenization:

```
["play", "##ing", "football"]
```

If `"football"` wasn’t in vocab, it might become:

```
["foot", "##ball"]
```

---

⚡ So, **WordPiece is like BPE but more probabilistic** — it tries to maximize the likelihood of seeing subwords rather than just merging the most frequent pairs.


Awesome choice 👌 — let’s go deep into **Byte-Level Tokenization**, which is what GPT-2, GPT-3, and LLaMA rely on.

---

## **What is Byte-Level Tokenization?**

* Instead of working with words, characters, or subwords directly,
* Text is broken down into **raw bytes (0–255)** → every possible character, symbol, or emoji is covered.

👉 This means the tokenizer can handle **any language, special symbols, punctuation, and even emojis** without OOV (Out-of-Vocabulary) issues.

---

## **How Byte-Level Tokenization Works**

### **Step 1: Encode text as UTF-8 bytes**

Example:
Text → `"Hello 😊"`
UTF-8 bytes:

```
H = 72
e = 101
l = 108
l = 108
o = 111
(space) = 32
😊 = 240, 159, 152, 138
```

So the text becomes a sequence of numbers:

```
[72, 101, 108, 108, 111, 32, 240, 159, 152, 138]
```

---

### **Step 2: Apply Subword Algorithm on Bytes**

* Once text is turned into bytes, we can apply a **subword algorithm** (like BPE).
* Instead of merging character pairs, BPE now merges **byte pairs**.
* Frequent byte sequences form new tokens.

Example:

* `"lo"` (`[108, 111]`) becomes a token.
* `"😊"` (`[240, 159, 152, 138]`) becomes a single token.

---

### **Step 3: Vocabulary**

* Start: 256 base tokens (0–255, all byte values).
* Train: Merge frequent byte sequences until reaching target vocab size (e.g., 50,000).

Result:

* `"the"` might be a single token.
* `"ing"` might be a single token.
* `"😊"` might be a single token.
* Rare words are broken into smaller byte-based pieces.

---

## **Why Byte-Level Tokenization?**

### ✅ Advantages

1. **Universal**: Works for any language, script, or emoji.
2. **No OOV problem**: Any text can always be broken down into bytes.
3. **Efficient**: Frequent words still become single tokens (through BPE merges).
4. **Space Handling**: Even whitespace, tabs, newlines are tokens.

### ❌ Disadvantages

1. Longer sequences (since each unknown sequence is bytes first).
2. Slightly more complex than word/subword methods.

---

## **Example**

Let’s tokenize `"unicorns 🦄"`:

1. Convert to bytes:
   `"u"=117, "n"=110, "i"=105, "c"=99, "o"=111, "r"=114, "n"=110, "s"=115, " " = 32, "🦄" = [240, 159, 166, 132]`

2. Apply BPE merges:

   * `"uni"`, `"corns"`, `"🦄"`

3. Final tokens:

   ```
   ["uni", "corns", "🦄"]
   ```

---

## **Byte-Level in GPT-2 and GPT-3**

* GPT-2 introduced **Byte-Level BPE**.
* Process:

  1. Encode text into bytes.
  2. Apply BPE merges.
  3. Build \~50k vocab tokens.

That’s why GPT-2/3 can handle:

* Multilingual text (`日本語`, `हिंदी`)
* Emojis (`😂👍`)
* Special symbols (`∑, π, √`)

All without retraining or special handling.

---

## **Byte-Level vs WordPiece vs BPE**

| Method         | Handles Unknown Words    | Emoji Support | Languages                      | Used In             |
| -------------- | ------------------------ | ------------- | ------------------------------ | ------------------- |
| WordPiece      | `[UNK]` for OOV          | ❌             | Limited                        | BERT, ALBERT        |
| Subword BPE    | Break into smaller parts | Limited       | Good for languages with spaces | GPT-1               |
| Byte-Level BPE | Always possible (bytes)  | ✅             | Universal                      | GPT-2, GPT-3, LLaMA |

---

⚡ So, Byte-Level Tokenization is the reason GPT-family models don’t choke on weird inputs like `"#@&😊💯🔥日本語"`.



In [4]:
from transformers import GPT2Tokenizer

# Load pretrained GPT-2 tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Example text
text = "Hello world! 👋🌍 I love AI 🤖"

# Encode: text -> token IDs
token_ids = tokenizer.encode(text)
print("Token IDs:", token_ids)

# Decode: token IDs -> text
decoded = tokenizer.decode(token_ids)
print("Decoded text:", decoded)

# Show tokens with mapping
tokens = tokenizer.convert_ids_to_tokens(token_ids)
print("\nTokens:")
for t in tokens:
    print(t)


  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


Token IDs: [15496, 995, 0, 50169, 233, 8582, 234, 235, 314, 1842, 9552, 12520, 97, 244]
Decoded text: Hello world! 👋🌍 I love AI 🤖

Tokens:
Hello
Ġworld
!
ĠðŁĳ
ĭ
ðŁ
Į
į
ĠI
Ġlove
ĠAI
ĠðŁ
¤
ĸ


# Let's build own tokenizer

In [5]:
with open("the_verdict.txt",'r',encoding='utf-8') as f:
    raw_text = f.read()

In [6]:
raw_text

'I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no great surprise to me to hear that, in the height of his glory, he had dropped his painting, married a rich widow, and established himself in a villa on the Riviera. (Though I rather thought it would have been Rome or Florence.)\n\n"The height of his glory"--that was what the women called it. I can hear Mrs. Gideon Thwing--his last Chicago sitter--deploring his unaccountable abdication. "Of course it\'s going to send the value of my picture \'way up; but I don\'t think of that, Mr. Rickham--the loss to Arrt is all I think of." The word, on Mrs. Thwing\'s lips, multiplied its _rs_ as though they were reflected in an endless vista of mirrors. And it was not only the Mrs. Thwings who mourned. Had not the exquisite Hermia Croft, at the last Grafton Gallery show, stopped me before Gisburn\'s "Moon-dancers" to say, with tears in her eyes: "We shall not look upon its like again"?\n\nWell!--even 

In [7]:
print("Total number of charcter :", len(raw_text))

Total number of charcter : 20479


<div class="alert alert-block alert-success">

Our goal is to tokenize this 20,479-character short story into individual words and special
characters that we can then turn into embeddings for LLM training  </div>

<div class="alert alert-block alert-success">

How can we best split this text to obtain a list of tokens? For this, we go on a small
excursion and use Python's regular expression library re for illustration purposes. (Note
that you don't have to learn or memorize any regular expression syntax since we will
transition to a pre-built tokenizer later in this chapter.) </div>

In [8]:
import re

text = "Hello, I'm Tesla. Speking from past telling you about future!"
res = re.split(r'(\s)',text)

print(res)

['Hello,', ' ', "I'm", ' ', 'Tesla.', ' ', 'Speking', ' ', 'from', ' ', 'past', ' ', 'telling', ' ', 'you', ' ', 'about', ' ', 'future!']


In [9]:
for r in res:
    print(r)

Hello,
 
I'm
 
Tesla.
 
Speking
 
from
 
past
 
telling
 
you
 
about
 
future!


<div class="alert alert-block alert-warning">

Let's modify the regular expression splits on whitespaces (\s) and commas, and periods
([,.]):</div>

In [10]:
res = re.split(r'([,.]|\s)',text)

print(res)

['Hello', ',', '', ' ', "I'm", ' ', 'Tesla', '.', '', ' ', 'Speking', ' ', 'from', ' ', 'past', ' ', 'telling', ' ', 'you', ' ', 'about', ' ', 'future!']


<div class="alert alert-block alert-warning">

A small remaining issue is that the list still includes whitespace characters. Optionally, we
can remove these redundant characters safely as follows:</div>

In [11]:
res = [item for item in res if item.strip()]
print(res)

['Hello', ',', "I'm", 'Tesla', '.', 'Speking', 'from', 'past', 'telling', 'you', 'about', 'future!']


<div class="alert alert-block alert-success">

REMOVING WHITESPACES OR NOT


When developing a simple tokenizer, whether we should encode whitespaces as
separate characters or just remove them depends on our application and its
requirements. Removing whitespaces reduces the memory and computing
requirements. However, keeping whitespaces can be useful if we train models that
are sensitive to the exact structure of the text (for example, Python code, which is
sensitive to indentation and spacing). Here, we remove whitespaces for simplicity
and brevity of the tokenized outputs. Later, we will switch to a tokenization scheme
that includes whitespaces.

</div>

In [12]:
text = "Hello, world. Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


In [13]:
# Strip whitespace from each item and then filter out any empty strings.
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


In [14]:
preprocessed = re.split(r'([,.:;?_!\']|--|\s)',raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:40])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in', 'the', 'height', 'of', 'his', 'glory', ',', 'he', 'had', 'dropped', 'his']


In [15]:
print(len(preprocessed))

4601


## Step 2 creating token id

In [16]:
all_word = sorted(set(preprocessed))
vocab_size = len(all_word)

print(vocab_size)

1159


In [17]:
all_word[0]

'!'

<div class="alert alert-block alert-success">

After determining that the vocabulary size is 1,159 via the above code, we create the
vocabulary and print its first 51 entries for illustration purposes:

</div>

In [18]:
vocab = {token:integer for integer,token in enumerate(all_word)}

In [19]:
for i, token in enumerate(vocab.items()):
    print(token)
    
    if i >= 50:
        break

('!', 0)
('"', 1)
('"Ah', 2)
('"Be', 3)
('"Begin', 4)
('"By', 5)
('"Come', 6)
('"Destroyed', 7)
('"Don', 8)
('"Gisburns"', 9)
('"Grindles', 10)
('"Hang', 11)
('"Has', 12)
('"How', 13)
('"I', 14)
('"If', 15)
('"It', 16)
('"Jack', 17)
('"Money', 18)
('"Moon-dancers"', 19)
('"Mr', 20)
('"Mrs', 21)
('"My', 22)
('"Never', 23)
('"Of', 24)
('"Oh', 25)
('"Once', 26)
('"Only', 27)
('"Or', 28)
('"That', 29)
('"The', 30)
('"Then', 31)
('"There', 32)
('"This', 33)
('"We', 34)
('"Well', 35)
('"What', 36)
('"When', 37)
('"Why', 38)
('"Yes', 39)
('"You', 40)
('"but', 41)
('"deadening', 42)
('"dragged', 43)
('"effects"', 44)
('"interesting"', 45)
('"lift', 46)
('"obituary"', 47)
('"strongest', 48)
('"strongly"', 49)
('"sweetly"', 50)


<div class="alert alert-block alert-success">

Let's implement a complete tokenizer class in Python.

The class will have an encode method that splits
text into tokens and carries out the string-to-integer mapping to produce token IDs via the
vocabulary. 

In addition, we implement a decode method that carries out the reverse
integer-to-string mapping to convert the token IDs back into text.

</div>

<div class="alert alert-block alert-info">
    
Step 1: Store the vocabulary as a class attribute for access in the encode and decode methods
    
Step 2: Create an inverse vocabulary that maps token IDs back to the original text tokens

Step 3: Process input text into token IDs

Step 4: Convert token IDs back into text

Step 5: Replace spaces before the specified punctuation

</div>



In [20]:
import re
class SimpleTokenizerV1:
    def __init__(self,vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}
        
    def encode(self,text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)',text)
        
        preprocessed = [item.strip() for  item in preprocessed if item.strip()]
        
        ids = [self.str_to_int[s] for s in preprocessed]
        
        return ids
    
    def decode(self,ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

In [21]:
tokenizer = SimpleTokenizerV1(vocab)

text = """"It's the last he painted, you know," 
           Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

[1, 97, 51, 881, 1016, 634, 565, 777, 55, 1155, 628, 55, 1, 106, 57, 84, 882, 1137, 785, 824, 57]


In [22]:
tokenizer.decode(ids)

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

<div class="alert alert-block alert-success">

So far, so good. We implemented a tokenizer capable of tokenizing and de-tokenizing
text based on a snippet from the training set. 

Let's now apply it to a new text sample that
is not contained in the training set:
</div>

In [23]:
text = "Hello, do you like tea?"
print(tokenizer.encode(text))

KeyError: 'Hello'

<div class="alert alert-block alert-info">
    
The problem is that the word "Hello" was not used in the The Verdict short story. 

Hence, it
is not contained in the vocabulary. 

This highlights the need to consider large and diverse
training sets to extend the vocabulary when working on LLMs.

</div>