# **Level 1: The Origins â€” Intro to LLMs & Chatbots**

## **Section 2: Introduction to Language Models**

### **Part 3: Tokens & Tokenization**

---

Before we can understand how AI models like ChatGPT process language, we need to appreciate a simple but crucial fact: **computers donâ€™t understand human language the way we do**.

We see language as sentences, ideas, and meaning. Computers, on the other hand, deal with numbers and symbols. To bridge that gap, the first step in building modern AI systems that understand text is **breaking down language into smaller, manageable pieces**. These pieces are called **tokens**.

---

### **What are Tokens?**

In simple terms, a **token** is a unit of text that the model processes. Depending on the model and its design, a token can be:

* A full word (e.g., "cat")
* Part of a word (e.g., "inter" and "national" from "international")
* Punctuation (e.g., ".")
* Special symbols (e.g., `<|endoftext|>`)

Tokens are the building blocks of language for AI models.

---

### **Why Not Just Use Whole Words?**

Language is complex. Words can be long, short, combined, or made-up. If we treated only whole words as units, the model would struggle with:

* Rare words
* Misspellings
* New words never seen before

Instead, breaking text into smaller chunks (tokens) allows the model to handle language flexibly. Even if it has never seen the exact word "antidisestablishmentarianism," it can process its tokens and still understand parts of it.

---

### **How is Text Broken into Tokens?**

This process is called **tokenization**. A special algorithm breaks text into tokens according to predefined rules.

Different models use different tokenization strategies:

* Some use **WordPiece** (common in BERT models)
* Others use **Byte Pair Encoding (BPE)** (common in GPT models)
* Some use **SentencePiece** (common in multilingual models)

These methods aim to balance efficiency and flexibility.

---

**Illustration Example:**

Take the sentence:
*"I love international collaborations."*

A tokenization algorithm might break it down like this:

\[`I`, `love`, `inter`, `national`, `collaborations`, `.`]

Notice how:

* "international" becomes two tokens: "inter" and "national"
* Punctuation is kept as its own token

Alternatively, depending on the tokenizer, it might also look like:
\[`I`, `love`, `international`, `collaborations`, `.`]

The key takeaway: tokenization isn't always perfectly intuitive to humans, but it's optimized for the model to handle language efficiently.

---

### **Why Does Token Count Matter?**

Modern AI models process tokens one at a time, internally converting them into numerical representations the model can work with. However, they have a **maximum token limit**, known as the **context window**.

This limit defines how much text the model can handle at once. For example:

* GPT-3.5 has a limit of around **4,000 tokens**
* GPT-4 can handle up to **128,000 tokens** in some versions

If your text exceeds this limit, the model will:

* Truncate the beginning or end
* Lose context
* Be unable to process the full input

This is why understanding tokens is important, especially when building applications or chatbots that work with long text.

---

**Real-World Implication:**

Imagine you're building a chatbot to summarize legal contracts. If the contract is too long and exceeds the token limit, the chatbot wonâ€™t see the entire document â€” leading to incomplete or inaccurate responses.

---

### **Quick Clarifications:**

* **Tokens â‰  Characters.** A single token might contain multiple characters, or a single character might be its own token.
* **Token count â‰  Word count.** A sentence with five words may have 5, 7, or more tokens depending on the tokenizer.

---

### **Summary:**

* Tokens are the basic units of text that AI models process.
* Tokenization breaks text into these chunks.
* The token limit defines how much information a model can process at once.
* Understanding tokens helps you design better AI applications and prevents errors due to exceeding context limits.

---

In the next part, we'll build on this by exploring the **Transformer**, the engine that processes these tokens and enables models to understand language.

### **Part 3: Tokens & Tokenization (Technical Subpart)**

To fully grasp how AI models handle language, itâ€™s not enough to know that text gets broken into "tokens" â€” we must also understand **how** this happens under the hood. This process, known as **tokenization**, is fundamental to how language models process and generate text.

---

### **Formal Definition of Tokenization:**

**Tokenization** is the process of mapping raw text input into discrete, machine-understandable units called **tokens**, using a deterministic algorithm. These tokens serve as indices or inputs to the model's embedding layer.

Mathematically, we can express tokenization as:

$$
T(x) = [t_1, t_2, t_3, \ldots, t_n]
$$

Where:

* $x$ = raw text input (string)
* $T(x)$ = list of tokens produced
* $t_1, t_2, \ldots, t_n$ = individual tokens

The output tokens are not raw text â€” they are mapped to integer IDs using a vocabulary, ready for model consumption.

---

### **Types of Tokenization Algorithms:**

Different tokenizers use different rules for breaking text into tokens. Here are the most common types:

---

#### **1. Word-Level Tokenization**

* Each word is treated as a token.
* Simple but inefficient for rare words.
* Example:
  *"The quick brown fox"* â†’ $`The`, `quick`, `brown`, `fox`$

**Limitation:**

* New words, typos, or rare vocabulary lead to unknown tokens (`<UNK>`), reducing model robustness.

---

#### **2. Subword Tokenization (Most Common Today)**

Subword tokenizers break text into **frequent subword units**, which balances vocabulary size and flexibility. This allows the model to handle rare or unseen words by decomposing them.

Popular subword tokenizers:

* **Byte-Pair Encoding (BPE)**
* **WordPiece**
* **Unigram Language Model**
* **SentencePiece**

---

##### **Byte-Pair Encoding (BPE) â€“ Conceptual Explanation:**

BPE starts with individual characters as tokens and iteratively merges the most frequent pairs of tokens into new tokens. This continues until a fixed vocabulary size is reached.

**Illustrative Example:**
Suppose we have the word "internationalization" seen repeatedly during training.

BPE might produce tokens like:
$`inter`, `national`, `ization`$

This allows the model to handle variations like:

* "international" â†’ $`inter`, `national`$
* "nationalism" â†’ $`national`, `ism`$

**Mathematical Perspective:**

* Initialize token set: all characters
* For $N$ iterations:

  * Find the most frequent pair of consecutive tokens
  * Merge them into a new token
* Result: a compact, efficient token vocabulary

---

##### **WordPiece and Unigram Models:**

These use similar ideas but differ in the specifics of how merges or probabilities are selected. For example:

* **WordPiece:** Popular in BERT; builds vocabulary by considering word likelihood improvements.
* **Unigram Model:** Used by SentencePiece; selects subword units probabilistically to maximize text likelihood.

---

### **3. Byte-Level Tokenization (e.g., GPT-2, GPT-3, GPT-4 Tokenizers)**

For maximum robustness, modern LLMs like GPT-3.5 and GPT-4 often tokenize at the **byte level**, meaning they operate on raw text byte sequences. This enables them to handle:

* Misspellings
* Emojis
* Non-English scripts
* Special characters

**Illustration:**
The string `"Hello ðŸ‘‹"` gets tokenized as a combination of regular text tokens and byte representations for the emoji.

---

### **Token IDs and Embeddings:**

After tokenization, each token is mapped to a unique **token ID**, which corresponds to a row in the model's **embedding matrix**.

Formally, given:

* A vocabulary $V$ of size $|V|$
* An embedding matrix $E \in \mathbb{R}^{|V| \times d}$ where $d$ is the embedding dimension

Each token $t_i$ is mapped to an ID $id_i$, and the model retrieves:

$$
\mathbf{e}_i = E[id_i]
$$

Where $\mathbf{e}_i$ is the vector representation of token $t_i$.

These embeddings are the numerical inputs to the Transformer model, which we'll explore in the next part.

---

### **Summary of Key Technical Points:**

âœ” Tokenization converts raw text into discrete, machine-readable units (tokens).
âœ” Modern LLMs use subword or byte-level tokenization for flexibility and efficiency.
âœ” The tokenization process produces token IDs that map directly to learned embeddings.
âœ” The choice of tokenizer affects a model's ability to generalize to rare, novel, or non-standard text.

