## Tokenization

### **What is Tokenization?**

`Tokenization` is the process of breaking down a raw text string into smaller, meaningful units called `tokens`. These tokens are the basic `"words"` or `"sub-words"` that a model can understand and process. It's the bridge between human language and machine numerical representation.

Think of it like this: you can't do math with the word `"seven,"` you need the number `7`. Similarly, an AI model can't process the word `"hello,"` it needs a numerical ID, say `42`. Tokenization is the process that maps `"hello"` to `42`.



**Why is it Necessary?**

- `Numerical Representation:` Machine Learning models are mathematical functions; they require numbers as input, not strings. Tokenization converts text into integer indices.

- `Handling Vocabulary:` It defines a model's known set of words (its vocabulary). Any word not in this vocabulary is a problem, which modern tokenizers solve cleverly.

- `Efficiency:` Processing individual tokens is more efficient than processing whole documents or characters.


### **Levels of Tokenization**

There are three main approaches, each with pros and cons.

**1. Word-based Tokenization**

This splits text into words based on spaces and punctuation.

- **Example:** "Don't hesitate to ask!" becomes `["Don", "t", "hesitate", "to", "ask"]`

- **Pros:** Intuitive, tokens have clear meaning.

- **Cons:**
  - **Large Vocabulary:** A model must know every possible word, inflection, and misspelling, leading to huge vocabularies (100k+ tokens).
  - **Out-of-Vocabulary (OOV) Problem:** What happens with the word `"Supercalifragilisticexpialidocious"?` It's unseen, so it becomes an `[UNK] (unknown)` token, losing all meaning.
  - **No semantic similarity:** Words like "run" and "running" are treated as entirely different tokens.


**2. Character-based Tokenization**

This splits text into individual characters.

- **Example:** `"ask"` becomes `["a", "s", "k"]`

- **Pros:**
  - **Tiny Vocabulary:** Only `~26 letters` (plus punctuation), so very small vocabulary size.
  - **No OOV Problem:** Any word can be built from characters.

- **Cons:**
  - **Loss of Meaning:** Individual characters carry very little semantic meaning.
  - **Long Sequences:** "hello" is 1 word but 5 tokens, making processing much longer and more expensive for the model.


**3. Subword-based Tokenization (The Modern Standard)**

This is the best of both worlds. It splits words into frequent sub-word units (like prefixes, suffixes, and roots). Rare words are split into meaningful sub-tokens, while common words remain a single token.


- **Example:** The word `"unhappily"` might be split into `["un", "happ", "ily"]`—all meaningful units that the model has likely seen before in other words like `"untrue"`, `"happy"`, and `"quickly"`.

- **Pros:**
  - **Balanced Vocabulary:** Vocabulary size is a chosen compromise (e.g., 30k-100k tokens).
  - **Solves OOV Problem:** Effectively zero unknown tokens. Even a new word like `"tokenization"` can be split into known subwords: `["token", "ization"]`.
  - **Low unknown tokens:** Covers most language constructs effectively.

- **Cons:** Less intuitive for humans to read than whole words.

`Algorithms that use this:` 
- `Byte-Pair Encoding` (BPE - used by GPT), 
- `WordPiece` (used by BERT), 
- `SentencePiece` (used by Llama). Used in multilingual models.

### The Tokenization Pipeline: A Step-by-Step Example

Let's tokenize the sentence: `"I don't like sand."` using a subword tokenizer (like `GPT-2's`).

**Step 1: Pre-tokenization**

Split the text into rough, word-like chunks based on rules and punctuation.
`["I", "don't", "like", "sand", "."]`


**Step 2: Apply the Tokenization Algorithm (e.g., BPE)**

The tokenizer checks each `"word"` against its pre-learned merge rules.

- `"I"` -> Found in vocabulary. Becomes token **314**.

- `"don't"` -> Not found. It's split into subwords: `["do", "n't"]`. These are found. Become tokens **1120**, **426**.

- `"like"` -> Found. Becomes token **544**.

- `"sand"` -> Found. Becomes token **5933**.

- `"."` -> Found. Becomes token **13**.


**Step 3: Add Special Tokens (Model-Specific)**

Many models add special tokens to signify the start, end, or other information.
`[**50256**, 314, 1120, 426, 544, 5933, 13, **50256**]` (`50256` might be an `<|endoftext|>` token)

**Final Result:** The original string is now a list of integers: `[50256, 314, 1120, 426, 544, 5933, 13, 50256]`
This list is what is fed directly into the model's embedding layer.

### Key Concepts to Remember


- **Vocabulary:** The fixed lookup table that maps token strings (e.g., " like") to their integer ID (e.g., `544`). It's created during the training of the tokenizer on a large corpus.

- **Special Tokens:** Tokens that don't correspond to text but have a functional purpose:

  - `[CLS] (BERT):` Classification token. Its final hidden state is used for classification tasks.

  - `[SEP] (BERT):` Separator token, used to separate two sentences.

  - `[UNK]:` Represents an `unknown` word that couldn't be tokenized. Good tokenizers minimize this.

  - `[PAD]:` Padding token, used to make all sequences in a batch the same length.

- **Attention to Spaces:** Notice how in the example, the token for `"like"` might be `" like"` (with a space) in the vocabulary. This is common in `BPE-based `tokenizers` to distinguish where words start. This is why you should use the exact tokenizer a model was trained with.

- Python Code Example with `transformers`

In [3]:
import transformers

In [4]:
from transformers import AutoTokenizer

# Load the tokenizer for a specific model (e.g., GPT-2)
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Our text
text = "I don't like sand."

# Tokenization pipeline: text -> tokens
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens) 
# Output: ['I', 'Ġdon', "'", 't', 'Ġlike', 'Ġsand', '.']
# The 'Ġ' is a special character representing a space.

# Convert tokens to IDs (what the model actually sees)
input_ids = tokenizer.encode(text)
print("Input IDs:", input_ids) 
# Output: [40, 1107, 11, 284, 1447, 7966, 13] 
# (These numbers are specific to GPT-2's vocabulary)

# Decode back to text to verify
decoded_text = tokenizer.decode(input_ids)
print("Decoded:", decoded_text) 
# Output: "I don't like sand."

Tokens: ['I', 'Ġdon', "'t", 'Ġlike', 'Ġsand', '.']
Input IDs: [40, 836, 470, 588, 6450, 13]
Decoded: I don't like sand.


In summary, `tokenization` is a non-trivial but crucial process that transforms messy human text into a clean, numerical structure that machine learning models can consume, and `subword tokenization` is the powerful technique that makes modern LLMs so effective.

### More Examples

**Step 1: Tokenization**

In [6]:
# Example
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
sequence = "Learning tokenization is essential in NLP."
tokens = tokenizer.tokenize(sequence)
print(tokens)

['Learning', 'token', '##ization', 'is', 'essential', 'in', 'NL', '##P', '.']


**Step 2: Conversion to Input IDs**

Each token is mapped to a unique ID from the tokenizer’s vocabulary.

In [7]:
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[9681, 22559, 2734, 1110, 6818, 1107, 21239, 2101, 119]


**Step 3: Decoding**

The decode method reverses the process, converting IDs back into readable text.

In [8]:
decoded_text = tokenizer.decode(ids)
print(decoded_text)

Learning tokenization is essential in NLP.


#### Using Pretrained Tokenizers

Hugging Face’s `transformers` library simplifies working with tokenizers.

In [9]:
# Loading a Tokenizer

from transformers import AutoTokenizer

# Saving a Tokenizer
tokenizer.save_pretrained("my_tokenizer_directory")

('my_tokenizer_directory/tokenizer_config.json',
 'my_tokenizer_directory/special_tokens_map.json',
 'my_tokenizer_directory/vocab.txt',
 'my_tokenizer_directory/added_tokens.json',
 'my_tokenizer_directory/tokenizer.json')

In [10]:
# Combining Tokenization and Conversion

encoded = tokenizer("Tokenization is key to NLP tasks.")
print(encoded)

{'input_ids': [101, 1706, 6378, 2734, 1110, 2501, 1106, 21239, 2101, 8249, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
