Tokenizers are one of the core components of the NLP pipeline. They serve one purpose: to translate text into data that can be processed by the model. Models can only process numbers, so tokenizers need to convert our text inputs to numerical data. In this section, we’ll explore exactly what happens in the tokenization pipeline.

In NLP tasks, the data that is generally processed is raw text. Here’s an example of such text:

In [None]:
# "Jim Henson was a puppeteer"

However, models can only process numbers, so we need to find a way to convert the raw text to numbers. That’s what the tokenizers do, and there are a lot of ways to go about this. The goal is to find the most meaningful representation — that is, the one that makes the most sense to the model — and, if possible, the smallest representation.

**Word-based(Word Tokenization):Splits text at whitespace or punctuation.**

Word-based tokenizers assign a unique ID to every word, creating large vocabularies (e.g., 500,000+ words in English). This approach struggles with inflected forms like “dog” vs. “dogs” or “run” vs. “running,” treating them as unrelated. It also requires a special [UNK] token for unknown words not in the vocabulary, which leads to information loss if used too often. To reduce unknown tokens and handle rare words better, one alternative is to use character-based tokenization, which breaks words into smaller, more manageable units.

In [None]:
# There are different ways to split the text. For example, we could use whitespace to tokenize the text into words by applying Python’s split() function:

tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)

['Jim', 'Henson', 'was', 'a', 'puppeteer']


# Subword Tokenization

Breaks rare or unknown words into smaller known subword units. Popular in LLMs.

Variants:
Byte-Pair Encoding (BPE)

WordPiece

SentencePiece

***a. Byte-Pair Encoding (BPE)***

In [None]:
# Used in: GPT, RoBERTa, GPT-Neo, GPT-J
# "unhappiness" → ['un', 'happiness']
# "disappointed" → ['dis', 'appoint', 'ed']



***WordPiece***

In [None]:
# Used in: BERT, DistilBERT
# Starts with words and breaks unknown ones into chunks using ## for suffixes.
# "playing" → ['play', '##ing']
# "unaffordable" → ['un', '##afford', '##able']




***SentencePiece***

In [None]:
# Used in: T5, ALBERT, mT5

# "MachineLearning" → ['▁Machine', 'Learning']
# Treats input as raw characters, doesn't require spaces.

# Great for multilingual tasks.

# Uses "▁" as a space indicator

***Compare how different tokenizers tokenize***

In [None]:
text = "Transformers are amazing!"


| Model   | Tokenizer Type           |
| ------- | ------------------------ |
| BERT    | WordPiece                |
| GPT-2   | Byte-Pair Encoding (BPE) |
| T5      | SentencePiece            |
| GPT-Neo | Byte-level BPE           |


We'll show how each processes the same input, including:

Tokens

Token IDs

Special tokens

How unknown words are handled



In [None]:
from transformers import AutoTokenizer


In [None]:
# Define models
tokenizers = {
    "bert-base-uncased": AutoTokenizer.from_pretrained("bert-base-uncased"),
    "gpt2": AutoTokenizer.from_pretrained("gpt2"),
    "t5-small": AutoTokenizer.from_pretrained("t5-small"),
    "EleutherAI/gpt-neo-125M": AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-125M"),
}

text = "Transformers are amazing!"

In [None]:
for model_name, tokenizer in tokenizers.items():
    print(f"\n🔹 Tokenizer: {model_name}")

    tokens = tokenizer.tokenize(text)
    token_ids = tokenizer.convert_tokens_to_ids(tokens)
    encoded = tokenizer.encode(text)
    decoded = tokenizer.decode(encoded)

    print("Tokens:", tokens)
    print("Token IDs:", token_ids)
    print("With Special Tokens:", encoded)
    print("Decoded:", decoded)



🔹 Tokenizer: bert-base-uncased
Tokens: ['transformers', 'are', 'amazing', '!']
Token IDs: [19081, 2024, 6429, 999]
With Special Tokens: [101, 19081, 2024, 6429, 999, 102]
Decoded: [CLS] transformers are amazing! [SEP]

🔹 Tokenizer: gpt2
Tokens: ['Transform', 'ers', 'Ġare', 'Ġamazing', '!']
Token IDs: [41762, 364, 389, 4998, 0]
With Special Tokens: [41762, 364, 389, 4998, 0]
Decoded: Transformers are amazing!

🔹 Tokenizer: t5-small
Tokens: ['▁Transformer', 's', '▁are', '▁amazing', '!']
Token IDs: [31220, 7, 33, 1237, 55]
With Special Tokens: [31220, 7, 33, 1237, 55, 1]
Decoded: Transformers are amazing!</s>

🔹 Tokenizer: EleutherAI/gpt-neo-125M
Tokens: ['Transform', 'ers', 'Ġare', 'Ġamazing', '!']
Token IDs: [41762, 364, 389, 4998, 0]
With Special Tokens: [41762, 364, 389, 4998, 0]
Decoded: Transformers are amazing!


# *1. BERT (Bidirectional Encoder Representations from Transformers)*

What is BERT?
BERT is a transformer-based encoder model that reads both left and right context of a word (bidirectional).
It was designed for understanding language, not generating it.

Architecture:
Uses only the encoder part of the transformer.

Trained using Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).

Input = [CLS] sentence [SEP] sentence.

| Task                               | Example                            |
| ---------------------------------- | ---------------------------------- |
| **Text Classification**            | Sentiment analysis, spam detection |
| **Named Entity Recognition (NER)** | Highlighting names, places, etc.   |
| **Question Answering**             | SQuAD-style tasks                  |
| **Text Similarity**                | Sentence pairs                     |


Why Use BERT?

Strong at understanding contextual meaning of words.

Bidirectional attention gives it deep understanding of syntax and semantics.

# . GPT-2 (Generative Pretrained Transformer 2)

What is GPT-2?

GPT-2 is an auto-regressive language model that generates text from left to right, trained to predict the next word given previous ones.

Architecture:
Uses only the decoder part of the transformer.

Trained using Causal Language Modeling (CLM).

No attention to future tokens (ensures natural, ordered generation).



| Task                             | Example                            |
| -------------------------------- | ---------------------------------- |
| **Text Generation**              | Story, article, or code generation |
| **Dialogue Bots**                | Chatbots, assistants               |
| **Summarization** *(fine-tuned)* | Generate abstract summaries        |
| **Creative Tasks**               | Poems, jokes, completions          |


Why Use GPT-2?

Great at free-form generation and language modeling.

Can be fine-tuned for various generative NLP applications.

## T5 (Text-to-Text Transfer Transformer)

What is T5?

T5 reformulates every NLP task as a text-to-text problem, i.e., both input and output are strings.

For example:

Input: "summarize: The news article is ..."

Output: "The article talks about..."

Architecture:

Uses both encoder and decoder (full transformer).

Pretrained on a huge dataset (C4) using a span-masking objective.

Trained in a multitask setup: translation, QA, summarization, etc.

| Task                   | Example                      |
| ---------------------- | ---------------------------- |
| **Summarization**      | News article → short summary |
| **Translation**        | English → German             |
| **Question Answering** | Text → Answer                |
| **Classification**     | Sentence → label (as text)   |


Why Use T5?

Extremely flexible (same model for many tasks).

Ideal when you want one model to do it all (multi-task).

| Model     | Architecture    | Direction     | Strength               | Common Use                     |
| --------- | --------------- | ------------- | ---------------------- | ------------------------------ |
| **BERT**  | Encoder-only    | Bidirectional | Understanding          | Classification, QA             |
| **GPT-2** | Decoder-only    | Left-to-right | Generation             | Text gen, chatbots             |
| **T5**    | Encoder-Decoder | Seq2Seq       | Text-to-text multitask | Summarization, translation, QA |
