<div align="center">
<img src="https://poorit.in/image.png" alt="Poorit" width="40" style="vertical-align: middle;"> <b>AI SYSTEMS ENGINEERING 1</b>

## Unit 3: Practical Tokenization

**CV Raman Global University, Bhubaneswar**  
*AI Center of Excellence*

---

</div>

---

### What You'll Learn

In this notebook, you will:

1. **Understand why tokenization matters** — cost, context length, and performance
2. **Use HuggingFace tokenizers** to encode and decode text
3. **Compare tokenizers** across different models and languages

**Duration:** ~30 minutes

---

## 1. Environment Setup

In [None]:
# Install required packages
!pip install -q transformers tokenizers

In [None]:
from transformers import AutoTokenizer

---

## 2. Why Tokenization Matters

LLMs don't process raw text — they work with **tokens** (numerical representations).

Tokenization affects:
- **Cost** — API pricing is per token
- **Context length** — models have token limits (e.g., GPT-4 has 128k tokens)
- **Performance** — different tokenizers handle languages differently

---

## 3. Loading Tokenizers

Each model has its own tokenizer trained on specific data. Let's load two popular ones.

In [None]:
# Load tokenizers for GPT-2 and BERT
gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2")
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

In [None]:
# Check vocabulary sizes
print(f"GPT-2 vocabulary size: {len(gpt2_tokenizer):,}")
print(f"BERT vocabulary size:  {len(bert_tokenizer):,}")

---

## 4. Basic Encoding and Decoding

**Encoding** converts text to token IDs. **Decoding** converts token IDs back to text.

In [None]:
text = "Hello, I am studying at CV Raman University!"

# Encode text to token IDs
tokens = gpt2_tokenizer.encode(text)
print(f"Text: {text}")
print(f"Token IDs: {tokens}")
print(f"Number of tokens: {len(tokens)}")

In [None]:
# Decode back to text
decoded = gpt2_tokenizer.decode(tokens)
print(f"Decoded: {decoded}")

In [None]:
# See individual tokens
token_strings = gpt2_tokenizer.convert_ids_to_tokens(tokens)

print("Token breakdown:")
for token_id, token_str in zip(tokens, token_strings):
    print(f"  {token_id:6d} → '{token_str}'")

Notice how some words are split into sub-words (e.g., `Ġstudying` — the `Ġ` represents a leading space). This is how the tokenizer handles words it hasn't seen as a whole.

---

## 5. Comparing Tokenizers

Different models tokenize the same text differently. Let's compare GPT-2 and BERT on English, Hindi, and code.

In [None]:
def compare_tokenizers(text, tokenizers):
    """Compare how different tokenizers handle the same text."""
    print(f"Text: {text}\n")

    for name, tokenizer in tokenizers.items():
        tokens = tokenizer.encode(text)
        token_strings = tokenizer.convert_ids_to_tokens(tokens)

        print(f"{name}:")
        print(f"  Tokens: {len(tokens)}")
        print(f"  Breakdown: {token_strings}")
        print()

In [None]:
tokenizers = {
    "GPT-2": gpt2_tokenizer,
    "BERT": bert_tokenizer
}

# English text
compare_tokenizers("Artificial Intelligence is transforming industries.", tokenizers)

In [None]:
# Hindi text — notice how many more tokens are needed
compare_tokenizers("भारत एक महान देश है।", tokenizers)

In [None]:
# Code — tokenizers handle code differently too
compare_tokenizers("def hello_world(): print('Hello!')", tokenizers)

**Key observations:**
- English text is tokenized efficiently by both models
- Hindi text requires significantly more tokens — these tokenizers were trained mostly on English
- Code tokenization varies; GPT-2 handles it slightly better since it was trained on web text that includes code

---

### A Note on Tokenization Strategies

Different models use different algorithms to build their vocabulary:

| Strategy | Used By | Key Idea |
|----------|---------|----------|
| **BPE** (Byte Pair Encoding) | GPT-2, GPT-4 | Starts with characters, repeatedly merges the most frequent pairs |
| **WordPiece** | BERT | Similar to BPE, but uses likelihood to decide merges |
| **SentencePiece** | Llama, T5 | Language-agnostic, works directly on raw text (no pre-tokenization) |

All three approaches learn sub-word vocabularies — they split rare words into pieces while keeping common words whole. You don't need to choose a strategy yourself; it's baked into each model's tokenizer.

---

## 6. Exercise

Pick 3 sentences — one in English, one in Hindi (or another non-English language), and one code snippet. Tokenize each with GPT-2 and count the tokens. Which type of content is the most token-efficient?

In [None]:
# Exercise: Token counting
# Replace the strings below with your own text

english_text = ""  # e.g. "The quick brown fox jumps over the lazy dog."
hindi_text = ""    # e.g. "नमस्ते, आप कैसे हैं?"
code_text = ""     # e.g. "for i in range(10): print(i)"

for label, text in [("English", english_text), ("Hindi", hindi_text), ("Code", code_text)]:
    tokens = gpt2_tokenizer.encode(text)
    print(f"{label}: {len(tokens)} tokens for {len(text)} characters (ratio: {len(text)/len(tokens):.1f} chars/token)")

---

## Key Takeaways

1. **Tokenizers are model-specific** — always use the matching tokenizer for a model

2. **Token counts vary by content** — non-English text and code often use more tokens

3. **Sub-word tokenization** — rare words are split into pieces, common words stay whole

### What's Next?

In the next notebook, we'll go beyond pipelines and **load and run open-source LLMs** directly, exploring generation parameters like temperature and sampling.

---

## Additional Resources

- [Tokenizers Documentation](https://huggingface.co/docs/tokenizers)
- [Understanding Tokenization](https://huggingface.co/docs/transformers/tokenizer_summary)

---

**Course Information:**
- **Institution:** CV Raman Global University, Bhubaneswar
- **Program:** AI Center of Excellence
- **Course:** AI Systems Engineering 1
- **Developed by:** [Poorit Technologies](https://poorit.in) — *Transform Graduates into Industry-Ready Professionals*

---