# üîé Tokenization in practice with tiktoken
In the `fun_fact`, we explored tokenization using the [OpenAI web tokenizer](https://platform.openai.com/tokenizer) , which provides a visual and interactive way to see how text is split into tokens.

In this notebook, we replicate the same process programmatically, using the Python library [tiktoken](https://github.com/openai/tiktoken).

This allows us to understand what happens under the hood and to work with tokens directly in code.

---
## üì¶ What is tiktoken?

`tiktoken` is the official tokenization library used by OpenAI models.

It implements the same tokenization logic used internally by GPT models, including:
- subword tokenization
- vocabulary lookup
- mapping between text ‚Üî token IDs

> Importantly: No API call is made. Indeed,tokenization happens entirely locally.
We are not querying a model, only applying a deterministic text ‚Üí tokens mapping.

This makes `tiktoken`: fast, cheap, and ideal for experiments and analysis.

---

## üîÅ Encoding vs decoding
Tokenization always involves two complementary operations:

### ‚úèÔ∏è Encoding
Encoding means: converting raw text into a sequence of token IDs!
```text
Text ‚Üí Tokens ‚Üí Token IDs
```
Where each token ID is an integer that represents a specific token in the model‚Äôs vocabulary.

### üîÑ Decoding
Decoding means: onverting token IDs back into human-readable text fragments!
```text
Token IDs ‚Üí Tokens ‚Üí Text
```

Decoding is useful to: inspect what a token actually represents, verify how words are split into subword fragments, and debug or reason about token boundaries.

---

In [14]:
import tiktoken

# Encoding text into tokens
print("Encode Results:")
encoding = tiktoken.encoding_for_model("gpt-4.1-mini")
tokens = encoding.encode("This sentence contains only very common English words.")

print(tokens)
print("------")

# Inspecting individual tokens
print("Decode Results:")
for token_id in tokens:
    token_text = encoding.decode([token_id])
    print(f"{token_id} = {token_text}")
print("------")

Encode Results
[2500, 21872, 8895, 1606, 1869, 5355, 7725, 6391, 13]
------
Decode Results
2500 = This
21872 =  sentence
8895 =  contains
1606 =  only
1869 =  very
5355 =  common
7725 =  English
6391 =  words
13 = .
------


## ‚úÖ Why this matters
Understanding tokenization at this level helps to:
- estimate prompt and context size
- reason about costs (priced per token)
- design better prompts
- understand why some inputs are split ‚Äústrangely‚Äù

Even experienced practitioners benefit from occasionally inspecting tokenization manually‚Äîit reveals a lot about how LLMs actually process text.