# Introduction to Tokens with `tiktoken`

## Learning Goals
- Understand how text is represented as **tokens** for LLMs.
- Use the **tiktoken** library to encode and decode tokens.
- See how token count affects **context windows** and **costs**.

This notebook connects with the discussion of **context windows** in Section 1.2 of the lecture notes.

## Step 1: Install and import `tiktoken`
`tiktoken` is an OpenAI library for **tokenization**. Each model (GPT-3.5, GPT-4, GPT-5, etc.) has a specific encoding scheme.

In [6]:
import tiktoken

## Step 2: Load an encoding
- For GPT-3.5 / GPT-4, the common encoding is `cl100k_base`.
- For Codex models, it's `p50k_base`.
- For older GPT-2, it's `gpt2`.

In [7]:
enc = tiktoken.get_encoding('cl100k_base')
sample_text = 'The art of Computer Programming by Donald E. Knuth'

tokens = enc.encode(sample_text)
print('Tokens:', tokens)
print('Number of tokens:', len(tokens))

decoded = enc.decode(tokens)
print('Decoded back:', decoded)

Tokens: [791, 1989, 315, 17863, 39524, 555, 9641, 469, 13, 13934, 952]
Number of tokens: 11
Decoded back: The art of Computer Programming by Donald E. Knuth


Visualização colorida com `rich`:

In [8]:
from rich.console import Console
from rich.text import Text

console = Console()

def visualize_tokens(text, encoding):
    tokens = encoding.encode(text)
    decoded_tokens = [encoding.decode([t]) for t in tokens]

    rich_text = Text()
    colors = ["red", "green", "blue", "yellow", "magenta", "cyan"]

    for i, token_str in enumerate(decoded_tokens):
        color = colors[i % len(colors)]
        rich_text.append(token_str, style=color)
        rich_text.append(" ")  # espaço entre tokens

    console.rule("[bold blue] Token Visualization [/bold blue]")
    console.print(rich_text)
    console.print(f"[bold]Number of tokens:[/bold] {len(tokens)}")

# Exemplo de uso
visualize_tokens(sample_text, enc)


The following code snippet demonstrates why “tokens are not equal to words”: common words may map to a single token, while less frequent or morphologically complex words are decomposed into several subword units.

In [9]:
# Exemplo: visualizar divisão de palavras em múltiplos tokens
examples = ["art", "database", "artificially", "tokenization"]

for word in examples:
    tokens = enc.encode(word)
    parts = [enc.decode([t]) for t in tokens]
    print(f"Palavra: {word}")
    print(f"Tokens: {tokens}")
    print(f"Divisão: {parts}\n")


Palavra: art
Tokens: [472]
Divisão: ['art']

Palavra: database
Tokens: [12494]
Divisão: ['database']

Palavra: artificially
Tokens: [472, 16895, 398]
Divisão: ['art', 'ificial', 'ly']

Palavra: tokenization
Tokens: [5963, 2065]
Divisão: ['token', 'ization']



### Reflection
- **Tokens are not equal to words**. Sometimes a word = 1 token, sometimes it splits into several.
- **Subword tokenization is statistical, not semantic.** The splits reflect patterns learned from frequency in the training corpus, not linguistic structure. For example, “artificially” is broken into parts that are efficient for the model, not necessarily meaningful for humans.
- **Token counts can be counterintuitive.** A common word like “database” is a single token. A morphologically simpler but less frequent form like “artificially” may become three tokens.

## Step 3: Estimate token usage
You can use `tiktoken` to calculate how many tokens are in a prompt. This is useful to:
- Avoid exceeding the **context window**.
- Estimate API **costs** (OpenAI charges per token).

In [10]:
long_text = 'Artificial intelligence will transform databases.' * 20
print('Characters:', len(long_text))
print('Tokens:', len(enc.encode(long_text)))

Characters: 980
Tokens: 121


### Exercises
1. Try different encodings: `gpt2`, `p50k_base`.
2. Count tokens for a long paragraph (e.g., from your notes).
3. Compare `len(text)` (characters) with number of tokens.
4. What happens if you try to encode a very long text (close to 4000 tokens)?