# Lab 1: Tokenization & Cost Analysis

Three ideas to take away:
1. **Tokens ≠ words** — LLMs see subword units, and counts vary by text type
2. **Tokens cost money** — pricing differs dramatically across models
3. **Context windows scale badly** — attention is O(n²)

Run every cell top-to-bottom and observe the outputs.

In [None]:
!uv pip install tiktoken plotly -q

---
## Part 1: Tokens ≠ Words

The tokenizer breaks text into **subword units** — not characters, not whole words.
Watch how the same encoder splits English prose, code, and Arabic differently.

In [None]:
import tiktoken

encoder = tiktoken.encoding_for_model("gpt-4")

samples = {
    "English prose": "The transformer architecture revolutionized natural language processing.",
    "Python code":   "def calculate_sum(a, b):\n    return a + b",
    "Multilingual":  "Bonjour! 你好! Today we're discussing tokens.",
    "Arabic":        "اسم المستخدم أحمد وعمره 30 سنة ويسكن في الرياض.",
}

print(f"{'Sample':<20} {'Chars':>6} {'Tokens':>7} {'Ratio':>7}   Breakdown")
print("-" * 80)
for name, text in samples.items():
    tokens = encoder.encode(text)
    pieces = [encoder.decode([t]) for t in tokens]
    ratio  = len(tokens) / len(text)
    print(f"{name:<20} {len(text):>6} {len(tokens):>7} {ratio:>7.2f}   {pieces}")

**Key observation:** Arabic and multilingual text produce far more tokens per character than plain English — because BPE was trained mostly on English text.

---
## Part 2: Tokens Cost Money

Every token in and out costs real money. Let's see what processing a 10-page document costs across different models.

In [None]:
# Pricing reference (USD per 1M tokens, approximate early 2025)
PRICING = {
    "GPT-4-Turbo":      {"input": 10.00, "output": 30.00},
    "GPT-4o":           {"input":  2.50, "output": 10.00},
    "GPT-3.5-Turbo":    {"input":  0.50, "output":  1.50},
    "Claude-3.5-Sonnet":{"input":  3.00, "output": 15.00},
    "Gemini-1.5-Pro":   {"input":  1.25, "output":  5.00},
}

def estimate_cost(text, model_name, expected_output_tokens=500):
    enc = tiktoken.encoding_for_model("gpt-4")
    input_tokens = len(enc.encode(text))
    prices = PRICING[model_name]
    input_cost  = (input_tokens          / 1_000_000) * prices["input"]
    output_cost = (expected_output_tokens / 1_000_000) * prices["output"]
    return {
        "model":        model_name,
        "input_tokens": input_tokens,
        "input_cost":   input_cost,
        "output_cost":  output_cost,
        "total_cost":   input_cost + output_cost,
    }

# Simulate a 10-page document (~5,000 words)
doc = ("The rapid advancement of artificial intelligence has transformed "
       "industries across the globe. Organizations are increasingly adopting "
       "machine learning systems for automation, analysis, and decision-making. ") * 170

print(f"Document: {len(doc):,} chars, ~{len(doc.split()):,} words\n")
print(f"{'Model':<22} {'Input tokens':>13} {'Input $':>9} {'Output $':>9} {'Total $':>9}")
print("-" * 66)
for model in PRICING:
    r = estimate_cost(doc, model, expected_output_tokens=500)
    print(f"{r['model']:<22} {r['input_tokens']:>13,} {r['input_cost']:>9.4f} {r['output_cost']:>9.4f} {r['total_cost']:>9.4f}")

**Key observation:** GPT-4-Turbo costs ~20× more than GPT-3.5-Turbo for the same document. Model choice matters enormously at scale.

---
## Part 3: Context Windows & O(n²) Scaling

Context windows have grown from 1K to 1M tokens in 5 years — but attention cost grows **quadratically**.

In [None]:
import plotly.graph_objects as go

models = [
    ("GPT-2",         2019,    1_024),
    ("GPT-3",         2020,    4_096),
    ("GPT-3.5",       2022,    4_096),
    ("Claude 1",      2023,    9_000),
    ("GPT-4",         2023,    8_192),
    ("GPT-4-32K",     2023,   32_768),
    ("Claude 2",      2023,  100_000),
    ("GPT-4-Turbo",   2023,  128_000),
    ("Claude 3.5",    2024,  200_000),
    ("Gemini 1.5 Pro",2024,1_000_000),
]

names = [m[0] for m in models]
years = [m[1] for m in models]
sizes = [m[2] for m in models]

fig = go.Figure(go.Bar(
    x=names,
    y=sizes,
    marker_color=["#9B8EC0" if y < 2023 else "#00C9A7" if y == 2023 else "#FF7A5C" for y in years],
    text=[f"{s:,}" for s in sizes],
    textposition="outside",
))
fig.update_layout(
    title="Context Window Evolution (log scale)",
    yaxis_title="Max Context Length (tokens)",
    yaxis_type="log",
    template="plotly_white",
    height=480,
)
fig.show()

In [None]:
# O(n²): doubling context → 4× more attention computations
context_lengths = [1_024, 4_096, 8_192, 32_768, 128_000]

print(f"{'Tokens':>10}  {'Attention ops':>18}  {'Relative cost':>14}")
print("-" * 48)
base = context_lengths[0] ** 2
for n in context_lengths:
    ops = n ** 2
    print(f"{n:>10,}  {ops:>18,}  {ops/base:>13.0f}×")

**Key observation:** Going from 1K → 128K tokens makes the context window **16,384× more expensive** to compute. Bigger isn't always better.

---
## Bonus: Tokens vs Characters Across Formats

The same information expressed in different formats tokenizes very differently.

In [None]:
formats = {
    "English": "The user's name is Ahmed and he is 30 years old and lives in Riyadh.",
    "Python":  'user = {"name": "Ahmed", "age": 30, "city": "Riyadh"}',
    "JSON":    '{"name": "Ahmed", "age": 30, "city": "Riyadh"}',
    "Arabic":  "اسم المستخدم أحمد وعمره 30 سنة ويسكن في الرياض.",
}

labels = list(formats.keys())
token_counts = [len(encoder.encode(t)) for t in formats.values()]
char_counts  = [len(t) for t in formats.values()]

fig = go.Figure()
fig.add_trace(go.Bar(name="Tokens",     x=labels, y=token_counts, marker_color="#1C355E",
                     text=token_counts, textposition="outside"))
fig.add_trace(go.Bar(name="Characters", x=labels, y=char_counts,  marker_color="#00C9A7",
                     text=char_counts,  textposition="outside"))
fig.update_layout(
    title="Token vs Character Counts Across Formats",
    yaxis_title="Count",
    barmode="group",
    template="plotly_white",
    height=430,
)
fig.show()

print("\nTokens per character:")
for label, tc, cc in zip(labels, token_counts, char_counts):
    print(f"  {label:<10} {tc/cc:.2f} tokens/char   ({tc} tokens, {cc} chars)")