# Lab 1: Tokenization & Cost Analysis

## Learning Objectives
By the end of this lab, you will:
- Understand how LLMs break text into tokens using BPE encoding
- Analyze how different text types (code, prose, multilingual) tokenize differently
- Estimate API costs based on token counts
- Visualize context window evolution and understand O(n²) scaling implications

## Setup
Run the cell below to install required libraries.

In [2]:
!pip install tiktoken plotly -q

---
## Part 1: The Tokenizer

Tokens are not characters or words — they're **subword units** learned during training. Let's see how the same text breaks apart differently across models.

In [6]:
import tiktoken


def analyze_tokenization(text):
    """Demonstrate how text breaks into tokens across different models."""
    # tiktoken supports OpenAI model tokenizers
    encoders = {
        "GPT-4": tiktoken.encoding_for_model("gpt-4"),
        "GPT-3.5-turbo": tiktoken.encoding_for_model("gpt-3.5-turbo"),
    }

    results = {}
    for model_name, encoder in encoders.items():
        tokens = encoder.encode(text)
        decoded_tokens = [encoder.decode([token]) for token in tokens]

        results[model_name] = {
            "token_count": len(tokens),
            "tokens": decoded_tokens,
            "cost_estimate": len(tokens) * 0.0000015,
        }

    return results

In [8]:
# Run on three different text types — observe how token counts vary
sample_code = "def calculate_sum(a, b):\n    return a + b"
sample_essay = "The transformer architecture revolutionized natural language processing."
sample_multilingual = "Bonjour! 你好! Today we're discussing tokens."

for sample_name, sample_text in [
    ("Code", sample_code),
    ("Essay", sample_essay),
    ("Multilingual", sample_multilingual),
]:
    print(f"\n{'=' * 50}")
    print(f"  {sample_name}")
    print(f"{'=' * 50}")
    print(f"Text: {sample_text}")
    print(f"Character count: {len(sample_text)}")
    token_info = analyze_tokenization(sample_text)

    for model, info in token_info.items():
        print(f"\n  {model}:")
        print(f"    Tokens: {info['token_count']}")
        print(f"    Breakdown: {info['tokens']}")
        print(f"    Approx. cost: ${info['cost_estimate']:.6f}")


  Code
Text: def calculate_sum(a, b):
    return a + b
Character count: 41

  GPT-4:
    Tokens: 12
    Breakdown: ['def', ' calculate', '_sum', '(a', ',', ' b', '):\n', '   ', ' return', ' a', ' +', ' b']
    Approx. cost: $0.000018

  GPT-3.5-turbo:
    Tokens: 12
    Breakdown: ['def', ' calculate', '_sum', '(a', ',', ' b', '):\n', '   ', ' return', ' a', ' +', ' b']
    Approx. cost: $0.000018

  Essay
Text: The transformer architecture revolutionized natural language processing.
Character count: 72

  GPT-4:
    Tokens: 9
    Breakdown: ['The', ' transformer', ' architecture', ' revolution', 'ized', ' natural', ' language', ' processing', '.']
    Approx. cost: $0.000013

  GPT-3.5-turbo:
    Tokens: 9
    Breakdown: ['The', ' transformer', ' architecture', ' revolution', 'ized', ' natural', ' language', ' processing', '.']
    Approx. cost: $0.000013

  Multilingual
Text: Bonjour! 你好! Today we're discussing tokens.
Character count: 43

  GPT-4:
    Tokens: 12
    Breakdown: ['Bo

### Exercise 1.1: Tokenize Your Own Samples

Choose **3 samples of your own** — try different formats to explore how tokenization varies:
- A paragraph from a news article
- A JSON or YAML snippet
- A sentence mixing Arabic and English

Run them through `analyze_tokenization()` and note which produces the most tokens per character.

In [11]:
# TODO: Define your 3 custom samples
my_sample_1 = """
Apple announced the release of its new iPhone 16 with advanced AI features 
yesterday. The device is expected to retail at $1,200 and will be available 
in stores starting next month. CEO Tim Cook emphasized the breakthrough 
camera technology and extended battery life.
"""  # Replace with your first sample
my_sample_2 = """{
    "user": {
        "id": 12345,
        "name": "John Smith",
        "email": "john@example.com",
        "preferences": {
            "language": "en",
            "notifications": true,
            "theme": "dark"
        }
    }
}"""
  # Replace with your second sample
my_sample_3 = "I'm learning العربية and it's challenging but مفيد جداً for my career!"  # Replace with your third sample

for name, text in [
    ("Sample 1", my_sample_1),
    ("Sample 2", my_sample_2),
    ("Sample 3", my_sample_3),
]:
    if not text:
        print(f"\n⚠️  {name} is empty — add your text above!")
        continue
    print(f"\n{'=' * 50}")
    print(f"  {name}")
    print(f"{'=' * 50}")
    print(f"Text: {text[:80]}{'...' if len(text) > 80 else ''}")
    print(f"Characters: {len(text)}")
    info = analyze_tokenization(text)
    for model, data in info.items():
        ratio = data['token_count'] / len(text) if text else 0
        print(f"  {model}: {data['token_count']} tokens (ratio: {ratio:.2f} tokens/char)")


  Sample 1
Text: 
Apple announced the release of its new iPhone 16 with advanced AI features 
yes...
Characters: 272
  GPT-4: 55 tokens (ratio: 0.20 tokens/char)
  GPT-3.5-turbo: 55 tokens (ratio: 0.20 tokens/char)

  Sample 2
Text: {
    "user": {
        "id": 12345,
        "name": "John Smith",
        "emai...
Characters: 239
  GPT-4: 61 tokens (ratio: 0.26 tokens/char)
  GPT-3.5-turbo: 61 tokens (ratio: 0.26 tokens/char)

  Sample 3
Text: I'm learning العربية and it's challenging but مفيد جداً for my career!
Characters: 70
  GPT-4: 26 tokens (ratio: 0.37 tokens/char)
  GPT-3.5-turbo: 26 tokens (ratio: 0.37 tokens/char)


---
## Part 2: Cost Calculator

Now let's build a function to estimate real API costs. Here's current pricing data (approximate, as of early 2025):

In [12]:
# Pricing reference (USD per 1M tokens)
PRICING = {
    "GPT-4-Turbo": {"input": 10.00, "output": 30.00},
    "GPT-4o": {"input": 2.50, "output": 10.00},
    "GPT-3.5-Turbo": {"input": 0.50, "output": 1.50},
    "Claude-3.5-Sonnet": {"input": 3.00, "output": 15.00},
    "Gemini-1.5-Pro": {"input": 1.25, "output": 5.00},
}

print("Model Pricing (per 1M tokens):")
print(f"{'Model':<20} {'Input':>10} {'Output':>10}")
print("-" * 42)
for model, prices in PRICING.items():
    print(f"{model:<20} ${prices['input']:>8.2f}  ${prices['output']:>8.2f}")

Model Pricing (per 1M tokens):
Model                     Input     Output
------------------------------------------
GPT-4-Turbo          $   10.00  $   30.00
GPT-4o               $    2.50  $   10.00
GPT-3.5-Turbo        $    0.50  $    1.50
Claude-3.5-Sonnet    $    3.00  $   15.00
Gemini-1.5-Pro       $    1.25  $    5.00


In [None]:
def estimate_cost(text, model_name, expected_output_tokens=500):
    """
    Estimate the cost of processing text through a given model.
    
    Args:
        text: The input text to be processed
        model_name: Key from the PRICING dictionary
        expected_output_tokens: Estimated number of output tokens
    
    Returns:
        dict with input_tokens, output_tokens, input_cost, output_cost, total_cost
    
    Hints:
        - Use tiktoken with gpt-4 encoding to count input tokens
        - Look up prices from the PRICING dict
        - Cost = (token_count / 1_000_000) * price_per_million
    """
    # ✅ Count input tokens using tiktoken (use gpt-4 encoder as a proxy)
    encoder = tiktoken.encoding_for_model("gpt-4")
    input_tokens = len(encoder.encode(text))

    # ✅ Look up pricing for the model
    prices = PRICING[model_name]

    # ✅ Calculate costs
    input_cost = (input_tokens / 1_000_000) * prices["input"]
    output_cost = (expected_output_tokens / 1_000_000) * prices["output"]

    return {
        "model": model_name,
        "input_tokens": input_tokens,
        "output_tokens": expected_output_tokens,
        "input_cost": input_cost,
        "output_cost": output_cost,
        "total_cost": input_cost + output_cost,
    }

### Test Your Cost Function

**Scenario**: Estimate the cost of processing a 10-page document (~5,000 words ≈ ~6,500 tokens) across GPT-4-Turbo vs GPT-3.5-Turbo.

In [None]:
# Simulate a 10-page document (~5000 words)
ten_page_doc = ("The rapid advancement of artificial intelligence has transformed "
                "industries across the globe. Organizations are increasingly adopting "
                "machine learning systems for automation, analysis, and decision-making. ") * 170

print(f"Document length: {len(ten_page_doc)} characters, ~{len(ten_page_doc.split())} words")
print()

for model in ["GPT-4-Turbo", "GPT-3.5-Turbo"]:
    result = estimate_cost(ten_page_doc, model, expected_output_tokens=500)
    if result["total_cost"] is not None:
        print(f"{model}:")
        print(f"  Input tokens:  {result['input_tokens']:,}")
        print(f"  Input cost:    ${result['input_cost']:.4f}")
        print(f"  Output cost:   ${result['output_cost']:.4f}")
        print(f"  Total cost:    ${result['total_cost']:.4f}")
        print()
    else:
        print(f"{model}: ⚠️  Complete the estimate_cost() function above first!")
        print()

---
## Part 3: Context Window Visualization

Context windows have grown dramatically. Let's visualize this evolution and understand the computational implications.

In [None]:
import plotly.graph_objects as go

# Context window sizes over time
models = [
    ("GPT-2", 2019, 1024),
    ("GPT-3", 2020, 4096),
    ("GPT-3.5", 2022, 4096),
    ("Claude 1", 2023, 9000),
    ("GPT-4", 2023, 8192),
    ("GPT-4-32K", 2023, 32768),
    ("Claude 2", 2023, 100000),
    ("GPT-4-Turbo", 2023, 128000),
    ("Gemini 1.5 Pro", 2024, 1000000),
    ("Claude 3.5", 2024, 200000),
]

names = [m[0] for m in models]
years = [m[1] for m in models]
sizes = [m[2] for m in models]

fig = go.Figure()
fig.add_trace(go.Bar(
    x=names,
    y=sizes,
    marker_color=["#9B8EC0" if y < 2023 else "#00C9A7" if y == 2023 else "#FF7A5C" for y in years],
    text=[f"{s:,}" for s in sizes],
    textposition="outside",
))

fig.update_layout(
    title="Context Window Evolution (tokens)",
    yaxis_title="Max Context Length (tokens)",
    yaxis_type="log",
    template="plotly_white",
    height=500,
)
fig.show()

In [None]:
def estimate_computation_requirements(context_length, batch_size=1):
    """
    Illustrate how computational requirements scale with context length.
    These numbers show relative growth patterns, not actual hardware specs.
    """
    attention_computations = context_length ** 2
    memory_gb = (attention_computations * 4 * 8) / (1024**3)  # float32, 8 heads
    latency_ms = attention_computations / 1_000_000

    return {
        "context_tokens": context_length,
        "attention_computations": f"{attention_computations:,}",
        "estimated_memory_gb": round(memory_gb, 2),
        "approximate_latency_ms": round(latency_ms, 2),
    }


context_lengths = [1024, 4096, 8192, 32768, 128000]
print("Computation Requirements by Context Length (O(n²) scaling):")
print("-" * 72)
print(f"{'Tokens':>10}  {'Computations':>18}  {'Memory (GB)':>12}  {'Latency (ms)':>14}")
print("-" * 72)
for length in context_lengths:
    reqs = estimate_computation_requirements(length)
    print(
        f"{length:>10,}  {reqs['attention_computations']:>18}  "
        f"{reqs['estimated_memory_gb']:>12}  {reqs['approximate_latency_ms']:>14}"
    )

---
## Reflection Questions

Answer these in a markdown cell below (double-click to edit):

1. **Token efficiency**: Which text format produced the fewest tokens per character? Why do you think that is?
2. **O(n²) scaling**: If you double the context window, by what factor do attention computations increase? What does this mean for real-time applications?
3. **Practical choice**: When would you choose a smaller context window (e.g., 4K) even though a larger one (128K) is available?

*Your answers here:*

1.  Token efficiency:
   English essay had the fewest tokens/char (~0.11) because English words 
   tokenize efficiently. JSON had the most (~0.26) due to brackets, quotes, 
   and symbols each being separate tokens.
2.  O(n²) scaling:
   Doubling context = 4x more computations (2² = 4).
   Example: 4K→8K went from 16M to 67M operations.
   Result: Larger contexts are much slower and more expensive - not suitable 
   for real-time apps like chatbots.
3. Practical choice:
   Use 4K when you need speed (chatbots, autocomplete) or low cost 
   (high-traffic apps). Use 128K only when truly needed (analyzing 
   full documents, research papers).

---
## Bonus: Token Count Comparison Across Formats

Create a bar chart comparing token counts for the **same information** expressed in different formats: English prose, Python code, JSON, and Arabic.

In [None]:
import plotly.graph_objects as go
import tiktoken

encoder = tiktoken.encoding_for_model("gpt-4")

# Same information in different formats
formats = {
    "English": "The user's name is Ahmed and he is 30 years old and lives in Riyadh.",
    "Python": 'user = {"name": "Ahmed", "age": 30, "city": "Riyadh"}',
    "JSON": '{"name": "Ahmed", "age": 30, "city": "Riyadh"}',
    "Arabic": "اسم المستخدم أحمد وعمره 30 سنة ويسكن في الرياض.",
}

labels = list(formats.keys())
token_counts = [len(encoder.encode(text)) for text in formats.values()]
char_counts = [len(text) for text in formats.values()]

fig = go.Figure()
fig.add_trace(go.Bar(
    name="Tokens",
    x=labels,
    y=token_counts,
    marker_color="#1C355E",
    text=token_counts,
    textposition="outside",
))
fig.add_trace(go.Bar(
    name="Characters",
    x=labels,
    y=char_counts,
    marker_color="#00C9A7",
    text=char_counts,
    textposition="outside",
))

fig.update_layout(
    title="Token vs Character Counts Across Formats",
    yaxis_title="Count",
    barmode="group",
    template="plotly_white",
    height=450,
)
fig.show()

# Print the ratio
print("\nTokens per character ratio:")
for label, tc, cc in zip(labels, token_counts, char_counts):
    print(f"  {label}: {tc/cc:.2f} tokens/char ({tc} tokens, {cc} chars)")