In [None]:
#@title üéß Download Narration Audio & Play Introduction
import os as _os
if not _os.path.exists("/content/narration"):
    !pip install -q gdown
    import gdown
    gdown.download(id="1K3QJmxvgc0_ZwXawSeV_oQll-6WeibaO", output="/content/narration.zip", quiet=False)
    !unzip -q /content/narration.zip -d /content/narration
    !rm /content/narration.zip
    print(f"Loaded {len(_os.listdir('/content/narration'))} narration segments")
else:
    print("Narration audio already loaded.")

from IPython.display import Audio, display
display(Audio("/content/narration/01_00_intro.mp3"))


In [None]:
# üîß Setup: Run this cell first!
# Check GPU availability and install dependencies

import torch
import sys

# Check GPU
if torch.cuda.is_available():
    device = torch.device('cuda')
    print(f"‚úÖ GPU available: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    device = torch.device('cpu')
    print("‚ö†Ô∏è No GPU detected. Some cells may run slowly.")
    print("   Go to Runtime ‚Üí Change runtime type ‚Üí GPU")

print(f"\nüì¶ Python {sys.version.split()[0]}")
print(f"üî• PyTorch {torch.__version__}")

# Set random seeds for reproducibility
import random
import numpy as np

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

print(f"üé≤ Random seed set to {SEED}")

%matplotlib inline

In [None]:
#@title üéß Listen: Notebook Intro
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_02_notebook_intro.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


# üöÄ Context Windows & Token Budgeting: Understanding the LLM's Working Desk

*Part 1 of the Vizuara series on Context Engineering for LLMs*
*Estimated time: 30 minutes*

Let us start with an analogy. Imagine you are sitting in an open-book exam. You have a desk of fixed size ‚Äî say, enough room for exactly 10 pages of notes. You brought your textbook, your formula sheet, some practice problems, and a few blank pages to write your answers on.

Here is the catch: **if you bring too many pages, they fall off the desk.** The desk does not grow. You have to choose what goes on it ‚Äî and what you leave behind.

This is *exactly* the situation a Large Language Model faces every single time it processes a request. The desk is the **context window**. The pages are **tokens**. And your job, as the developer, is to decide what goes on that desk.

Andrej Karpathy put it best:

> *"Context engineering is the delicate art and science of filling the context window with just the right information for the next step."*

By the end of this notebook, you will understand what fills that desk, how to measure it, and how to budget it wisely ‚Äî for any model, any application.

In [None]:
#@title üéß Listen: Ai Assistant
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_03_ai_assistant.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


# ü§ñ AI Teaching Assistant

Need help with this notebook? Open the **AI Teaching Assistant** ‚Äî it has already read this entire notebook and can help with concepts, code, and exercises.

**[üëâ Open AI Teaching Assistant](https://pods.vizuara.ai/courses/context-engineering-for-llms/practice/1/assistant)**

*Tip: Open it in a separate tab and work through this notebook side-by-side.*


## 1. ü§î Why Does This Matter?

Every LLM has a **context window** ‚Äî a hard limit on how many tokens it can process in a single call. Go over that limit and your request simply fails. Stay well under it and you are leaving capability on the table.

But here is the subtle part: the context window is not just "your prompt." It is shared across **six distinct components**, each competing for the same finite space:

| Component | Typical Budget | What It Contains |
|-----------|---------------|-----------------|
| System Prompt | ~2K tokens | Personality, rules, format instructions |
| User Message | ~1K tokens | The current question or request |
| Conversation History | ~20K tokens | Previous turns in the chat |
| Retrieved Context (RAG) | ~60K tokens | Documents fetched from a knowledge base |
| Tool Results | ~10K tokens | Outputs from function calls, APIs |
| Reserved for Output | ~35K tokens | Space the model needs to generate its answer |

Think of it this way: if you stuff 100K tokens of retrieved documents into a 128K window, you have left almost no room for the model to actually *think* and *respond*.

Let us make this concrete with code.

In [None]:
#@title üéß Code Walkthrough: Imports Setup
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_05_imports_setup.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
# First, let's install tiktoken ‚Äî OpenAI's fast tokenizer library
# This works for GPT-style tokenizers and gives us ground truth token counts
!pip install tiktoken matplotlib numpy -q

In [None]:
import tiktoken
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import numpy as np
from typing import Dict, Optional, Tuple, List

# Use a clean style for all our visualizations
plt.rcParams.update({
    'figure.facecolor': 'white',
    'axes.facecolor': '#fafafa',
    'axes.grid': True,
    'grid.alpha': 0.3,
    'font.size': 11,
    'figure.dpi': 100,
})

print("‚úÖ All imports ready. Let's explore context windows!")

In [None]:
#@title üéß Listen: What Is A Token
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_06_what_is_a_token.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


## 2. üí° Building Intuition: What Is a Token, Really?

Before we can budget tokens, we need to understand what they are. A **token** is not a word. It is not a character. It is a *subword unit* ‚Äî a chunk of text that the model's tokenizer has learned to treat as a single piece.

A rough rule of thumb: **1 token ‚âà 4 characters** of English text, or about **¬æ of a word**.

But this is only an approximation. The actual count depends on the specific tokenizer and the text itself. Let us see the difference.

In [None]:
#@title üéß Code Walkthrough: Token Estimation Code
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_07_token_estimation_code.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
def rough_token_estimate(text: str) -> int:
    """Estimate tokens using the ~4 characters per token rule of thumb."""
    return len(text) // 4

def exact_token_count(text: str, model: str = "cl100k_base") -> int:
    """Count exact tokens using tiktoken (OpenAI's tokenizer).

    cl100k_base is used by GPT-4, GPT-3.5-turbo, and text-embedding-ada-002.
    """
    encoder = tiktoken.get_encoding(model)
    return len(encoder.encode(text))

# Let's test with different types of text
samples = {
    "Simple English": "The cat sat on the mat and looked out the window.",
    "Technical": "The transformer architecture uses multi-head self-attention mechanisms.",
    "Code snippet": "def forward(self, x): return self.linear(self.relu(self.norm(x)))",
    "JSON data": '{"name": "Alice", "age": 30, "scores": [95, 87, 92]}',
    "Repeated text": "buffalo " * 20,
}

print(f"{'Text Type':<20} {'Chars':>6} {'Rough Est.':>10} {'Exact (tiktoken)':>16} {'Ratio':>8}")
print("-" * 65)

for label, text in samples.items():
    chars = len(text)
    rough = rough_token_estimate(text)
    exact = exact_token_count(text)
    ratio = chars / exact if exact > 0 else 0
    print(f"{label:<20} {chars:>6} {rough:>10} {exact:>16} {ratio:>7.1f}:1")

In [None]:
#@title üéß Listen: Token Estimation Results
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_08_token_estimation_results.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


ü§î **Think about it:** Notice how the ratio of characters-to-tokens varies by text type. Code and JSON tend to use *more* tokens per character than plain English. This matters when you are budgeting ‚Äî a 10K-character JSON blob from a tool call uses more of your context window than 10K characters of natural language.

In [None]:
#@title üéß What to Look For: Viz Checkpoint Intro
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_09_viz_checkpoint_1_intro.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


## 3. üìä Visualization Checkpoint 1: Seeing Tokenization in Action

Let us actually *see* how a tokenizer breaks text into pieces. This builds intuition for why token counts vary so much.

In [None]:
#@title üéß What to Look For: Visualize Tokens
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_10_visualize_tokens.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
def visualize_tokens(text: str, max_display: int = 40):
    """Show how tiktoken breaks a string into individual tokens."""
    encoder = tiktoken.get_encoding("cl100k_base")
    token_ids = encoder.encode(text)
    tokens = [encoder.decode([tid]) for tid in token_ids]

    # Color each token differently for visibility
    colors = plt.cm.Set3(np.linspace(0, 1, min(len(tokens), max_display)))

    fig, ax = plt.subplots(figsize=(12, 2.5))
    ax.set_xlim(0, 1)
    ax.set_ylim(0, 1)
    ax.axis('off')
    ax.set_title(f'Tokenization: {len(tokens)} tokens from {len(text)} characters',
                 fontsize=13, fontweight='bold', pad=10)

    x_pos = 0.02
    y_pos = 0.7
    for i, (token, color) in enumerate(zip(tokens[:max_display], colors)):
        display = repr(token)[1:-1]  # Show whitespace clearly
        text_width = max(len(display) * 0.015, 0.03)

        rect = mpatches.FancyBboxPatch(
            (x_pos, y_pos - 0.15), text_width + 0.01, 0.3,
            boxstyle="round,pad=0.005", facecolor=color, edgecolor='gray', alpha=0.8
        )
        ax.add_patch(rect)
        ax.text(x_pos + (text_width + 0.01) / 2, y_pos, display,
                ha='center', va='center', fontsize=8, fontfamily='monospace')

        x_pos += text_width + 0.015
        if x_pos > 0.95:
            x_pos = 0.02
            y_pos -= 0.4
            if y_pos < 0:
                break

    if len(tokens) > max_display:
        ax.text(0.5, 0.05, f'... and {len(tokens) - max_display} more tokens',
                ha='center', va='center', fontsize=10, style='italic', color='gray')

    plt.tight_layout()
    plt.show()

# Visualize tokenization of a technical sentence
visualize_tokens("Context engineering is the delicate art of filling the context window with just the right information.")

In [None]:
#@title üéß What to Look For: Visualize Code Vs Prose
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_11_visualize_code_vs_prose.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
# Now compare: how does the tokenizer handle code vs prose?
visualize_tokens('def calculate_budget(system=2000, user=1000, rag=60000): return sum([system, user, rag])')

In [None]:
#@title üéß Listen: Token Viz Results
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_12_token_viz_results.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


Notice how the tokenizer handles code differently ‚Äî variable names get split into subwords, punctuation often gets its own token, and numbers may be chunked in unexpected ways. This is why **exact token counting matters** when you are working close to the limits of a context window.

In [None]:
#@title üéß Listen: The Math
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_13_the_math.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


## 4. The Mathematics: Token Budget Equation

Now that we have intuition, let us formalize. The total context window of an LLM is divided as:

$$T_{\text{total}} = T_{\text{system}} + T_{\text{history}} + T_{\text{RAG}} + T_{\text{tools}} + T_{\text{user}} + T_{\text{reserved}}$$

For a 128K-token model (like GPT-4-Turbo), a typical allocation looks like:

$$128{,}000 = 2{,}000 + 20{,}000 + 60{,}000 + 10{,}000 + 1{,}000 + 35{,}000$$

What does each term mean computationally?

- **$T_{\text{system}}$**: The instructions you give the model. These are present in *every* API call, so keeping them concise directly saves budget.
- **$T_{\text{history}}$**: Past conversation turns. In a chatbot, this grows with each exchange ‚Äî you must decide when to summarize or truncate.
- **$T_{\text{RAG}}$**: Retrieved documents. This is typically the *largest* consumer. If your vector search returns too much, you blow the budget here.
- **$T_{\text{tools}}$**: Results from function calls (e.g., a database query returning JSON). Often overlooked in budgeting.
- **$T_{\text{user}}$**: The current user message. Usually small, but can be large if the user pastes a document.
- **$T_{\text{reserved}}$**: Space for the model's response. **If you don't reserve enough, the output gets truncated mid-sentence.**

The key insight: **these components compete for a fixed resource.** Increasing one means decreasing another.

In [None]:
#@title üéß Code Walkthrough: Define Budget
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_14_define_budget.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
# Let's define our context window components as a clean data structure

# Standard budget for a 128K model (in tokens)
DEFAULT_BUDGET = {
    "System Prompt":       2_000,
    "Conversation History": 20_000,
    "Retrieved Context (RAG)": 60_000,
    "Tool Results":        10_000,
    "User Message":        1_000,
    "Reserved for Output": 35_000,
}

COMPONENT_COLORS = {
    "System Prompt":       "#4C72B0",
    "Conversation History": "#55A868",
    "Retrieved Context (RAG)": "#C44E52",
    "Tool Results":        "#8172B2",
    "User Message":        "#CCB974",
    "Reserved for Output": "#64B5CD",
}

MODEL_LIMITS = {
    "GPT-3.5 (4K)":       4_096,
    "GPT-4 (32K)":        32_768,
    "GPT-4 Turbo (128K)": 128_000,
    "Gemini 1.5 Pro (1M)": 1_000_000,
}

def total_usage(budget: Dict[str, int]) -> int:
    """Sum all token allocations in a budget."""
    return sum(budget.values())

budget_total = total_usage(DEFAULT_BUDGET)
print(f"Total budget: {budget_total:,} tokens")
print(f"Target model: 128K")
print(f"Match: {'‚úÖ Exact fit!' if budget_total == 128_000 else '‚ùå Mismatch!'}")

## 5. üîß Let's Build It: Your Turn (TODO #1)

Time to get your hands dirty. Implement the `estimate_tokens` function below. It should:

1. Accept a text string and an optional `method` parameter
2. When `method="rough"`, return the ~4 chars/token estimate
3. When `method="exact"`, use tiktoken
4. When `method="both"`, return a dictionary with both estimates

This is the kind of utility function you will use in every context engineering project.

In [None]:
#@title üéß Before You Start: Todo After
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_16_todo_1_after.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
#@title üéß Before You Start: Todo After
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_31_todo_2_after.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
# ============ TODO ============
# Implement the estimate_tokens function below.
#
# Requirements:
#   - method="rough" ‚Üí return int (len(text) // 4)
#   - method="exact" ‚Üí return int (tiktoken count using cl100k_base)
#   - method="both"  ‚Üí return dict {"rough": ..., "exact": ...}
#   - Handle empty strings gracefully (return 0 or {"rough": 0, "exact": 0})
#
# Hint: You already saw rough_token_estimate() and exact_token_count() above.
# ============ TODO ============

def estimate_tokens(text: str, method: str = "both") -> int | Dict[str, int]:
    """Estimate the token count of a text string.

    Args:
        text: The input text to tokenize.
        method: One of "rough", "exact", or "both".

    Returns:
        Token count (int) for "rough"/"exact", or dict for "both".
    """
    # YOUR CODE HERE ‚Äî replace the pass statements
    if not text:
        if method == "both":
            return {"rough": 0, "exact": 0}
        return 0

    rough = len(text) // 4

    if method == "rough":
        return rough

    encoder = tiktoken.get_encoding("cl100k_base")
    exact = len(encoder.encode(text))

    if method == "exact":
        return exact

    # method == "both"
    return {"rough": rough, "exact": exact}

# ============ VERIFICATION ============
# Run this cell to check your implementation

test_text = "The transformer architecture revolutionized natural language processing in 2017."

rough_result = estimate_tokens(test_text, method="rough")
exact_result = estimate_tokens(test_text, method="exact")
both_result = estimate_tokens(test_text, method="both")

assert isinstance(rough_result, int), "rough should return an int"
assert isinstance(exact_result, int), "exact should return an int"
assert isinstance(both_result, dict), "both should return a dict"
assert "rough" in both_result and "exact" in both_result, "both dict must have 'rough' and 'exact' keys"
assert estimate_tokens("", method="rough") == 0, "empty string should return 0"
assert estimate_tokens("", method="both") == {"rough": 0, "exact": 0}, "empty string both should return zeros"

print("‚úÖ All assertions passed!")
print(f"   Test text: '{test_text}'")
print(f"   Rough estimate: {rough_result} tokens")
print(f"   Exact count:    {exact_result} tokens")
print(f"   Both:           {both_result}")

In [None]:
#@title üéß What to Look For: Viz Checkpoint Intro
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_17_viz_checkpoint_2_intro.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


## 6. üìä Visualization Checkpoint 2: The Context Window Budget

Now let us visualize *where* your tokens go. This is the single most important diagram in context engineering ‚Äî a stacked bar chart showing how each component fills (or overfills) the context window.

In [None]:
#@title üéß What to Look For: Plot Context Budget
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_18_plot_context_budget.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
def plot_context_budget(budget: Dict[str, int], model_limit: int,
                         title: str = "Context Window Budget Breakdown"):
    """Visualize a token budget as a stacked horizontal bar chart.

    Shows each component's allocation and whether the total exceeds the model limit.
    """
    components = list(budget.keys())
    values = list(budget.values())
    colors = [COMPONENT_COLORS.get(c, '#999999') for c in components]
    total = sum(values)

    fig, ax = plt.subplots(figsize=(12, 4))

    # Draw stacked horizontal bars
    left = 0
    bars = []
    for comp, val, color in zip(components, values, colors):
        bar = ax.barh(0, val, left=left, color=color, edgecolor='white',
                      linewidth=1.5, height=0.5)
        bars.append(bar)

        # Label inside the bar if there's room
        if val / model_limit > 0.05:
            pct = val / total * 100
            ax.text(left + val / 2, 0, f"{val // 1000}K\n({pct:.0f}%)",
                    ha='center', va='center', fontsize=9, fontweight='bold', color='white')
        left += val

    # Model limit line
    ax.axvline(x=model_limit, color='red', linewidth=2, linestyle='--', label=f'Model Limit ({model_limit // 1000}K)')

    # Overflow zone
    if total > model_limit:
        overflow = total - model_limit
        ax.axvspan(model_limit, total, alpha=0.15, color='red')
        ax.text(model_limit + overflow / 2, 0.35, f'‚ö†Ô∏è OVERFLOW\n{overflow // 1000}K tokens',
                ha='center', va='center', fontsize=10, color='red', fontweight='bold')

    ax.set_xlim(0, max(total, model_limit) * 1.05)
    ax.set_yticks([])
    ax.set_xlabel('Tokens', fontsize=12)
    ax.set_title(title, fontsize=14, fontweight='bold', pad=15)

    # Legend
    legend_patches = [mpatches.Patch(color=color, label=comp)
                      for comp, color in zip(components, colors)]
    legend_patches.append(mpatches.Patch(facecolor='white', edgecolor='red',
                                          linestyle='--', label=f'Model Limit'))
    ax.legend(handles=legend_patches, loc='upper center', bbox_to_anchor=(0.5, -0.15),
              ncol=3, fontsize=9, frameon=True)

    plt.tight_layout()
    plt.show()

    # Print summary
    status = "‚úÖ Within budget" if total <= model_limit else f"‚ùå Over budget by {total - model_limit:,} tokens"
    print(f"\nTotal: {total:,} / {model_limit:,} tokens ‚Äî {status}")

# Visualize the default 128K budget
plot_context_budget(DEFAULT_BUDGET, model_limit=128_000,
                    title="Standard Budget: GPT-4 Turbo (128K)")

In [None]:
#@title üéß Listen: Plot Context Budget Results
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_19_plot_context_budget_results.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


üí° **Insight:** Look at how RAG dominates the budget ‚Äî nearly half of the entire context window. This is why retrieval quality matters so much. If your retriever returns irrelevant documents, you are wasting the most valuable real estate in the window.

Also notice that the output reservation (35K) is substantial. A shorter expected response means more room for input context.

In [None]:
#@title üéß Transition: What Goes Wrong Intro
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_20_what_goes_wrong_intro.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


## 7. What Happens When Things Go Wrong?

Let us simulate a realistic scenario: your RAG pipeline returns **80K tokens** instead of the budgeted 60K. What happens?

In [None]:
#@title üéß Code Walkthrough: Rag Overflow Scenario
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_21_rag_overflow_scenario.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
# Scenario: RAG returns more than expected
overflow_budget = DEFAULT_BUDGET.copy()
overflow_budget["Retrieved Context (RAG)"] = 80_000  # 20K more than planned

plot_context_budget(overflow_budget, model_limit=128_000,
                    title="‚ö†Ô∏è Overflow Scenario: RAG Returns 80K Tokens")

In [None]:
#@title üéß Code Walkthrough: Smaller Model Scenario
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_22_smaller_model_scenario.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
# Let's also see: what if we try to cram this into a smaller model?
small_budget = {
    "System Prompt":       1_000,
    "Conversation History": 1_500,
    "Retrieved Context (RAG)": 500,
    "Tool Results":        200,
    "User Message":        500,
    "Reserved for Output": 396,
}

plot_context_budget(small_budget, model_limit=4_096,
                    title="Tight Budget: GPT-3.5 (4K) ‚Äî Every Token Counts")

In [None]:
#@title üéß Narration: Model Size Reflection
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_23_model_size_reflection.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


ü§î **Think about it:** With only 4K tokens, you can barely fit a system prompt and a short conversation. This is why the jump from 4K to 128K models was so transformative ‚Äî it unlocked RAG, multi-turn conversations, and tool use all at once.

But even with 1M tokens (Gemini 1.5 Pro), budgeting still matters. More context means more latency, more cost, and more chances for the model to get confused by irrelevant information. **Bigger is not always better.**

In [None]:
#@title üéß What to Look For: Viz Checkpoint Intro
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_24_viz_checkpoint_3_intro.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


## 8. üìä Visualization Checkpoint 3: Multi-Model Comparison

Let us see how the same application's needs map onto different model sizes. This is a question every developer faces: "Which model do I need for my use case?"

In [None]:
#@title üéß What to Look For: Multi Model Comparison
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_25_multi_model_comparison.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
def scale_budget_to_model(base_budget: Dict[str, int], base_limit: int,
                           target_limit: int) -> Dict[str, int]:
    """Scale a budget proportionally to fit a different model size.

    Keeps the same relative proportions, but scales to the new limit.
    The system prompt and user message have minimums to stay functional.
    """
    ratio = target_limit / base_limit
    scaled = {}
    for component, tokens in base_budget.items():
        scaled_val = int(tokens * ratio)
        # Enforce minimums for critical components
        if component == "System Prompt":
            scaled_val = max(scaled_val, 200)
        elif component == "User Message":
            scaled_val = max(scaled_val, 100)
        elif component == "Reserved for Output":
            scaled_val = max(scaled_val, 200)
        scaled[component] = scaled_val
    return scaled


def plot_multi_model_comparison(base_budget: Dict[str, int],
                                 model_limits: Dict[str, int]):
    """Compare token budgets across multiple model sizes."""
    fig, ax = plt.subplots(figsize=(14, 6))

    model_names = list(model_limits.keys())
    y_positions = range(len(model_names))
    bar_height = 0.5

    for i, (model_name, limit) in enumerate(model_limits.items()):
        budget = scale_budget_to_model(base_budget, 128_000, limit)

        left = 0
        for comp, val in budget.items():
            color = COMPONENT_COLORS.get(comp, '#999999')
            ax.barh(i, val, left=left, color=color, edgecolor='white',
                    linewidth=1, height=bar_height)
            left += val

        # Model limit marker
        ax.plot(limit, i, 'r|', markersize=20, markeredgewidth=2)

        # Label total
        total = sum(budget.values())
        ax.text(total + limit * 0.02, i, f'{total:,.0f} tokens',
                va='center', fontsize=9, color='#333')

    ax.set_yticks(y_positions)
    ax.set_yticklabels(model_names, fontsize=11)
    ax.set_xlabel('Tokens', fontsize=12)
    ax.set_title('Context Budget Across Model Sizes', fontsize=14, fontweight='bold')
    ax.set_xscale('log')
    ax.set_xlim(100, 2_000_000)

    # Legend
    legend_patches = [mpatches.Patch(color=color, label=comp)
                      for comp, color in COMPONENT_COLORS.items()]
    ax.legend(handles=legend_patches, loc='upper center', bbox_to_anchor=(0.5, -0.12),
              ncol=3, fontsize=9, frameon=True)

    plt.tight_layout()
    plt.show()

plot_multi_model_comparison(DEFAULT_BUDGET, MODEL_LIMITS)

In [None]:
#@title üéß Listen: Multi Model Takeaway
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_26_multi_model_takeaway.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


üí° **Key takeaway from this visualization:** On a log scale, you can see that the 4K model barely registers. The jump from 32K to 128K is where RAG-heavy applications become feasible. And 1M tokens? That is enough to fit an entire codebase or book ‚Äî but you pay for it in latency and cost.

In [None]:
#@title üéß Transition: Budget Calculator Intro
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_27_budget_calculator_intro.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


## 9. Let's Build It: The Budget Calculator

Now let us build a proper budget calculator ‚Äî a function that takes your desired allocations and tells you exactly where you stand.

In [None]:
#@title üéß Code Walkthrough: Budget Calculator Code
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_28_budget_calculator_code.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
def budget_calculator(budget: Dict[str, int], model_limit: int) -> Dict:
    """Analyze a token budget against a model's context limit.

    Returns a detailed analysis including:
    - Total usage and remaining capacity
    - Per-component percentages
    - Warnings for over-budget or tight allocations
    - Suggestions for rebalancing
    """
    total = sum(budget.values())
    remaining = model_limit - total
    utilization = total / model_limit * 100

    analysis = {
        "model_limit": model_limit,
        "total_used": total,
        "remaining": remaining,
        "utilization_pct": round(utilization, 1),
        "is_over_budget": total > model_limit,
        "components": {},
        "warnings": [],
        "suggestions": [],
    }

    # Analyze each component
    for comp, tokens in budget.items():
        pct_of_total = tokens / total * 100 if total > 0 else 0
        pct_of_limit = tokens / model_limit * 100
        analysis["components"][comp] = {
            "tokens": tokens,
            "pct_of_total": round(pct_of_total, 1),
            "pct_of_limit": round(pct_of_limit, 1),
        }

    # Generate warnings
    if total > model_limit:
        analysis["warnings"].append(
            f"‚ö†Ô∏è OVER BUDGET by {total - model_limit:,} tokens! "
            f"Reduce allocations by at least {(total - model_limit) / model_limit * 100:.1f}%."
        )

    output_budget = budget.get("Reserved for Output", 0)
    if output_budget < model_limit * 0.1:
        analysis["warnings"].append(
            f"‚ö†Ô∏è Output reservation ({output_budget:,}) is less than 10% of the window. "
            f"Responses may be truncated."
        )

    rag_budget = budget.get("Retrieved Context (RAG)", 0)
    if rag_budget > model_limit * 0.6:
        analysis["warnings"].append(
            f"‚ö†Ô∏è RAG allocation ({rag_budget:,}) exceeds 60% of the window. "
            f"Consider more aggressive retrieval filtering."
        )

    if remaining > model_limit * 0.2 and not analysis["is_over_budget"]:
        analysis["suggestions"].append(
            f"üí° You have {remaining:,} tokens unused ({remaining / model_limit * 100:.0f}%). "
            f"Consider expanding RAG or history allocation."
        )

    if utilization > 90 and not analysis["is_over_budget"]:
        analysis["suggestions"].append(
            f"üí° Running at {utilization:.0f}% utilization. Leave a small buffer for "
            f"variable-length inputs."
        )

    return analysis


def print_budget_report(analysis: Dict):
    """Pretty-print a budget analysis report."""
    print("=" * 60)
    print(f"  CONTEXT WINDOW BUDGET REPORT")
    print(f"  Model limit: {analysis['model_limit']:,} tokens")
    print("=" * 60)

    print(f"\n{'Component':<30} {'Tokens':>8} {'% Total':>8} {'% Limit':>8}")
    print("-" * 56)
    for comp, info in analysis["components"].items():
        print(f"{comp:<30} {info['tokens']:>8,} {info['pct_of_total']:>7.1f}% {info['pct_of_limit']:>7.1f}%")

    print("-" * 56)
    status = "üî¥ OVER" if analysis["is_over_budget"] else "üü¢ OK"
    print(f"{'TOTAL':<30} {analysis['total_used']:>8,} {'100.0':>7}% {analysis['utilization_pct']:>7.1f}%  {status}")
    print(f"{'Remaining':<30} {analysis['remaining']:>8,}")

    if analysis["warnings"]:
        print(f"\n{'WARNINGS':}")
        for w in analysis["warnings"]:
            print(f"  {w}")

    if analysis["suggestions"]:
        print(f"\n{'SUGGESTIONS':}")
        for s in analysis["suggestions"]:
            print(f"  {s}")

    print()

# Test with our default budget
report = budget_calculator(DEFAULT_BUDGET, 128_000)
print_budget_report(report)

In [None]:
#@title üéß Code Walkthrough: Budget Calculator Overflow
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_29_budget_calculator_overflow.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
# Now test with the overflow scenario
overflow_budget = DEFAULT_BUDGET.copy()
overflow_budget["Retrieved Context (RAG)"] = 80_000

report_overflow = budget_calculator(overflow_budget, 128_000)
print_budget_report(report_overflow)

## 10. üîß Your Turn (TODO #2): The Budget Optimizer

Here is the real challenge. Given a set of **constraints** (minimums and maximums for each component) and a total token budget, find the **optimal allocation**.

For example: "I need at least 40K for RAG and at least 20K for output. How should I distribute the rest?"

This is a constrained optimization problem. The simplest approach: satisfy all minimums first, then distribute the remaining tokens proportionally among components that have room to grow.

In [None]:
# ============ TODO ============
# Implement the budget_optimizer function.
#
# Input:
#   constraints: Dict[str, Dict] where each key is a component name and value is:
#       {"min": int, "max": int, "priority": float}
#       - min: minimum tokens needed (hard constraint)
#       - max: maximum tokens useful (soft cap)
#       - priority: how much this component benefits from extra tokens (0.0 to 1.0)
#   max_tokens: int ‚Äî the total context window size
#
# Algorithm:
#   1. Start by giving each component its minimum
#   2. Calculate remaining tokens after all minimums
#   3. If remaining < 0, raise ValueError (constraints are infeasible)
#   4. Distribute remaining tokens proportionally by priority,
#      but never exceed a component's max
#   5. If there are still tokens left after all components hit max,
#      add them to "Reserved for Output"
#
# Return: Dict[str, int] ‚Äî the optimized allocation
# ============ TODO ============

def budget_optimizer(
    constraints: Dict[str, Dict],
    max_tokens: int
) -> Dict[str, int]:
    """Find the optimal token allocation given constraints and a total budget.

    Args:
        constraints: Per-component constraints with min, max, and priority.
        max_tokens: Total context window size.

    Returns:
        Optimized allocation as {component_name: token_count}.
    """
    # Step 1: Assign minimums
    allocation = {comp: info["min"] for comp, info in constraints.items()}

    # Step 2: Calculate remaining
    remaining = max_tokens - sum(allocation.values())

    # Step 3: Check feasibility
    if remaining < 0:
        raise ValueError(
            f"Infeasible! Minimums sum to {sum(allocation.values()):,} "
            f"but model limit is {max_tokens:,}. "
            f"Over by {-remaining:,} tokens."
        )

    # Step 4: Distribute remaining by priority, respecting maxes
    # We may need multiple passes because when one component hits its max,
    # leftover tokens redistribute to others.
    components_with_room = {
        comp: info for comp, info in constraints.items()
        if allocation[comp] < info["max"]
    }

    while remaining > 0 and components_with_room:
        total_priority = sum(info["priority"] for info in components_with_room.values())
        if total_priority == 0:
            break

        distributed_this_round = 0
        newly_maxed = []

        for comp, info in components_with_room.items():
            share = int(remaining * (info["priority"] / total_priority))
            room = info["max"] - allocation[comp]
            addition = min(share, room)
            allocation[comp] += addition
            distributed_this_round += addition

            if allocation[comp] >= info["max"]:
                newly_maxed.append(comp)

        remaining -= distributed_this_round

        # Remove maxed-out components
        for comp in newly_maxed:
            del components_with_room[comp]

        # Safety: if no progress was made, break to avoid infinite loop
        if distributed_this_round == 0:
            break

    # Step 5: Any leftover goes to output reservation
    if remaining > 0:
        if "Reserved for Output" in allocation:
            allocation["Reserved for Output"] += remaining
        else:
            allocation["Reserved for Output"] = remaining

    return allocation


# ============ VERIFICATION ============

test_constraints = {
    "System Prompt":       {"min": 1_000, "max": 3_000,  "priority": 0.1},
    "Conversation History": {"min": 5_000, "max": 30_000, "priority": 0.2},
    "Retrieved Context (RAG)": {"min": 40_000, "max": 70_000, "priority": 0.4},
    "Tool Results":        {"min": 2_000, "max": 15_000, "priority": 0.1},
    "User Message":        {"min": 500,   "max": 2_000,  "priority": 0.05},
    "Reserved for Output": {"min": 20_000,"max": 50_000, "priority": 0.15},
}

optimized = budget_optimizer(test_constraints, max_tokens=128_000)

print("‚úÖ Optimized Budget Allocation:")
print(f"{'Component':<30} {'Tokens':>10} {'Min':>8} {'Max':>8}")
print("-" * 58)
for comp, tokens in optimized.items():
    info = test_constraints[comp]
    status = "‚úì" if info["min"] <= tokens <= info["max"] else "‚úó"
    print(f"{comp:<30} {tokens:>10,} {info['min']:>8,} {info['max']:>8,}  {status}")

total = sum(optimized.values())
print("-" * 58)
print(f"{'Total':<30} {total:>10,}")
print(f"\n{'Budget used:':<30} {total:,} / 128,000 ({total/128_000*100:.1f}%)")
assert total <= 128_000, f"Over budget! {total:,} > 128,000"
assert all(optimized[c] >= test_constraints[c]["min"] for c in test_constraints), "Minimums violated!"
print("‚úÖ All constraints satisfied!")

In [None]:
#@title üéß Transition: Optimized Viz Intro
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_32_optimized_viz_intro.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


Excellent! Let us visualize the optimized allocation alongside the default one to see the difference.

In [None]:
#@title üéß What to Look For: Plot Budget Comparison
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_33_plot_budget_comparison.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
def plot_budget_comparison(budgets: Dict[str, Dict[str, int]], model_limit: int):
    """Compare multiple budget allocations side by side."""
    fig, axes = plt.subplots(len(budgets), 1, figsize=(13, 3 * len(budgets)),
                             sharex=True)
    if len(budgets) == 1:
        axes = [axes]

    for ax, (budget_name, budget) in zip(axes, budgets.items()):
        total = sum(budget.values())
        left = 0

        for comp, val in budget.items():
            color = COMPONENT_COLORS.get(comp, '#999999')
            ax.barh(0, val, left=left, color=color, edgecolor='white',
                    linewidth=1.5, height=0.5)
            if val / model_limit > 0.04:
                ax.text(left + val / 2, 0, f"{val // 1000}K",
                        ha='center', va='center', fontsize=9,
                        fontweight='bold', color='white')
            left += val

        ax.axvline(x=model_limit, color='red', linewidth=2, linestyle='--')
        ax.set_yticks([])
        ax.set_title(f"{budget_name} ‚Äî Total: {total:,} tokens ({total/model_limit*100:.0f}%)",
                     fontsize=12, fontweight='bold')
        ax.set_xlim(0, model_limit * 1.05)

    axes[-1].set_xlabel('Tokens', fontsize=12)

    legend_patches = [mpatches.Patch(color=color, label=comp)
                      for comp, color in COMPONENT_COLORS.items()]
    fig.legend(handles=legend_patches, loc='upper center',
               bbox_to_anchor=(0.5, -0.02), ncol=3, fontsize=9, frameon=True)

    plt.tight_layout()
    plt.show()

plot_budget_comparison({
    "Default (Hand-Tuned)": DEFAULT_BUDGET,
    "Optimizer (RAG-Heavy Constraints)": optimized,
}, model_limit=128_000)

In [None]:
#@title üéß Listen: Optimized Viz Results
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_34_optimized_viz_results.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


üí° **Notice the difference?** The optimizer respected our constraint that RAG needs at least 40K tokens, and distributed the remaining budget according to priorities. The hand-tuned budget assumed fixed allocations. In practice, you would run the optimizer for each use case ‚Äî a chatbot needs more history, a RAG system needs more retrieval space, a code generator needs more output reservation.

## 11. üìä Putting It All Together: The Final Dashboard

Let us build the capstone visualization ‚Äî a multi-panel dashboard that shows everything we have learned in one view.

In [None]:
#@title üéß What to Look For: Create Dashboard
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_36_create_dashboard.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
def create_dashboard(base_budget: Dict[str, int], model_limits: Dict[str, int]):
    """Create a comprehensive context engineering dashboard.

    Panel 1: Pie chart of budget components (proportions)
    Panel 2: What happens when RAG overflows (bar chart progression)
    Panel 3: Optimal vs suboptimal allocation comparison
    Panel 4: Multi-model scaling (how budget maps to different models)
    """
    fig = plt.figure(figsize=(16, 14))
    fig.suptitle('Context Engineering Dashboard', fontsize=18, fontweight='bold', y=0.98)

    # ‚îÄ‚îÄ Panel 1: Budget Proportions (Pie) ‚îÄ‚îÄ
    ax1 = fig.add_subplot(2, 2, 1)
    components = list(base_budget.keys())
    values = list(base_budget.values())
    colors = [COMPONENT_COLORS.get(c, '#999') for c in components]

    wedges, texts, autotexts = ax1.pie(
        values, labels=None, autopct='%1.0f%%', colors=colors,
        startangle=90, pctdistance=0.75,
        wedgeprops=dict(width=0.5, edgecolor='white', linewidth=2)
    )
    for t in autotexts:
        t.set_fontsize(9)
        t.set_fontweight('bold')

    ax1.set_title('Budget Proportions (128K Model)', fontsize=12, fontweight='bold', pad=15)
    ax1.legend(components, loc='center left', bbox_to_anchor=(-0.3, 0.5), fontsize=8)

    # ‚îÄ‚îÄ Panel 2: RAG Overflow Progression ‚îÄ‚îÄ
    ax2 = fig.add_subplot(2, 2, 2)

    rag_scenarios = [40_000, 60_000, 80_000, 100_000]
    scenario_labels = ['40K (Light)', '60K (Normal)', '80K (Heavy)', '100K (Extreme)']

    for i, (rag_val, label) in enumerate(zip(rag_scenarios, scenario_labels)):
        scenario = base_budget.copy()
        scenario["Retrieved Context (RAG)"] = rag_val

        left = 0
        for comp, val in scenario.items():
            color = COMPONENT_COLORS.get(comp, '#999')
            ax2.barh(i, val, left=left, color=color, edgecolor='white',
                     linewidth=0.5, height=0.6)
            left += val

    ax2.axvline(x=128_000, color='red', linewidth=2, linestyle='--', label='128K Limit')
    ax2.set_yticks(range(len(scenario_labels)))
    ax2.set_yticklabels(scenario_labels, fontsize=9)
    ax2.set_xlabel('Tokens', fontsize=10)
    ax2.set_title('RAG Overflow Scenarios', fontsize=12, fontweight='bold')
    ax2.legend(fontsize=9)
    ax2.set_xlim(0, 180_000)

    # ‚îÄ‚îÄ Panel 3: Optimal vs Suboptimal ‚îÄ‚îÄ
    ax3 = fig.add_subplot(2, 2, 3)

    # "Suboptimal" = naive equal split
    equal_split = {comp: 128_000 // 6 for comp in base_budget.keys()}
    # Adjust to sum exactly to 128K
    diff = 128_000 - sum(equal_split.values())
    first_key = list(equal_split.keys())[0]
    equal_split[first_key] += diff

    allocations = {"Optimized\n(Proportional)": base_budget, "Naive\n(Equal Split)": equal_split}

    for j, (alloc_name, alloc) in enumerate(allocations.items()):
        left = 0
        for comp, val in alloc.items():
            color = COMPONENT_COLORS.get(comp, '#999')
            ax3.barh(j, val, left=left, color=color, edgecolor='white',
                     linewidth=0.5, height=0.5)
            left += val

    ax3.axvline(x=128_000, color='red', linewidth=2, linestyle='--')
    ax3.set_yticks(range(len(allocations)))
    ax3.set_yticklabels(list(allocations.keys()), fontsize=10)
    ax3.set_xlabel('Tokens', fontsize=10)
    ax3.set_title('Optimized vs Naive Allocation', fontsize=12, fontweight='bold')
    ax3.set_xlim(0, 140_000)

    # ‚îÄ‚îÄ Panel 4: Multi-Model Scaling ‚îÄ‚îÄ
    ax4 = fig.add_subplot(2, 2, 4)

    model_names = list(model_limits.keys())
    model_totals = list(model_limits.values())

    # For each model, show how much of the default budget fits
    for i, (name, limit) in enumerate(model_limits.items()):
        scaled = scale_budget_to_model(base_budget, 128_000, limit)
        left = 0
        for comp, val in scaled.items():
            color = COMPONENT_COLORS.get(comp, '#999')
            ax4.barh(i, val, left=left, color=color, edgecolor='white',
                     linewidth=0.5, height=0.5)
            left += val

        ax4.text(sum(scaled.values()) * 1.05, i, f'{limit:,}',
                va='center', fontsize=8, color='gray')

    ax4.set_yticks(range(len(model_names)))
    ax4.set_yticklabels(model_names, fontsize=9)
    ax4.set_xlabel('Tokens', fontsize=10)
    ax4.set_title('Budget Scaling by Model Size', fontsize=12, fontweight='bold')
    ax4.set_xscale('symlog', linthresh=1000)

    plt.tight_layout(rect=[0, 0, 1, 0.95])
    plt.show()

create_dashboard(DEFAULT_BUDGET, MODEL_LIMITS)

In [None]:
#@title üéß Narration: Final Scenario Intro
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_37_final_scenario_intro.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


## 12. üéØ Training / Results: What We Built

Let us run one final end-to-end example that ties everything together. Imagine you are building a customer support chatbot with RAG. Let us plan its token budget.

In [None]:
#@title üéß Code Walkthrough: Chatbot Scenario Code
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_38_chatbot_scenario_code.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
# Real-world scenario: Customer Support Chatbot
print("=" * 60)
print("  SCENARIO: Customer Support Chatbot with RAG")
print("  Model: GPT-4 Turbo (128K context window)")
print("=" * 60)

# Step 1: Define our system prompt and measure it
system_prompt = """You are a helpful customer support agent for TechCorp.

Rules:
- Always be polite and professional
- If you don't know the answer, say so honestly
- Reference specific documentation when possible
- Never make up product features or pricing
- For billing issues, always recommend contacting billing@techcorp.com
- Format responses in clear, numbered steps when giving instructions"""

system_tokens = estimate_tokens(system_prompt, method="both")
print(f"\n1. System prompt measured: {system_tokens}")

# Step 2: Estimate typical conversation history (5 turns)
sample_turn = "Customer: I'm having trouble resetting my password. Agent: I'd be happy to help! Let me walk you through the password reset process step by step."
turn_tokens = estimate_tokens(sample_turn, method="exact")
history_estimate = turn_tokens * 10  # 10 turns of conversation
print(f"2. History estimate (10 turns): ~{history_estimate:,} tokens")

# Step 3: Define constraints based on our application
chatbot_constraints = {
    "System Prompt":       {"min": system_tokens["exact"], "max": system_tokens["exact"] + 500, "priority": 0.05},
    "Conversation History": {"min": 5_000,  "max": 25_000, "priority": 0.25},
    "Retrieved Context (RAG)": {"min": 30_000, "max": 65_000, "priority": 0.35},
    "Tool Results":        {"min": 3_000,  "max": 12_000, "priority": 0.1},
    "User Message":        {"min": 500,    "max": 3_000,  "priority": 0.05},
    "Reserved for Output": {"min": 15_000, "max": 40_000, "priority": 0.2},
}

# Step 4: Optimize
chatbot_budget = budget_optimizer(chatbot_constraints, max_tokens=128_000)

print(f"\n3. Optimized budget:")
for comp, tokens in chatbot_budget.items():
    print(f"   {comp:<30} {tokens:>8,} tokens")
print(f"   {'‚îÄ' * 40}")
print(f"   {'TOTAL':<30} {sum(chatbot_budget.values()):>8,} tokens")

# Step 5: Analyze
print()
report = budget_calculator(chatbot_budget, 128_000)
print_budget_report(report)

In [None]:
#@title üéß What to Look For: Chatbot Scenario Viz
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_39_chatbot_scenario_viz.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
# Final visualization: our chatbot's budget
plot_context_budget(chatbot_budget, model_limit=128_000,
                    title="üéØ Optimized Budget: Customer Support Chatbot (128K)")

## 13. ü§î Reflection

Let us step back and consolidate what we have learned.

**The Big Ideas:**

1. **A context window is a fixed-size desk.** Everything ‚Äî system instructions, conversation history, retrieved documents, tool outputs, the user's question, and the model's response ‚Äî must fit on that desk. If it does not fit, something falls off.

2. **Tokens are the unit of measurement, not characters or words.** The rough rule of ~4 characters per token is useful for quick estimates, but for production systems, always use the actual tokenizer (tiktoken, sentencepiece, etc.).

3. **The six components compete for the same budget:**
   $$T_{\text{total}} = T_{\text{system}} + T_{\text{history}} + T_{\text{RAG}} + T_{\text{tools}} + T_{\text{user}} + T_{\text{reserved}}$$

4. **RAG is the biggest budget consumer** in most applications, and also the most variable. Over-retrieval is the most common cause of context overflow.

5. **Always reserve output space.** Forgetting to budget for the model's response is a surprisingly common mistake that leads to truncated outputs.

6. **Budget optimization is a constrained resource allocation problem.** Define your minimums, set priorities, and let the math distribute the rest.

**What is next?** In Part 2 of this series, we will dive into **prompt architecture** ‚Äî how to structure the *content* within each budget slot to maximize the model's performance. Knowing *how much* space you have is only half the battle; knowing *what to put in that space* is where context engineering truly becomes an art.

In [None]:
#@title üéß Wrap-Up: Closing
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_41_closing.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
# Final summary ‚Äî one function to rule them all
def context_engineering_summary():
    """Print a quick-reference summary of context engineering principles."""
    print("""
‚ïî‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïó
‚ïë              CONTEXT ENGINEERING CHEAT SHEET                ‚ïë
‚ï†‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï£
‚ïë                                                              ‚ïë
‚ïë  Token Estimation:                                           ‚ïë
‚ïë    ‚Ä¢ Rough: len(text) // 4                                   ‚ïë
‚ïë    ‚Ä¢ Exact: tiktoken.get_encoding("cl100k_base").encode(t)   ‚ïë
‚ïë                                                              ‚ïë
‚ïë  The Budget Equation:                                        ‚ïë
‚ïë    T_total = T_sys + T_hist + T_rag + T_tools + T_user + T_out ‚ïë
‚ïë                                                              ‚ïë
‚ïë  Typical 128K Allocation:                                    ‚ïë
‚ïë    System:  2K  ‚îÇ History: 20K ‚îÇ RAG: 60K                    ‚ïë
‚ïë    Tools:  10K  ‚îÇ User:     1K ‚îÇ Output: 35K                 ‚ïë
‚ïë                                                              ‚ïë
‚ïë  Key Rules:                                                  ‚ïë
‚ïë    1. Always reserve output space (‚â•10% of window)           ‚ïë
‚ïë    2. RAG is your biggest lever ‚Äî control retrieval volume    ‚ïë
‚ïë    3. System prompts are taxed on every call ‚Äî keep concise  ‚ïë
‚ïë    4. Measure, don't guess ‚Äî use exact token counts          ‚ïë
‚ïë    5. Bigger window ‚â† better ‚Äî irrelevant context hurts      ‚ïë
‚ïë                                                              ‚ïë
‚ïö‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïù
    """)

context_engineering_summary()
print("‚úÖ Notebook complete! You now have the tools to budget any context window.")
print("üìä Next up: Part 2 ‚Äî Prompt Architecture & Information Ordering")