# 🎯 Context Engineering: Optimizing LLM Context Windows

Welcome to this interactive lesson! You'll learn how to strategically structure and optimize context for Large Language Models.

**Duration:** ~30 minutes  
**Difficulty:** Intermediate  
**Approach:** 100% local, no API costs

---

## What You'll Build

By the end of this lesson, you will:
- ✅ Understand token budgets and context constraints
- ✅ Implement 4 different context assembly strategies
- ✅ Measure and compare their performance quantitatively
- ✅ Apply optimization techniques to improve quality or reduce costs

**Let's get started!** 🚀

In [None]:
# Cell 2: Import required libraries
import os
import json
import warnings
from pathlib import Path

import torch
import numpy as np
import pandas as pd
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
import seaborn as sns

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# Set plotting style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

# Import lesson modules
import sys
sys.path.append('../src')

from token_manager import count_tokens, fits_in_budget, TokenBudgetManager
from helpers import load_documents, load_questions, calculate_similarity
from evaluation import evaluate_answer, LLMEvaluator

print("✅ All imports successful!")
print(f"📊 PyTorch version: {torch.__version__}")
print(f"🖥️  Device available: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU'}")

In [None]:
# Cell 3: Load models (this will download on first run)
print("Loading models... (first run downloads ~6.6 GB, please be patient)")
print("Subsequent runs will be instant.\n")

from transformers import AutoTokenizer, AutoModelForCausalLM
from sentence_transformers import SentenceTransformer

# Load LLM for generation
MODEL_NAME = "Qwen/Qwen2.5-3B-Instruct"
print(f"Loading {MODEL_NAME}...")

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
    device_map="auto"
)
print("✅ LLM loaded!")

# Load embedding model for similarity
EMBED_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
print(f"\nLoading {EMBED_MODEL}...")
embedder = SentenceTransformer(EMBED_MODEL)
print("✅ Embedding model loaded!")

print("\n🎉 All models ready! Let's start learning.")

In [None]:
# Cell 4: Load lesson data
print("Loading lesson data...")

# Load documents
documents = load_documents('../data/source_documents.json')
print(f"✅ Loaded {len(documents)} documents")

# Load evaluation questions
questions = load_questions('../data/evaluation_questions.json')
print(f"✅ Loaded {len(questions)} evaluation questions")

# Preview first document
print("\n📄 Sample Document:")
print(f"Title: {documents[0]['title']}")
print(f"Tokens: {documents[0]['tokens']}")
print(f"Preview: {documents[0]['content'][:200]}...")

---

## 📋 Lesson Roadmap

### Phase 1: Understanding Context Windows (7 min)
Learn about token limits and budget constraints

### Phase 2: Baseline Implementation (8 min)
Build a naive context assembly function

### Phase 3: Strategic Placement (10 min)
Implement and compare three placement strategies:
- **Primacy:** Important info at the start
- **Recency:** Important info at the end
- **Sandwich:** Important info at both ends

### Phase 4: Optimization (5 min)
Choose and implement one advanced optimization

### Phase 5: Results & Evaluation
Compare all strategies and see your improvements!

---

**Ready? Let's dive into Phase 1!** ⬇️

# Phase 1: Understanding Context Windows (7 minutes)

## What is a Context Window?

A **context window** is the maximum amount of text (measured in tokens) that an LLM can process at once. This includes:
- Your prompt/instructions
- Any retrieved documents or context
- The user's question
- The model's response

## Why Does This Matter?

Every token costs:
- **Money:** API providers charge per token
- **Time:** More tokens = slower inference
- **Attention:** Models struggle with very long contexts ("lost in the middle")

## Your Challenge

You have 10 documents and need to answer questions about them. But they don't all fit in the context window at once!

**Let's see what we're working with...**

In [None]:
# Cell 7: Analyze token counts
print("📊 Document Token Analysis\n")

# Calculate statistics
total_tokens = sum(doc['tokens'] for doc in documents)
avg_tokens = total_tokens / len(documents)
min_tokens = min(doc['tokens'] for doc in documents)
max_tokens = max(doc['tokens'] for doc in documents)

print(f"Total tokens across all documents: {total_tokens:,}")
print(f"Average tokens per document: {avg_tokens:.0f}")
print(f"Smallest document: {min_tokens} tokens")
print(f"Largest document: {max_tokens} tokens")

# Visualize distribution
token_counts = [doc['tokens'] for doc in documents]
plt.figure(figsize=(10, 5))
plt.bar(range(len(token_counts)), token_counts, color='steelblue', alpha=0.7)
plt.xlabel('Document Index')
plt.ylabel('Token Count')
plt.title('Token Distribution Across Documents')
plt.axhline(y=avg_tokens, color='r', linestyle='--', label=f'Average ({avg_tokens:.0f})')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f"\n💡 Key Insight: All documents together = {total_tokens:,} tokens")
print("    Most LLMs have 4K-8K token windows. We need to be selective!")

## 🧮 Exercise: What Fits in Different Windows?

Given our total of ~8,500 tokens across all documents, let's see what fits in common context window sizes.

Remember: You also need room for:
- The question (~50 tokens)
- The response (~200 tokens)
- System instructions (~50 tokens)

So subtract ~300 tokens from each window for overhead!

In [None]:
# Cell 9: TODO - Calculate what fits in different windows
# This is your first coding task!

def calculate_fit_analysis(documents, window_sizes=[2048, 4096, 8192]):
    """
    TODO: For each window size, determine:
    1. How many documents fit (accounting for 300 token overhead)
    2. What percentage of total tokens can be included
    3. Which specific documents fit (in order, until limit reached)
    
    Args:
        documents: List of document dicts with 'tokens' field
        window_sizes: List of context window sizes to analyze
    
    Returns:
        Dictionary with results for each window size
    """
    results = {}
    overhead = 300  # tokens for question + response + instructions
    
    # TODO: Your implementation here
    # Hint: Iterate through documents in order, track cumulative tokens
    # Hint: Stop when adding next doc would exceed (window_size - overhead)
    
    for window_size in window_sizes:
        available_tokens = window_size - overhead
        # TODO: Calculate how many docs fit
        # TODO: Calculate percentage of total
        # TODO: Track which specific docs
        
        results[window_size] = {
            'docs_fit': 0,  # TODO: Replace with actual count
            'tokens_used': 0,  # TODO: Replace with actual sum
            'percentage': 0,  # TODO: Replace with actual percentage
            'doc_indices': []  # TODO: Replace with actual indices
        }
    
    return results

# Test your implementation
fit_analysis = calculate_fit_analysis(documents)

# Display results
print("📊 Context Window Fit Analysis\n")
for window_size, stats in fit_analysis.items():
    print(f"Window Size: {window_size:,} tokens")
    print(f"  Documents that fit: {stats['docs_fit']}/{len(documents)}")
    print(f"  Tokens used: {stats['tokens_used']:,}")
    print(f"  Coverage: {stats['percentage']:.1f}%")
    print()

# 💡 HINT: If you're stuck, uncomment the next line
# %load ../src/hints/hint_calculate_fit.py

## ✅ Phase 1 Complete!

**What you learned:**
- Context windows have hard token limits
- Not all information can fit at once
- Strategic selection is crucial

**Key Takeaway:** With an 8K window, you can only fit ~75% of documents. **Which ones should you choose? And where should you put them?**

That's what we'll explore next! ⬇️

---

# Phase 2: Baseline Context Assembly (8 minutes)

## The Naive Approach

The simplest strategy: concatenate documents in order until you run out of space.

**No intelligence, no optimization, just raw concatenation.**

This will be our **baseline** for comparison. Every other strategy must beat this!

## Your Task

Implement `naive_context_assembly()` that:
1. Takes documents and a query
2. Concatenates documents in order
3. Stops when approaching the token limit
4. Returns the assembled context string

Let's build it! 💪

In [None]:
# Cell 12: TODO - Implement naive context assembly

def naive_context_assembly(documents, query, token_limit=4000):
    """
    Naive context assembly: concatenate documents in order until token limit.
    
    Args:
        documents: List of document dicts with 'content' and 'tokens' fields
        query: The question being asked (string)
        token_limit: Maximum tokens for context (int)
    
    Returns:
        Assembled context string
    """
    # TODO: Implement naive concatenation
    # Hint 1: Reserve some tokens for the query itself (~50)
    # Hint 2: Iterate through documents in order
    # Hint 3: Keep track of cumulative tokens
    # Hint 4: Stop when adding next doc would exceed limit
    # Hint 5: Format nicely with document separators
    
    context_parts = []
    used_tokens = 0
    available_tokens = token_limit - 50  # Reserve for query
    
    # TODO: Your implementation here
    
    return "\n\n".join(context_parts)

# Test your implementation
test_query = questions[0]['question']
test_context = naive_context_assembly(documents, test_query, token_limit=4000)

print(f"✅ Naive context assembled!")
print(f"📏 Length: {len(test_context)} characters")
print(f"🔢 Tokens: ~{count_tokens(test_context)}")
print(f"\n📄 Preview:\n{test_context[:300]}...")

# 💡 HINT: Stuck? Uncomment for solution skeleton
# %load ../src/hints/hint_naive_assembly.py

In [None]:
# Cell 13: Set up evaluator
print("Setting up evaluation system...")

evaluator = LLMEvaluator(model, tokenizer)
print("✅ Evaluator ready!")

# Test on one question
print("\n🧪 Testing evaluator with one question...")
test_context = naive_context_assembly(documents, questions[0]['question'])
test_answer = evaluator.generate_answer(test_context, questions[0]['question'])
test_score = evaluator.score_answer(
    test_answer, 
    questions[0]['ground_truth_answer']
)

print(f"\nQuestion: {questions[0]['question']}")
print(f"Generated Answer: {test_answer}")
print(f"Score: {test_score:.2f}")

In [None]:
# Cell 14: Evaluate naive strategy on all questions
print("🔬 Evaluating naive strategy on all questions...")
print("This may take 2-3 minutes...\n")

naive_results = []

for q in tqdm(questions, desc="Evaluating"):
    # Assemble context
    context = naive_context_assembly(documents, q['question'], token_limit=4000)
    
    # Generate answer
    answer = evaluator.generate_answer(context, q['question'])
    
    # Score answer
    score = evaluator.score_answer(answer, q['ground_truth_answer'])
    
    naive_results.append({
        'question_id': q['id'],
        'question': q['question'],
        'answer': answer,
        'score': score,
        'tokens_used': count_tokens(context)
    })

# Calculate metrics
naive_accuracy = np.mean([r['score'] for r in naive_results])
naive_tokens = np.mean([r['tokens_used'] for r in naive_results])

print(f"\n📊 Naive Strategy Results:")
print(f"   Average Accuracy: {naive_accuracy:.2%}")
print(f"   Average Tokens: {naive_tokens:.0f}")
print(f"   Token Efficiency: {(naive_accuracy / naive_tokens * 1000):.3f} (accuracy per 1K tokens)")

# Save for later comparison
baseline_metrics = {
    'strategy': 'naive',
    'accuracy': naive_accuracy,
    'avg_tokens': naive_tokens,
    'all_results': naive_results
}

## 📈 Baseline Established!

You've now measured the **naive approach** performance. This is your baseline.

**Typical Results:**
- Accuracy: 65-72%
- Token usage: ~3800/4000

## What's Wrong with Naive?

1. **No relevance ranking** - Treats all documents equally
2. **Order dependency** - First documents always included, last ones never are
3. **Ignores the query** - Doesn't consider what's actually being asked
4. **Wastes attention** - Model must process irrelevant info

**Can we do better? Absolutely!** ⬇️

---

# Phase 3: Strategic Context Placement (10 minutes)

## The "Lost in the Middle" Problem

Research shows that LLMs have **positional bias**:
- ✅ **Strong recall** for information at the START of context
- ✅ **Strong recall** for information at the END of context
- ❌ **Weak recall** for information in the MIDDLE

This is called the **"lost in the middle"** phenomenon.

## Three Strategies to Test

### 1. Primacy Placement
Place most relevant documents at the **beginning**

### 2. Recency Placement  
Place most relevant documents at the **end**

### 3. Sandwich Placement
Place relevant documents at **both ends**, less relevant in middle

## Your Challenge

Implement all three strategies and measure which performs best!

**First, we need a way to rank document relevance...**

In [None]:
# Cell 17: Implement document ranking by relevance
def rank_documents_by_relevance(documents, query, embedder):
    """
    Rank documents by semantic similarity to the query.
    
    Args:
        documents: List of document dicts
        query: Question string
        embedder: SentenceTransformer model
    
    Returns:
        List of (doc, similarity_score) tuples, sorted by score descending
    """
    # Encode query
    query_embedding = embedder.encode(query, convert_to_tensor=True)
    
    # Encode all documents and calculate similarity
    ranked = []
    for doc in documents:
        doc_embedding = embedder.encode(doc['content'], convert_to_tensor=True)
        similarity = calculate_similarity(query_embedding, doc_embedding)
        ranked.append((doc, similarity.item()))
    
    # Sort by similarity (highest first)
    ranked.sort(key=lambda x: x[1], reverse=True)
    
    return ranked

# Test ranking
test_query = "What is the lost in the middle problem?"
ranked_docs = rank_documents_by_relevance(documents, test_query, embedder)

print("📊 Document Ranking for Query:", test_query)
print("\nTop 3 most relevant:")
for i, (doc, score) in enumerate(ranked_docs[:3], 1):
    print(f"{i}. {doc['title'][:50]}... (similarity: {score:.3f})")

print("\nBottom 3 least relevant:")
for i, (doc, score) in enumerate(ranked_docs[-3:], 1):
    print(f"{i}. {doc['title'][:50]}... (similarity: {score:.3f})")

In [None]:
# Cell 18: TODO - Implement primacy placement strategy

def primacy_context_assembly(documents, query, token_limit=4000, embedder=None):
    """
    Primacy placement: Most relevant documents at the START.
    
    Args:
        documents: List of document dicts
        query: Question string
        token_limit: Max tokens
        embedder: SentenceTransformer for ranking
    
    Returns:
        Assembled context string
    """
    # TODO: Implement primacy strategy
    # Step 1: Rank documents by relevance to query
    # Step 2: Place highest-ranked docs first
    # Step 3: Continue adding until token limit
    # Step 4: Return formatted context
    
    # TODO: Your implementation here
    # Hint: Use rank_documents_by_relevance() from above
    # Hint: Similar to naive, but with sorted order
    
    pass

# Test your implementation
test_primacy_context = primacy_context_assembly(
    documents, 
    questions[0]['question'], 
    embedder=embedder
)

print(f"✅ Primacy context assembled!")
print(f"📏 Tokens: ~{count_tokens(test_primacy_context)}")

# 💡 HINT: Stuck?
# %load ../src/hints/hint_primacy.py

In [None]:
# Cell 19: TODO - Implement recency placement strategy

def recency_context_assembly(documents, query, token_limit=4000, embedder=None):
    """
    Recency placement: Most relevant documents at the END.
    
    Args:
        documents: List of document dicts
        query: Question string  
        token_limit: Max tokens
        embedder: SentenceTransformer for ranking
    
    Returns:
        Assembled context string
    """
    # TODO: Implement recency strategy
    # Step 1: Rank documents by relevance
    # Step 2: Add documents in REVERSE rank order (least relevant first)
    # Step 3: This puts most relevant at the end
    # Step 4: Return formatted context
    
    # TODO: Your implementation here
    # Hint: Very similar to primacy, but reverse the order!
    
    pass

# Test
test_recency_context = recency_context_assembly(
    documents,
    questions[0]['question'],
    embedder=embedder
)

print(f"✅ Recency context assembled!")
print(f"📏 Tokens: ~{count_tokens(test_recency_context)}")

# 💡 HINT: Stuck?
# %load ../src/hints/hint_recency.py

In [None]:
# Cell 20: TODO - Implement sandwich placement strategy

def sandwich_context_assembly(documents, query, token_limit=4000, embedder=None):
    """
    Sandwich placement: Relevant docs at BOTH ends, less relevant in middle.
    
    Strategy:
    - Top 50% of relevant docs → split into two groups
    - First group at START
    - Second group at END
    - Remaining docs in MIDDLE
    
    Args:
        documents: List of document dicts
        query: Question string
        token_limit: Max tokens
        embedder: SentenceTransformer for ranking
    
    Returns:
        Assembled context string
    """
    # TODO: Implement sandwich strategy
    # Step 1: Rank documents by relevance
    # Step 2: Identify top-ranked docs (most relevant)
    # Step 3: Split top docs into two groups
    # Step 4: Assemble: [group1] + [middle docs] + [group2]
    # Step 5: Respect token limit throughout
    
    # TODO: Your implementation here
    # Hint: This is the most complex strategy!
    # Hint: Consider what % of top docs to sandwich (try 40%)
    
    pass

# Test
test_sandwich_context = sandwich_context_assembly(
    documents,
    questions[0]['question'],
    embedder=embedder
)

print(f"✅ Sandwich context assembled!")
print(f"📏 Tokens: ~{count_tokens(test_sandwich_context)}")

# 💡 HINT: Stuck? This one is tricky!
# %load ../src/hints/hint_sandwich.py

In [None]:
# Cell 21: Evaluate primacy, recency, and sandwich strategies
print("🔬 Evaluating all three strategic placement approaches...")
print("This will take 5-8 minutes total...\n")

strategies = {
    'primacy': primacy_context_assembly,
    'recency': recency_context_assembly,
    'sandwich': sandwich_context_assembly
}

all_results = {'naive': baseline_metrics}  # Include baseline

for strategy_name, strategy_func in strategies.items():
    print(f"\n📊 Evaluating {strategy_name.upper()} strategy...")
    
    results = []
    for q in tqdm(questions, desc=f"  {strategy_name}"):
        # Assemble context using this strategy
        context = strategy_func(
            documents, 
            q['question'], 
            token_limit=4000,
            embedder=embedder
        )
        
        # Generate and score answer
        answer = evaluator.generate_answer(context, q['question'])
        score = evaluator.score_answer(answer, q['ground_truth_answer'])
        
        results.append({
            'question_id': q['id'],
            'score': score,
            'tokens_used': count_tokens(context)
        })
    
    # Calculate metrics
    accuracy = np.mean([r['score'] for r in results])
    avg_tokens = np.mean([r['tokens_used'] for r in results])
    
    all_results[strategy_name] = {
        'strategy': strategy_name,
        'accuracy': accuracy,
        'avg_tokens': avg_tokens,
        'all_results': results
    }
    
    print(f"   ✅ Accuracy: {accuracy:.2%}")
    print(f"   📏 Avg Tokens: {avg_tokens:.0f}")

print("\n🎉 All strategies evaluated!")

In [None]:
# Cell 22: Visualize strategy comparison
# Create comparison dataframe
comparison_df = pd.DataFrame([
    {
        'Strategy': name.capitalize(),
        'Accuracy': metrics['accuracy'],
        'Avg Tokens': metrics['avg_tokens'],
        'Improvement': (metrics['accuracy'] - all_results['naive']['accuracy']) / all_results['naive']['accuracy']
    }
    for name, metrics in all_results.items()
])

# Sort by accuracy
comparison_df = comparison_df.sort_values('Accuracy', ascending=False)

# Display table
print("📊 STRATEGY COMPARISON\n")
print(comparison_df.to_string(index=False))
print()

# Plot comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Accuracy comparison
ax1.barh(comparison_df['Strategy'], comparison_df['Accuracy'] * 100, color='steelblue')
ax1.set_xlabel('Accuracy (%)')
ax1.set_title('Strategy Accuracy Comparison')
ax1.grid(True, alpha=0.3)

# Improvement over baseline
ax2.barh(
    comparison_df['Strategy'][1:],  # Exclude naive (baseline)
    comparison_df['Improvement'][1:] * 100,
    color='green'
)
ax2.set_xlabel('Improvement over Baseline (%)')
ax2.set_title('Relative Improvement')
ax2.axvline(x=0, color='r', linestyle='--', alpha=0.5)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## ✅ Phase 3 Complete!

**What you discovered:**
- Position in context matters significantly
- Different strategies perform differently
- Strategic placement can improve accuracy by 10-20%

### Typical Results

| Strategy | Expected Accuracy | Improvement |
|----------|------------------|-------------|
| Naive    | 65-72%           | Baseline    |
| Primacy  | 70-77%           | +5-8%       |
| Recency  | 75-82%           | +10-15%     |
| Sandwich | 78-85%           | +15-20%     |

**Key Insight:** The sandwich strategy usually wins by avoiding the "lost in the middle" problem!

**But can we do even better?** Let's find out! ⬇️

---