# 1.4 Context and Conversation ManagementThis notebook focuses on **context engineering** - managing information for AI applications.**Key Concepts:**- Token limits and budgeting- Conversation management strategies- Stateful conversations (automatic state)- CLAUDE.md pattern for project context- R&D: Reduce and Delegate**Why this matters:** Effective context management is critical for building robust AI applications.<a target="_blank" href="https://githubtocolab.com/IT-HUSET/ai-agenter-2025/blob/main/exercises/openai/1.5-context-management.ipynb">  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setup

In [None]:
%pip install openai~=2.1 python-dotenv~=1.0 tiktoken~=0.8 --upgrade --quiet

In [None]:
import os
from openai import OpenAI
import tiktoken

# Check if running in Google Colab
try:
    from google.colab import userdata
    IN_COLAB = True
    os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')
    print("✅ Running in Google Colab - API key loaded from secrets")
except ImportError:
    IN_COLAB = False
    try:
        from dotenv import load_dotenv, find_dotenv
        load_dotenv(find_dotenv())
        print("✅ Running locally - API key loaded from .env file")
    except ImportError:
        print("⚠️ python-dotenv not installed")

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

if not os.getenv("OPENAI_API_KEY"):
    print("❌ OPENAI_API_KEY not found!")
    if IN_COLAB:
        print("   → Click the key icon (🔑) in the left sidebar and add 'OPENAI_API_KEY'")
else:
    print("✅ Setup complete")

---

## Part 1: Understanding Token Limits

Before we manage context, we need to understand what we're managing.

### Token Counting

In [None]:
def count_tokens(text: str, model: str = "gpt-5") -> int:
    """Count tokens in a text string"""
    # Use cl100k_base encoding for GPT-4 and GPT-5 models
    encoding = tiktoken.get_encoding("cl100k_base")
    return len(encoding.encode(text))

# Test with different texts
texts = [
    "Hello, world!",
    "The quick brown fox jumps over the lazy dog.",
    "Supercalifragilisticexpialidocious",
    "A" * 100,
]

for text in texts:
    tokens = count_tokens(text)
    print(f"Text: {text[:50]}...")
    print(f"Length: {len(text)} chars, {tokens} tokens\n")

### Model Context Windows

Different models have different context window sizes:

In [None]:
# Context window sizes (as of 2025)
context_windows = {
    "gpt-5": 200_000,
    "gpt-5-mini": 200_000,
    "gpt-5": 128_000,
    "gpt-5-mini": 128_000,
    "o3-mini": 200_000,
}

print("Model Context Windows:")
for model, window in context_windows.items():
    print(f"  {model}: {window:,} tokens")

# Calculate how many pages of text fit
avg_tokens_per_page = 500  # Rough estimate
print(f"\nRoughly {context_windows['gpt-5'] // avg_tokens_per_page:,} pages in GPT-5")

### 🎯 Exercise 1: Token Budgeting

**Task:** You have a 128K token context window. Budget tokens for:
- System instructions
- Conversation history
- Retrieved documents (RAG)
- Response generation

Calculate how many messages and documents you can fit.

In [None]:
# YOUR CODE HERE

CONTEXT_WINDOW = 128_000

# Estimate token allocations
system_instructions = 500  # tokens
max_response = 4_000  # tokens
doc_size = 1_000  # tokens per document
message_size = 100  # tokens per message

# TODO: Calculate:
# - How many documents can you retrieve?
# - How many conversation turns can you keep?
# - What's your safety margin?


---

## Part 2: Conversation Context Management

Managing conversation history is critical for agents.

### Strategy 1: Sliding Window

In [None]:
class SlidingWindowContext:
    """Keep only the N most recent messages"""
    
    def __init__(self, max_messages: int = 10):
        self.max_messages = max_messages
        self.messages = []
    
    def add_message(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})
        
        # Keep only recent messages
        if len(self.messages) > self.max_messages:
            self.messages = self.messages[-self.max_messages:]
    
    def get_messages(self):
        return self.messages
    
    def get_token_count(self):
        text = "\n".join([f"{m['role']}: {m['content']}" for m in self.messages])
        return count_tokens(text)

# Test it
window = SlidingWindowContext(max_messages=5)

for i in range(10):
    window.add_message("user", f"Question {i}")
    window.add_message("assistant", f"Answer to question {i}")

print(f"Messages kept: {len(window.get_messages())}")
print(f"Total tokens: {window.get_token_count()}")
print("\nMessages:")
for msg in window.get_messages():
    print(f"  {msg['role']}: {msg['content']}")

### Strategy 2: Summarization

Instead of dropping old messages, summarize them.

In [None]:
class SummarizingContext:
    """Summarize old messages when window gets full"""
    
    def __init__(self, max_messages: int = 10, summarize_threshold: int = 8):
        self.max_messages = max_messages
        self.summarize_threshold = summarize_threshold
        self.messages = []
        self.summary = ""
    
    def add_message(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})
        
        # Trigger summarization when threshold is reached
        if len(self.messages) >= self.summarize_threshold:
            self._summarize_old_messages()
    
    def _summarize_old_messages(self):
        """Summarize oldest half of messages"""
        split_point = len(self.messages) // 2
        old_messages = self.messages[:split_point]
        
        # Create summary prompt
        conversation = "\n".join(
            [f"{m['role']}: {m['content']}" for m in old_messages]
        )
        
        prompt = f"""Summarize this conversation in 2-3 sentences. 
Focus on key topics, decisions, and context needed for future messages.

{conversation}

Summary:"""
        
        response = client.responses.create(
            model="gpt-5-mini",
            input=prompt,
            temperature=0
        )
        
        # Update summary and keep recent messages
        new_summary = response.output_text
        self.summary = f"{self.summary}\n\n{new_summary}" if self.summary else new_summary
        self.messages = self.messages[split_point:]
        
        print(f"📝 Summarized {split_point} messages")
    
    def get_full_context(self):
        """Get summary + recent messages"""
        context = ""
        if self.summary:
            context = f"Previous conversation summary:\n{self.summary}\n\n"
        
        recent = "\n".join([f"{m['role']}: {m['content']}" for m in self.messages])
        return context + "Recent conversation:\n" + recent

# Test it
summarizing = SummarizingContext(max_messages=10, summarize_threshold=6)

topics = ["Python", "Data structures", "Algorithms", "Machine learning", 
          "Neural networks", "Transformers", "LLMs", "Agents"]

for topic in topics:
    summarizing.add_message("user", f"Tell me about {topic}")
    summarizing.add_message("assistant", f"Here's information about {topic}...")

print("\nFull context:")
print(summarizing.get_full_context())

### Strategy 3: Token-Based Budgeting

The most precise approach: manage based on actual token counts.

In [None]:
class TokenBudgetContext:
    """Manage context based on token budget"""
    
    def __init__(self, max_tokens: int = 4000):
        self.max_tokens = max_tokens
        self.messages = []
    
    def add_message(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})
        self._trim_to_budget()
    
    def _trim_to_budget(self):
        """Remove oldest messages until within budget"""
        while self.get_token_count() > self.max_tokens and len(self.messages) > 1:
            # Remove oldest message
            removed = self.messages.pop(0)
            print(f"🗑️ Removed message: {removed['role']}: {removed['content'][:50]}...")
    
    def get_token_count(self):
        text = "\n".join([f"{m['role']}: {m['content']}" for m in self.messages])
        return count_tokens(text)
    
    def get_messages(self):
        return self.messages

# Test with long messages
token_budget = TokenBudgetContext(max_tokens=500)

for i in range(5):
    long_text = f"This is message {i}. " * 50  # ~100-150 tokens each
    token_budget.add_message("user", long_text)

print(f"\nFinal message count: {len(token_budget.get_messages())}")
print(f"Total tokens: {token_budget.get_token_count()}")

### 🎯 Exercise 2: Hybrid Context Manager

**Task:** Combine all three strategies:
1. Use token budgeting as primary constraint
2. Summarize when you hit 80% of budget
3. Keep last N messages regardless (sliding window minimum)

**Bonus:** Add message priority (mark important messages to always keep)

In [None]:
# YOUR CODE HERE

class HybridContextManager:
    def __init__(self, max_tokens: int = 4000, min_messages: int = 5):
        self.max_tokens = max_tokens
        self.min_messages = min_messages
        self.messages = []
        self.summary = ""
    
    def add_message(self, role: str, content: str, priority: bool = False):
        # TODO: Implement hybrid strategy
        pass
    
    def _should_summarize(self) -> bool:
        # TODO: Check if at 80% of budget
        pass
    
    def _summarize(self):
        # TODO: Summarize non-priority messages
        pass

# Test your implementation


---

## Part 3: The CLAUDE.md Pattern

**Context engineering** is about structuring your entire project for AI understanding.

### Example CLAUDE.md Structure

In [None]:
claude_md_template = """
# Project Context

## Project Description
This is a task management API built with FastAPI. It provides CRUD operations 
for tasks, users, and projects with role-based access control.

## Architecture
- **Backend**: FastAPI (Python 3.11+)
- **Database**: PostgreSQL via SQLAlchemy ORM
- **Auth**: JWT tokens with refresh mechanism
- **Testing**: pytest with async support
- **Deployment**: Docker + Kubernetes

## Key Files
- `app/main.py` - FastAPI application entry point
- `app/models/` - SQLAlchemy models
- `app/routers/` - API endpoints
- `app/services/` - Business logic
- `tests/` - Test suite

## Development Guidelines
1. **Code Style**: Follow PEP 8, use black for formatting
2. **Testing**: Write tests for all endpoints (>80% coverage)
3. **Type Hints**: Use type hints for all functions
4. **Async**: Use async/await for all I/O operations
5. **Error Handling**: Use HTTPException with appropriate status codes

## Common Tasks

### Adding a New Endpoint
```python
# 1. Define Pydantic schema in app/schemas/
class TaskCreate(BaseModel):
    title: str
    description: str | None = None

# 2. Add route in app/routers/
@router.post("/tasks/", response_model=Task)
async def create_task(task: TaskCreate, db: AsyncSession = Depends(get_db)):
    return await task_service.create(db, task)

# 3. Write tests in tests/
async def test_create_task(client):
    response = await client.post("/tasks/", json={"title": "Test"})
    assert response.status_code == 200
```

## Constraints & Guardrails
- ❌ Never commit secrets or API keys
- ❌ Don't modify the database schema without migrations
- ❌ Don't use sync I/O in async functions
- ✅ Always validate input with Pydantic
- ✅ Use dependency injection for database sessions
- ✅ Log all errors with context

## Running the Project
```bash
# Development
uv run uvicorn app.main:app --reload

# Tests
uv run pytest

# Database migrations
uv run alembic upgrade head
```
"""

print(claude_md_template)
print(f"\nContext size: {count_tokens(claude_md_template)} tokens")

### Using Context in API Calls

In [None]:
# Example: Use CLAUDE.md context for a coding task
task = """Add a new endpoint to assign a task to a user. 
The endpoint should:
- Accept task_id and user_id
- Verify the user has permission
- Update the task's assigned_to field
- Return the updated task
"""

# With context, the model understands project structure
response = client.responses.create(
    model="gpt-5-mini",
    instructions=claude_md_template,  # Project context
    input=task,
    temperature=0
)

print("Generated code:")
print(response.output_text)

### 🎯 Exercise 3: Create Your CLAUDE.md

**Task:** Write a CLAUDE.md for a project you're working on (or make one up).

**Requirements:**
- Include all sections from the template
- Add at least 2 code examples
- List 5+ constraints/guardrails
- Keep under 2000 tokens

**Test it:** Ask the model to generate code using your context.

In [None]:
# YOUR CODE HERE

your_claude_md = """
# TODO: Write your project context
"""

# Test it
print(f"Token count: {count_tokens(your_claude_md)}")

# Ask the model to do something
# response = client.responses.create(
#     model="gpt-5-mini",
#     instructions=your_claude_md,
#     input="Generate a README.md for this project"
# )
# print(response.output_text)

---

## Part 4: R&D - Reduce and Delegate

Two principles for managing context effectively.

### Reduce: Be Specific and Focused

In [None]:
# BAD: Too much irrelevant context
bad_context = """
I have a Python project. It uses FastAPI. And SQLAlchemy. We also use pytest.
The team likes to use async. We deploy on AWS. Sometimes we use Docker.
Our database is PostgreSQL but we're thinking of switching to MongoDB.
We have 10 developers. The project started in 2023...
(continues for 2000 more tokens)
"""

# GOOD: Focused, relevant context
good_context = """
FastAPI + SQLAlchemy (async) + PostgreSQL. 
Task: Add new endpoint for task assignment.
Pattern: See app/routers/tasks.py for existing endpoints.
"""

print("Bad context:", count_tokens(bad_context), "tokens")
print("Good context:", count_tokens(good_context), "tokens")
print(f"\nSavings: {count_tokens(bad_context) - count_tokens(good_context)} tokens")

### Delegate: Use Multiple Focused Sessions

In [None]:
# Instead of one huge context, use separate focused sessions

# Session 1: Design the API
design_task = "Design an endpoint to assign tasks to users. Return the API spec."

design_response = client.responses.create(
    model="gpt-5-mini",
    input=design_task,
    temperature=0
)

api_spec = design_response.output_text
print("API Spec:")
print(api_spec)
print("\n" + "="*50 + "\n")

# Session 2: Implement based on spec (separate context)
impl_task = f"""Implement this API spec using FastAPI:

{api_spec}

Follow FastAPI best practices. Use async. Include error handling.
"""

impl_response = client.responses.create(
    model="gpt-5-mini",
    input=impl_task,
    temperature=0
)

print("Implementation:")
print(impl_response.output_text[:500] + "...")

---

## Part 5: Stateful Conversations

The Responses API provides **automatic state management** via `previous_response_id`.

**Benefits:**
- No manual message tracking
- Automatic context window management
- Simpler code
- Built-in optimization

### Automatic vs Manual State Management

In [None]:
# Chat Completions: Manual state management
messages = [{"role": "system", "content": "You are helpful"}]

# Turn 1
messages.append({"role": "user", "content": "What's 2+2?"})
response = client.chat.completions.create(model="gpt-4o", messages=messages)
messages.append({"role": "assistant", "content": response.choices[0].message.content})

# Turn 2
messages.append({"role": "user", "content": "Multiply that by 3"})
response = client.chat.completions.create(model="gpt-4o", messages=messages)

print("Chat Completions result:", response.choices[0].message.content)
print(f"Messages tracked: {len(messages)}")

In [None]:
# Responses API: Automatic state management
# Turn 1
response_1 = client.responses.create(
    model="gpt-4o",
    input="What's 2+2?"
)

# Turn 2 - just reference previous!
response_2 = client.responses.create(
    model="gpt-4o",
    input="Multiply that by 3",
    previous_response_id=response_1.id  # ✨ Automatic!
)

print("Responses API result:", response_2.output_text)
print("\n✅ No manual tracking needed!")

### Combining with Context Management Strategies

You can combine stateful conversations with the strategies from earlier:

In [None]:
class StatefulContextManager:
    """Combines automatic state with token budgeting"""
    
    def __init__(self, max_tokens: int = 8000, model: str = "gpt-4o"):
        self.max_tokens = max_tokens
        self.model = model
        self.last_response_id = None
    
    def send(self, message: str) -> str:
        """Send message with automatic state"""
        response = client.responses.create(
            model=self.model,
            input=message,
            previous_response_id=self.last_response_id
        )
        
        self.last_response_id = response.id
        return response.output_text
    
    def reset(self):
        """Start new conversation"""
        self.last_response_id = None

# Test it
manager = StatefulContextManager()
print(manager.send("What's the capital of France?"))
print("\n" + "="*50 + "\n")
print(manager.send("What's its population?"))  # Knows context!

### When to Use Which Approach

| Scenario | Use |
|----------|-----|
| **New projects** | Responses API (automatic) |
| **Need fine control** | Chat Completions (manual) |
| **Building agents** | Responses API |
| **Existing codebase** | Chat Completions |
| **Simple conversations** | Responses API |
| **Complex state logic** | Chat Completions |

### 🎯 Exercise 4: Reduce and Delegate in Practice

**Task:** You need to add a complete feature: user authentication with email/password.

Instead of one huge prompt, break it into focused sessions:
1. Design the auth flow
2. Design the database schema
3. Implement password hashing
4. Implement login endpoint
5. Implement token refresh
6. Write tests

**Measure:** Compare total tokens used vs. putting everything in one prompt.

In [None]:
# YOUR CODE HERE

def multi_session_approach():
    """Break into focused sessions"""
    total_tokens = 0
    
    # Session 1: Design
    # Session 2: Schema
    # Session 3: Hashing
    # ...
    
    return total_tokens

def single_session_approach():
    """Everything in one huge prompt"""
    massive_prompt = """TODO: Everything at once"""
    return count_tokens(massive_prompt)

# Compare
# multi_tokens = multi_session_approach()
# single_tokens = single_session_approach()
# print(f"Multi-session: {multi_tokens} tokens")
# print(f"Single-session: {single_tokens} tokens")
# print(f"Difference: {single_tokens - multi_tokens} tokens")

---

## Summary

In this notebook, you learned:

✅ **Token budgeting**: Understanding and managing context limits  
✅ **Conversation management**: Sliding window, summarization, token budgets  
✅ **Stateful conversations**: Automatic state with `previous_response_id`  
✅ **CLAUDE.md pattern**: Structuring project context  
✅ **R&D principle**: Reduce and Delegate for efficiency  

**Key Takeaways:**
- Token limits are real constraints - plan accordingly
- Use Responses API for automatic state management
- Different strategies for different needs (sliding window vs. summarization)
- CLAUDE.md provides consistent context for AI coding agents
- Break complex tasks into focused sessions (Delegate)
- Only include relevant information (Reduce)

**Next Steps:**
- Notebook 1.6: Agentic Applications (combine everything)
- Apply these patterns in your LangGraph agents
- Create CLAUDE.md for your projects

**Resources:**
- [OpenAI Responses API Docs](https://platform.openai.com/docs/api-reference/responses)
- [Token Counting with tiktoken](https://github.com/openai/tiktoken)