# Token Management

This notebook explores token management techniques for working with Large Language Models (LLMs). Topics include:

- Understanding and counting tokens
- Context window optimization
- Chunking strategies for long inputs
- Token usage estimation and cost calculation

## 1. Setup and Imports

First, let's import the necessary libraries and set up our environment:

In [None]:
import os
import json
import sys
import math
import pandas as pd
import matplotlib.pyplot as plt
import tiktoken
from IPython.display import display, HTML, clear_output, Markdown
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Import our custom utilities
sys.path.append('.')
from api_utils import (
    call_openrouter,
    extract_text_response,
    estimate_cost
)
from token_counter import (
    count_tokens,
    count_message_tokens,
    chunk_text,
    get_context_window,
    truncate_to_token_limit,
    optimize_context,
    optimize_messages_for_context
)

## 2. Understanding Tokens

Tokens are the basic units that LLMs process. A token is not the same as a word or character. Instead, it's a piece of text determined by the model's tokenizer. Understanding how text is tokenized is essential for efficiently using LLMs.

### 2.1 What is a Token?

Let's explore how different text gets tokenized:

In [None]:
def show_tokenization(text, model="gpt-4o"):
    """Show how a piece of text is tokenized."""
    # Determine the encoding based on the model
    if model.startswith("gpt-4"):
        encoding = tiktoken.encoding_for_model("gpt-4")
    elif model.startswith("gpt-3.5"):
        encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
    else:
        encoding = tiktoken.get_encoding("cl100k_base")  # Default to cl100k_base
    
    # Encode the text
    tokens = encoding.encode(text)
    
    # Decode each token to show the breakdown
    token_texts = [encoding.decode([token]) for token in tokens]
    
    # Create a DataFrame to display the tokens
    df = pd.DataFrame({
        'Token ID': tokens,
        'Token Text': token_texts,
        'Bytes': [len(t.encode('utf-8')) for t in token_texts],
        'Chars': [len(t) for t in token_texts]
    })
    
    display(Markdown(f"**Total tokens:** {len(tokens)}"))
    return df

# Test with a simple example
simple_text = "Hello, world! This is a test of tokenization."
show_tokenization(simple_text)

Now let's try with some more complex examples to see how different types of text get tokenized:

In [None]:
# Load some examples from our test cases file
with open('token_test_cases.txt', 'r', encoding='utf-8', errors='replace') as f:
    token_test_cases = f.read()

# Split the test cases by the markdown headers
import re
test_cases = re.split(r'# .*?\n', token_test_cases)[1:]
test_case_names = re.findall(r'# (.*?)\n', token_test_cases)

# Create a dictionary of test cases
test_case_dict = {name.strip(): case.strip() for name, case in zip(test_case_names, test_cases)}

# Display the test cases we have
content = "## Available test cases:\n\n"
for name in test_case_dict.keys():
    content += f"- {name}\n"

display(Markdown(content))

In [None]:
# Test with different types of content
test_examples = [
    ("Short Text", test_case_dict["Short Text"]),
    ("Code Sample (first 100 chars)", test_case_dict["Code Sample"][:100]),
    ("Mathematical Text (first 100 chars)", test_case_dict["Mathematical Text"][:100]),
    ("Multilingual Text (first 100 chars)", test_case_dict["Multilingual Text"][:100])
]

for name, text in test_examples:
    content = f"## {name}:\n\n{text}\n\n"
    display(Markdown(content))
    display(show_tokenization(text))

### 2.2 Tokenization Rules

Based on the examples above, we can observe several patterns in how text is tokenized:

1. **Common Words**: Frequent words like "the", "and", "is" often get their own tokens
2. **Whitespace**: Spaces and newlines are part of the tokens that follow them
3. **Capitalization**: "Hello" and "hello" may be tokenized differently
4. **Punctuation**: Punctuation marks often get their own tokens or are grouped with adjacent characters
5. **Numbers**: Numbers are often broken down digit by digit
6. **Special Characters**: Special characters and symbols may get their own tokens

Let's compare token counts for various text patterns:

In [None]:
# Create a set of test phrases to see tokenization patterns
test_phrases = [
    "Hello",
    "hello",  # case difference
    "hello world",
    "helloworld",  # spacing difference
    "Hello, world!",  # punctuation
    "1234567890",  # numbers
    "ChatGPT",  # mixed case
    "chat gpt",  # space vs. no space
    "deeplearning",
    "deep learning",
    "‚≠êüòäüî•",  # emojis
    "https://example.com",  # URL
    "const x = 10;",  # code
    "e=mc¬≤",  # special characters
    "‡§®‡§Æ‡§∏‡•ç‡§§‡•á",  # non-latin script (Hindi)
    "„Åì„Çì„Å´„Å°„ÅØ"  # non-latin script (Japanese)
]

# Count tokens for each phrase
results = []
for phrase in test_phrases:
    tokens = count_tokens(phrase)
    results.append({
        "phrase": phrase,
        "characters": len(phrase),
        "tokens": tokens,
        "chars_per_token": round(len(phrase) / tokens, 2) if tokens > 0 else 0
    })

# Display the results
df = pd.DataFrame(results)
display(df)

# Visualize the relationship between characters and tokens
plt.figure(figsize=(10, 6))
plt.scatter(df['characters'], df['tokens'])
plt.xlabel('Characters')
plt.ylabel('Tokens')
plt.title('Characters vs. Tokens')
for i, row in df.iterrows():
    plt.annotate(row['phrase'], (row['characters'], row['tokens']), textcoords="offset points", 
                 xytext=(0, 10), ha='center')
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

### 2.3 Token Counts for Different Content Types

Now let's analyze token counts for longer pieces of content from our test cases:

In [None]:
# Count tokens for each test case
token_counts = {}
char_counts = {}

for name, text in test_case_dict.items():
    tokens = count_tokens(text)
    chars = len(text)
    token_counts[name] = tokens
    char_counts[name] = chars

# Create a DataFrame for visualization
token_df = pd.DataFrame({
    'Test Case': list(token_counts.keys()),
    'Tokens': list(token_counts.values()),
    'Characters': list(char_counts.values())
})

token_df['Chars per Token'] = token_df['Characters'] / token_df['Tokens']
token_df['Tokens per Char'] = token_df['Tokens'] / token_df['Characters']

# Sort by tokens count
token_df = token_df.sort_values('Tokens', ascending=False)

# Display the results
display(token_df)

# Visualize token counts
plt.figure(figsize=(12, 6))
plt.bar(token_df['Test Case'], token_df['Tokens'], color='skyblue')
plt.xticks(rotation=45, ha='right')
plt.xlabel('Test Case')
plt.ylabel('Number of Tokens')
plt.title('Token Counts for Different Content Types')
plt.tight_layout()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

# Visualize characters per token
plt.figure(figsize=(12, 6))
plt.bar(token_df['Test Case'], token_df['Chars per Token'], color='coral')
plt.xticks(rotation=45, ha='right')
plt.xlabel('Test Case')
plt.ylabel('Characters per Token')
plt.title('Characters per Token for Different Content Types')
plt.tight_layout()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

In [None]:
# Load chat messages from the test cases
chat_messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "What is artificial intelligence?"},
    {"role": "assistant", "content": "Artificial intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think and learn like humans. It encompasses various technologies including machine learning, natural language processing, computer vision, and robotics. AI systems can perform tasks that typically require human intelligence such as visual perception, speech recognition, decision-making, and language translation."},
    {"role": "user", "content": "How is AI different from machine learning?"}
]

In [None]:
def analyze_chat_messages(messages):
    """Analyze token usage in chat messages."""
    # Calculate overall token count
    total_tokens = count_message_tokens(messages)
    
    # Calculate token count for each message individually
    individual_counts = []
    for msg in messages:
        content_tokens = count_tokens(msg["content"])
        role_tokens = count_tokens(msg["role"])
        individual_counts.append({
            "role": msg["role"],
            "content_chars": len(msg["content"]),
            "content_tokens": content_tokens,
            "role_tokens": role_tokens,
            "message_tokens": content_tokens + role_tokens
        })
    
    # Create DataFrame
    df = pd.DataFrame(individual_counts)
    
    # Add message format overhead
    sum_of_parts = df["message_tokens"].sum()
    format_overhead = total_tokens - sum_of_parts
    
    content = f"""## Chat Message Analysis

**Total tokens in the entire message structure:** {total_tokens}
**Sum of individual message tokens:** {sum_of_parts}
**Format overhead tokens:** {format_overhead}
**Overhead tokens per message:** {format_overhead / len(messages):.1f}"""
    
    display(Markdown(content))
    return df

# Analyze our chat messages
chat_analysis = analyze_chat_messages(chat_messages)
display(chat_analysis)

As we can see, there's overhead in the chat message format beyond just the content tokens. This includes tokens for marking message boundaries, role indicators, and other structural elements.

Let's create a function to estimate total tokens for a chat conversation with a certain number of messages:

In [None]:
def estimate_chat_tokens(avg_message_length, num_messages, avg_overhead=4):
    """Estimate total tokens for a chat conversation.
    
    Args:
        avg_message_length: Average number of tokens per message content
        num_messages: Number of messages in the conversation
        avg_overhead: Average overhead tokens per message (typically 3-5)
        
    Returns:
        Estimated total tokens
    """
    return num_messages * (avg_message_length + avg_overhead)

# Create a table of token estimates for different conversation sizes
conversation_sizes = [
    {"messages": 5, "avg_length": 50},  # Small conversation
    {"messages": 10, "avg_length": 100},  # Medium conversation
    {"messages": 20, "avg_length": 150},  # Large conversation
    {"messages": 50, "avg_length": 100},  # Very large conversation
    {"messages": 100, "avg_length": 80},  # Extremely large conversation
]

conversation_estimates = []
for size in conversation_sizes:
    tokens = estimate_chat_tokens(size["avg_length"], size["messages"])
    conversation_estimates.append({
        "num_messages": size["messages"],
        "avg_message_length": size["avg_length"],
        "estimated_tokens": tokens,
        "fits_in_8k_context": tokens <= 8000,
        "fits_in_16k_context": tokens <= 16000,
        "fits_in_128k_context": tokens <= 128000
    })

# Display the estimates
estimate_df = pd.DataFrame(conversation_estimates)
display(estimate_df)

## 3. Context Window Optimization

The context window is the amount of text (measured in tokens) that a model can process at once. Different models have different context window sizes. Let's explore how to optimize within these constraints.

### 3.1 Context Window Sizes

Let's first look at the context window sizes for different models:

In [None]:
# Define a list of popular models (updated for 2025)
models = [
    "openai/gpt-4o-2024-08-06",
    "openai/gpt-4o-mini-2024-07-18", 
    "openai/gpt-4-turbo",
    "anthropic/claude-3-5-sonnet-20241022",
    "anthropic/claude-3-5-haiku-20241022",
    "anthropic/claude-3-opus-20240229",
    "google/gemini-2.0-flash-exp",
    "google/gemini-1.5-pro",
    "google/gemini-1.5-flash",
    "deepseek/deepseek-r1",
    "deepseek/deepseek-chat-v3"
]

# Get context window sizes
context_windows = {model: get_context_window(model) for model in models}

# Create a DataFrame for visualization
context_df = pd.DataFrame({
    'Model': list(context_windows.keys()),
    'Context Window (tokens)': list(context_windows.values())
})

# Sort by context window size for better visualization
context_df = context_df.sort_values('Context Window (tokens)', ascending=False)

# Display the results
display(context_df)

# Visualize context window sizes
plt.figure(figsize=(14, 8))
bars = plt.barh(context_df['Model'], context_df['Context Window (tokens)'], color='skyblue')
plt.xscale('log')  # Log scale for better visualization of the wide range
plt.xlabel('Context Window Size (tokens)')
plt.title('Context Window Sizes for Latest LLM Models (2025)')
plt.grid(axis='x', linestyle='--', alpha=0.7)

# Add labels to the bars
for bar in bars:
    width = bar.get_width()
    label_x_pos = width * 1.1  # Position label slightly to the right of the bar
    plt.text(label_x_pos, bar.get_y() + bar.get_height()/2, f'{int(width):,}',
             va='center', ha='left', fontweight='bold', fontsize=9)

plt.tight_layout()
plt.show()

### 3.2 Context Window Best Practices

When working with the context window, consider these best practices:

1. **Reserve tokens for the response**: Don't fill the entire context window with input
2. **Use a buffer**: Leave some tokens as a safety margin
3. **Prioritize recent messages**: In a conversation, recent messages are often more important
4. **Truncate or summarize older content**: For long conversations, summarize earlier exchanges

Let's create utility functions to optimize content for the context window:

In [None]:
# Load a long piece of text from our test cases
long_text = test_case_dict["Long Text"]

# Count the tokens in the long text
long_text_tokens = count_tokens(long_text)
content = f"""## Processing Long Text

**Long text contains:** {long_text_tokens} tokens and {len(long_text)} characters."""

# Let's simulate a prompt that includes this long text
prompt = f"Summarize the following text in 3-5 bullet points:\n\n{long_text}"
prompt_tokens = count_tokens(prompt)
content += f"\n**Full prompt contains:** {prompt_tokens} tokens."

# Check if it fits in different context windows
content += "\n\n### Context Window Compatibility:\n\n"
for window_size in [8000, 16000, 32000, 64000, 128000]:
    fits_with_buffer = prompt_tokens + 500 <= window_size  # 500 tokens reserved for response
    status = "‚úÖ" if fits_with_buffer else "‚ùå"
    content += f"- **{window_size} token window** with response buffer: {status}\n"

display(Markdown(content))

In [None]:
# Optimize the long text to fit in a smaller window
smaller_window = 4000  # Simulate a smaller context window
reserved_tokens = 500  # Reserve tokens for the response

# Check if we need optimization
base_prompt = "Summarize the following text in 3-5 bullet points:\n\n"
base_prompt_tokens = count_tokens(base_prompt)
available_tokens = smaller_window - base_prompt_tokens - reserved_tokens

content = f"""## Context Window Optimization

**Base prompt uses:** {base_prompt_tokens} tokens
**Available tokens for content:** {available_tokens}
**Original content tokens:** {long_text_tokens}"""

if long_text_tokens > available_tokens:
    content += f"\n**Need to truncate content by:** {long_text_tokens - available_tokens} tokens"
    optimized_text = truncate_to_token_limit(long_text, available_tokens)
    optimized_tokens = count_tokens(optimized_text)
    content += f"\n**Optimized content tokens:** {optimized_tokens}"
    content += f"\n\n**Truncated text (showing first 200 chars):**\n\n{optimized_text[:200]}..."
    
    # Double-check the total
    optimized_prompt = f"{base_prompt}{optimized_text}"
    optimized_prompt_tokens = count_tokens(optimized_prompt)
    content += f"\n\n**Final prompt uses:** {optimized_prompt_tokens} tokens"
    fits = optimized_prompt_tokens + reserved_tokens <= smaller_window
    status = "‚úÖ" if fits else "‚ùå"
    content += f"\n**Fits in {smaller_window} token window with response buffer:** {status}"
else:
    content += "\n\n**Content already fits in the available space.**"

display(Markdown(content))

### 3.3 Optimize Chat History

For multi-turn conversations, we often need to optimize the chat history to fit within the context window. Let's implement a function to manage chat history:

In [None]:
def create_sample_conversation(num_turns, message_length=100):
    """Create a sample conversation with the specified number of turns."""
    conversation = [
        {"role": "system", "content": "You are a helpful AI assistant."}
    ]
    
    # Each turn consists of a user message and an assistant response
    for i in range(num_turns):
        # Add user message
        conversation.append({
            "role": "user", 
            "content": f"This is user message {i+1}. " + "A " * message_length
        })
        
        # Add assistant response
        conversation.append({
            "role": "assistant", 
            "content": f"This is assistant response {i+1}. " + "B " * message_length
        })
    
    return conversation

# Create a long conversation
long_conversation = create_sample_conversation(15, message_length=50)

# Count the total tokens
conversation_tokens = count_message_tokens(long_conversation)

content = f"""## Sample Long Conversation

**Long conversation has:** {len(long_conversation)} messages and uses {conversation_tokens} tokens.

### First 2 messages:"""

for msg in long_conversation[:2]:
    content += f"\n**[{msg['role']}]:** {msg['content'][:50]}..."
    
content += "\n\n### Last 2 messages:"
for msg in long_conversation[-2:]:
    content += f"\n**[{msg['role']}]:** {msg['content'][:50]}..."

display(Markdown(content))

In [None]:
# Optimize the conversation for different context windows
def optimize_for_target_window(messages, target_window, reserved_tokens=500):
    """Optimize messages to fit within a specific target context window."""
    available_tokens = target_window - reserved_tokens
    
    # If messages already fit, return as is
    current_tokens = count_message_tokens(messages)
    if current_tokens <= available_tokens:
        return messages
    
    # Keep removing oldest messages until we fit (preserve system message)
    optimized = messages.copy()
    while optimized and count_message_tokens(optimized) > available_tokens:
        # Remove the oldest message (excluding system message)
        if len(optimized) > 1 and optimized[0]["role"] == "system":
            optimized.pop(1)  # Remove first non-system message
        elif len(optimized) > 0:
            optimized.pop(0)  # Remove first message if no system message
        else:
            break  # Safety break
    
    return optimized

results_content = []

for window_size in [2000, 4000, 8000]:
    # Reserve tokens for the response
    reserved_tokens = 500
    
    # Optimize the conversation for this specific window size
    try:
        optimized_conversation = optimize_for_target_window(
            long_conversation, 
            target_window=window_size,
            reserved_tokens=reserved_tokens
        )
        
        # Count tokens in the optimized conversation
        optimized_tokens = count_message_tokens(optimized_conversation)
        
        result = f"""### Context window: {window_size} tokens

**Original conversation:** {len(long_conversation)} messages, {conversation_tokens} tokens
**Optimized conversation:** {len(optimized_conversation)} messages, {optimized_tokens} tokens
**Messages removed:** {len(long_conversation) - len(optimized_conversation)}
**Fits in context window with response buffer:** {"‚úÖ" if optimized_tokens + reserved_tokens <= window_size else "‚ùå"}

"""
        results_content.append(result)
        
    except Exception as e:
        result = f"""### Context window: {window_size} tokens

**Error optimizing conversation:** {str(e)}

"""
        results_content.append(result)

# Display all results
content = "## Conversation Optimization Results\n\n" + "\n".join(results_content)
display(Markdown(content))

## 4. Chunking Strategies for Long Inputs

When dealing with very long texts that exceed the context window, we need strategies to chunk the content and process it in parts.

### 4.1 Basic Chunking

Let's implement and demonstrate a basic chunking strategy:

In [None]:
# Test our chunking function on the long text
chunks = chunk_text(long_text, chunk_size=500, overlap=50)

content = f"""## Basic Text Chunking

**Divided text into {len(chunks)} chunks with 50 token overlap**"""

for i, chunk in enumerate(chunks[:3]):  # Show first 3 chunks
    chunk_tokens = count_tokens(chunk)
    content += f"\n\n### Chunk {i+1} ({chunk_tokens} tokens):\n\n{chunk[:100]}..."

# If there are more than 3 chunks, show the last one too
if len(chunks) > 3:
    last_chunk = chunks[-1]
    last_chunk_tokens = count_tokens(last_chunk)
    content += f"\n\n### Last chunk ({last_chunk_tokens} tokens):\n\n{last_chunk[:100]}..."

display(Markdown(content))

### 4.2 Intelligent Chunking

Basic chunking splits text based solely on token count. Let's implement a more intelligent chunking strategy that tries to respect document structure:

In [None]:
def chunk_by_structure(text, max_chunk_size=500, overlap=50):
    """Chunk text based on structural elements like paragraphs."""
    # Split by paragraph breaks first
    paragraphs = re.split(r'\n\s*\n', text)
    
    chunks = []
    current_chunk = ""
    current_chunk_tokens = 0
    
    for paragraph in paragraphs:
        paragraph_tokens = count_tokens(paragraph)
        
        # If the paragraph itself is too large, we need to split it
        if paragraph_tokens > max_chunk_size:
            # Process the current chunk if it's not empty
            if current_chunk_tokens > 0:
                chunks.append(current_chunk)
                current_chunk = ""
                current_chunk_tokens = 0
            
            # Split the large paragraph and add as separate chunks
            paragraph_chunks = chunk_text(paragraph, max_chunk_size, overlap)
            chunks.extend(paragraph_chunks)
        
        # If adding this paragraph would exceed the limit, finalize the current chunk
        elif current_chunk_tokens + paragraph_tokens > max_chunk_size:
            chunks.append(current_chunk)
            
            # Start a new chunk with overlap
            if overlap > 0 and current_chunk_tokens > 0:
                # Get the last ~N tokens for overlap
                # This is an approximation since we're working with text
                words = current_chunk.split()
                overlap_text = " ".join(words[-overlap*2:])  # rough estimate
                current_chunk = overlap_text + "\n\n" + paragraph
                current_chunk_tokens = count_tokens(current_chunk)
            else:
                current_chunk = paragraph
                current_chunk_tokens = paragraph_tokens
        else:
            # Add paragraph to the current chunk
            if current_chunk:
                current_chunk += "\n\n" + paragraph
            else:
                current_chunk = paragraph
            current_chunk_tokens = count_tokens(current_chunk)
    
    # Don't forget to add the last chunk if it's not empty
    if current_chunk:
        chunks.append(current_chunk)
    
    return chunks

# Test the structural chunking
structure_chunks = chunk_by_structure(long_text, max_chunk_size=500, overlap=50)

content = f"""## Intelligent Structural Chunking

**Divided text into {len(structure_chunks)} chunks based on structure**"""

for i, chunk in enumerate(structure_chunks[:3]):  # Show first 3 chunks
    chunk_tokens = count_tokens(chunk)
    content += f"\n\n### Chunk {i+1} ({chunk_tokens} tokens):\n\n{chunk[:100]}..."

# If there are more than 3 chunks, show the last one too
if len(structure_chunks) > 3:
    last_chunk = structure_chunks[-1]
    last_chunk_tokens = count_tokens(last_chunk)
    content += f"\n\n### Last chunk ({last_chunk_tokens} tokens):\n\n{last_chunk[:100]}..."

display(Markdown(content))

### 4.3 Processing Strategies for Chunked Text

Once we've chunked the text, there are several strategies for processing it:

1. **Process Each Chunk Independently**: Good for tasks like summarization or classification
2. **Chain of Thought**: Process chunks sequentially, carrying information forward
3. **Map-Reduce**: Process each chunk independently, then combine the results

Let's implement a simple map-reduce approach for summarization:

In [None]:
def map_reduce_summarize(text, max_chunk_size=500, model="openai/gpt-4o-mini-2024-07-18"):
    """Summarize long text using a map-reduce approach."""
    # Map phase: Split the text into chunks and summarize each chunk
    chunks = chunk_by_structure(text, max_chunk_size)
    
    display(Markdown(f"## Map-Reduce Summarization\n\n**Divided text into {len(chunks)} chunks**"))
    
    chunk_summaries = []
    for i, chunk in enumerate(chunks):
        display(Markdown(f"**Processing chunk {i+1}/{len(chunks)}...**"))
        
        # Call the API to summarize this chunk
        map_prompt = f"Summarize the following text in 2-3 sentences, capturing key points:\n\n{chunk}"
        response = call_openrouter(
            prompt=map_prompt,
            model=model,
            temperature=0.3,  # Lower temperature for more focused summaries
            max_tokens=150
        )
        
        if response.get("success", False):
            chunk_summary = extract_text_response(response)
            chunk_summaries.append(chunk_summary)
        else:
            error_msg = f"[Error processing chunk {i+1}]"
            display(Markdown(f"‚ùå Error summarizing chunk {i+1}: {response.get('error', 'Unknown error')}"))
            chunk_summaries.append(error_msg)
    
    # Reduce phase: Combine the summaries into a single summary
    display(Markdown("**Combining summaries...**"))
    combined_summaries = "\n\n".join([f"Chunk {i+1}: {summary}" for i, summary in enumerate(chunk_summaries)])
    
    reduce_prompt = f"""You have been given summaries of different chunks of a larger text. 
    Based on these summaries, create a cohesive summary of the entire text in 4-5 bullet points.
    
    Chunk summaries:
    {combined_summaries}"""
    
    final_response = call_openrouter(
        prompt=reduce_prompt,
        model=model,
        temperature=0.3,
        max_tokens=300
    )
    
    if final_response.get("success", False):
        final_summary = extract_text_response(final_response)
        return {
            "chunk_summaries": chunk_summaries,
            "final_summary": final_summary
        }
    else:
        display(Markdown(f"‚ùå Error in reduce phase: {final_response.get('error', 'Unknown error')}"))
        return {
            "chunk_summaries": chunk_summaries,
            "final_summary": "[Error generating final summary]"
        }

# Test the map-reduce summarization
summary_results = map_reduce_summarize(long_text, max_chunk_size=600)

content = "## Individual Chunk Summaries\n\n"
for i, summary in enumerate(summary_results["chunk_summaries"]):
    content += f"### Chunk {i+1}:\n\n{summary}\n\n"

content += "---\n\n## Final Combined Summary\n\n"
content += summary_results["final_summary"]

display(Markdown(content))

## 5. Token Usage Estimation and Cost Calculation

Understanding and managing token usage is important both for ensuring content fits within context windows and for managing costs.

### 5.1 Token Usage Patterns

Let's analyze how tokens are used in different scenarios:

In [None]:
# Define common tasks and their token usage patterns
tasks = [
    {"name": "Simple Question", "input_tokens": 10, "output_tokens": 50},
    {"name": "Detailed Question", "input_tokens": 50, "output_tokens": 200},
    {"name": "Text Summarization (1 page)", "input_tokens": 500, "output_tokens": 100},
    {"name": "Text Summarization (5 pages)", "input_tokens": 2500, "output_tokens": 200},
    {"name": "Essay Generation", "input_tokens": 100, "output_tokens": 1000},
    {"name": "Code Generation", "input_tokens": 200, "output_tokens": 500},
    {"name": "Chat Conversation (10 turns)", "input_tokens": 1000, "output_tokens": 1000},
    {"name": "Document Analysis", "input_tokens": 4000, "output_tokens": 300},
]

# Create a DataFrame for visualization
task_df = pd.DataFrame(tasks)
task_df["total_tokens"] = task_df["input_tokens"] + task_df["output_tokens"]

# Visualize token distribution
plt.figure(figsize=(12, 6))
task_df.plot(x="name", y=["input_tokens", "output_tokens"], kind="bar", stacked=True, 
            color=["skyblue", "coral"], figsize=(12, 6))
plt.title("Token Usage Patterns for Different Tasks")
plt.xlabel("Task")
plt.ylabel("Number of Tokens")
plt.xticks(rotation=45, ha="right")
plt.legend(title="Token Type")
plt.grid(axis="y", linestyle="--", alpha=0.7)
plt.tight_layout()
plt.show()

### 5.2 Cost Calculation

Now, let's calculate the cost of these tasks for different models:

In [None]:
# Calculate costs for each task across different models
model_costs = []

for task in tasks:
    task_name = task["name"]
    input_tokens = task["input_tokens"]
    output_tokens = task["output_tokens"]
    
    for model in models[:5]:  # Use first 5 models to keep the table manageable
        cost = estimate_cost(model, input_tokens, output_tokens)
        model_costs.append({
            "task": task_name,
            "model": model,
            "input_tokens": input_tokens,
            "output_tokens": output_tokens,
            "total_tokens": input_tokens + output_tokens,
            "cost_usd": cost
        })

# Create a DataFrame for analysis
cost_df = pd.DataFrame(model_costs)

# Pivot the table to show costs by task and model
pivot_df = cost_df.pivot(index="task", columns="model", values="cost_usd")

# Format the pivot table for display (round to 6 decimal places and format as currency)
formatted_pivot = pivot_df.applymap(lambda x: f"${x:.6f}")
display(formatted_pivot)

In [None]:
# Visualize cost comparison for a specific task
def visualize_task_costs(task_name):
    """Visualize costs for a specific task across models."""
    task_costs = cost_df[cost_df["task"] == task_name]
    
    plt.figure(figsize=(12, 6))
    plt.bar(task_costs["model"], task_costs["cost_usd"], color="purple")
    plt.title(f"Cost Comparison for '{task_name}' Across Models")
    plt.xlabel("Model")
    plt.ylabel("Cost (USD)")
    plt.xticks(rotation=45, ha="right")
    plt.grid(axis="y", linestyle="--", alpha=0.7)
    
    # Add cost labels above the bars
    for i, cost in enumerate(task_costs["cost_usd"]):
        plt.text(i, cost + 0.00001, f"${cost:.6f}", ha="center", va="bottom", fontsize=9)
    
    plt.tight_layout()
    plt.show()

# Test with a couple of tasks
visualize_task_costs("Document Analysis")
visualize_task_costs("Simple Question")

### 5.3 Cost Optimization Strategies

Let's explore some strategies for optimizing token usage and reducing costs:

In [None]:
# Define cost optimization strategies
optimization_strategies = [
    {
        "strategy": "Use smaller models for simpler tasks",
        "description": "Use GPT-4o mini or Gemini Flash for straightforward tasks, reserving larger models for complex reasoning.",
        "token_savings": "0%",
        "cost_savings": "50-90%"
    },
    {
        "strategy": "Prompt engineering to reduce verbosity",
        "description": "Request concise responses and specify exactly what you need.",
        "token_savings": "20-50%",
        "cost_savings": "10-25%"
    },
    {
        "strategy": "Pre-process and filter inputs",
        "description": "Remove irrelevant content before sending to the model, especially for document processing.",
        "token_savings": "30-70%",
        "cost_savings": "30-70%"
    },
    {
        "strategy": "Reuse responses for repeated questions",
        "description": "Implement caching for common queries to avoid redundant API calls.",
        "token_savings": "Varies",
        "cost_savings": "Up to 90%"
    },
    {
        "strategy": "Use map-reduce for large documents",
        "description": "Process large documents in chunks then combine, rather than exceeding context limits.",
        "token_savings": "Varies",
        "cost_savings": "20-50%"
    },
    {
        "strategy": "Truncate chat history",
        "description": "Maintain only the most relevant conversation history, summarizing older messages.",
        "token_savings": "40-80%",
        "cost_savings": "40-80%"
    },
    {
        "strategy": "Use embeddings for retrieval",
        "description": "For large knowledge bases, use embeddings to retrieve relevant content instead of including everything.",
        "token_savings": "80-95%",
        "cost_savings": "80-95%"
    }
]

# Display the strategies
strategy_df = pd.DataFrame(optimization_strategies)
display(strategy_df)

Let's implement one of these strategies to demonstrate token savings:

In [None]:
# Example: Prompt Engineering for Verbosity Control

verbose_prompt = "Tell me about the history of artificial intelligence and its major developments over time."
concise_prompt = "List the 5 most important milestones in AI history with their dates. Keep it brief."

# Model to use
test_model = "openai/gpt-4o-mini-2024-07-18"

# Get verbose response
verbose_response = call_openrouter(
    prompt=verbose_prompt,
    model=test_model,
    temperature=0.7,
    max_tokens=500
)

# Get concise response
concise_response = call_openrouter(
    prompt=concise_prompt,
    model=test_model,
    temperature=0.7,
    max_tokens=500
)

# Compare results
if verbose_response.get("success", False) and concise_response.get("success", False):
    verbose_text = extract_text_response(verbose_response)
    concise_text = extract_text_response(concise_response)
    
    verbose_tokens = count_tokens(verbose_text)
    concise_tokens = count_tokens(concise_text)
    
    verbose_cost = estimate_cost(test_model, count_tokens(verbose_prompt), verbose_tokens)
    concise_cost = estimate_cost(test_model, count_tokens(concise_prompt), concise_tokens)
    
    print(f"Verbose prompt: {verbose_prompt}")
    print(f"Concise prompt: {concise_prompt}\n")
    
    print(f"Verbose response tokens: {verbose_tokens}")
    print(f"Concise response tokens: {concise_tokens}")
    print(f"Token reduction: {(verbose_tokens - concise_tokens) / verbose_tokens:.1%}\n")
    
    print(f"Verbose response cost: ${verbose_cost:.6f}")
    print(f"Concise response cost: ${concise_cost:.6f}")
    print(f"Cost savings: ${verbose_cost - concise_cost:.6f} ({(verbose_cost - concise_cost) / verbose_cost:.1%})\n")
    
    print("Verbose Response:")
    print(verbose_text)
    
    print("\n" + "-" * 50 + "\n")
    
    print("Concise Response:")
    print(concise_text)
else:
    print("Error getting responses.")

## 6. Practical Application: Token-Aware Document Processor

Let's build a practical application that demonstrates token-aware processing of documents.

In [None]:
class TokenAwareDocumentProcessor:
    """A processor that intelligently handles documents with token awareness."""
    
    def __init__(self, model="openai/gpt-4o-mini-2024-07-18", max_tokens_per_chunk=700):
        self.model = model
        self.max_tokens_per_chunk = max_tokens_per_chunk
        self.context_window = get_context_window(model)
        self.reserved_tokens = 300  # Reserve tokens for system prompts and responses
    
    def _estimate_processing_cost(self, document):
        """Estimate the cost of processing the document."""
        doc_tokens = count_tokens(document)
        chunks = math.ceil(doc_tokens / self.max_tokens_per_chunk)
        
        # Estimate input tokens (chunks + overhead)
        estimated_input_tokens = doc_tokens + (chunks * 50)  # 50 tokens overhead per chunk
        
        # Estimate output tokens (varies by operation)
        operations = {
            "summarize": doc_tokens * 0.2,  # Summary is ~20% of original length
            "analyze": doc_tokens * 0.3,    # Analysis is ~30% of original length
            "extract": doc_tokens * 0.1     # Extraction is ~10% of original length
        }
        
        # Calculate costs for each operation
        costs = {}
        for op, output_tokens in operations.items():
            costs[op] = estimate_cost(self.model, estimated_input_tokens, output_tokens)
        
        return {
            "document_tokens": doc_tokens,
            "estimated_chunks": chunks,
            "estimated_input_tokens": estimated_input_tokens,
            "estimated_output_tokens": operations,
            "estimated_costs": costs
        }
    
    def _determine_processing_strategy(self, document):
        """Determine the best processing strategy based on document length."""
        doc_tokens = count_tokens(document)
        available_tokens = self.context_window - self.reserved_tokens
        
        if doc_tokens <= available_tokens:
            return "single_pass"
        else:
            return "map_reduce"
    
    def summarize(self, document):
        """Summarize the document."""
        strategy = self._determine_processing_strategy(document)
        
        if strategy == "single_pass":
            print("Using single pass processing strategy")
            system_prompt = "You are an expert summarizer. Create a concise, accurate summary of the text."
            prompt = f"Please summarize the following document in a few paragraphs, highlighting the main points:\n\n{document}"
            
            response = call_openrouter(
                prompt=prompt,
                model=self.model,
                system_prompt=system_prompt,
                temperature=0.3,
                max_tokens=self.reserved_tokens
            )
            
            if response.get("success", False):
                return extract_text_response(response)
            else:
                return f"Error: {response.get('error', 'Unknown error')}"
        else:
            print("Using map-reduce processing strategy")
            result = map_reduce_summarize(document, self.max_tokens_per_chunk, self.model)
            return result["final_summary"]
    
    def analyze_document(self, document):
        """Analyze the document."""
        strategy = self._determine_processing_strategy(document)
        
        if strategy == "single_pass":
            print("Using single pass processing strategy")
            system_prompt = "You are an expert document analyst. Provide clear, insightful analysis."
            prompt = f"""Please analyze the following document and provide insights on:
            1. Main themes and topics
            2. Key arguments or points
            3. Tone and style
            4. Intended audience
            
            Document:
            {document}"""
            
            response = call_openrouter(
                prompt=prompt,
                model=self.model,
                system_prompt=system_prompt,
                temperature=0.3,
                max_tokens=self.reserved_tokens
            )
            
            if response.get("success", False):
                return extract_text_response(response)
            else:
                return f"Error: {response.get('error', 'Unknown error')}"
        else:
            print("Using map-reduce processing strategy")
            # Split into chunks
            chunks = chunk_by_structure(document, self.max_tokens_per_chunk)
            
            # Analyze each chunk
            chunk_analyses = []
            for i, chunk in enumerate(chunks):
                print(f"Processing chunk {i+1}/{len(chunks)}...")
                
                chunk_prompt = f"""Analyze this document segment and identify:
                1. Main themes and topics
                2. Key points or arguments
                3. Tone and style
                
                Document segment:
                {chunk}"""
                
                response = call_openrouter(
                    prompt=chunk_prompt,
                    model=self.model,
                    temperature=0.3,
                    max_tokens=200
                )
                
                if response.get("success", False):
                    chunk_analysis = extract_text_response(response)
                    chunk_analyses.append(chunk_analysis)
                else:
                    chunk_analyses.append(f"[Error analyzing chunk {i+1}]")
            
            # Combine the analyses
            print("Combining analyses...")
            combined_analyses = "\n\n".join([f"Segment {i+1} Analysis:\n{analysis}" for i, analysis in enumerate(chunk_analyses)])
            
            final_prompt = f"""You have been given analyses of different segments of a document.
            Based on these segment analyses, provide a comprehensive analysis of the entire document, covering:
            1. Main themes and topics across the entire document
            2. Key arguments or points and how they develop
            3. Overall tone and style
            4. Likely intended audience
            
            Segment analyses:
            {combined_analyses}"""
            
            final_response = call_openrouter(
                prompt=final_prompt,
                model=self.model,
                temperature=0.3,
                max_tokens=300
            )
            
            if final_response.get("success", False):
                return extract_text_response(final_response)
            else:
                return f"Error in final analysis: {final_response.get('error', 'Unknown error')}"
    
    def process_document(self, document):
        """Process the document with cost estimation and strategy selection."""
        # Get cost estimates
        estimates = self._estimate_processing_cost(document)
        
        # Display processing information
        print(f"Document size: {estimates['document_tokens']} tokens")
        print(f"Processing strategy: {self._determine_processing_strategy(document)}")
        print(f"Estimated chunks needed: {estimates['estimated_chunks']}")
        print(f"Estimated cost for summarization: ${estimates['estimated_costs']['summarize']:.6f}")
        print(f"Estimated cost for analysis: ${estimates['estimated_costs']['analyze']:.6f}")
        print("\n" + "-" * 50 + "\n")
        
        # Process the document
        results = {
            "summary": self.summarize(document),
            "analysis": self.analyze_document(document)
        }
        
        return results

In [None]:
# Test the TokenAwareDocumentProcessor with our long text
processor = TokenAwareDocumentProcessor(model="openai/gpt-4o-mini-2024-07-18")
results = processor.process_document(long_text)

print("\n" + "=" * 80)
print("DOCUMENT SUMMARY:")
print(results["summary"])

print("\n" + "=" * 80)
print("DOCUMENT ANALYSIS:")
print(results["analysis"])

## 7. Exercises

Here are some exercises to practice token management techniques:

1. **Token Counter**: Create a tool that counts tokens for common file formats (TXT, PDF, Markdown, etc.) and provides a cost estimate for processing them with different models

2. **Context Window Optimizer**: Develop a function that dynamically decides whether to use a whole-document approach or a chunking approach based on token counts

3. **Chat History Manager**: Implement a system that maintains chat history while keeping token count under control (summarizing old messages or removing less relevant ones)

4. **Token-Efficient Embeddings**: Create a system that generates embeddings for document chunks while minimizing token usage and maximizing information retention

5. **Cost Comparison Tool**: Build a tool that compares the cost of running the same task across different models, helping users choose the most cost-effective option for their needs

## 8. Summary

In this notebook, we've explored token management strategies for working with LLMs, including:

- Understanding how text is tokenized and how different content types affect token counts
- Implementing techniques to optimize content for different context window sizes
- Developing chunking strategies for processing long documents
- Creating map-reduce approaches for handling content that exceeds context limits
- Estimating token usage and calculating costs across different models
- Building a token-aware document processor that selects optimal strategies based on content length

Effective token management is essential for maximizing the capabilities of LLMs while controlling costs. By applying these techniques, you can handle documents of any size and optimize your API usage.