# Context Windows Lab: Experiment Analysis

**Course:** LLMs and Multi-Agent Systems  
**Team:** OmerAndYogever  
**Assignment:** 5 - Context Windows in Practice

This notebook demonstrates and analyzes four experiments exploring LLM context window behavior:

1. **Needle in Haystack** - Lost in the Middle phenomenon
2. **Context Size Impact** - Accuracy degradation with larger contexts
3. **RAG Impact** - Comparing retrieval strategies
4. **Context Engineering** - Managing context in multi-step agents

## Setup and Configuration

In [None]:
# Install dependencies if needed
# !pip install -e ..

import sys
sys.path.insert(0, '../src')

# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Project imports
from context_windows_lab.config import Config
from context_windows_lab.experiments import (
    NeedleInHaystackExperiment,
    ContextSizeExperiment,
    RAGImpactExperiment,
    ContextEngineeringExperiment,
)
from context_windows_lab.utils.document_generator import DocumentGenerator, FactPosition
from context_windows_lab.utils.token_counter import TokenCounter
from context_windows_lab.utils.statistics import StatisticalAnalyzer
from context_windows_lab.utils.visualization import Visualizer

# Configure plotting
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
%matplotlib inline

print("✓ Setup complete!")

In [None]:
# Load configuration
config = Config.from_env()

# Validate
errors = config.validate()
if errors:
    print("⚠️ Configuration warnings:")
    for e in errors:
        print(f"  - {e}")
else:
    print(f"✓ Configuration valid")
    print(f"  Model: {config.claude_model}")
    print(f"  Trials per experiment: {config.num_runs}")

---
## Experiment 1: Needle in Haystack (Lost in the Middle)

### Hypothesis
LLMs have difficulty retrieving information placed in the middle of a long context, compared to information at the beginning or end.

### Methodology
- Generate synthetic documents with embedded facts
- Place the critical fact at start, middle, or end positions
- Measure retrieval accuracy for each position

In [None]:
# Create the experiment
needle_exp = NeedleInHaystackExperiment(
    config=config,
    verbose=True,
    num_documents=5,
    words_per_doc=200,
)

print(f"Experiment: {needle_exp.NAME}")
print(f"Description: {needle_exp.DESCRIPTION}")
print(f"Estimated duration: {needle_exp.ESTIMATED_DURATION}")

In [None]:
# Run the experiment
output_dir = Path('../outputs')
output_dir.mkdir(exist_ok=True)

needle_result = needle_exp.run(num_trials=5, output_dir=output_dir)

In [None]:
# Display results
print("\n" + "="*60)
print("EXPERIMENT 1 RESULTS: Needle in Haystack")
print("="*60)

analysis = needle_result.analysis
print(f"\nMain Finding: {analysis['main_finding']}")

print("\nKey Metrics:")
for metric, value in analysis['key_metrics'].items():
    print(f"  {metric}: {value}")

print(f"\nStatistical Significance: {analysis['statistical_significance']}")

In [None]:
# Visualize results
from IPython.display import Image, display

if needle_result.visualizations:
    display(Image(filename=needle_result.visualizations[0]))

---
## Experiment 2: Context Window Size Impact

### Hypothesis
As context window size increases, LLM accuracy decreases and latency increases.

### Methodology
- Gradually increase the number of documents: 2, 5, 10, 20, 50
- Measure accuracy, latency, and token usage at each level
- Analyze correlation between size and performance

In [None]:
# Create the experiment
size_exp = ContextSizeExperiment(
    config=config,
    verbose=True,
    doc_counts=[2, 5, 10, 20, 50],
    words_per_doc=200,
)

print(f"Experiment: {size_exp.NAME}")
print(f"Description: {size_exp.DESCRIPTION}")

In [None]:
# Run the experiment
size_result = size_exp.run(num_trials=5, output_dir=output_dir)

In [None]:
# Display results
print("\n" + "="*60)
print("EXPERIMENT 2 RESULTS: Context Size Impact")
print("="*60)

analysis = size_result.analysis
print(f"\nMain Finding: {analysis['main_finding']}")

print("\nKey Metrics:")
for metric, value in analysis['key_metrics'].items():
    print(f"  {metric}: {value}")

print(f"\nAccuracy Correlation: r={analysis['accuracy_correlation']['r']:.3f}, p={analysis['accuracy_correlation']['p_value']:.4f}")
print(f"Latency Correlation: r={analysis['latency_correlation']['r']:.3f}, p={analysis['latency_correlation']['p_value']:.4f}")

In [None]:
# Visualize results
if size_result.visualizations:
    display(Image(filename=size_result.visualizations[0]))

---
## Experiment 3: RAG Impact

### Hypothesis
RAG (Retrieval Augmented Generation) improves accuracy and reduces latency compared to full context.

### Methodology
- Create a corpus of 20 documents
- Compare two approaches:
  - **Full Context**: All documents in the context
  - **RAG**: Only relevant documents retrieved via similarity search

In [None]:
# Create the experiment
rag_exp = RAGImpactExperiment(
    config=config,
    verbose=True,
    total_documents=20,
    relevant_documents=3,
    words_per_doc=200,
)

print(f"Experiment: {rag_exp.NAME}")
print(f"Description: {rag_exp.DESCRIPTION}")

In [None]:
# Run the experiment
rag_result = rag_exp.run(num_trials=5, output_dir=output_dir)

In [None]:
# Display results
print("\n" + "="*60)
print("EXPERIMENT 3 RESULTS: RAG Impact")
print("="*60)

analysis = rag_result.analysis
print(f"\nMain Finding: {analysis['main_finding']}")

print("\nKey Metrics:")
for metric, value in analysis['key_metrics'].items():
    print(f"  {metric}: {value}")

print(f"\nImprovements:")
print(f"  Accuracy: {analysis['improvements']['accuracy']:.1%}")
print(f"  Latency: {analysis['improvements']['latency']:.2f}s faster")
print(f"  Tokens: {analysis['improvements']['tokens']:.0f} fewer")

In [None]:
# Visualize results
if rag_result.visualizations:
    display(Image(filename=rag_result.visualizations[0]))

---
## Experiment 4: Context Engineering Strategies

### Hypothesis
Different context management strategies (Select, Compress, Write) maintain performance better than naive accumulation.

### Methodology
- Simulate a multi-step agent executing 10 sequential actions
- Compare four strategies:
  - **Baseline**: Accumulate all history
  - **Select**: Keep only recent items (RAG-like)
  - **Compress**: Summarize when too long
  - **Write**: Use external scratchpad

In [None]:
# Create the experiment
eng_exp = ContextEngineeringExperiment(
    config=config,
    verbose=True,
    num_actions=10,
    action_output_words=100,
    max_context_tokens=2000,
)

print(f"Experiment: {eng_exp.NAME}")
print(f"Description: {eng_exp.DESCRIPTION}")

In [None]:
# Run the experiment
eng_result = eng_exp.run(num_trials=5, output_dir=output_dir)

In [None]:
# Display results
print("\n" + "="*60)
print("EXPERIMENT 4 RESULTS: Context Engineering Strategies")
print("="*60)

analysis = eng_result.analysis
print(f"\nMain Finding: {analysis['main_finding']}")

print("\nKey Metrics:")
for metric, value in analysis['key_metrics'].items():
    print(f"  {metric}: {value}")

print(f"\nBest Strategy: {analysis['best_strategy'].upper()}")
print(f"\nANOVA Results:")
print(f"  F-statistic: {analysis['anova']['statistic']:.2f}")
print(f"  p-value: {analysis['anova']['p_value']:.4f}")
print(f"  Effect size (η²): {analysis['anova']['effect_size']:.3f}")

In [None]:
# Visualize results
if eng_result.visualizations:
    display(Image(filename=eng_result.visualizations[0]))

---
## Summary and Conclusions

### Summary Table

In [None]:
# Create summary table
summary_data = [
    {
        "Experiment": "1. Needle in Haystack",
        "Key Finding": needle_result.analysis.get('main_finding', 'N/A')[:100] + "...",
        "Significant": "Yes" if any('significant' in str(v).lower() for v in needle_result.analysis.values()) else "No",
    },
    {
        "Experiment": "2. Context Size Impact",
        "Key Finding": size_result.analysis.get('main_finding', 'N/A')[:100] + "...",
        "Significant": "Yes" if size_result.analysis.get('accuracy_correlation', {}).get('significant', False) else "No",
    },
    {
        "Experiment": "3. RAG Impact",
        "Key Finding": rag_result.analysis.get('main_finding', 'N/A')[:100] + "...",
        "Significant": "Yes" if rag_result.analysis.get('comparisons', {}).get('accuracy', {}).get('significant', False) else "No",
    },
    {
        "Experiment": "4. Context Engineering",
        "Key Finding": eng_result.analysis.get('main_finding', 'N/A')[:100] + "...",
        "Significant": "Yes" if eng_result.analysis.get('anova', {}).get('significant', False) else "No",
    },
]

summary_df = pd.DataFrame(summary_data)
display(summary_df)

### Key Takeaways

1. **Lost in the Middle**: Information placed in the middle of long contexts is harder to retrieve. This is a fundamental limitation of transformer attention mechanisms.

2. **Context Size Trade-offs**: Larger contexts provide more information but at the cost of accuracy and latency. There's an optimal context size for each task.

3. **RAG Benefits**: Retrieval-Augmented Generation significantly improves both accuracy and efficiency by focusing on relevant information.

4. **Context Engineering**: Strategic management of context (Select, Compress, Write) helps maintain performance in multi-step agent workflows.

### Recommendations

- Place critical information at the **beginning or end** of prompts
- Use **RAG** for large document collections
- Implement **context management strategies** for agents with long conversations
- Monitor **token usage** to optimize cost and performance

In [None]:
print("\n" + "="*60)
print("EXPERIMENT COMPLETE")
print("="*60)
print(f"\nAll results saved to: {output_dir.absolute()}")
print(f"\nFiles generated:")
for f in output_dir.glob("*.png"):
    print(f"  - {f.name}")
for f in output_dir.glob("*.json"):
    print(f"  - {f.name}")