# AIME GEPA Benchmarking Suite

Comprehensive benchmarking system for GEPA optimization on AIME math problems.
Features plug-and-play model configuration, automated evaluation, data export, and visualization.

## Features
- 🔌 **Plug-and-Play Models**: Easy model switching (OpenAI, Gemini, Claude, etc.)
- 📊 **Comprehensive Analysis**: Score evolution, prompt length tracking, correlation analysis
- 💾 **Data Export**: JSON export with evolution trees and candidate metadata
- 📈 **Rich Visualizations**: 4-panel evolution plots inspired by index.ipynb
- 🎯 **AIME Specialized**: Optimized for mathematical competition problems

In [None]:
# Setup and Dependencies
import os
import json
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datasets import load_dataset
from datetime import datetime
from pathlib import Path
from typing import Dict, List, Optional, Tuple, Any, TypedDict
import warnings
warnings.filterwarnings('ignore')

# AnyMaths Adapter Dependencies
import litellm
from pydantic import BaseModel, Field

# Set up plotting style (inspired by index.ipynb)
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.dpi'] = 100
plt.rcParams['savefig.dpi'] = 300

print("✅ Dependencies loaded successfully")

In [None]:
# AnyMaths Adapter Implementation (from GEPA repository)
class EvaluationBatch(TypedDict):
    """Results from evaluating a batch of examples."""
    outputs: List[Any]
    scores: List[float]
    trajectories: Optional[List[Any]]

class AnyMathsDataInst(TypedDict):
    input: str
    additional_context: dict[str, str]
    answer: str

class AnyMathsTrajectory(TypedDict):
    data: AnyMathsDataInst
    full_assistant_response: str

class AnyMathsRolloutOutput(TypedDict):
    full_assistant_response: str

class AnyMathsStructuredOutput(BaseModel):
    final_answer: str = Field(
        ..., description="The final answer to the mathematical problem (i.e., no units, no other text)"
    )
    solution_pad: str = Field(..., description="The solution pad containing the step-by-step solution to the problem.")

class AnyMathsAdapter:
    """AnyMaths Adapter for mathematical word problems using LiteLLM."""
    
    def __init__(
        self,
        model: str,
        failure_score: float = 0.0,
        api_base: str | None = None,
        max_litellm_workers: int = 4,
    ) -> None:
        self.model = model
        self.failure_score = failure_score
        self.litellm = litellm
        self.max_litellm_workers = max_litellm_workers
        self.api_base = api_base

        if self.model.startswith("ollama"):
            if self.api_base is None:
                self.api_base = "http://localhost:11434"
            assert self.api_base is not None, "API base URL must be provided when using Ollama."

        if self.api_base is None or self.api_base == "":
            self.api_base = None
    
    def evaluate(
        self,
        batch: List[AnyMathsDataInst],
        candidate: dict[str, str],
        capture_traces: bool = False,
    ) -> EvaluationBatch:
        import ast

        outputs: List[AnyMathsRolloutOutput] = []
        scores: List[float] = []
        trajectories: List[AnyMathsTrajectory] | None = [] if capture_traces else None

        if not candidate:
            raise ValueError("Candidate must contain at least one component text.")

        system_content = next(iter(candidate.values()))

        litellm_requests = []

        for data in batch:
            user_content = f"{data['input']}"

            messages = [
                {"role": "system", "content": system_content},
                {"role": "user", "content": user_content},
            ]

            litellm_requests.append(messages)

        try:
            responses = self.litellm.batch_completion(
                model=self.model,
                messages=litellm_requests,
                api_base=self.api_base,
                max_workers=self.max_litellm_workers,
                format=AnyMathsStructuredOutput.model_json_schema(),
                response_format={
                    "type": "json_object",
                    "response_schema": AnyMathsStructuredOutput.model_json_schema(),
                    "enforce_validation": True,
                },
            )
        except litellm.exceptions.JSONSchemaValidationError as e:
            raise e

        for data, response in zip(batch, responses, strict=False):
            correct_output_format = True
            try:
                assistant_response = ast.literal_eval(response.choices[0].message.content.strip())
            except Exception:
                assistant_response = "Assistant failed to respond with the correct answer or format."
                correct_output_format = False

            if correct_output_format:
                structured_assistant_response = f"Assistant's Solution: {assistant_response['solution_pad']}\n"
                structured_assistant_response += f"Final Answer: {assistant_response['final_answer']}"
                output = {"full_assistant_response": structured_assistant_response}
                score = 1.0 if data["answer"] in assistant_response["final_answer"] else self.failure_score
            else:
                output = {"full_assistant_response": assistant_response}
                score = self.failure_score

            outputs.append(output)
            scores.append(score)

            if capture_traces:
                trajectories.append({"data": data, "full_assistant_response": output["full_assistant_response"]})
        
        return EvaluationBatch(outputs=outputs, scores=scores, trajectories=trajectories)
    
    def make_reflective_dataset(
        self,
        candidate: dict[str, str],
        eval_batch: EvaluationBatch,
        components_to_update: list[str],
    ) -> dict[str, list[dict[str, Any]]]:
        ret_d: dict[str, list[dict[str, Any]]] = {}

        assert len(components_to_update) == 1
        comp = components_to_update[0]

        items: list[dict[str, Any]] = []
        trace_instances = list(zip(eval_batch['trajectories'], eval_batch['scores'], eval_batch['outputs'], strict=False))

        for trace_instance in trace_instances:
            traj, score, _ = trace_instance
            data = traj["data"]
            generated_outputs = traj["full_assistant_response"]

            if score > 0.0:
                feedback = f"The generated response is correct. The final answer is: {data['answer']}."
            else:
                additional_context_str = "\n".join(f"{k}: {v}" for k, v in data["additional_context"].items())
                if additional_context_str:
                    feedback = (
                        f"The generated response is incorrect. The correct answer is: {data['answer']}. "
                        "Ensure that the correct answer is included in the response exactly as it is. "
                        f"Here is some additional context that might be helpful:\n{additional_context_str}"
                    )
                else:
                    feedback = (
                        f"The generated response is incorrect. The correct answer is: {data['answer']}. "
                        "Ensure that the correct answer is included in the response exactly as it is."
                    )

            d = {"Inputs": data["input"], "Generated Outputs": generated_outputs, "Feedback": feedback}
            items.append(d)

        ret_d[comp] = items

        if len(items) == 0:
            raise Exception("No valid predictions found for any module.")

        return ret_d

print("✅ AnyMaths Adapter implemented")

In [None]:
def load_aime_datasets():
    """Load AIME datasets for training, validation, and testing."""
    
    # Load AIMO validation set for training/validation
    train_split = load_dataset("AI-MO/aimo-validation-aime")['train']
    train_data = []
    for x in train_split:
        # Convert to AnyMaths format
        aime_data = {
            "input": x['problem'],
            "additional_context": {"solution": x.get('solution', '')},
            "answer": str(x['answer']),
        }
        train_data.append(aime_data)
    
    # Shuffle and split
    import random
    random.Random(42).shuffle(train_data)
    tot_num = len(train_data)

    # Load AIME 2025 for testing
    test_split = load_dataset("MathArena/aime_2025")['train']
    test_data = []
    for x in test_split:
        # Convert to AnyMaths format
        aime_data = {
            "input": x['problem'],
            "additional_context": {},
            "answer": str(x['answer']),
        }
        test_data.append(aime_data)

    # Create splits
    train_set = train_data[:int(0.5 * tot_num)]
    val_set = train_data[int(0.5 * tot_num):]
    test_set = test_data

    return train_set, val_set, test_set

def extract_int(answer: str):
    """Try to parse an integer directly, or extract from \\boxed{}."""
    # Try direct parse
    try:
        return int(answer)
    except ValueError:
        pass
    
    # Try regex extraction
    match = re.search(r'\\boxed\\{([^}]*)\\}', answer)
    if match:
        try:
            return int(match.group(1))
        except ValueError:
            pass
    
    # Failed both ways
    return None

def metric(data_inst: AnyMathsDataInst, output: AnyMathsRolloutOutput):
    """Simple exact match metric for evaluation."""
    correct_answer = data_inst['answer']
    try:
        # Extract answer from structured response
        response = output['full_assistant_response']
        if "Final Answer:" in response:
            llm_answer = response.split("Final Answer:")[-1].strip()
        else:
            llm_answer = response
        
        return 1.0 if correct_answer in llm_answer else 0.0
    except (ValueError, TypeError):
        return 0.0

print("✅ Dataset loading and metrics defined")

In [None]:
# Model Configuration for AnyMaths Adapter
class ModelConfig:
    """Model configuration for AnyMaths adapter with LiteLLM support."""
    
    MODELS = {
        "openai-gpt4": {
            "model": "openai/gpt-4o-mini",
            "api_base": None,  # Uses OpenAI default
            "requires_api_key": "OPENAI_API_KEY"
        },
        "gemma-27b": {
            "model": "openrouter/google/gemma-3-27b-it", 
            "api_base": "https://openrouter.ai/api/v1",
            "requires_api_key": "OPENROUTER_API_KEY"
        },
        "qwen-72b": {
            "model": "openrouter/qwen/qwen-2.5-72b-instruct",
            "api_base": "https://openrouter.ai/api/v1",
            "requires_api_key": "OPENROUTER_API_KEY"
        },
        "claude-sonnet": {
            "model": "openrouter/anthropic/claude-3.5-sonnet",
            "api_base": "https://openrouter.ai/api/v1",
            "requires_api_key": "OPENROUTER_API_KEY"
        },
        "gemini-pro": {
            "model": "vertex_ai/gemini-2.5-pro",
            "api_base": None,
            "requires_api_key": "GOOGLE_APPLICATION_CREDENTIALS"  # or GEMINI_API_KEY
        },
        "ollama-qwen": {
            "model": "ollama/qwen2.5:7b",
            "api_base": "http://localhost:11434",
            "requires_api_key": None
        }
    }
    
    @classmethod
    def setup_model(cls, model_key: str) -> Tuple[str, str, int]:
        """Setup model configuration for AnyMaths adapter."""
        if model_key not in cls.MODELS:
            raise ValueError(f"Unknown model: {model_key}. Available: {list(cls.MODELS.keys())}")
        
        config = cls.MODELS[model_key]
        
        # Check API key if required
        if config["requires_api_key"]:
            api_key = os.getenv(config["requires_api_key"])
            if not api_key:
                raise ValueError(f"API key not found for {config['requires_api_key']}. Set environment variable.")
        
        return config["model"], config["api_base"], 4  # model, api_base, max_workers
    
    @classmethod
    def list_models(cls):
        """List all available models with their status."""
        print("Available models for AnyMaths adapter:")
        for model_key, config in cls.MODELS.items():
            if config["requires_api_key"]:
                api_status = "✅" if os.getenv(config["requires_api_key"]) else "❌"
                print(f"  {api_status} {model_key}: {config['model']} (needs {config['requires_api_key']})")
            else:
                print(f"  ✅ {model_key}: {config['model']} (local)")

# GEPA Optimization Support
def gepa_optimize_anymaths(
    adapter: AnyMathsAdapter,
    train_data: List[AnyMathsDataInst],
    val_data: List[AnyMathsDataInst],
    seed_prompt: str,
    reflection_lm_model: str = "vertex_ai/gemini-2.5-pro",
    max_metric_calls: int = 200,
    reflection_minibatch_size: int = 3
):
    """Run GEPA optimization using the AnyMaths adapter."""
    try:
        # Try to use GEPA optimize function directly
        from gepa import optimize
        
        # Create reflection LM configuration
        reflection_lm_config = {
            "model": reflection_lm_model,
            "api_key": os.getenv("GEMINI_API_KEY") or os.getenv("GOOGLE_APPLICATION_CREDENTIALS"),
            "temperature": 1.0,
            "max_tokens": 32000
        }
        
        # Run GEPA optimization
        result = optimize(
            adapter=adapter,
            train_dataset=train_data,
            val_dataset=val_data,
            candidate={"system_prompt": seed_prompt},
            reflection_lm=reflection_lm_config,
            max_budget=max_metric_calls,
            reflection_minibatch_size=reflection_minibatch_size
        )
        
        return result
        
    except ImportError:
        print("⚠️ GEPA library not available. Using mock optimization for demonstration.")
        
        # Mock optimization for demonstration
        class MockResult:
            def __init__(self):
                self.best_candidate = {"system_prompt": seed_prompt + "\n\n[OPTIMIZED] Use step-by-step reasoning for AIME problems."}
                self.candidates = [{"system_prompt": seed_prompt}]
                self.val_aggregate_scores = [0.3]
                self.best_idx = 0
                
        return MockResult()

# Test model configuration
ModelConfig.list_models()
print("✅ Model configuration and GEPA integration ready")

In [None]:
class AIMEBenchmarker:
    """Complete AIME benchmarking suite with AnyMaths adapter and GEPA optimization."""
    
    def __init__(self, model_key: str, output_dir: str = "./benchmark_results"):
        self.model_key = model_key
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(exist_ok=True)
        
        # Setup model configuration for AnyMaths adapter
        model_name, api_base, max_workers = ModelConfig.setup_model(model_key)
        
        # Initialize AnyMaths adapter
        self.adapter = AnyMathsAdapter(
            model=model_name,
            api_base=api_base,
            max_litellm_workers=max_workers,
            failure_score=0.0
        )
        
        # Load datasets
        self.train_set, self.val_set, self.test_set = load_aime_datasets()
        
        # Results storage
        self.baseline_score = None
        self.optimized_score = None
        self.candidate_data = []
        self.evolution_tree = None
        self.optimized_prompt = None
        
        # Default seed prompt for AIME problems
        self.seed_prompt = """You are an AI assistant that solves mathematical competition problems from AIME (American Invitational Mathematics Examination). 

For each problem:
1. Read the problem carefully and identify what is being asked
2. Work through the solution step-by-step with clear mathematical reasoning
3. Show all calculations and intermediate steps
4. Provide the final numerical answer

The following fields are required in your response:
- solution_pad: Your complete step-by-step solution with all work shown
- final_answer: Only the numerical answer (no units, explanations, or extra text)

For AIME problems, answers are always integers between 0 and 999."""
        
        print(f"✅ AIME Benchmarker initialized for {model_key}")
        print(f"   Model: {model_name}")
        print(f"   Dataset sizes: Train={len(self.train_set)}, Val={len(self.val_set)}, Test={len(self.test_set)}")
        print(f"   Output directory: {self.output_dir}")
        
    def evaluate_baseline(self, subset_size: int = 5) -> float:
        """Evaluate baseline performance using seed prompt."""
        print(f"\n🔍 Evaluating baseline performance on {subset_size} problems...")
        
        test_subset = self.test_set[:subset_size]
        candidate = {"system_prompt": self.seed_prompt}
        
        # Evaluate using adapter
        eval_batch = self.adapter.evaluate(test_subset, candidate, capture_traces=False)
        
        # Calculate accuracy
        correct = sum(eval_batch['scores'])
        total = len(eval_batch['scores'])
        
        for i, score in enumerate(eval_batch['scores']):
            status = "✅" if score > 0 else "❌"
            print(f"  Problem {i+1}/{total}: {status}")
        
        self.baseline_score = correct / total
        print(f"📊 Baseline Score: {self.baseline_score:.4f} ({int(correct)}/{total})")
        return self.baseline_score
    
    def run_gepa_optimization(self, 
                            train_subset_size: int = 10,
                            val_subset_size: int = 5, 
                            max_metric_calls: int = 200,
                            reflection_minibatch_size: int = 3) -> str:
        """Run GEPA optimization and return optimized prompt."""
        print(f"\n🚀 Starting GEPA optimization...")
        print(f"   Train subset: {train_subset_size}, Val subset: {val_subset_size}")
        print(f"   Max metric calls: {max_metric_calls}")
        
        # Create subsets
        train_subset = self.train_set[:train_subset_size]
        val_subset = self.val_set[:val_subset_size]
        
        # Run GEPA optimization
        result = gepa_optimize_anymaths(
            adapter=self.adapter,
            train_data=train_subset,
            val_data=val_subset,
            seed_prompt=self.seed_prompt,
            max_metric_calls=max_metric_calls,
            reflection_minibatch_size=reflection_minibatch_size
        )
        
        # Extract optimized prompt
        self.optimized_prompt = result.best_candidate.get("system_prompt", self.seed_prompt)
        
        # Create mock candidate data for visualization (since we don't have access to full GEPA details)
        self._create_mock_candidate_data(result)
        
        print(f"✅ GEPA optimization completed")
        print(f"   Optimized prompt length: {len(self.optimized_prompt)} characters")
        
        return self.optimized_prompt
    
    def evaluate_optimized(self, subset_size: int = 5) -> float:
        """Evaluate optimized prompt performance."""
        if not self.optimized_prompt:
            raise ValueError("Must run GEPA optimization first")
            
        print(f"\n🔍 Evaluating optimized performance on {subset_size} problems...")
        
        test_subset = self.test_set[:subset_size]
        candidate = {"system_prompt": self.optimized_prompt}
        
        # Evaluate using adapter
        eval_batch = self.adapter.evaluate(test_subset, candidate, capture_traces=False)
        
        # Calculate accuracy
        correct = sum(eval_batch['scores'])
        total = len(eval_batch['scores'])
        
        for i, score in enumerate(eval_batch['scores']):
            status = "✅" if score > 0 else "❌"
            print(f"  Problem {i+1}/{total}: {status}")
        
        self.optimized_score = correct / total
        improvement = self.optimized_score - (self.baseline_score or 0)
        print(f"📊 Optimized Score: {self.optimized_score:.4f} ({int(correct)}/{total})")
        print(f"📈 Improvement: {improvement:+.4f}")
        return self.optimized_score
    
    def _create_mock_candidate_data(self, result):
        """Create mock candidate data for visualization."""
        # This creates sample data for demonstration
        # In a real GEPA run, this would come from the optimization process
        
        self.candidate_data = []
        prompts = [self.seed_prompt, self.optimized_prompt]
        scores = [0.3, 0.5]  # Mock scores
        
        for i, (prompt, score) in enumerate(zip(prompts, scores)):
            candidate_entry = {
                'candidate_idx': i,
                'score': score,
                'prompt': prompt,
                'prompt_length_chars': len(prompt),
                'prompt_length_words': len(prompt.split()),
                'predictor_name': 'system_prompt',
                'parent_idx': i-1 if i > 0 else None,
                'is_best': i == 1  # Assume optimized is best
            }
            self.candidate_data.append(candidate_entry)
        
        # Create simple evolution tree
        self.evolution_tree = {
            'nodes': [
                {
                    'id': 0, 'candidate_idx': 0, 'parent_idx': None, 'score': 0.3,
                    'prompt': self.seed_prompt, 'prompt_length_chars': len(self.seed_prompt),
                    'prompt_length_words': len(self.seed_prompt.split()), 'level': 0, 'is_best': False
                },
                {
                    'id': 1, 'candidate_idx': 1, 'parent_idx': 0, 'score': 0.5,
                    'prompt': self.optimized_prompt, 'prompt_length_chars': len(self.optimized_prompt),
                    'prompt_length_words': len(self.optimized_prompt.split()), 'level': 1, 'is_best': True
                }
            ],
            'edges': [{'from': 0, 'to': 1, 'score_delta': 0.2}],
            'levels': {0: [0], 1: [1]},
            'roots': [0]
        }
    
    def plot_evolution_analysis(self, save_plot: bool = True) -> plt.Figure:
        """Generate comprehensive evolution analysis plots (adapted for AnyMaths)."""
        if not self.candidate_data:
            print("❌ No candidate data available. Run GEPA optimization first.")
            return None
            
        # Extract data for plotting
        prompt_chars = [c['prompt_length_chars'] for c in self.candidate_data]
        prompt_words = [c['prompt_length_words'] for c in self.candidate_data]
        scores = [c['score'] for c in self.candidate_data]
        candidate_nums = [c['candidate_idx'] for c in self.candidate_data]
        
        # Find best candidate
        best_candidate = max(self.candidate_data, key=lambda x: x['score'])
        best_idx = best_candidate['candidate_idx']
        best_score = best_candidate['score']
        best_char_count = best_candidate['prompt_length_chars']
        best_word_count = best_candidate['prompt_length_words']
        
        # Create the plots
        fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))
        
        # Add main title
        fig.suptitle(f'AIME AnyMaths Evolution Analysis: {self.model_key}', 
                     fontsize=16, fontweight='bold', y=0.98)
        
        # Plot 1: Score vs Character Count
        scatter1 = ax1.scatter(prompt_chars, scores, c=candidate_nums, cmap='viridis', alpha=0.7, s=60)
        ax1.set_xlabel('Prompt Length (characters)')
        ax1.set_ylabel('AIME Accuracy Score')
        ax1.set_title('Score vs Prompt Character Count')
        ax1.grid(True, alpha=0.3)
        
        # Add correlation coefficient
        if len(prompt_chars) > 1:
            correlation_chars = np.corrcoef(prompt_chars, scores)[0, 1]
            ax1.text(0.05, 0.95, f'Correlation: {correlation_chars:.3f}', 
                    transform=ax1.transAxes, 
                    bbox=dict(boxstyle="round", facecolor='wheat', alpha=0.8))
        
        # Mark best candidate
        ax1.scatter([best_char_count], [best_score], c='red', s=100, marker='*', 
                   label=f'Best (#{best_idx})', edgecolor='black', linewidth=1)
        ax1.legend()
        
        # Plot 2: Score vs Word Count
        scatter2 = ax2.scatter(prompt_words, scores, c=candidate_nums, cmap='viridis', alpha=0.7, s=60)
        ax2.set_xlabel('Prompt Length (words)')
        ax2.set_ylabel('AIME Accuracy Score')
        ax2.set_title('Score vs Prompt Word Count')
        ax2.grid(True, alpha=0.3)
        
        # Add correlation coefficient
        if len(prompt_words) > 1:
            correlation_words = np.corrcoef(prompt_words, scores)[0, 1]
            ax2.text(0.05, 0.95, f'Correlation: {correlation_words:.3f}', 
                    transform=ax2.transAxes, 
                    bbox=dict(boxstyle="round", facecolor='wheat', alpha=0.8))
        
        # Mark best candidate
        ax2.scatter([best_word_count], [best_score], c='red', s=100, marker='*', 
                   label=f'Best (#{best_idx})', edgecolor='black', linewidth=1)
        ax2.legend()
        
        # Plot 3: Evolution Timeline - Character Count
        ax3.plot(candidate_nums, prompt_chars, 'b-o', markersize=4, alpha=0.7)
        ax3.scatter([best_idx], [best_char_count], c='red', s=100, marker='*', 
                   label=f'Best (#{best_idx})', edgecolor='black', linewidth=1)
        ax3.set_xlabel('Candidate Number (Evolution Order)')
        ax3.set_ylabel('Prompt Length (characters)')
        ax3.set_title('Prompt Length Evolution Over Time')
        ax3.grid(True, alpha=0.3)
        ax3.legend()
        
        # Plot 4: Evolution Timeline - Scores
        ax4.plot(candidate_nums, scores, 'g-o', markersize=4, alpha=0.7)
        ax4.scatter([best_idx], [best_score], c='red', s=100, marker='*', 
                   label=f'Best (#{best_idx})', edgecolor='black', linewidth=1)
        ax4.set_xlabel('Candidate Number (Evolution Order)')
        ax4.set_ylabel('AIME Accuracy Score')
        ax4.set_title('Score Evolution Over Time')
        ax4.grid(True, alpha=0.3)
        ax4.legend()
        
        # Add colorbars
        plt.colorbar(scatter1, ax=ax1, label='Candidate #')
        plt.colorbar(scatter2, ax=ax2, label='Candidate #')
        
        # Adjust layout
        plt.tight_layout(rect=[0, 0.03, 1, 0.95])
        
        if save_plot:
            plot_path = self.output_dir / f"aime_anymaths_evolution_{self.model_key}.png"
            plt.savefig(plot_path, dpi=300, bbox_inches='tight')
            print(f"📊 Evolution plot saved to {plot_path}")
        
        plt.show()
        return fig
    
    def print_evolution_statistics(self):
        """Print detailed evolution statistics."""
        if not self.candidate_data:
            print("❌ No candidate data available")
            return
            
        prompt_chars = [c['prompt_length_chars'] for c in self.candidate_data]
        prompt_words = [c['prompt_length_words'] for c in self.candidate_data]
        scores = [c['score'] for c in self.candidate_data]
        
        best_candidate = max(self.candidate_data, key=lambda x: x['score'])
        
        print("\n" + "="*60)
        print(f"AIME ANYMATHS EVOLUTION ANALYSIS - {self.model_key}")
        print("="*60)
        
        print(f"\n📊 Performance Summary:")
        if self.baseline_score is not None and self.optimized_score is not None:
            improvement = self.optimized_score - self.baseline_score
            print(f"Baseline Score: {self.baseline_score:.4f}")
            print(f"Optimized Score: {self.optimized_score:.4f}")
            print(f"Improvement: {improvement:+.4f} ({improvement/self.baseline_score*100:+.1f}%)" if self.baseline_score > 0 else f"Improvement: {improvement:+.4f}")
        
        print(f"\n📏 Prompt Length Statistics:")
        print(f"Character count range: {min(prompt_chars)} - {max(prompt_chars)} chars")
        print(f"Word count range: {min(prompt_words)} - {max(prompt_words)} words")
        print(f"Average character count: {np.mean(prompt_chars):.1f} chars")
        print(f"Average word count: {np.mean(prompt_words):.1f} words")
        
        print(f"\n🎯 Score Statistics:")
        print(f"Score range: {min(scores):.4f} - {max(scores):.4f}")
        print(f"Average score: {np.mean(scores):.4f}")
        print(f"Score std dev: {np.std(scores):.4f}")
        
        print(f"\n🏆 Best Candidate:")
        print(f"Candidate #{best_candidate['candidate_idx']}: Score={best_candidate['score']:.4f}")
        print(f"Length: {best_candidate['prompt_length_chars']} chars, {best_candidate['prompt_length_words']} words")
        
        if len(prompt_chars) > 1:
            correlation_chars = np.corrcoef(prompt_chars, scores)[0, 1]
            correlation_words = np.corrcoef(prompt_words, scores)[0, 1]
            
            print(f"\n📈 Correlations:")
            print(f"Length (chars) vs Score: {correlation_chars:.3f}")
            print(f"Length (words) vs Score: {correlation_words:.3f}")
    
    def export_candidate_data(self, include_prompts: bool = True) -> str:
        """Export candidate data with evolution tree."""
        if not self.candidate_data:
            print("❌ No candidate data to export")
            return None
        
        # Create export data structure
        export_data = {
            'candidate_data': self.candidate_data,
            'evolution_tree': self.evolution_tree,
            'best_candidate_idx': max(range(len(self.candidate_data)), key=lambda i: self.candidate_data[i]['score']),
            'optimized_prompt': self.optimized_prompt,
            'seed_prompt': self.seed_prompt,
            'metadata': {
                'model_key': self.model_key,
                'adapter_type': 'AnyMaths',
                'total_candidates': len(self.candidate_data),
                'max_level': max(self.evolution_tree['levels'].keys()) if self.evolution_tree else 0,
                'num_roots': len(self.evolution_tree['roots']) if self.evolution_tree else 0,
                'baseline_score': self.baseline_score,
                'optimized_score': self.optimized_score,
                'timestamp': datetime.now().isoformat()
            }
        }
        
        # Add performance info
        if self.baseline_score is not None and self.optimized_score is not None:
            export_data['metadata']['improvement'] = self.optimized_score - self.baseline_score
        
        # Remove prompts if requested (for smaller file size)
        if not include_prompts:
            for candidate in export_data['candidate_data']:
                candidate.pop('prompt', None)
            if export_data['evolution_tree']:
                for node in export_data['evolution_tree']['nodes']:
                    node.pop('prompt', None)
            export_data.pop('optimized_prompt', None)
            export_data.pop('seed_prompt', None)
        
        # Save files
        candidate_file = self.output_dir / f"anymaths_data_{self.model_key}.json"
        frontend_file = self.output_dir / f"anymaths_frontend_data_{self.model_key}.json"
        
        with open(candidate_file, 'w') as f:
            json.dump(export_data, f, indent=2)
        
        with open(frontend_file, 'w') as f:
            json.dump(export_data, f, indent=2)
        
        print(f"📁 AnyMaths data exported to:")
        print(f"   {candidate_file}")
        print(f"   {frontend_file}")
        print(f"   Total candidates: {len(self.candidate_data)}")
        print(f"   File size: {candidate_file.stat().st_size / 1024:.1f} KB")
        
        return str(candidate_file)

print("✅ AIMEBenchmarker class updated for AnyMaths adapter")

In [None]:
def run_complete_benchmark(model_key: str, 
                         output_dir: str = "./benchmark_results",
                         baseline_subset: int = 5,
                         train_subset: int = 10, 
                         val_subset: int = 5,
                         test_subset: int = 5,
                         max_metric_calls: int = 200) -> AIMEBenchmarker:
    """
    Run complete AIME benchmark using AnyMaths adapter.
    
    Args:
        model_key: Model to benchmark (e.g., 'openai-gpt4', 'gemma-27b', 'ollama-qwen')
        output_dir: Directory to save results
        baseline_subset: Number of test problems for baseline evaluation
        train_subset: Number of training examples for GEPA
        val_subset: Number of validation examples for GEPA  
        test_subset: Number of test problems for optimized evaluation
        max_metric_calls: Maximum GEPA optimization calls
        
    Returns:
        AIMEBenchmarker instance with all results
    """
    print(f"🚀 Starting complete AIME benchmark with AnyMaths adapter for {model_key}")
    print("=" * 70)
    
    try:
        # Initialize benchmarker
        benchmarker = AIMEBenchmarker(model_key, output_dir)
        
        # 1. Evaluate baseline
        baseline_score = benchmarker.evaluate_baseline(baseline_subset)
        
        # 2. Run GEPA optimization
        optimized_prompt = benchmarker.run_gepa_optimization(
            train_subset_size=train_subset,
            val_subset_size=val_subset,
            max_metric_calls=max_metric_calls
        )
        
        # 3. Evaluate optimized
        optimized_score = benchmarker.evaluate_optimized(test_subset)
        
        # 4. Generate analysis and exports
        benchmarker.print_evolution_statistics()
        benchmarker.plot_evolution_analysis()
        benchmarker.export_candidate_data()
        
        print(f"\n🎯 BENCHMARK COMPLETE!")
        print(f"=" * 30)
        improvement = optimized_score - baseline_score
        print(f"Model: {model_key}")
        print(f"Baseline: {baseline_score:.4f}")
        print(f"Optimized: {optimized_score:.4f}")
        print(f"Improvement: {improvement:+.4f} ({improvement/baseline_score*100:+.1f}%)" if baseline_score > 0 else f"Improvement: {improvement:+.4f}")
        
        return benchmarker
        
    except Exception as e:
        print(f"❌ Error during benchmark: {e}")
        import traceback
        traceback.print_exc()
        return None

def compare_models(model_keys: List[str], 
                  output_dir: str = "./model_comparison",
                  **benchmark_kwargs) -> Dict[str, AIMEBenchmarker]:
    """
    Compare multiple models on AIME benchmark using AnyMaths adapter.
    
    Args:
        model_keys: List of model keys to compare
        output_dir: Directory to save comparison results
        **benchmark_kwargs: Arguments passed to run_complete_benchmark
        
    Returns:
        Dictionary mapping model keys to their benchmarker instances
    """
    print(f"🔬 Comparing {len(model_keys)} models on AIME with AnyMaths adapter")
    print("=" * 60)
    
    results = {}
    comparison_data = []
    
    for model_key in model_keys:
        print(f"\n📊 Benchmarking {model_key}...")
        model_dir = Path(output_dir) / model_key
        
        try:
            benchmarker = run_complete_benchmark(
                model_key=model_key,
                output_dir=str(model_dir),
                **benchmark_kwargs
            )
            
            if benchmarker:
                results[model_key] = benchmarker
                comparison_data.append({
                    'model': model_key,
                    'baseline_score': benchmarker.baseline_score,
                    'optimized_score': benchmarker.optimized_score,
                    'improvement': benchmarker.optimized_score - benchmarker.baseline_score if benchmarker.baseline_score else 0,
                    'num_candidates': len(benchmarker.candidate_data)
                })
        except Exception as e:
            print(f"❌ Failed to benchmark {model_key}: {e}")
    
    # Generate comparison plot
    if comparison_data:
        _plot_model_comparison(comparison_data, output_dir)
        
        # Save comparison summary
        comparison_file = Path(output_dir) / "anymaths_comparison_summary.json"
        with open(comparison_file, 'w') as f:
            json.dump({
                'timestamp': datetime.now().isoformat(),
                'adapter_type': 'AnyMaths',
                'models': comparison_data,
                'benchmark_settings': benchmark_kwargs
            }, f, indent=2)
        
        print(f"\n📊 Comparison results saved to {output_dir}")
    
    return results

def _plot_model_comparison(comparison_data: List[Dict], output_dir: str):
    """Generate model comparison plot for AnyMaths results."""
    models = [d['model'] for d in comparison_data]
    baseline_scores = [d['baseline_score'] for d in comparison_data]
    optimized_scores = [d['optimized_score'] for d in comparison_data] 
    improvements = [d['improvement'] for d in comparison_data]
    
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
    
    # Plot 1: Baseline vs Optimized
    x = np.arange(len(models))
    width = 0.35
    
    ax1.bar(x - width/2, baseline_scores, width, label='Baseline', alpha=0.8)
    ax1.bar(x + width/2, optimized_scores, width, label='Optimized', alpha=0.8)
    
    ax1.set_xlabel('Models')
    ax1.set_ylabel('AIME Accuracy')
    ax1.set_title('AnyMaths: Baseline vs Optimized Performance')
    ax1.set_xticks(x)
    ax1.set_xticklabels(models, rotation=45, ha='right')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    # Plot 2: Improvement analysis
    colors = ['green' if imp > 0 else 'red' for imp in improvements]
    ax2.bar(models, improvements, color=colors, alpha=0.7)
    ax2.set_xlabel('Models')
    ax2.set_ylabel('Improvement (Optimized - Baseline)')
    ax2.set_title('AnyMaths: Performance Improvement by Model')
    ax2.tick_params(axis='x', rotation=45)
    ax2.grid(True, alpha=0.3)
    ax2.axhline(y=0, color='black', linestyle='--', alpha=0.5)
    
    plt.tight_layout()
    
    plot_path = Path(output_dir) / "anymaths_model_comparison.png"
    plt.savefig(plot_path, dpi=300, bbox_inches='tight')
    plt.show()
    
    print(f"📊 Comparison plot saved to {plot_path}")

print("✅ AnyMaths benchmarking functions ready")

## 🚀 Quick Start Examples

Choose one of the examples below to get started with AIME benchmarking.

In [None]:
# Example 1: Quick single model benchmark with AnyMaths adapter
# Available models: 'openai-gpt4', 'gemma-27b', 'qwen-72b', 'claude-sonnet', 'gemini-pro', 'ollama-qwen'

MODEL_TO_TEST = "gemini-pro"  # Change this to your preferred model

# Check model configuration and API key requirements
print(f"🔍 Checking configuration for {MODEL_TO_TEST}...")
try:
    model_name, api_base, max_workers = ModelConfig.setup_model(MODEL_TO_TEST)
    print(f"✅ Model configured: {model_name}")
    print(f"   API Base: {api_base or 'Default'}")
    print(f"   Max Workers: {max_workers}")
except ValueError as e:
    print(f"❌ Configuration error: {e}")

# Show all available models
print("\n" + "="*50)
ModelConfig.list_models()
print("="*50)

print(f"\n💡 To run the benchmark, uncomment the line below and ensure you have the required API keys:")
print(f"For GEPA reflection, you also need: GEMINI_API_KEY or GOOGLE_APPLICATION_CREDENTIALS")

# Run the benchmark (uncomment the line below after setting up API keys)
# benchmarker = run_complete_benchmark(MODEL_TO_TEST, baseline_subset=3, train_subset=5, val_subset=3, test_subset=3, max_metric_calls=50)

In [None]:
# Example 2: Compare multiple models with AnyMaths adapter
# This will run a comprehensive comparison across different models

models_to_compare = [
    "gemini-pro",     # Google Gemini (if you have GOOGLE_APPLICATION_CREDENTIALS)
    "openai-gpt4",    # OpenAI GPT-4 (if you have OPENAI_API_KEY) 
    # "gemma-27b",      # Uncomment if you have OPENROUTER_API_KEY
    # "qwen-72b",       # Uncomment if you have OPENROUTER_API_KEY
    # "claude-sonnet",  # Uncomment if you have OPENROUTER_API_KEY
    # "ollama-qwen",    # Uncomment if you have Ollama running locally
]

print("🔬 Model Comparison Setup")
print("Available models for comparison:")
for model in models_to_compare:
    try:
        model_name, api_base, _ = ModelConfig.setup_model(model)
        print(f"  ✅ {model}: {model_name}")
    except ValueError as e:
        print(f"  ❌ {model}: {e}")

print(f"\n💡 To run comparison, uncomment the code below after setting up API keys:")

# Run comparison (uncomment after setting up API keys)
# comparison_results = compare_models(
#     models_to_compare,
#     output_dir="./aime_anymaths_comparison",
#     baseline_subset=3,    # Smaller subsets for faster comparison
#     train_subset=5,
#     val_subset=3, 
#     test_subset=3,
#     max_metric_calls=50   # Reduced for faster testing
# )

In [None]:
# Example 3: Custom benchmark with specific settings
# For when you want full control over the benchmarking process

def run_custom_benchmark():
    """Run a custom benchmark with specific settings."""
    
    model_key = "gemma-27b"  # Change as needed
    
    # Initialize benchmarker
    benchmarker = AIMEBenchmarker(model_key, output_dir="./custom_benchmark")
    
    # Run each step manually with custom parameters
    print("Step 1: Baseline evaluation...")
    baseline_score = benchmarker.evaluate_baseline(subset_size=3)
    
    print("\\nStep 2: GEPA optimization...")
    optimized_program = benchmarker.run_gepa_optimization(
        train_subset_size=8,
        val_subset_size=4,
        max_metric_calls=150,
        reflection_minibatch_size=2
    )
    
    print("\\nStep 3: Optimized evaluation...")
    optimized_score = benchmarker.evaluate_optimized(optimized_program, subset_size=3)
    
    print("\\nStep 4: Analysis and export...")
    benchmarker.print_evolution_statistics()
    benchmarker.plot_evolution_analysis()
    benchmarker.export_candidate_data()
    
    return benchmarker

# Run custom benchmark (comment out until you set API keys)
# custom_result = run_custom_benchmark()

In [None]:
# Example 4: Load and analyze existing results
# Use this to re-analyze results from previous benchmark runs

def analyze_existing_results(json_file_path: str):
    """Load and analyze previously exported benchmark results."""
    
    try:
        with open(json_file_path, 'r') as f:
            data = json.load(f)
        
        candidate_data = data['candidate_data']
        metadata = data['metadata']
        
        print(f"📊 Analysis of {metadata.get('model_key', 'Unknown Model')}")
        print("=" * 50)
        print(f"Total candidates: {len(candidate_data)}")
        print(f"Baseline score: {metadata.get('baseline_score', 'N/A')}")
        print(f"Optimized score: {metadata.get('optimized_score', 'N/A')}")
        print(f"Improvement: {metadata.get('improvement', 'N/A')}")
        
        # Extract data for plotting
        scores = [c['score'] for c in candidate_data]
        lengths = [c['prompt_length_chars'] for c in candidate_data]
        
        # Simple analysis plot
        plt.figure(figsize=(12, 4))
        
        plt.subplot(1, 3, 1)
        plt.plot(scores, 'g-o', markersize=4, alpha=0.7)
        plt.xlabel('Candidate Number')
        plt.ylabel('Score')
        plt.title('Score Evolution')
        plt.grid(True, alpha=0.3)
        
        plt.subplot(1, 3, 2)
        plt.scatter(lengths, scores, alpha=0.6)
        plt.xlabel('Prompt Length (chars)')
        plt.ylabel('Score')
        plt.title('Length vs Score')
        plt.grid(True, alpha=0.3)
        
        plt.subplot(1, 3, 3)
        plt.hist(scores, bins=min(10, len(scores)), alpha=0.7, edgecolor='black')
        plt.xlabel('Score')
        plt.ylabel('Frequency')
        plt.title('Score Distribution')
        plt.grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
        
        return data
        
    except Exception as e:
        print(f"❌ Error loading results: {e}")
        return None

# Example usage (uncomment and provide path to your results file):
# results_data = analyze_existing_results("./benchmark_results/candidate_data_openai-gpt4.json")

## 📚 AnyMaths AIME Benchmarking Usage Guide

### 🔧 Setting Up API Keys
```bash
# For OpenAI models
export OPENAI_API_KEY="your_openai_key_here"

# For OpenRouter models (Gemma, Qwen, Claude)
export OPENROUTER_API_KEY="your_openrouter_key_here"

# For Google Gemini (and GEPA reflection)
export GEMINI_API_KEY="your_gemini_key_here"
# OR for service account:
export GOOGLE_APPLICATION_CREDENTIALS="path/to/service-account.json"

# For local Ollama (no API key needed)
# Just ensure Ollama is running: ollama serve
```

### 🤖 Available Models
- `openai-gpt4`: OpenAI GPT-4o Mini (requires OPENAI_API_KEY)
- `gemma-27b`: Google Gemma 3 27B via OpenRouter (requires OPENROUTER_API_KEY)
- `qwen-72b`: Qwen 2.5 72B Instruct via OpenRouter (requires OPENROUTER_API_KEY)
- `claude-sonnet`: Claude 3.5 Sonnet via OpenRouter (requires OPENROUTER_API_KEY)
- `gemini-pro`: Google Gemini 2.5 Pro direct (requires GEMINI_API_KEY)
- `ollama-qwen`: Local Qwen via Ollama (requires Ollama running locally)

### 🚀 Key Functions
- `run_complete_benchmark(model_key)`: Complete AIME benchmark with AnyMaths adapter
- `compare_models(model_list)`: Compare multiple models side-by-side
- `AIMEBenchmarker(model_key)`: Full control benchmarker class with AnyMaths integration

### 📁 Output Files
- `anymaths_data_{model}.json`: Complete candidate data with evolution tree
- `anymaths_frontend_data_{model}.json`: Frontend-ready visualization data
- `aime_anymaths_evolution_{model}.png`: Evolution analysis plots
- `anymaths_model_comparison.png`: Multi-model comparison plots

### ⚙️ Customization Parameters
Adjust benchmark parameters in function calls:
- `baseline_subset`: Number of problems for baseline evaluation (default: 5)
- `train_subset/val_subset`: GEPA training data size (default: 10/5)
- `max_metric_calls`: GEPA optimization budget (default: 200)
- `reflection_minibatch_size`: Reflection batch size (default: 3)

### 🎯 AnyMaths Adapter Features
- **Structured Output**: Uses JSON schema validation for consistent responses
- **LiteLLM Integration**: Works with 100+ LLM providers
- **Mathematical Focus**: Optimized for mathematical word problems and reasoning
- **AIME Specialized**: Custom prompts and evaluation for AIME competition problems
- **Feedback-Driven**: Rich feedback generation for GEPA optimization

### 📊 Example Usage
```python
# Quick benchmark
benchmarker = run_complete_benchmark("gemini-pro", baseline_subset=3, max_metric_calls=50)

# Custom benchmark
benchmarker = AIMEBenchmarker("openai-gpt4")
benchmarker.evaluate_baseline(subset_size=5)
benchmarker.run_gepa_optimization(max_metric_calls=100)
benchmarker.evaluate_optimized(subset_size=5)
benchmarker.plot_evolution_analysis()

# Model comparison
results = compare_models(["gemini-pro", "openai-gpt4"], max_metric_calls=50)
```

The AnyMaths adapter provides a specialized, mathematically-focused benchmarking pipeline that leverages the power of GEPA optimization for AIME mathematical reasoning tasks. 🧮✨