# LLM Model Comparison using OpenRouter Rankings

This notebook evaluates and compares the top language models from [OpenRouter Rankings](https://openrouter.ai/rankings).

## What This Notebook Does

1. **Fetches Top Models** - Scrapes the current rankings from OpenRouter to identify the most popular models
2. **Generates Questions** - Each model creates a challenging reasoning question
3. **Answers Questions** - Each model answers all questions from other models
4. **Evaluates Responses** - Each model rates the quality of all answers on a 10-point scale
5. **Aggregates Results** - Produces comparison metrics and cross-model rating matrices

## Requirements

- **OpenRouter API Key**: Set the `OPENROUTER_API_KEY` environment variable
- **Dependencies**: pandas, requests, beautifulsoup4, playwright, openai (automatically installed with `uv sync`)

In [None]:
# Core imports for the notebook
import os  # Environment variable access
from typing import Any  # Type hints

# Data manipulation and analysis
import pandas as pd  # DataFrames for structured data

# Web scraping and API calls
import requests  # HTTP requests for OpenRouter API
from bs4 import BeautifulSoup  # HTML parsing for rankings page

# Note: Additional imports (re, time, playwright, openai) are imported
# within specific cells where they're needed

## Evaluation Pipeline Flow

Here's how the evaluation works step-by-step:

```
┌─────────────────────────────────────────────────────────┐
│  Step 1: Fetch Top Models from OpenRouter Rankings     │
│  → Scrapes live usage data OR uses API sorting         │
└─────────────────┬───────────────────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────────────────┐
│  Step 2: Question Generation (N models)                │
│  → Each model creates 1 challenging question           │
│  → Total: N questions                                  │
└─────────────────┬───────────────────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────────────────┐
│  Step 3: Answer Generation (N × N combinations)        │
│  → Each model answers every question                   │
│  → Total: N × N answers                                │
└─────────────────┬───────────────────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────────────────┐
│  Step 4: Answer Evaluation (N × N ratings)             │
│  → Each model rates every answer (1-10 scale)          │
│  → Total: N × N ratings                                │
└─────────────────┬───────────────────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────────────────┐
│  Step 5: Aggregate Results                             │
│  → Average ratings per model                           │
│  → Cross-model rating matrix                           │
│  → Performance insights                                │
└─────────────────────────────────────────────────────────┘
```

**Example with 5 models:**
- 5 questions generated
- 25 answers generated (5 models × 5 questions)
- 25 ratings collected (5 models × 5 answers)
- Total API calls: ~55

## Setup and Configuration

Before running this notebook, you need to:

1. **Install Playwright browsers** (one-time setup):
```bash
uv run playwright install chromium
```

2. **Set your OpenRouter API Key**:
   - Get a free API key from [OpenRouter](https://openrouter.ai/)
   - Set it as an environment variable:
     - **Windows (PowerShell)**: `$env:OPENROUTER_API_KEY="your-key-here"`
     - **Mac/Linux**: `export OPENROUTER_API_KEY="your-key-here"`
   - Or add it to a `.env` file in the project root

The notebook will check for this API key before making requests.

## Fetch Top Models from OpenRouter

This section scrapes the OpenRouter rankings page to get real-time popularity data based on actual usage.

### How It Works

The `fetch_openrouter_rankings()` function uses **Playwright** (a browser automation tool) to:
1. Launch a headless browser
2. Navigate to the OpenRouter rankings page
3. Wait for JavaScript content to load
4. Extract model data including rank, name, token usage, and usage trends

**Why Playwright?** The rankings page uses JavaScript to render content dynamically, so we need a real browser to see the data.

**Windows Event Loop Note:** Jupyter uses an asyncio event loop that conflicts with Playwright on Windows. The function runs Playwright in a separate thread with its own event loop to avoid this issue.

### Available Sorting Options

The `fetch_top_openrouter_models()` function supports different sorting criteria:
- **popularity** - Models ranked by actual usage on OpenRouter (default, uses Playwright scraping)
- **price_low** - Cheapest models first
- **price_high** - Most expensive (often most capable) models first  
- **context_length** - Longest context window first
- **newest** - Most recently added models first

**Note:** Only the popularity ranking requires Playwright. Other sorting options use the OpenRouter API directly.

In [None]:
# Fallback model list used when API calls or web scraping fails
FALLBACK_MODELS = [
    {
        "id": "anthropic/claude-3.5-sonnet",
        "name": "Claude 3.5 Sonnet",
        "description": "Anthropic Claude 3.5 Sonnet",
        "context_length": 200000,
    },
    {
        "id": "openai/gpt-4o",
        "name": "GPT-4o",
        "description": "OpenAI GPT-4o",
        "context_length": 128000,
    },
    {
        "id": "google/gemini-pro-1.5",
        "name": "Gemini Pro 1.5",
        "description": "Google Gemini Pro 1.5",
        "context_length": 1_000_000,
    },
    {
        "id": "meta-llama/llama-3.1-405b-instruct",
        "name": "Llama 3.1 405B",
        "description": "Meta Llama 3.1 405B Instruct",
        "context_length": 128000,
    },
    {
        "id": "anthropic/claude-3.5-haiku",
        "name": "Claude 3.5 Haiku",
        "description": "Anthropic Claude 3.5 Haiku",
        "context_length": 100000,
    },
]


In [None]:
def fetch_openrouter_rankings() -> pd.DataFrame:
    """
    Scrape the OpenRouter rankings page to get the current top models by actual usage.
    Uses Playwright async API run in a separate thread to avoid event loop conflicts.

    Returns:
        DataFrame with columns: rank, model_id, model_name, tokens, token_change
    """
    import asyncio
    import concurrent.futures
    import sys

    async def _async_fetch():
        """Internal async function to fetch rankings using browser automation"""
        try:
            import re

            from playwright.async_api import async_playwright

            print("Launching browser to fetch rankings...")

            async with async_playwright() as p:
                # Launch browser in headless mode (no visible window)
                browser = await p.chromium.launch(headless=True)
                page = await browser.new_page()

                try:
                    # Navigate to rankings page with short timeout (user can retry if it fails)
                    await page.goto(
                        "https://openrouter.ai/rankings",
                        wait_until="domcontentloaded",
                        timeout=15000,
                    )

                    # Wait for the leaderboard section to appear in the DOM
                    await page.wait_for_selector(
                        "#leaderboard", timeout=10000, state="attached"
                    )

                    # Give JavaScript time to populate the content (5 seconds)
                    await page.wait_for_timeout(5000)

                    # Get the fully rendered HTML after JavaScript execution
                    html = await page.content()

                finally:
                    await browser.close()

            # Parse the rendered HTML with BeautifulSoup
            soup = BeautifulSoup(html, "html.parser")
            leaderboard = soup.find(id="leaderboard")

            if not leaderboard:
                print("⚠ Leaderboard section not found in rendered page")
                return pd.DataFrame(
                    columns=["rank", "model_id", "model_name", "tokens", "token_change"]
                )

            # Extract model information using CSS selectors
            rankings_data: list[dict[str, object]] = []

            # Find all leaderboard entries (each entry is a grid container with 12 columns)
            entries = leaderboard.select("div.grid.grid-cols-12.items-center")

            for entry in entries[:30]:  # Limit to top 30 models
                try:
                    # Column 1: Extract rank number (e.g., "1.", "2.")
                    rank_elem = entry.select_one("div.col-span-1")
                    rank = (
                        int(rank_elem.get_text(strip=True).replace(".", ""))
                        if rank_elem
                        else None
                    )

                    # Column 2: Extract model name and ID from the link
                    model_link = entry.select_one("div.col-span-7 a.font-medium")
                    if not model_link:
                        continue

                    model_name = model_link.get_text(strip=True)
                    href = model_link.get("href", "")
                    # Remove leading "/" from href to get model_id
                    model_id = (
                        href[1:] if isinstance(href, str) and href.startswith("/") else href
                    )

                    # Column 3: Extract token count and change percentage
                    tokens: int | None = None
                    token_change: str | None = None
                    token_container = entry.select_one("div.col-span-4")

                    if token_container:
                        divs = token_container.select("div")

                        # First div contains the token count (e.g., "1.04T tokens", "801B tokens")
                        if divs:
                            token_text = divs[0].get_text(strip=True)
                            # Parse token count with units (K=thousand, M=million, B=billion, T=trillion)
                            token_match = re.search(
                                r"([\d.]+)([KMBT])\s*tokens", token_text, re.IGNORECASE
                            )
                            if token_match:
                                value = float(token_match.group(1))
                                unit = token_match.group(2).upper()
                                multipliers = {
                                    "K": 1_000,
                                    "M": 1_000_000,
                                    "B": 1_000_000_000,
                                    "T": 1_000_000_000_000,
                                }
                                tokens = int(value * multipliers.get(unit, 1))

                        # Extract percentage change (usage trend)
                        percent_div = token_container.select_one("div.mt-1")
                        if percent_div:
                            svg_elem = percent_div.select_one("svg")
                            full_text = percent_div.get_text(strip=True)
                            percent_match = re.search(r"([\d.]+)%", full_text)
                            if percent_match:
                                percentage_value = percent_match.group(1)
                                svg_class = ""
                                if svg_elem:
                                    svg_class_raw = svg_elem.get("class", [])
                                    if isinstance(svg_class_raw, list):
                                        svg_class = " ".join(svg_class_raw)
                                    else:  # str
                                        svg_class = str(svg_class_raw)
                                # Red SVG indicates decrease, green indicates increase
                                if svg_class and "text-red" in svg_class:
                                    token_change = f"-{percentage_value}%"
                                else:
                                    token_change = f"{percentage_value}%"

                    rankings_data.append(
                        {
                            "rank": rank,
                            "model_id": model_id,
                            "model_name": model_name,
                            "tokens": tokens,
                            "token_change": token_change,
                        }
                    )

                except (ValueError, AttributeError):
                    # Skip entries that don't match expected format
                    continue

            if rankings_data:
                print(
                    f"✓ Successfully extracted {len(rankings_data)} models from rankings page"
                )
                return pd.DataFrame(rankings_data)
            
            print("⚠ No models found, using fallback")
            raise ValueError("No models extracted")

        except Exception as err:  # noqa: BLE001
            print(f"Error with Playwright: {err}")
            import traceback

            traceback.print_exc()
            raise

    def _run_in_thread():
        """
        Run async function in a new event loop in a separate thread.
        This avoids conflicts with Jupyter's event loop on Windows.
        """
        # On Windows, use ProactorEventLoop which supports subprocesses
        if sys.platform == "win32":
            loop = asyncio.WindowsProactorEventLoopPolicy().new_event_loop()
        else:
            loop = asyncio.new_event_loop()

        asyncio.set_event_loop(loop)
        try:
            return loop.run_until_complete(_async_fetch())
        finally:
            loop.close()

    try:
        # Run in a thread pool to avoid event loop conflicts with Jupyter
        with concurrent.futures.ThreadPoolExecutor() as executor:
            future = executor.submit(_run_in_thread)
            return future.result(timeout=30)

    except Exception as err:  # noqa: BLE001
        print(f"Error fetching rankings: {err}")

        # Fallback to hardcoded list if scraping fails
        print("Using fallback: hardcoded top models list")
        fallback_rankings = [
            {
                "rank": i + 1,
                "model_id": model["id"],
                "model_name": model["name"],
                "tokens": None,
                "token_change": None,
            }
            for i, model in enumerate(FALLBACK_MODELS)
        ]
        return pd.DataFrame(fallback_rankings)


def fetch_top_openrouter_models(top_n: int = 5, sort_by: str = "popularity") -> list[dict]:
    """
    Fetch the top N models from OpenRouter using various sorting criteria.

    Args:
        top_n: Number of top models to return
        sort_by: Sorting criterion - 'popularity', 'price_low', 'price_high',
                'context_length', 'newest'

    Returns:
        List of dictionaries containing model information with keys:
        id, name, description, context_length, pricing, created, avg_cost
    """
    url = "https://openrouter.ai/api/v1/models"

    try:
        response = requests.get(url, timeout=30)
        response.raise_for_status()
        data = response.json()

        # Extract model data from API response
        models = data.get("data", [])

        # Filter out models without proper pricing or context info
        valid_models: list[dict[str, object]] = []
        for model in models:
            pricing = model.get("pricing", {})
            # Only include models with valid pricing information
            if pricing and pricing.get("prompt") and pricing.get("completion"):
                try:
                    prompt_cost = float(pricing.get("prompt", "0"))
                    completion_cost = float(pricing.get("completion", "0"))
                except (TypeError, ValueError):
                    continue
                model_info = {
                    "id": model.get("id", ""),
                    "name": model.get("name", model.get("model_id", "")),
                    "description": model.get(
                        "description", "No description available"
                    ),
                    "context_length": model.get("context_length", 0),
                    "pricing": pricing,
                    "created": model.get("created", 0),
                    # Calculate average cost per 1M tokens for easy comparison
                    "avg_cost": (prompt_cost + completion_cost) / 2 * 1_000_000,
                }
                valid_models.append(model_info)

        # Sort models based on the specified criterion
        if sort_by == "popularity":
            print(
                "Fetching current model rankings from OpenRouter (using Playwright)..."
            )
            rankings_df = fetch_openrouter_rankings()
            if rankings_df is not None and not rankings_df.empty:
                # Create a mapping of model_id to rank
                popularity_order = {
                    row["model_id"]: row["rank"] for _, row in rankings_df.iterrows()
                }
                # Sort by rank (lower rank = more popular), default to 999 for unranked
                sorted_models = sorted(
                    valid_models, key=lambda x: popularity_order.get(x["id"], 999)
                )
                print(
                    f"✓ Successfully ranked {len(rankings_df)} models by current usage data"
                )
            else:
                print("⚠ Could not fetch rankings, using default sort")
                sorted_models = valid_models
        elif sort_by == "price_low":
            sorted_models = sorted(valid_models, key=lambda x: x["avg_cost"])
        elif sort_by == "price_high":
            sorted_models = sorted(valid_models, key=lambda x: x["avg_cost"], reverse=True)
        elif sort_by == "context_length":
            sorted_models = sorted(
                valid_models, key=lambda x: x["context_length"], reverse=True
            )
        elif sort_by == "newest":
            sorted_models = sorted(valid_models, key=lambda x: x["created"], reverse=True)
        else:
            sorted_models = valid_models

        return sorted_models[:top_n]

    except Exception as err:  # noqa: BLE001
        print(f"Error fetching models: {err}")
        # Fallback to a hardcoded list of popular models
        return [
            {**model, "pricing": {}, "avg_cost": 0}
            for model in FALLBACK_MODELS
        ][:top_n]


In [None]:
# Test: Fetch and display the current rankings
rankings_df = fetch_openrouter_rankings()

print(f"Successfully fetched {len(rankings_df)} ranked models\n")
print("Top 10 Models by Usage on OpenRouter:")
print("=" * 80)

display_df = rankings_df.head(10).copy()

def format_tokens(tokens: int | None) -> str:
    if tokens is None:
        return "N/A"
    if tokens >= 1_000_000_000_000:
        return f"{tokens / 1_000_000_000_000:.2f}T"
    if tokens >= 1_000_000_000:
        return f"{tokens / 1_000_000_000:.1f}B"
    if tokens >= 1_000_000:
        return f"{tokens / 1_000_000:.1f}M"
    return f"{tokens:,}"

display_df["tokens_formatted"] = display_df["tokens"].apply(format_tokens)
display_df[["rank", "model_name", "model_id", "tokens_formatted", "token_change"]]

### Understanding the Rankings Output

The table above shows:

- **rank**: Current position on OpenRouter (lower = more popular)
- **model_name**: Human-readable name of the model
- **model_id**: Unique identifier used for API calls (format: `provider/model-name`)
- **tokens_formatted**: Total tokens processed (T=trillion, B=billion, M=million)
- **token_change**: Usage trend as percentage change (↑ green positive, ↓ red negative)

**What does this tell us?**
Models at the top are being used most heavily by real users on OpenRouter, which often (but not always) correlates with quality, speed, or value. This ranking updates in real-time based on actual API usage.

In [None]:
# Remove ":free" suffix from model IDs for cleaner display
# OpenRouter sometimes appends ":free" to free-tier models
display_df["model_id"] = display_df["model_id"].str.replace(r":free$", "", regex=True)

In [None]:
# Select top 5 models from the rankings for our evaluation
# Alternative approach (commented out): fetch via API with different sorting
# top_models = fetch_top_openrouter_models(5, sort_by="popularity")
# df_models = pd.DataFrame(top_models)

top_models = display_df.head(5)

In [None]:
# Validate that we have the required data for the evaluation
print("Data Validation:")
print("=" * 60)

# Check we have models
if len(display_df) < 3:
    print("⚠️  Warning: Less than 3 models available")
    print("   Results may be less meaningful with fewer models")
else:
    print(f"✓ {len(display_df)} models available for evaluation")

# Check for required columns
required_cols = ['model_id', 'model_name']
missing_cols = [col for col in required_cols if col not in display_df.columns]
if missing_cols:
    print(f"❌ Missing required columns: {missing_cols}")
else:
    print(f"✓ All required columns present: {required_cols}")

# Check for duplicate models
duplicates = display_df['model_id'].duplicated().sum()
if duplicates > 0:
    print(f"⚠️  Warning: {duplicates} duplicate model IDs found")
else:
    print("✓ No duplicate models")

print("\nReady to proceed with evaluation! 🚀")

In [None]:
top_models

In [None]:
# output top_models in markdown table

def display_top_models(models):
    table = "| Model | Score |\n|-------|-------|\n"
    for model in models:
        table += f"| {model['name']} | {model['score']} |\n"
    print(table)

In [None]:
# Convert top_models DataFrame to the format expected by display_top_models
models_for_display = []
for _, model in top_models.iterrows():
    models_for_display.append({
        'name': model['model_name'],
        'score': f"{model['tokens_formatted']} tokens ({model['token_change']} change)"
})

display_top_models(models_for_display)

In [None]:
# Configure the OpenRouter API client
# OpenRouter uses an OpenAI-compatible API, so we use the OpenAI Python client
from openai import OpenAI

# Get API key from environment variable
OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY")

if not OPENROUTER_API_KEY:
    error_msg = """
    ❌ OPENROUTER_API_KEY not found!
    
    To fix this issue:
    
    1. Get a free API key from: https://openrouter.ai/
       (Sign up with GitHub or email)
    
    2. Set the environment variable:
       
       Windows PowerShell:
         $env:OPENROUTER_API_KEY="sk-or-v1-xxxxx"
       
       Windows CMD:
         set OPENROUTER_API_KEY=sk-or-v1-xxxxx
       
       Mac/Linux:
         export OPENROUTER_API_KEY="sk-or-v1-xxxxx"
    
    3. Restart this notebook kernel (Kernel → Restart)
    
    Alternative: Create a .env file in the project root:
       OPENROUTER_API_KEY=sk-or-v1-xxxxx
    """
    raise ValueError(error_msg)

# Validate API key format
if not OPENROUTER_API_KEY.startswith("sk-or-"):
    print("⚠️  Warning: API key doesn't start with 'sk-or-'")
    print("   This might not be a valid OpenRouter API key")
    print("   Expected format: sk-or-v1-xxxxx")

# Create client with OpenRouter's base URL
client = OpenAI(base_url="https://openrouter.ai/api/v1", api_key=OPENROUTER_API_KEY)

print("✓ OpenRouter client configured successfully!")
print(f"  API key: {OPENROUTER_API_KEY[:15]}...{OPENROUTER_API_KEY[-4:]}")
print("  Base URL: https://openrouter.ai/api/v1")

## Model Comparison Pipeline

The evaluation consists of three phases:

1. **Question Generation** - Each model creates one challenging reasoning question
2. **Answer Generation** - Each model answers every question from all other models  
3. **Answer Evaluation** - Each model rates every answer on a 10-point scale

This creates a comprehensive cross-comparison where models evaluate each other's performance.

In [None]:
import re
import time

# Display how many models we're using for the evaluation pipeline
num_models = len(top_models)
total_questions = num_models
total_answers = num_models * num_models
total_ratings = num_models * num_models
total_api_calls = total_questions + total_answers + total_ratings

print(f"Using {num_models} models for the evaluation pipeline:")
print("  - Each model will generate 1 question")
print(f"  - Each model will answer {num_models} questions")
print(f"  - Each model will rate {num_models} answers")
print(f"\nTotal API calls: {total_api_calls}")
print(f"  • {total_questions} question generation calls")
print(f"  • {total_answers} answer generation calls")
print(f"  • {total_ratings} rating calls")

# Estimate time based on typical API response times
avg_question_time = 3  # seconds
avg_answer_time = 5    # seconds
avg_rating_time = 2    # seconds

estimated_time = (
    total_questions * avg_question_time +
    total_answers * avg_answer_time +
    total_ratings * avg_rating_time
)

print(f"\n⏱️  Estimated total time: ~{estimated_time // 60} minutes {estimated_time % 60} seconds")
print(f"   (assuming {avg_question_time}s/question, {avg_answer_time}s/answer, {avg_rating_time}s/rating)")
print("\n💡 Tip: Actual time may vary based on model speed and API load")

In [None]:
def generate_questions(
    models: pd.DataFrame, client: Any
) -> tuple[pd.DataFrame, pd.DataFrame]:
    """
    Generate evaluation questions using the provided models.
    
    Each model generates one challenging question designed to test reasoning depth.
    
    Args:
        models: DataFrame with model information (must have 'model_id' and 'model_name' columns)
        client: OpenRouter API client (OpenAI-compatible)
        
    Returns:
        Tuple of (all_questions_df, valid_questions_df)
        - all_questions_df: All questions including any that had errors
        - valid_questions_df: Only successfully generated questions (errors filtered out)
        
    Example:
        >>> all_q, valid_q = generate_questions(top_models, client)
        >>> print(f"Generated {len(valid_q)} valid questions out of {len(all_q)} attempts")
        Generated 5 valid questions out of 5 attempts
        
    Notes:
        - Uses temperature=0.8 for creative/diverse questions
        - Each question is limited to 180 tokens
        - Errors are captured in the DataFrame but don't stop execution
        - Preview of each question is printed during generation
    """
    question_generation_prompt = (
        "Please craft ONE challenging, original, nuanced question that can effectively "
        "discriminate between language models of varying reasoning depth. The question should: "
        "(1) require multi-step reasoning, (2) avoid simple trivia, (3) be answerable without external "
        "browsing, (4) not be purely opinion, (5) allow partial credit, and (6) be less than 400 tokens. Provide only the question text."
    )

    generated_questions: list[dict[str, Any]] = []
    
    # Each model generates one question
    for _, model in models.iterrows():
        mid = model["model_id"]
        mname = model.get("model_name", mid)
        print(f"\n[Generation] {mname} generating a question...")
        
        try:
            start = time.time()
            completion = client.chat.completions.create(
                model=mid,
                messages=[{"role": "user", "content": question_generation_prompt}],
                max_tokens=10000,  # Limit response length (increased to 10,000 for Gemini 2.5 Pro despite the instructions in the prompt)
                temperature=0.8,  # Higher temperature for creative/diverse questions
            )
            q_text = completion.choices[0].message.content.strip()
            elapsed = time.time() - start
            
            generated_questions.append(
                {
                    "question_model_id": mid,
                    "question_model_name": mname,
                    "question": q_text,
                    "gen_time_s": round(elapsed, 2),
                }
            )
            # Show preview of generated question
            print(
                f"✓ Question from {mname}: {q_text[:110]}{'...' if len(q_text) > 110 else ''}"
            )
        except Exception as err:  # noqa: BLE001
            # Record errors but continue with other models
            generated_questions.append(
                {
                    "question_model_id": mid,
                    "question_model_name": mname,
                    "question": f"Error generating question: {err}",
                    "gen_time_s": None,
                }
            )
            print(f"✗ {mname} failed: {err}")

    questions_df = pd.DataFrame(generated_questions)

    # Filter out errored questions for downstream phases
    valid_questions_df = questions_df[
        ~questions_df["question"].str.startswith("Error")
    ].copy()

    return questions_df, valid_questions_df

In [None]:
questions_df, valid_questions_df = generate_questions(top_models, client)

In [None]:
# Display the question from the 4th row (index 4) - Gemini 2.5 Pro in this case because we had to increase max_tokens
if len(questions_df) >= 5:
    fourth_question = questions_df.iloc[3]
    print(f"Question from {fourth_question['question_model_name']}:")
    print("=" * 60)
    print(fourth_question['question'])
    print("=" * 60)
    print(f"Generation time: {fourth_question['gen_time_s']}s")
else:
    print(f"Only {len(questions_df)} questions available. Cannot display 4th question.")

In [None]:
valid_questions_df

In [None]:
def answer_questions(
    questions_df: pd.DataFrame, models: pd.DataFrame, client: Any
) -> pd.DataFrame:
    """
    Generate answers to questions using the provided models.

    Each model answers every question from every model (including its own question).

    Args:
        questions_df: DataFrame with questions (must have 'question' and 'question_model_name' columns)
        models: DataFrame with model information (must have 'model_id' and 'model_name' columns)
        client: OpenRouter API client (OpenAI-compatible)

    Returns:
        DataFrame with answers containing columns: question_model_name, question,
        answer_model_id, answer_model_name, answer, answer_time_s
    """
    answers: list[dict[str, Any]] = []
    answer_instructions = (
        "You will be given a question designed to evaluate reasoning depth. Provide a thorough, "
        "structured answer. Show reasoning explicitly if helpful, but keep it concise and logical."
    )

    # Each model answers each question
    for _, qrow in questions_df.iterrows():
        q_text = qrow["question"]
        origin_model = qrow["question_model_name"]

        for _, model in models.iterrows():
            mid = model["model_id"]
            mname = model.get("model_name", mid)
            print(f"\n[Answer] {mname} answering question from {origin_model}...")

            try:
                start = time.time()
                completion = client.chat.completions.create(
                    model=mid,
                    messages=[
                        {"role": "system", "content": answer_instructions},
                        {"role": "user", "content": q_text},
                    ],
                    max_tokens=10000,  # Allow longer responses for thorough answers (especially for Gemini Pro)
                    temperature=0.5,  # Moderate temperature for balanced responses
                    timeout=30.0,  # 30 second timeout to prevent hanging
                )
                ans_text = completion.choices[0].message.content.strip()
                elapsed = time.time() - start

                answers.append(
                    {
                        "question_model_name": origin_model,
                        "question": q_text,
                        "answer_model_id": mid,
                        "answer_model_name": mname,
                        "answer": ans_text,
                        "answer_time_s": round(elapsed, 2),
                    }
                )
                print(f"✓ Answer length: {len(ans_text)} chars")
            except Exception as err:  # noqa: BLE001
                # Record errors but continue with other models
                answers.append(
                    {
                        "question_model_name": origin_model,
                        "question": q_text,
                        "answer_model_id": mid,
                        "answer_model_name": mname,
                        "answer": f"Error: {err}",
                        "answer_time_s": None,
                    }
                )
                print(f"✗ {mname} failed to answer: {err}")

    answers_df = pd.DataFrame(answers)
    print("\n✓ Answers DataFrame ready (answers_df)")

    return answers_df


In [None]:
answers_df_full = answer_questions(valid_questions_df, top_models, client)

In [None]:
answers_df_full

In [None]:
def evaluate_answers(
    answers_df: pd.DataFrame, models: pd.DataFrame, client: Any
) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """
    Evaluate answers using the provided models as raters.
    
    Each model rates every answer on a 10-point scale across multiple criteria.
    
    Args:
        answers_df: DataFrame with answers (must have 'question', 'answer', and 'answer_model_name' columns)
        models: DataFrame with model information (must have 'model_id' and 'model_name' columns)
        client: OpenRouter API client (OpenAI-compatible)
        
    Returns:
        Tuple of (ratings_df, summary_by_model, summary_by_pair)
        - ratings_df: All individual ratings
        - summary_by_model: Average rating per answer model
        - summary_by_pair: Cross-model rating matrix (how each rater rated each answerer)
    """
    ratings: list[dict[str, Any]] = []
    rating_prompt_template = (
        "You are evaluating an answer to a reasoning-focused question. Score 1-10 (10 = outstanding).\n"
        "Criteria (roughly equal weight):\n"
        "1. Clarity & organization\n"
        "2. Depth & correctness\n"
        "3. Completeness\n"
        "4. Insight/originality (if applicable)\n"
        "\nReturn ONLY the integer score (1-10)."
    )

    # Each model rates each answer
    for _, arow in answers_df.iterrows():
        ans_text = arow["answer"]
        # Skip error responses
        if ans_text.startswith("Error:"):
            continue
            
        q_text = arow["question"]
        ans_model = arow["answer_model_name"]
        
        for _, model in models.iterrows():
            mid = model["model_id"]
            mname = model.get("model_name", mid)
            print(f"\n[Rating] {mname} rating answer from {ans_model}...")
            
            # Combine question, answer, and rating instructions
            rating_input = (
                f"Question: {q_text}\n\nAnswer: {ans_text}\n\n{rating_prompt_template}"
            )
            
            try:
                completion = client.chat.completions.create(
                    model=mid,
                    messages=[{"role": "user", "content": rating_input}],
                    max_tokens=1000,  # Very short response (just a number)
                    temperature=0.0,  # Deterministic rating
                )
                raw = completion.choices[0].message.content.strip()
                
                # Extract numeric score (1-10) from response
                match = re.search(r"\b(10|[1-9])\b", raw)
                score = int(match.group(1)) if match else None
            except Exception as err:  # noqa: BLE001
                raw = f"Error: {err}"
                score = None
                
            ratings.append(
                {
                    "question": q_text,
                    "answer_model_name": ans_model,
                    "rater_model_id": mid,
                    "rater_model_name": mname,
                    "raw_rating_text": raw,
                    "rating": score,
                }
            )

    ratings_df = pd.DataFrame(ratings)
    print("\n✓ Ratings DataFrame ready (ratings_df)")

    # Generate aggregation summaries
    summary_model = pd.DataFrame()
    summary_pair = pd.DataFrame()

    if not ratings_df.empty:
        # Calculate average rating for each model's answers
        summary_model = (
            ratings_df.groupby("answer_model_name")["rating"]
            .mean()
            .reset_index()
            .rename(columns={"rating": "avg_rating"})
        )
        
        # Calculate average rating for each (answerer, rater) pair
        summary_pair = (
            ratings_df.groupby(["answer_model_name", "rater_model_name"])["rating"]
            .mean()
            .reset_index()
            .rename(columns={"rating": "avg_rating"})
        )
        
        print("\n" + "="*80)
        print("Average rating per answer model:")
        print("="*80)
        display(summary_model.sort_values("avg_rating", ascending=False))
        
        print("\n" + "="*80)
        print("Cross-model rating matrix (long form):")
        print("="*80)
        display(summary_pair.head(20))
    else:
        print("⚠ No ratings captured.")

    return ratings_df, summary_model, summary_pair

### Understanding the Rating System

Each model evaluates answers on a **10-point scale** across these criteria:

1. **Clarity & Organization** (2.5 points)
   - Is the answer well-structured and easy to follow?
   - Are concepts explained clearly without unnecessary jargon?

2. **Depth & Correctness** (2.5 points)
   - Is the reasoning sound and logically valid?
   - Are facts accurate and relevant?

3. **Completeness** (2.5 points)
   - Does the answer address all parts of the question?
   - Are edge cases or caveats mentioned when appropriate?

4. **Insight & Originality** (2.5 points)
   - Does the answer provide novel perspectives or connections?
   - Is there evidence of deeper understanding beyond surface-level knowledge?

**Why use models as raters?**
- Consistent evaluation criteria across all answers
- Faster than human evaluation for large-scale comparisons
- Tests if models can accurately judge reasoning quality (meta-evaluation)

**Limitations:**
- Models may have biases (e.g., preferring similar styles to their own)
- Some subjective criteria may be interpreted differently
- This is why we aggregate ratings across multiple model-raters

In [None]:
ratings_df_full, summary_model_full, summary_pair_full = evaluate_answers(answers_df_full, top_models, client)

In [None]:
summary_pair_full

In [None]:
# Extract results for README documentation
print("## Generated Questions\n")
for idx, row in valid_questions_df.iterrows():
    print(f"### {idx+1}. {row['question_model_name']}")
    print(f"**Question:** {row['question']}\n")
    print(f"**Generation Time:** {row['gen_time_s']}s\n")

print("\n## Model Performance Summary\n")
print("### Overall Average Ratings (10-point scale)\n")
for idx, row in summary_model_full.sort_values('avg_rating', ascending=False).iterrows():
    print(f"{idx+1}. **{row['answer_model_name']}**: {row['avg_rating']:.2f}/10")

print("\n### Cross-Model Rating Matrix")
print("\nThis shows how each model (rater/columns) rated each model's answers (answerer/rows):\n")

# Create a pivot table for easier viewing
pivot_table = summary_pair_full.pivot_table(
    index='answer_model_name',
    columns='rater_model_name',
    values='avg_rating'
)

# Convert pivot table to markdown format
def pivot_to_markdown(pivot_df):
    """Convert a pivot table DataFrame to markdown table format."""
    if pivot_df.empty:
        return "No data available"
    
    # Start with header row
    headers = ['Model'] + list(pivot_df.columns)
    markdown_lines = []
    
    # Create header
    header_line = '| ' + ' | '.join(headers) + ' |'
    markdown_lines.append(header_line)
    
    # Create separator line
    separator = '| ' + ' | '.join(['---'] * len(headers)) + ' |'
    markdown_lines.append(separator)
    
    # Add data rows
    for index, row in pivot_df.iterrows():
        row_data = [str(index)]
        for col in pivot_df.columns:
            value = row[col]
            if pd.isna(value):
                row_data.append('N/A')
            else:
                row_data.append(f'{value:.2f}')
        row_line = '| ' + ' | '.join(row_data) + ' |'
        markdown_lines.append(row_line)
    
    return '\n'.join(markdown_lines)

print(pivot_to_markdown(pivot_table))

## Interpretation & Next Steps

### What Do These Results Tell Us?

1. **Average Ratings** - Higher average ratings suggest models that consistently provide well-reasoned, complete answers
2. **Cross-Model Matrix** - Shows agreement/disagreement between raters. High variance may indicate:
   - Different evaluation standards between models
   - Genuine quality differences in answers
   - Bias toward certain answer styles

3. **Self-Ratings vs Peer-Ratings** - Reveals whether models are overconfident, overly critical, or fair

### Limitations to Consider

- **Sample Size**: Results are based on a small number of questions (N models = N questions)
- **Question Quality**: The evaluation quality depends on how good the generated questions are
- **Rater Bias**: Models may have inherent biases in how they evaluate responses
- **Domain Coverage**: Questions may cluster in certain domains based on model training

### Ideas for Extension

1. **Increase Sample Size**: Run with more questions per model (requires API credits)
2. **Domain-Specific Evaluation**: Test models on specific domains (math, coding, creative writing)
3. **Human Validation**: Compare model ratings with human expert ratings
4. **Consistency Testing**: Run the same evaluation multiple times to check stability
5. **Cost Analysis**: Track token usage and costs to compute value-per-dollar
6. **Response Time Analysis**: Compare speed vs quality tradeoffs
7. **Temperature Experiments**: Test how different temperature settings affect question/answer quality

### Saving Results

You can export the results for further analysis:

```python
# Export to CSV
summary_model_full.to_csv('model_ratings.csv', index=False)
summary_pair_full.to_csv('cross_model_ratings.csv', index=False)
questions_df_full.to_csv('generated_questions.csv', index=False)
answers_df_full.to_csv('model_answers.csv', index=False)
ratings_df_full.to_csv('all_ratings.csv', index=False)
```

## Run Full Evaluation Pipeline

Execute the complete pipeline: question generation → answering → rating

In [None]:
# Run the complete evaluation pipeline
print(f"Starting evaluation pipeline with {len(top_models)} models...\n")

# Step 1: Generate questions
questions_df_full, valid_questions_df_full = generate_questions(top_models, client)

# Step 2: Generate answers
answers_df_full = answer_questions(valid_questions_df_full, top_models, client)

# Step 3: Evaluate answers
ratings_df_full, summary_model_full, summary_pair_full = evaluate_answers(answers_df_full, top_models, client)

print("\n" + "="*80)
print("PIPELINE COMPLETE")
print("="*80)
print("\nGenerated artifacts:")
print("  - questions_df_full: All generated questions")
print("  - valid_questions_df_full: Successfully generated questions only")
print("  - answers_df_full: All model answers to all questions")
print("  - ratings_df_full: All ratings from all rater models")
print("  - summary_model_full: Average rating per model")
print("  - summary_pair_full: Cross-model rating matrix")

In [None]:
# Calculate and display interesting statistics about the ratings in markdown format
def generate_rating_statistics_markdown():
    """Generate rating statistics in markdown format."""
    if ratings_df_full.empty or 'rating' not in ratings_df_full.columns:
        return "⚠️ **No valid ratings data available**"
    
    # Overall statistics
    valid_ratings = ratings_df_full['rating'].dropna()
    
    markdown_lines = []
    markdown_lines.append("## Rating Statistics Summary")
    markdown_lines.append("")
    
    markdown_lines.append("### 📊 Overall Rating Distribution")
    markdown_lines.append("")
    markdown_lines.append(f"- **Total ratings collected**: {len(valid_ratings)}")
    markdown_lines.append(f"- **Average rating**: {valid_ratings.mean():.2f} / 10")
    markdown_lines.append(f"- **Median rating**: {valid_ratings.median():.1f} / 10")
    markdown_lines.append(f"- **Standard deviation**: {valid_ratings.std():.2f}")
    markdown_lines.append(f"- **Range**: {valid_ratings.min():.0f} - {valid_ratings.max():.0f}")
    markdown_lines.append("")
    
    # Rating distribution
    markdown_lines.append("### 📈 Rating Frequency")
    markdown_lines.append("")
    rating_counts = valid_ratings.value_counts().sort_index(ascending=False)
    for score, count in rating_counts.items():
        bar = '█' * int(count / len(valid_ratings) * 20)  # Shorter bars for markdown
        percentage = count/len(valid_ratings)*100
        markdown_lines.append(f"- **{score:2.0f}/10**: `{bar}` ({count} ratings, {percentage:.1f}%)")
    markdown_lines.append("")
    
    # Self-rating analysis (do models rate themselves higher?)
    if 'answer_model_name' in ratings_df_full.columns and 'rater_model_name' in ratings_df_full.columns:
        self_ratings = ratings_df_full[
            ratings_df_full['answer_model_name'] == ratings_df_full['rater_model_name']
        ]['rating'].dropna()
        
        other_ratings = ratings_df_full[
            ratings_df_full['answer_model_name'] != ratings_df_full['rater_model_name']
        ]['rating'].dropna()
        
        if len(self_ratings) > 0 and len(other_ratings) > 0:
            markdown_lines.append("### 🤔 Self-Rating vs. Peer-Rating")
            markdown_lines.append("")
            markdown_lines.append(f"- **Average when rating own answers**: {self_ratings.mean():.2f}")
            markdown_lines.append(f"- **Average when rating others' answers**: {other_ratings.mean():.2f}")
            diff = self_ratings.mean() - other_ratings.mean()
            
            if abs(diff) < 0.5:
                bias_text = f"Models are fairly unbiased (difference: {diff:+.2f})"
            elif diff > 0:
                bias_text = f"Models tend to rate themselves higher (difference: {diff:+.2f})"
            else:
                bias_text = f"Models tend to rate themselves lower (difference: {diff:+.2f})"
            
            markdown_lines.append(f"- **Bias Analysis**: {bias_text}")
            markdown_lines.append("")
    
    # Most generous and strictest raters
    if 'rater_model_name' in ratings_df_full.columns:
        avg_by_rater = ratings_df_full.groupby('rater_model_name')['rating'].mean().sort_values(ascending=False)
        
        markdown_lines.append("### 🎭 Rater Characteristics")
        markdown_lines.append("")
        markdown_lines.append(f"- **🎁 Most Generous Rater**: {avg_by_rater.index[0]} (avg: {avg_by_rater.iloc[0]:.2f})")
        markdown_lines.append(f"- **🔍 Strictest Rater**: {avg_by_rater.index[-1]} (avg: {avg_by_rater.iloc[-1]:.2f})")
        markdown_lines.append(f"- **📏 Rater Spread**: {avg_by_rater.iloc[0] - avg_by_rater.iloc[-1]:.2f} points")
    
    return '\n'.join(markdown_lines)

# Generate and print the markdown statistics
print(generate_rating_statistics_markdown())

### 🧪 Quick Test (Optional)

Before running the full evaluation, you can test with a single model to verify everything works:

```python
# Test with just one model
test_model = top_models.head(1)
print(f"Testing with: {test_model.iloc[0]['model_name']}")

# Generate one question
test_q, test_valid_q = generate_questions(test_model, client)
if len(test_valid_q) > 0:
    print("✓ Question generation works!")
    
# Generate one answer
test_a = answer_questions(test_valid_q, test_model, client)
if len(test_a) > 0:
    print("✓ Answer generation works!")
        
# Generate one rating
test_r, _, _ = evaluate_answers(test_a, test_model, client)
if len(test_r) > 0:
    print("✓ Rating works!")
    print("\n🎉 All systems go! Ready for full evaluation.")
```

**Tip:** Run this test first if you're unsure about your API setup.