# LLM Model Comparison using OpenRouter Rankings

This notebook fetches the top 5 models from [OpenRouter Rankings](https://openrouter.ai/rankings) and allows you to compare them.

In [1]:
import os

# Built-in / typing cleanups
from typing import Any  # Any still needed elsewhere

import pandas as pd
import requests
from bs4 import BeautifulSoup

## Fetch Top Models

Available sorting options:
- **popularity** - Models ranked by actual usage on OpenRouter (scraped using Playwright)
- **price_low** - Cheapest models first
- **price_high** - Most expensive (often most capable) models first  
- **context_length** - Longest context window first
- **newest** - Most recently added models first

**Note:** The popularity ranking uses Playwright in a separate thread to avoid Windows event loop limitations in Jupyter.


In [2]:
def fetch_openrouter_rankings() -> pd.DataFrame:
    """
    Scrape the OpenRouter rankings page to get the current top models by actual usage.
    Uses Playwright async API run in a separate thread to avoid event loop conflicts.
    Returns a DataFrame with columns: rank, model_id, model_name, tokens
    """
    import asyncio
    import concurrent.futures
    import sys

    async def _async_fetch():
        """Internal async function to fetch rankings"""
        try:
            import re

            from playwright.async_api import async_playwright

            print("Launching browser to fetch rankings...")

            async with async_playwright() as p:
                # Launch browser in headless mode
                browser = await p.chromium.launch(headless=True)
                page = await browser.new_page()

                try:
                    # Navigate to rankings page (short timeout; user can retry)
                    await page.goto(
                        "https://openrouter.ai/rankings",
                        wait_until="domcontentloaded",
                        timeout=15000,
                    )

                    # Wait for the leaderboard section
                    await page.wait_for_selector(
                        "#leaderboard", timeout=10000, state="attached"
                    )

                    # Allow JS content to populate
                    await page.wait_for_timeout(5000)

                    # Get the rendered HTML
                    html = await page.content()

                finally:
                    await browser.close()

            # Parse the rendered HTML with BeautifulSoup
            soup = BeautifulSoup(html, "html.parser")
            leaderboard = soup.find(id="leaderboard")

            if not leaderboard:
                print("⚠ Leaderboard section not found in rendered page")
                return pd.DataFrame(
                    columns=["rank", "model_id", "model_name", "tokens", "token_change"]
                )

            # Extract model information using CSS selectors
            rankings_data: list[dict[str, object]] = []

            # Find all leaderboard entries (grid containers)
            entries = leaderboard.select("div.grid.grid-cols-12.items-center")

            for entry in entries[:30]:  # Limit to top 30
                try:
                    # Extract rank (first column)
                    rank_elem = entry.select_one("div.col-span-1")
                    rank = (
                        int(rank_elem.get_text(strip=True).replace(".", ""))
                        if rank_elem
                        else None
                    )

                    # Extract model name and link (second column)
                    model_link = entry.select_one("div.col-span-7 a.font-medium")
                    if not model_link:
                        continue

                    model_name = model_link.get_text(strip=True)
                    href = model_link.get("href", "")
                    model_id = (
                        href[1:] if isinstance(href, str) and href.startswith("/") else href
                    )

                    # Extract token count and change percentage (third column)
                    tokens: int | None = None
                    token_change: str | None = None
                    token_container = entry.select_one("div.col-span-4")

                    if token_container:
                        divs = token_container.select("div")

                        # First div contains the token count
                        if divs:
                            token_text = divs[0].get_text(strip=True)
                            # Pattern examples: 1.04T tokens, 801B tokens
                            token_match = re.search(
                                r"([\d.]+)([KMBT])\s*tokens", token_text, re.IGNORECASE
                            )
                            if token_match:
                                value = float(token_match.group(1))
                                unit = token_match.group(2).upper()
                                multipliers = {
                                    "K": 1_000,
                                    "M": 1_000_000,
                                    "B": 1_000_000_000,
                                    "T": 1_000_000_000_000,
                                }
                                tokens = int(value * multipliers.get(unit, 1))

                        # Extract percentage change
                        percent_div = token_container.select_one("div.mt-1")
                        if percent_div:
                            svg_elem = percent_div.select_one("svg")
                            full_text = percent_div.get_text(strip=True)
                            percent_match = re.search(r"([\d.]+)%", full_text)
                            if percent_match:
                                percentage_value = percent_match.group(1)
                                svg_class = ""
                                if svg_elem:
                                    svg_class_raw = svg_elem.get("class", [])
                                    if isinstance(svg_class_raw, list):
                                        svg_class = " ".join(svg_class_raw)
                                    else:  # str
                                        svg_class = str(svg_class_raw)
                                # Red indicates decrease, green indicates increase
                                if svg_class and "text-red" in svg_class:
                                    token_change = f"-{percentage_value}%"
                                else:
                                    token_change = f"{percentage_value}%"

                    rankings_data.append(
                        {
                            "rank": rank,
                            "model_id": model_id,
                            "model_name": model_name,
                            "tokens": tokens,
                            "token_change": token_change,
                        }
                    )

                except (ValueError, AttributeError):
                    # Skip entries that don't match expected format
                    continue

            if rankings_data:
                print(
                    f"✓ Successfully extracted {len(rankings_data)} models from rankings page"
                )
                return pd.DataFrame(rankings_data)
            print("⚠ No models found, using fallback")
            raise ValueError("No models extracted")

        except Exception as err:  # noqa: BLE001
            print(f"Error with Playwright: {err}")
            import traceback

            traceback.print_exc()
            raise

    def _run_in_thread():
        """Run async function in a new event loop in a separate thread"""
        # On Windows, use ProactorEventLoop which supports subprocesses
        if sys.platform == "win32":
            loop = asyncio.WindowsProactorEventLoopPolicy().new_event_loop()
        else:
            loop = asyncio.new_event_loop()

        asyncio.set_event_loop(loop)
        try:
            return loop.run_until_complete(_async_fetch())
        finally:
            loop.close()

    try:
        # Run in a thread pool to avoid event loop conflicts
        with concurrent.futures.ThreadPoolExecutor() as executor:
            future = executor.submit(_run_in_thread)
            return future.result(timeout=30)

    except Exception as err:  # noqa: BLE001
        print(f"Error fetching rankings: {err}")

        # Fallback to hardcoded list
        print("Using fallback: hardcoded top models list")
        fallback_rankings = [
            {
                "rank": 1,
                "model_id": "google/gemini-2.0-flash-exp:free",
                "model_name": "Gemini 2.0 Flash",
                "tokens": None,
                "token_change": None,
            },
            {
                "rank": 2,
                "model_id": "anthropic/claude-3.5-sonnet",
                "model_name": "Claude 3.5 Sonnet",
                "tokens": None,
                "token_change": None,
            },
            {
                "rank": 3,
                "model_id": "openai/gpt-4o",
                "model_name": "GPT-4o",
                "tokens": None,
                "token_change": None,
            },
            {
                "rank": 4,
                "model_id": "x-ai/grok-beta",
                "model_name": "Grok Beta",
                "tokens": None,
                "token_change": None,
            },
            {
                "rank": 5,
                "model_id": "google/gemini-pro-1.5",
                "model_name": "Gemini Pro 1.5",
                "tokens": None,
                "token_change": None,
            },
            {
                "rank": 6,
                "model_id": "anthropic/claude-3.5-haiku",
                "model_name": "Claude 3.5 Haiku",
                "tokens": None,
                "token_change": None,
            },
            {
                "rank": 7,
                "model_id": "deepseek/deepseek-chat",
                "model_name": "DeepSeek Chat",
                "tokens": None,
                "token_change": None,
            },
            {
                "rank": 8,
                "model_id": "openai/gpt-4o-mini",
                "model_name": "GPT-4o Mini",
                "tokens": None,
                "token_change": None,
            },
            {
                "rank": 9,
                "model_id": "meta-llama/llama-3.1-405b-instruct",
                "model_name": "Llama 3.1 405B",
                "tokens": None,
                "token_change": None,
            },
            {
                "rank": 10,
                "model_id": "qwen/qwen-2.5-72b-instruct",
                "model_name": "Qwen 2.5 72B",
                "tokens": None,
                "token_change": None,
            },
        ]
        return pd.DataFrame(fallback_rankings)


def fetch_top_openrouter_models(top_n: int = 5, sort_by: str = "popularity") -> list[dict]:
    """
    Fetch the top N models from OpenRouter.

    Args:
        top_n: Number of top models to return
        sort_by: Sorting criterion - 'popularity', 'price_low', 'price_high', 'context_length', 'newest'

    Returns a list of dictionaries containing model information.
    """
    url = "https://openrouter.ai/api/v1/models"

    try:
        response = requests.get(url, timeout=30)
        response.raise_for_status()
        data = response.json()

        # Extract model data
        models = data.get("data", [])

        # Filter out models without proper pricing or context info
        valid_models: list[dict[str, object]] = []
        for model in models:
            pricing = model.get("pricing", {})
            if pricing and pricing.get("prompt") and pricing.get("completion"):
                try:
                    prompt_cost = float(pricing.get("prompt", "0"))
                    completion_cost = float(pricing.get("completion", "0"))
                except (TypeError, ValueError):
                    continue
                model_info = {
                    "id": model.get("id", ""),
                    "name": model.get("name", model.get("model_id", "")),
                    "description": model.get(
                        "description", "No description available"
                    ),
                    "context_length": model.get("context_length", 0),
                    "pricing": pricing,
                    "created": model.get("created", 0),
                    # Average cost per 1M tokens
                    "avg_cost": (prompt_cost + completion_cost) / 2 * 1_000_000,
                }
                valid_models.append(model_info)

        # Sort models based on the specified criterion
        if sort_by == "popularity":
            print(
                "Fetching current model rankings from OpenRouter (using Playwright)..."
            )
            rankings_df = fetch_openrouter_rankings()
            if rankings_df is not None and not rankings_df.empty:
                popularity_order = {
                    row["model_id"]: row["rank"] for _, row in rankings_df.iterrows()
                }
                sorted_models = sorted(
                    valid_models, key=lambda x: popularity_order.get(x["model_id"], 999)
                )
                print(
                    f"✓ Successfully ranked {len(rankings_df)} models by current usage data"
                )
            else:
                print("⚠ Could not fetch rankings, using default sort")
                sorted_models = valid_models
        elif sort_by == "price_low":
            sorted_models = sorted(valid_models, key=lambda x: x["avg_cost"])
        elif sort_by == "price_high":
            sorted_models = sorted(valid_models, key=lambda x: x["avg_cost"], reverse=True)
        elif sort_by == "context_length":
            sorted_models = sorted(
                valid_models, key=lambda x: x["context_length"], reverse=True
            )
        elif sort_by == "newest":
            sorted_models = sorted(valid_models, key=lambda x: x["created"], reverse=True)
        else:
            sorted_models = valid_models

        return sorted_models[:top_n]

    except Exception as err:  # noqa: BLE001
        print(f"Error fetching models: {err}")
        # Fallback models
        return [
            {
                "id": "openai/gpt-4o",
                "name": "GPT-4o",
                "description": "OpenAI GPT-4o",
                "context_length": 128000,
                "pricing": {},
                "avg_cost": 0,
            },
            {
                "id": "anthropic/claude-3.5-sonnet",
                "name": "Claude 3.5 Sonnet",
                "description": "Anthropic Claude 3.5 Sonnet",
                "context_length": 200000,
                "pricing": {},
                "avg_cost": 0,
            },
            {
                "id": "google/gemini-pro-1.5",
                "name": "Gemini Pro 1.5",
                "description": "Google Gemini Pro 1.5",
                "context_length": 1_000_000,
                "pricing": {},
                "avg_cost": 0,
            },
            {
                "id": "meta-llama/llama-3.1-405b-instruct",
                "name": "Llama 3.1 405B",
                "description": "Meta Llama 3.1 405B Instruct",
                "context_length": 128000,
                "pricing": {},
                "avg_cost": 0,
            },
            {
                "id": "mistralai/mistral-large",
                "name": "Mistral Large",
                "description": "Mistral Large",
                "context_length": 128000,
                "pricing": {},
                "avg_cost": 0,
            },
        ][:top_n]

In [3]:
# Test: Fetch and display the current rankings
rankings_df = fetch_openrouter_rankings()

print(f"Successfully fetched {len(rankings_df)} ranked models\n")
print("Top 10 Models by Usage on OpenRouter:")
print("=" * 80)

display_df = rankings_df.head(10).copy()

def format_tokens(tokens: int | None) -> str:
    if tokens is None:
        return "N/A"
    if tokens >= 1_000_000_000_000:
        return f"{tokens / 1_000_000_000_000:.2f}T"
    if tokens >= 1_000_000_000:
        return f"{tokens / 1_000_000_000:.1f}B"
    if tokens >= 1_000_000:
        return f"{tokens / 1_000_000:.1f}M"
    return f"{tokens:,}"

display_df["tokens_formatted"] = display_df["tokens"].apply(format_tokens)
display_df[["rank", "model_name", "model_id", "tokens_formatted", "token_change"]]

Launching browser to fetch rankings...
✓ Successfully extracted 10 models from rankings page
Successfully fetched 10 ranked models

Top 10 Models by Usage on OpenRouter:
✓ Successfully extracted 10 models from rankings page
Successfully fetched 10 ranked models

Top 10 Models by Usage on OpenRouter:


Unnamed: 0,rank,model_name,model_id,tokens_formatted,token_change
0,1,Grok Code Fast 1,x-ai/grok-code-fast-1,1.07T,-4%
1,2,Claude Sonnet 4.5,anthropic/claude-sonnet-4.5,359.0B,502%
2,3,Gemini 2.5 Flash,google/gemini-2.5-flash,351.0B,-5%
3,4,Grok 4 Fast (free),x-ai/grok-4-fast:free,331.0B,-68%
4,5,Claude Sonnet 4,anthropic/claude-sonnet-4,245.0B,-55%
5,6,DeepSeek V3.1 (free),deepseek/deepseek-chat-v3.1:free,226.0B,2%
6,7,GPT-4.1 Mini,openai/gpt-4.1-mini,144.0B,26%
7,8,Gemini 2.5 Pro,google/gemini-2.5-pro,136.0B,-2%
8,9,Gemini 2.0 Flash,google/gemini-2.0-flash-001,135.0B,-16%
9,10,Gemini 2.5 Flash Lite,google/gemini-2.5-flash-lite,130.0B,48%


In [8]:
# remove ":free" from model IDs for OpenRouter models
display_df["model_id"] = display_df["model_id"].str.replace(r":free$", "", regex=True)

In [9]:
# Fetch top 5 models (sort_by can be: 'price_low', 'price_high', 'context_length', 'newest', 'popularity')
# top_models = fetch_top_openrouter_models(5, sort_by="popularity")

# df_models = pd.DataFrame(top_models)
# print("\nTop 5 Models from OpenRouter (sorted by popularity):")
# print("=" * 80)
# df_models[["name", "id", "context_length", "avg_cost"]].head()
top_models = display_df.head(5)

In [10]:
top_models

Unnamed: 0,rank,model_id,model_name,tokens,token_change,tokens_formatted
0,1,x-ai/grok-code-fast-1,Grok Code Fast 1,1070000000000,-4%,1.07T
1,2,anthropic/claude-sonnet-4.5,Claude Sonnet 4.5,359000000000,502%,359.0B
2,3,google/gemini-2.5-flash,Gemini 2.5 Flash,351000000000,-5%,351.0B
3,4,x-ai/grok-4-fast,Grok 4 Fast (free),331000000000,-68%,331.0B
4,5,anthropic/claude-sonnet-4,Claude Sonnet 4,245000000000,-55%,245.0B


In [11]:
# Set up OpenRouter API client
from openai import OpenAI

OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY")
client = OpenAI(base_url="https://openrouter.ai/api/v1", api_key=OPENROUTER_API_KEY)
print("OpenRouter client configured!")

OpenRouter client configured!


## Model Comparison

Now let's compare the selected models using a test prompt.

In [12]:
import re
import time

print(f"Using {len(top_models)} models for generation/answers/ratings.")

Using 5 models for generation/answers/ratings.


In [13]:
def generate_questions(
    models: pd.DataFrame, client: Any
) -> tuple[pd.DataFrame, pd.DataFrame]:
    """
    Generate evaluation questions using the provided models.
    
    Args:
        models: DataFrame with model information (must have 'model_id' and 'model_name' columns)
        client: OpenRouter API client (OpenAI-compatible)
        
    Returns:
        Tuple of (all_questions_df, valid_questions_df)
    """
    question_generation_prompt = (
        "Please craft ONE challenging, original, nuanced question that can effectively "
        "discriminate between language models of varying reasoning depth. The question should: "
        "(1) require multi-step reasoning, (2) avoid simple trivia, (3) be answerable without external "
        "browsing, (4) not be purely opinion, and (5) allow partial credit. Provide only the question text."
    )

    generated_questions: list[dict[str, Any]] = []
    for _, model in models.iterrows():
        mid = model["model_id"]
        mname = model.get("model_name", mid)
        print(f"\n[Generation] {mname} generating a question...")
        try:
            start = time.time()
            completion = client.chat.completions.create(
                model=mid,
                messages=[{"role": "user", "content": question_generation_prompt}],
                max_tokens=180,
                temperature=0.8,
            )
            q_text = completion.choices[0].message.content.strip()
            elapsed = time.time() - start
            generated_questions.append(
                {
                    "question_model_id": mid,
                    "question_model_name": mname,
                    "question": q_text,
                    "gen_time_s": round(elapsed, 2),
                }
            )
            print(
                f"✓ Question from {mname}: {q_text[:110]}{'...' if len(q_text) > 110 else ''}"
            )
        except Exception as err:  # noqa: BLE001
            generated_questions.append(
                {
                    "question_model_id": mid,
                    "question_model_name": mname,
                    "question": f"Error generating question: {err}",
                    "gen_time_s": None,
                }
            )
            print(f"✗ {mname} failed: {err}")

    questions_df = pd.DataFrame(generated_questions)

    # Filter out errored questions for downstream phases
    valid_questions_df = questions_df[
        ~questions_df["question"].str.startswith("Error")
    ].copy()

    return questions_df, valid_questions_df

In [14]:
questions_df, valid_questions_df = generate_questions(top_models, client)


[Generation] Grok Code Fast 1 generating a question...
✓ Question from Grok Code Fast 1: In a hypothetical society, all inhabitants are either Knights who always tell the truth or Knaves who always l...

[Generation] Claude Sonnet 4.5 generating a question...
✓ Question from Grok Code Fast 1: In a hypothetical society, all inhabitants are either Knights who always tell the truth or Knaves who always l...

[Generation] Claude Sonnet 4.5 generating a question...
✓ Question from Claude Sonnet 4.5: A rectangular swimming pool is being filled by two pipes. Pipe A fills at a constant rate, while Pipe B's flow...

[Generation] Gemini 2.5 Flash generating a question...
✓ Question from Claude Sonnet 4.5: A rectangular swimming pool is being filled by two pipes. Pipe A fills at a constant rate, while Pipe B's flow...

[Generation] Gemini 2.5 Flash generating a question...
✓ Question from Gemini 2.5 Flash: You are presented with a series of nested, opaque containers. Container A contains either 

In [15]:
questions_df

Unnamed: 0,question_model_id,question_model_name,question,gen_time_s
0,x-ai/grok-code-fast-1,Grok Code Fast 1,"In a hypothetical society, all inhabitants are...",5.6
1,anthropic/claude-sonnet-4.5,Claude Sonnet 4.5,A rectangular swimming pool is being filled by...,5.26
2,google/gemini-2.5-flash,Gemini 2.5 Flash,"You are presented with a series of nested, opa...",1.63
3,x-ai/grok-4-fast,Grok 4 Fast (free),Suppose you have a rectangular garden that mea...,9.74
4,anthropic/claude-sonnet-4,Claude Sonnet 4,A rectangular garden has a perimeter of 100 me...,3.54


In [16]:
valid_questions_df

Unnamed: 0,question_model_id,question_model_name,question,gen_time_s
0,x-ai/grok-code-fast-1,Grok Code Fast 1,"In a hypothetical society, all inhabitants are...",5.6
1,anthropic/claude-sonnet-4.5,Claude Sonnet 4.5,A rectangular swimming pool is being filled by...,5.26
2,google/gemini-2.5-flash,Gemini 2.5 Flash,"You are presented with a series of nested, opa...",1.63
3,x-ai/grok-4-fast,Grok 4 Fast (free),Suppose you have a rectangular garden that mea...,9.74
4,anthropic/claude-sonnet-4,Claude Sonnet 4,A rectangular garden has a perimeter of 100 me...,3.54


In [19]:
def answer_questions(
    questions_df: pd.DataFrame, models: pd.DataFrame, client: Any
) -> pd.DataFrame:
    """
    Generate answers to questions using the provided models.
    
    Args:
        questions_df: DataFrame with questions (must have 'question' and 'question_model_name' columns)
        models: DataFrame with model information (must have 'model_id' and 'model_name' columns)
        client: OpenRouter API client (OpenAI-compatible)
        
    Returns:
        DataFrame with answers
    """
    answers: list[dict[str, Any]] = []
    answer_instructions = (
        "You will be given a question designed to evaluate reasoning depth. Provide a thorough, "
        "structured answer. Show reasoning explicitly if helpful, but keep it concise and logical."
    )

    for _, qrow in questions_df.iterrows():
        q_text = qrow["question"]
        origin_model = qrow["question_model_name"]
        for _, model in models.iterrows():
            mid = model["model_id"]
            mname = model.get("model_name", mid)
            print(f"\n[Answer] {mname} answering question from {origin_model}...")
            try:
                start = time.time()
                completion = client.chat.completions.create(
                    model=mid,
                    messages=[
                        {"role": "system", "content": answer_instructions},
                        {"role": "user", "content": q_text},
                    ],
                    max_tokens=500,
                    temperature=0.5,
                    timeout=30.0,  # 30 second timeout
                )
                ans_text = completion.choices[0].message.content.strip()
                elapsed = time.time() - start
                answers.append(
                    {
                        "question_model_name": origin_model,
                        "question": q_text,
                        "answer_model_id": mid,
                        "answer_model_name": mname,
                        "answer": ans_text,
                        "answer_time_s": round(elapsed, 2),
                    }
                )
                print(f"✓ Answer length: {len(ans_text)} chars")
            except Exception as err:  # noqa: BLE001
                answers.append(
                    {
                        "question_model_name": origin_model,
                        "question": q_text,
                        "answer_model_id": mid,
                        "answer_model_name": mname,
                        "answer": f"Error: {err}",
                        "answer_time_s": None,
                    }
                )
                print(f"✗ {mname} failed to answer: {err}")

    answers_df = pd.DataFrame(answers)
    print("\nAnswers DataFrame ready (answers_df)")

    return answers_df

In [20]:
answers_df_full = answer_questions(valid_questions_df, top_models, client)


[Answer] Grok Code Fast 1 answering question from Grok Code Fast 1...
✓ Answer length: 1652 chars

[Answer] Claude Sonnet 4.5 answering question from Grok Code Fast 1...
✓ Answer length: 1652 chars

[Answer] Claude Sonnet 4.5 answering question from Grok Code Fast 1...
✓ Answer length: 1580 chars

[Answer] Gemini 2.5 Flash answering question from Grok Code Fast 1...
✓ Answer length: 1580 chars

[Answer] Gemini 2.5 Flash answering question from Grok Code Fast 1...
✓ Answer length: 1605 chars

[Answer] Grok 4 Fast (free) answering question from Grok Code Fast 1...
✓ Answer length: 1605 chars

[Answer] Grok 4 Fast (free) answering question from Grok Code Fast 1...
✓ Answer length: 1906 chars

[Answer] Claude Sonnet 4 answering question from Grok Code Fast 1...
✓ Answer length: 1906 chars

[Answer] Claude Sonnet 4 answering question from Grok Code Fast 1...
✓ Answer length: 1649 chars

[Answer] Grok Code Fast 1 answering question from Claude Sonnet 4.5...
✓ Answer length: 1649 chars

[Ans

In [21]:
answers_df_full

Unnamed: 0,question_model_name,question,answer_model_id,answer_model_name,answer,answer_time_s
0,Grok Code Fast 1,"In a hypothetical society, all inhabitants are...",x-ai/grok-code-fast-1,Grok Code Fast 1,### Step-by-Step Reasoning\n\n1. **Establish P...,21.44
1,Grok Code Fast 1,"In a hypothetical society, all inhabitants are...",anthropic/claude-sonnet-4.5,Claude Sonnet 4.5,# Solution: Determining Knights and Knaves\n\n...,10.02
2,Grok Code Fast 1,"In a hypothetical society, all inhabitants are...",google/gemini-2.5-flash,Gemini 2.5 Flash,Let's break down this classic logic puzzle ste...,2.59
3,Grok Code Fast 1,"In a hypothetical society, all inhabitants are...",x-ai/grok-4-fast,Grok 4 Fast (free),### Step-by-Step Reasoning\n\nIn this Knights ...,87.78
4,Grok Code Fast 1,"In a hypothetical society, all inhabitants are...",anthropic/claude-sonnet-4,Claude Sonnet 4,"I'll work through this step-by-step, analyzing...",9.39
5,Claude Sonnet 4.5,A rectangular swimming pool is being filled by...,x-ai/grok-code-fast-1,Grok Code Fast 1,### Step 1: Define Variables and Rates\nLet th...,17.56
6,Claude Sonnet 4.5,A rectangular swimming pool is being filled by...,anthropic/claude-sonnet-4.5,Claude Sonnet 4.5,"I'll solve this step-by-step, tracking the rat...",9.45
7,Claude Sonnet 4.5,A rectangular swimming pool is being filled by...,google/gemini-2.5-flash,Gemini 2.5 Flash,Let V be the total volume of the swimming pool...,2.38
8,Claude Sonnet 4.5,A rectangular swimming pool is being filled by...,x-ai/grok-4-fast,Grok 4 Fast (free),10.69\n\n### Key Reasoning Steps\n- Pipe A's c...,91.58
9,Claude Sonnet 4.5,A rectangular swimming pool is being filled by...,anthropic/claude-sonnet-4,Claude Sonnet 4,I need to find the rates of both pipes and the...,9.11


In [22]:
def evaluate_answers(
    answers_df: pd.DataFrame, models: pd.DataFrame, client: Any
) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """
    Evaluate answers using the provided models as raters.
    
    Args:
        answers_df: DataFrame with answers (must have 'question', 'answer', and 'answer_model_name' columns)
        models: DataFrame with model information (must have 'model_id' and 'model_name' columns)
        client: OpenRouter API client (OpenAI-compatible)
        
    Returns:
        Tuple of (ratings_df, summary_by_model, summary_by_pair)
    """
    ratings: list[dict[str, Any]] = []
    rating_prompt_template = (
        "You are evaluating an answer to a reasoning-focused question. Score 1-10 (10 = outstanding).\n"
        "Criteria (roughly equal weight):\n1. Clarity & organization\n2. Depth & correctness\n"
        "3. Completeness\n4. Insight/originality (if applicable)\nReturn ONLY the integer score."
    )

    for _, arow in answers_df.iterrows():
        ans_text = arow["answer"]
        if ans_text.startswith("Error:"):
            continue
        q_text = arow["question"]
        ans_model = arow["answer_model_name"]
        for _, model in models.iterrows():
            mid = model["model_id"]
            mname = model.get("model_name", mid)
            print(f"\n[Rating] {mname} rating answer from {ans_model}...")
            rating_input = (
                f"Question: {q_text}\nAnswer: {ans_text}\n\n{rating_prompt_template}"
            )
            try:
                completion = client.chat.completions.create(
                    model=mid,
                    messages=[{"role": "user", "content": rating_input}],
                    max_tokens=8,
                    temperature=0.0,
                )
                raw = completion.choices[0].message.content.strip()
                match = re.search(r"\b(10|[1-9])\b", raw)
                score = int(match.group(1)) if match else None
            except Exception as err:  # noqa: BLE001
                raw = f"Error: {err}"
                score = None
            ratings.append(
                {
                    "question": q_text,
                    "answer_model_name": ans_model,
                    "rater_model_id": mid,
                    "rater_model_name": mname,
                    "raw_rating_text": raw,
                    "rating": score,
                }
            )

    ratings_df = pd.DataFrame(ratings)
    print("\nRatings DataFrame ready (ratings_df)")

    # Generate aggregation summaries
    summary_model = pd.DataFrame()
    summary_pair = pd.DataFrame()

    if not ratings_df.empty:
        summary_model = (
            ratings_df.groupby("answer_model_name")["rating"].mean().reset_index().rename(
                columns={"rating": "avg_rating"}
            )
        )
        summary_pair = (
            ratings_df.groupby(["answer_model_name", "rater_model_name"])["rating"]
            .mean()
            .reset_index()
            .rename(columns={"rating": "avg_rating"})
        )
        print("\nAverage rating per answer model:")
        display(summary_model.sort_values("avg_rating", ascending=False))
        print("\nCross-model rating matrix (long form):")
        display(summary_pair.head(20))
    else:
        print("No ratings captured.")

    return ratings_df, summary_model, summary_pair

In [23]:
ratings_df_full, summary_model_full, summary_pair_full = evaluate_answers(answers_df_full, top_models, client)


[Rating] Grok Code Fast 1 rating answer from Grok Code Fast 1...

[Rating] Claude Sonnet 4.5 rating answer from Grok Code Fast 1...

[Rating] Claude Sonnet 4.5 rating answer from Grok Code Fast 1...

[Rating] Gemini 2.5 Flash rating answer from Grok Code Fast 1...

[Rating] Gemini 2.5 Flash rating answer from Grok Code Fast 1...

[Rating] Grok 4 Fast (free) rating answer from Grok Code Fast 1...

[Rating] Grok 4 Fast (free) rating answer from Grok Code Fast 1...

[Rating] Claude Sonnet 4 rating answer from Grok Code Fast 1...

[Rating] Claude Sonnet 4 rating answer from Grok Code Fast 1...

[Rating] Grok Code Fast 1 rating answer from Claude Sonnet 4.5...

[Rating] Grok Code Fast 1 rating answer from Claude Sonnet 4.5...

[Rating] Claude Sonnet 4.5 rating answer from Claude Sonnet 4.5...

[Rating] Claude Sonnet 4.5 rating answer from Claude Sonnet 4.5...

[Rating] Gemini 2.5 Flash rating answer from Claude Sonnet 4.5...

[Rating] Gemini 2.5 Flash rating answer from Claude Sonnet 4.5..

Unnamed: 0,answer_model_name,avg_rating
3,Grok 4 Fast (free),8.666667
4,Grok Code Fast 1,7.208333
1,Claude Sonnet 4.5,6.92
0,Claude Sonnet 4,6.4
2,Gemini 2.5 Flash,6.375



Cross-model rating matrix (long form):


Unnamed: 0,answer_model_name,rater_model_name,avg_rating
0,Claude Sonnet 4,Claude Sonnet 4,6.4
1,Claude Sonnet 4,Claude Sonnet 4.5,6.6
2,Claude Sonnet 4,Gemini 2.5 Flash,6.8
3,Claude Sonnet 4,Grok 4 Fast (free),5.6
4,Claude Sonnet 4,Grok Code Fast 1,6.6
5,Claude Sonnet 4.5,Claude Sonnet 4,7.0
6,Claude Sonnet 4.5,Claude Sonnet 4.5,6.8
7,Claude Sonnet 4.5,Gemini 2.5 Flash,8.0
8,Claude Sonnet 4.5,Grok 4 Fast (free),6.4
9,Claude Sonnet 4.5,Grok Code Fast 1,6.4


In [25]:
summary_pair_full

Unnamed: 0,answer_model_name,rater_model_name,avg_rating
0,Claude Sonnet 4,Claude Sonnet 4,6.4
1,Claude Sonnet 4,Claude Sonnet 4.5,6.6
2,Claude Sonnet 4,Gemini 2.5 Flash,6.8
3,Claude Sonnet 4,Grok 4 Fast (free),5.6
4,Claude Sonnet 4,Grok Code Fast 1,6.6
5,Claude Sonnet 4.5,Claude Sonnet 4,7.0
6,Claude Sonnet 4.5,Claude Sonnet 4.5,6.8
7,Claude Sonnet 4.5,Gemini 2.5 Flash,8.0
8,Claude Sonnet 4.5,Grok 4 Fast (free),6.4
9,Claude Sonnet 4.5,Grok Code Fast 1,6.4


In [27]:
# Extract data for README
print("## Generated Questions\n")
for idx, row in valid_questions_df.iterrows():
    print(f"### {idx+1}. {row['question_model_name']}")
    print(f"**Question:** {row['question']}\n")
    print(f"**Generation Time:** {row['gen_time_s']}s\n")

print("\n## Model Performance Summary\n")
print("### Overall Average Ratings (10-point scale)\n")
for idx, row in summary_model_full.sort_values('avg_rating', ascending=False).iterrows():
    print(f"{idx+1}. **{row['answer_model_name']}**: {row['avg_rating']:.2f}/10")

print("\n### Cross-Model Rating Matrix")
print("\nThis shows how each model (rater) rated each model's answers (answerer):\n")
pivot_table = summary_pair_full.pivot_table(
    index='answer_model_name',
    columns='rater_model_name',
    values='avg_rating'
)
print(pivot_table.to_string())

## Generated Questions

### 1. Grok Code Fast 1
**Question:** In a hypothetical society, all inhabitants are either Knights who always tell the truth or Knaves who always lie. You encounter three people: A, B, and C. A says, "We are all Knaves." B says, "Exactly one of us is a Knight." C says, "B and I are different types." Determine the type of each person (Knight or Knave) and provide a step-by-step explanation of your reasoning, including why any assumptions or contradictions arise.

**Generation Time:** 5.6s

### 2. Claude Sonnet 4.5
**Question:** A rectangular swimming pool is being filled by two pipes. Pipe A fills at a constant rate, while Pipe B's flow rate decreases by 10% each hour (compared to its previous hour's rate). When both pipes run together from the start, the pool fills in exactly 6 hours. When only Pipe A runs, it takes 10 hours to fill the pool. 

If the pool is empty and you run only Pipe B for 3 hours, then turn it off and finish filling the pool with only Pipe 

## Run Full Evaluation Pipeline

Execute the complete pipeline: question generation → answering → rating

In [None]:
# Run the complete evaluation pipeline
print(f"Starting evaluation pipeline with {len(top_models)} models...\n")

# Step 1: Generate questions
questions_df_full, valid_questions_df_full = generate_questions(top_models, client)

# Step 2: Generate answers
answers_df_full = answer_questions(valid_questions_df_full, top_models, client)

# Step 3: Evaluate answers
ratings_df_full, summary_model_full, summary_pair_full = evaluate_answers(answers_df_full, top_models, client)

print("\n" + "="*80)
print("PIPELINE COMPLETE")
print("="*80)
print("\nGenerated artifacts:")
print("  - questions_df_full: All generated questions")
print("  - valid_questions_df_full: Successfully generated questions only")
print("  - answers_df_full: All model answers to all questions")
print("  - ratings_df_full: All ratings from all rater models")
print("  - summary_model_full: Average rating per model")
print("  - summary_pair_full: Cross-model rating matrix")