This notebook evaluates different LLM-as-a-judge strategies for LM Compass.

Reference material: https://docs.google.com/document/d/1vKkgJj6Tj-gSZ-1LUBvNQQ34gatz0RaRCWAMDegFozU/edit?tab=t.0#heading=h.jo95wu3e9n0z (Prompt­-based Rubric Scoring, Multi-Agent Self Reflection, Rationale‑Based Self‑Critique Loops)

Also see our proposed algorithm for judging: https://docs.google.com/document/d/1oDZiobHY0ze7zyKv1oRim8qLS9VL1oiLWeWElbhV6RI/edit?usp=sharing

The goal is to compare various methods against each other and against simply using a single model's output.

General prompt -> evaluation flow:
0. Select n model candidates M (n = 2 to 4)
1. Call OpenRouter API on initial input query Q to M candidates (in parallel, async function required probably)
2. Store all responses R_0..R_n
3. Pick an evaluation method
4. Initialize judge(s) based on evaluation method
5. Compare the judges evaluation to a baseline LLM (e.g. Base GPT-4o vs. GPT-4o Judge)

Example of evaluation comparison
1. User submits query
2. Query gets passed to GPT-4o and Deepseek (A & B)
3. We pick our proposed algorithm for evaluation (see above)
3.1 Response A gets sent to Judge B. Response B gets sent to Judge A.
3.2 Given a generic judging prompt, they determine a score
3.3 The returned response is the response with the higher score (as long as it passes threshold, see above linked document)
4. Return the 'winning' response
5. Find metrics or reasons for effectiveness of this approach
6. Repeat for other methods

In [None]:
from dotenv import load_dotenv
from openai import AsyncOpenAI
import os
import asyncio
import json
import pprint

load_dotenv()

In [None]:
OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY")
if not OPENROUTER_API_KEY:
    raise ValueError("OPENROUTER_API_KEY not found in .env file or environment variables.")

In [None]:
client = AsyncOpenAI(
  base_url="https://openrouter.ai/api/v1",
  api_key=OPENROUTER_API_KEY
)

In [None]:
# ALL AVALIABLE MODELS FOR TESTING
# *to add models, use the screen name for the model from OpenRouter as the key and
#  the actual name used in the api as the value

candidate_models = {
    # Free Models
    "MiniMax: MiniMax M2 (free)"          : "minimax/minimax-m2:free",
    "TNG: DeepSeek R1T2 Chimera (free)"   : "tngtech/deepseek-r1t2-chimera:free",
    "Meta: Llama 3.3 70B Instruct (free)" : "meta-llama/llama-3.3-70b-instruct:free",
    "OpenAI: gpt-oss-20b (free)"          : "openai/gpt-oss-20b:free"
}

In [None]:
# QUERY FUNCTIONS

async def query_model(model: str, query: str, role="user"):
    """
    Queries a single model using the models in 'candidate_models'
    """
    try:
        response = await client.chat.completions.create(
            model=candidate_models[model],
            messages=[{"role" : role, "content" : query}],
            temperature=1
        )
        content = response.choices[0].message.content
        return model, content
    except Exception as e:
        return model, str(e)

async def query_models(models: list[str], queries: list[str], role="user"):
    """
    Queries multiple models asychronously
    """
    coroutines = [query_model(models[i], queries[i], role=role) for i in range(len(models))]
    results = await asyncio.gather(*coroutines)
    return results

In [None]:
# QUERY EACH MODEL FOR THEIR ANSWER TO THE USER PROMPT

user_query = "What color is grass?"
models_to_use = ["MiniMax: MiniMax M2 (free)", "Meta: Llama 3.3 70B Instruct (free)", "TNG: DeepSeek R1T2 Chimera (free)"]
result = await query_models(models_to_use, [user_query]*len(models_to_use))
pprint.pprint(result)

In [None]:
# BUILDING THE QUERIES FOR EACH MODEL TO EVALUATE EACH OTHER

scoring_query = lambda answer : f"""\
Question: {user_query}
Answer: {answer}

Please rate the answer to the question and give it a score between 0 and 100. End your output with just the score on a new line.\
"""

new_models_to_use = []
new_queries_to_use = []
for model1 in models_to_use:
    for model2, answer in result:
        if model1 != model2:
            new_models_to_use.append(model1)
            new_queries_to_use.append(scoring_query(answer))

In [None]:
# QUERY EACH MODEL TO EVALUATE EACH OTHER MODEL'S ANSWER

scoring_results = await query_models(new_models_to_use, new_queries_to_use)