Problem
In the configuration as shipped, the agent under test and the LLM judge end up being the same model — so the judge is grading its own outputs. This is self-preference bias [1] applied directly.
Concrete path:
src/agent/claude_agent/runner.py:36 and cli.py:16 default the candidate model to litellm_proxy/aws/claude-opus-4-6.
src/agent/deep_agent/runner.py:39 and cli.py:16 default to the same.
INSTRUCTIONS.md:244 and docs/evaluation.md:90 use --judge-model litellm_proxy/aws/claude-opus-4-6 as the canonical evaluate example.
A user following the README runs claude-agent (claude-opus-4-6) and evaluates with --judge-model litellm_proxy/aws/claude-opus-4-6 — claude-opus-4-6 judging claude-opus-4-6. No user error required; this is the documented path.
The paper has the same issue at one row of the leaderboard: llama-4-maverick is both a candidate model and the chosen judge model.
Why this is distinct from #281 and #296
Reference
[1] Panickssery, A., Bowman, S. R., Feng, S. (2024). LLM Evaluators Recognize and Favor Their Own Generations. arXiv:2404.13076.
Proposed solution
One principle: the judge is held constant across the entire leaderboard, and the judge is not a candidate. Pick a side — the same model cannot be both. Today's defaults make it both.
The concrete changes follow from that:
-
Decide which side claude-opus-4-6 is on. Either:
- Keep it as the judge → remove it as the default candidate for
claude-agent and deep-agent (swap to a different model, e.g. litellm_proxy/azure/gpt-5.4 like openai-agent), and document that claude-opus-4-6 is reserved as the leaderboard judge and is not benchmarked as a candidate, OR
- Keep it as a candidate → switch the documented judge in
INSTRUCTIONS.md:244 and docs/evaluation.md:90 to a model that is not on the candidate pool.
-
Add a guard in the eval CLI. When --judge-model resolves to the same identifier as the trajectory's model field for a row, abort that row with a clear error (not a silent warning). Catches the violation regardless of how the user configured things — config drift, future model swaps, custom runners.
-
Shrink the LLM judge's surface area. Whether the agent called the right tool, in the right order, on the right asset / time window is deterministically checkable against the persisted trajectory and the execution_steps / execution_links ground truth. src/evaluation/scorers/code_based.py currently raises NotImplementedError. Wiring it up lets llm_judge keep only the genuinely subjective parts (reasoning quality, justification) — and any residual self-eval bias rides on a much smaller fraction of the score.
(1) and (2) together solve the self-judging problem outright. (3) is independent and compounds — smaller LLM-judge surface means smaller impact from any residual single-model bias, including this one.
Happy to send the PR for (2) — the eval-CLI guard is the smallest patch and prevents regression regardless of which way you go on (1).
Problem
In the configuration as shipped, the agent under test and the LLM judge end up being the same model — so the judge is grading its own outputs. This is self-preference bias [1] applied directly.
Concrete path:
src/agent/claude_agent/runner.py:36andcli.py:16default the candidate model tolitellm_proxy/aws/claude-opus-4-6.src/agent/deep_agent/runner.py:39andcli.py:16default to the same.INSTRUCTIONS.md:244anddocs/evaluation.md:90use--judge-model litellm_proxy/aws/claude-opus-4-6as the canonical evaluate example.A user following the README runs
claude-agent(claude-opus-4-6) and evaluates with--judge-model litellm_proxy/aws/claude-opus-4-6— claude-opus-4-6 judging claude-opus-4-6. No user error required; this is the documented path.The paper has the same issue at one row of the leaderboard:
llama-4-maverickis both a candidate model and the chosen judge model.Why this is distinct from #281 and #296
Reference
[1] Panickssery, A., Bowman, S. R., Feng, S. (2024). LLM Evaluators Recognize and Favor Their Own Generations. arXiv:2404.13076.
Proposed solution
One principle: the judge is held constant across the entire leaderboard, and the judge is not a candidate. Pick a side — the same model cannot be both. Today's defaults make it both.
The concrete changes follow from that:
Decide which side claude-opus-4-6 is on. Either:
claude-agentanddeep-agent(swap to a different model, e.g.litellm_proxy/azure/gpt-5.4likeopenai-agent), and document that claude-opus-4-6 is reserved as the leaderboard judge and is not benchmarked as a candidate, ORINSTRUCTIONS.md:244anddocs/evaluation.md:90to a model that is not on the candidate pool.Add a guard in the eval CLI. When
--judge-modelresolves to the same identifier as the trajectory'smodelfield for a row, abort that row with a clear error (not a silent warning). Catches the violation regardless of how the user configured things — config drift, future model swaps, custom runners.Shrink the LLM judge's surface area. Whether the agent called the right tool, in the right order, on the right asset / time window is deterministically checkable against the persisted trajectory and the
execution_steps/execution_linksground truth.src/evaluation/scorers/code_based.pycurrently raisesNotImplementedError. Wiring it up letsllm_judgekeep only the genuinely subjective parts (reasoning quality, justification) — and any residual self-eval bias rides on a much smaller fraction of the score.(1) and (2) together solve the self-judging problem outright. (3) is independent and compounds — smaller LLM-judge surface means smaller impact from any residual single-model bias, including this one.
Happy to send the PR for (2) — the eval-CLI guard is the smallest patch and prevents regression regardless of which way you go on (1).