Objective
Add an optional oracle-run workflow that applies a reference (gold) solution inside the same eval environment, then runs the normal verifier. This validates that benchmark cases are solvable and that test harnesses are sound.
Problem
When authoring benchmark cases, a common failure mode is:
- Writing a task prompt and test script
- Running the agent against it
- The agent fails — but is it because the agent is bad, or because the test script is broken?
Without an oracle solution, you can't distinguish "agent failed the task" from "the benchmark case is broken." This is especially important for large benchmark suites where manual validation of every case is impractical.
Design
Oracle solution asset
Add an optional oracle/ directory to workspace templates:
workspace/
prompt.md
setup.sh
tests/
verify.sh
oracle/
solve.sh # reference solution
Oracle-run command
# Run the oracle solution instead of the agent, then execute the verifier
agentv eval my-benchmark.eval.yaml --oracle
# Oracle-validate specific test cases
agentv eval my-benchmark.eval.yaml --oracle --test-id fix-null-check
Behavior
- Skip the agent target entirely
- Execute
oracle/solve.sh in the eval workspace
- Run the normal verifier (code-grader, assertions, etc.)
- Report pass/fail — a failing oracle means the benchmark case is broken
EVAL.yaml integration
workspace: ./cases/fix-null-check/
oracle:
command: ./oracle/solve.sh # reference solution
timeout: 60
tests:
- id: fix-null-check
input: "Fix the null pointer exception in parser.ts"
execution:
evaluators:
- type: code-grader
command: ./tests/verify.sh
Use cases
- Benchmark authoring QA — validate all cases are solvable before publishing
- Test harness debugging — when an agent fails, run oracle to check if the verifier works
- Regression detection — if oracle starts failing, the benchmark env changed, not the agent
- Benchmark suite CI — run oracle on all cases as a pre-merge check for benchmark PRs
Non-goals
- Not requiring oracle for all evals — this is opt-in for benchmark authoring
- Not comparing agent output to oracle output — oracle validates the harness, not the agent
Acceptance signals
Objective
Add an optional oracle-run workflow that applies a reference (gold) solution inside the same eval environment, then runs the normal verifier. This validates that benchmark cases are solvable and that test harnesses are sound.
Problem
When authoring benchmark cases, a common failure mode is:
Without an oracle solution, you can't distinguish "agent failed the task" from "the benchmark case is broken." This is especially important for large benchmark suites where manual validation of every case is impractical.
Design
Oracle solution asset
Add an optional
oracle/directory to workspace templates:Oracle-run command
Behavior
oracle/solve.shin the eval workspaceEVAL.yaml integration
Use cases
Non-goals
Acceptance signals
--oracleflag skips agent and runs oracle solution