Skip to content

feat: oracle-run workflow for benchmark case validation #1136

@christso

Description

@christso

Objective

Add an optional oracle-run workflow that applies a reference (gold) solution inside the same eval environment, then runs the normal verifier. This validates that benchmark cases are solvable and that test harnesses are sound.

Problem

When authoring benchmark cases, a common failure mode is:

  1. Writing a task prompt and test script
  2. Running the agent against it
  3. The agent fails — but is it because the agent is bad, or because the test script is broken?

Without an oracle solution, you can't distinguish "agent failed the task" from "the benchmark case is broken." This is especially important for large benchmark suites where manual validation of every case is impractical.

Design

Oracle solution asset

Add an optional oracle/ directory to workspace templates:

workspace/
  prompt.md
  setup.sh
  tests/
    verify.sh
  oracle/
    solve.sh       # reference solution

Oracle-run command

# Run the oracle solution instead of the agent, then execute the verifier
agentv eval my-benchmark.eval.yaml --oracle

# Oracle-validate specific test cases
agentv eval my-benchmark.eval.yaml --oracle --test-id fix-null-check

Behavior

  1. Skip the agent target entirely
  2. Execute oracle/solve.sh in the eval workspace
  3. Run the normal verifier (code-grader, assertions, etc.)
  4. Report pass/fail — a failing oracle means the benchmark case is broken

EVAL.yaml integration

workspace: ./cases/fix-null-check/

oracle:
  command: ./oracle/solve.sh    # reference solution
  timeout: 60

tests:
  - id: fix-null-check
    input: "Fix the null pointer exception in parser.ts"
    execution:
      evaluators:
        - type: code-grader
          command: ./tests/verify.sh

Use cases

  1. Benchmark authoring QA — validate all cases are solvable before publishing
  2. Test harness debugging — when an agent fails, run oracle to check if the verifier works
  3. Regression detection — if oracle starts failing, the benchmark env changed, not the agent
  4. Benchmark suite CI — run oracle on all cases as a pre-merge check for benchmark PRs

Non-goals

  • Not requiring oracle for all evals — this is opt-in for benchmark authoring
  • Not comparing agent output to oracle output — oracle validates the harness, not the agent

Acceptance signals

  • --oracle flag skips agent and runs oracle solution
  • Oracle pass = benchmark case is valid
  • Oracle fail = benchmark case or verifier is broken
  • Works with existing workspace templates and code-grader evaluators

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions