feat: oracle-run workflow for benchmark case validation

## Objective

Add an optional oracle-run workflow that applies a reference (gold) solution inside the same eval environment, then runs the normal verifier. This validates that benchmark cases are solvable and that test harnesses are sound.

## Problem

When authoring benchmark cases, a common failure mode is:

1. Writing a task prompt and test script
2. Running the agent against it
3. The agent fails — but is it because the agent is bad, or because the test script is broken?

Without an oracle solution, you can't distinguish "agent failed the task" from "the benchmark case is broken." This is especially important for large benchmark suites where manual validation of every case is impractical.

## Design

### Oracle solution asset

Add an optional `oracle/` directory to workspace templates:

```
workspace/
  prompt.md
  setup.sh
  tests/
    verify.sh
  oracle/
    solve.sh       # reference solution
```

### Oracle-run command

```bash
# Run the oracle solution instead of the agent, then execute the verifier
agentv eval my-benchmark.eval.yaml --oracle

# Oracle-validate specific test cases
agentv eval my-benchmark.eval.yaml --oracle --test-id fix-null-check
```

### Behavior

1. Skip the agent target entirely
2. Execute `oracle/solve.sh` in the eval workspace
3. Run the normal verifier (code-grader, assertions, etc.)
4. Report pass/fail — a failing oracle means the benchmark case is broken

### EVAL.yaml integration

```yaml
workspace: ./cases/fix-null-check/

oracle:
  command: ./oracle/solve.sh    # reference solution
  timeout: 60

tests:
  - id: fix-null-check
    input: "Fix the null pointer exception in parser.ts"
    execution:
      evaluators:
        - type: code-grader
          command: ./tests/verify.sh
```

## Use cases

1. **Benchmark authoring QA** — validate all cases are solvable before publishing
2. **Test harness debugging** — when an agent fails, run oracle to check if the verifier works
3. **Regression detection** — if oracle starts failing, the benchmark env changed, not the agent
4. **Benchmark suite CI** — run oracle on all cases as a pre-merge check for benchmark PRs

## Non-goals

- Not requiring oracle for all evals — this is opt-in for benchmark authoring
- Not comparing agent output to oracle output — oracle validates the harness, not the agent

## Acceptance signals

- [ ] `--oracle` flag skips agent and runs oracle solution
- [ ] Oracle pass = benchmark case is valid
- [ ] Oracle fail = benchmark case or verifier is broken
- [ ] Works with existing workspace templates and code-grader evaluators

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: oracle-run workflow for benchmark case validation #1136

Objective

Problem

Design

Oracle solution asset

Oracle-run command

Behavior

EVAL.yaml integration

Use cases

Non-goals

Acceptance signals

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: oracle-run workflow for benchmark case validation #1136

Description

Objective

Problem

Design

Oracle solution asset

Oracle-run command

Behavior

EVAL.yaml integration

Use cases

Non-goals

Acceptance signals

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions