Objective
Support tests: ./cases/ where a directory path auto-discovers test cases from subdirectories, each containing a case.yaml with standard test case fields.
Problem
Benchmark authoring with many cases is verbose. Today you either:
- List every case manually in EVAL.yaml or cases.yaml (50 cases = 50 entries)
- Write a script to generate cases.yaml from a directory (bad DX — codegen before every eval)
Design
Convention
my-benchmark/
benchmark.eval.yaml # shared: targets, workspace, hooks
cases/ # auto-discovered
fix-null-check/
case.yaml # per-case: input, criteria, assertions
workspace/ # optional per-case workspace override
add-auth/
case.yaml
refactor-parser/
case.yaml
workspace/
case.yaml
Standard test case schema — each case declares its own input and assertions:
input: "Fix the null pointer in parser.ts"
criteria: Agent should identify root cause and fix without breaking existing tests.
assertions:
- type: code-grader
command: ["bun", "test", "tests/parser.test.ts"]
- Agent should not break existing tests
- type: execution-metrics
max_tool_calls: 20
metadata:
difficulty: medium
category: debugging
Per-case assertions are explicit and declarative — no shared grader with implicit coupling to workspace contents. Each case says exactly what it checks. This is cleaner than pipeline-based approaches (like Margin's test.sh → config.json → run_script.sh → parser.py → evaluate.py indirection chain) because the full verification contract is readable in one file.
EVAL.yaml
Only truly shared config lives at the suite level:
workspace: ./shared-workspace/
execution:
targets:
- name: claude-baseline
use_target: ${{ AGENT_TARGET }}
- name: claude-superpowers
use_target: ${{ AGENT_TARGET }}
hooks:
before_each:
command: ["bash", "./scripts/setup-variant.sh", "superpowers"]
# Suite-level assertions apply to ALL discovered cases (in addition to per-case assertions)
assertions:
- Response does not include harmful content
tests: ./cases/ # directory path → auto-discover
What goes where
| Concern |
Where |
Why |
| Targets and target hooks |
EVAL.yaml |
Shared across all cases |
| Default workspace template |
EVAL.yaml |
Shared base environment |
| Suite-level assertions (apply to all cases) |
EVAL.yaml |
Shared quality gates |
Task prompt (input:) |
case.yaml |
Different per case |
| Per-case assertions and graders |
case.yaml |
Each case declares its own verification contract |
| Per-case workspace overrides |
case directory workspace/ |
Different repo state per case |
| Case metadata |
case.yaml |
Per-case difficulty, category, tags |
Discovery rules
- Each immediate subdirectory of
./cases/ is a test case
- Subdirectory name becomes the
id (unless id: is specified in case.yaml)
case.yaml is required — directories without it are skipped with a warning
case.yaml uses the existing test case schema (no new fields)
- Suite-level
assertions: from EVAL.yaml are merged with per-case assertions:
- Subdirectories are sorted alphabetically for deterministic ordering
Core change
The only new behavior is in the test case loader:
- Detect that
tests: value is a directory path (not a file)
- Scan immediate subdirectories for
case.yaml
- Load each
case.yaml using the existing test case schema
- Set
id from directory name if not specified
- Merge suite-level assertions as normal
No new schema fields. No new file formats.
Backward compatibility
tests: ./cases.yaml (file path) continues to work as today
tests: ./cases/ (directory path) triggers auto-discovery
- Detection: if the path is a directory, auto-discover; if it's a file, parse as YAML/JSONL
- Inline tests continue to work unchanged
Non-goals
- Not changing the EVAL.yaml format — auto-discover is just a new resolution mode for
tests:
- Not adding new manifest formats or directory conventions beyond
case.yaml
- Not adding shared grader scripts — per-case verification belongs in
case.yaml assertions
Related
Acceptance signals
Objective
Support
tests: ./cases/where a directory path auto-discovers test cases from subdirectories, each containing acase.yamlwith standard test case fields.Problem
Benchmark authoring with many cases is verbose. Today you either:
Design
Convention
case.yaml
Standard test case schema — each case declares its own input and assertions:
Per-case assertions are explicit and declarative — no shared grader with implicit coupling to workspace contents. Each case says exactly what it checks. This is cleaner than pipeline-based approaches (like Margin's
test.sh→config.json→run_script.sh→parser.py→evaluate.pyindirection chain) because the full verification contract is readable in one file.EVAL.yaml
Only truly shared config lives at the suite level:
What goes where
input:)workspace/Discovery rules
./cases/is a test caseid(unlessid:is specified incase.yaml)case.yamlis required — directories without it are skipped with a warningcase.yamluses the existing test case schema (no new fields)assertions:from EVAL.yaml are merged with per-caseassertions:Core change
The only new behavior is in the test case loader:
tests:value is a directory path (not a file)case.yamlcase.yamlusing the existing test case schemaidfrom directory name if not specifiedNo new schema fields. No new file formats.
Backward compatibility
tests: ./cases.yaml(file path) continues to work as todaytests: ./cases/(directory path) triggers auto-discoveryNon-goals
tests:case.yamlcase.yamlassertionsRelated
Acceptance signals
tests: ./cases/discovers subdirectories as test casescase.yamlis loaded using existing test case schemaidwhen not specifiedassertions:merge with per-caseassertions:case.yamlare skipped with a warningtests: ./file.yamland inline tests continue to work unchanged