Skip to content

feat: auto-discover test cases from directory structure #1141

@christso

Description

@christso

Objective

Support tests: ./cases/ where a directory path auto-discovers test cases from subdirectories, each containing a case.yaml with standard test case fields.

Problem

Benchmark authoring with many cases is verbose. Today you either:

  1. List every case manually in EVAL.yaml or cases.yaml (50 cases = 50 entries)
  2. Write a script to generate cases.yaml from a directory (bad DX — codegen before every eval)

Design

Convention

my-benchmark/
  benchmark.eval.yaml       # shared: targets, workspace, hooks
  cases/                    # auto-discovered
    fix-null-check/
      case.yaml             # per-case: input, criteria, assertions
      workspace/            # optional per-case workspace override
    add-auth/
      case.yaml
    refactor-parser/
      case.yaml
      workspace/

case.yaml

Standard test case schema — each case declares its own input and assertions:

input: "Fix the null pointer in parser.ts"
criteria: Agent should identify root cause and fix without breaking existing tests.
assertions:
  - type: code-grader
    command: ["bun", "test", "tests/parser.test.ts"]
  - Agent should not break existing tests
  - type: execution-metrics
    max_tool_calls: 20
metadata:
  difficulty: medium
  category: debugging

Per-case assertions are explicit and declarative — no shared grader with implicit coupling to workspace contents. Each case says exactly what it checks. This is cleaner than pipeline-based approaches (like Margin's test.shconfig.jsonrun_script.shparser.pyevaluate.py indirection chain) because the full verification contract is readable in one file.

EVAL.yaml

Only truly shared config lives at the suite level:

workspace: ./shared-workspace/

execution:
  targets:
    - name: claude-baseline
      use_target: ${{ AGENT_TARGET }}
    - name: claude-superpowers
      use_target: ${{ AGENT_TARGET }}
      hooks:
        before_each:
          command: ["bash", "./scripts/setup-variant.sh", "superpowers"]

# Suite-level assertions apply to ALL discovered cases (in addition to per-case assertions)
assertions:
  - Response does not include harmful content

tests: ./cases/                   # directory path → auto-discover

What goes where

Concern Where Why
Targets and target hooks EVAL.yaml Shared across all cases
Default workspace template EVAL.yaml Shared base environment
Suite-level assertions (apply to all cases) EVAL.yaml Shared quality gates
Task prompt (input:) case.yaml Different per case
Per-case assertions and graders case.yaml Each case declares its own verification contract
Per-case workspace overrides case directory workspace/ Different repo state per case
Case metadata case.yaml Per-case difficulty, category, tags

Discovery rules

  1. Each immediate subdirectory of ./cases/ is a test case
  2. Subdirectory name becomes the id (unless id: is specified in case.yaml)
  3. case.yaml is required — directories without it are skipped with a warning
  4. case.yaml uses the existing test case schema (no new fields)
  5. Suite-level assertions: from EVAL.yaml are merged with per-case assertions:
  6. Subdirectories are sorted alphabetically for deterministic ordering

Core change

The only new behavior is in the test case loader:

  1. Detect that tests: value is a directory path (not a file)
  2. Scan immediate subdirectories for case.yaml
  3. Load each case.yaml using the existing test case schema
  4. Set id from directory name if not specified
  5. Merge suite-level assertions as normal

No new schema fields. No new file formats.

Backward compatibility

  • tests: ./cases.yaml (file path) continues to work as today
  • tests: ./cases/ (directory path) triggers auto-discovery
  • Detection: if the path is a directory, auto-discover; if it's a file, parse as YAML/JSONL
  • Inline tests continue to work unchanged

Non-goals

  • Not changing the EVAL.yaml format — auto-discover is just a new resolution mode for tests:
  • Not adding new manifest formats or directory conventions beyond case.yaml
  • Not adding shared grader scripts — per-case verification belongs in case.yaml assertions

Related

Acceptance signals

  • tests: ./cases/ discovers subdirectories as test cases
  • case.yaml is loaded using existing test case schema
  • Directory name is used as id when not specified
  • Suite-level assertions: merge with per-case assertions:
  • Directories without case.yaml are skipped with a warning
  • Existing tests: ./file.yaml and inline tests continue to work unchanged
  • At least one showcase example demonstrating the pattern

Metadata

Metadata

Assignees

No one assigned

    Labels

    in-progressClaimed by an agent — do not duplicate work

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions