feat: auto-discover test cases from directory structure

## Objective

Support `tests: ./cases/` where a directory path auto-discovers test cases from subdirectories, each containing a `case.yaml` with standard test case fields.

## Problem

Benchmark authoring with many cases is verbose. Today you either:
1. List every case manually in EVAL.yaml or cases.yaml (50 cases = 50 entries)
2. Write a script to generate cases.yaml from a directory (bad DX — codegen before every eval)

## Design

### Convention

```
my-benchmark/
  benchmark.eval.yaml       # shared: targets, workspace, hooks
  cases/                    # auto-discovered
    fix-null-check/
      case.yaml             # per-case: input, criteria, assertions
      workspace/            # optional per-case workspace override
    add-auth/
      case.yaml
    refactor-parser/
      case.yaml
      workspace/
```

### case.yaml

Standard test case schema — each case declares its own input and assertions:

```yaml
input: "Fix the null pointer in parser.ts"
criteria: Agent should identify root cause and fix without breaking existing tests.
assertions:
  - type: code-grader
    command: ["bun", "test", "tests/parser.test.ts"]
  - Agent should not break existing tests
  - type: execution-metrics
    max_tool_calls: 20
metadata:
  difficulty: medium
  category: debugging
```

Per-case assertions are explicit and declarative — no shared grader with implicit coupling to workspace contents. Each case says exactly what it checks. This is cleaner than pipeline-based approaches (like Margin's `test.sh` → `config.json` → `run_script.sh` → `parser.py` → `evaluate.py` indirection chain) because the full verification contract is readable in one file.

### EVAL.yaml

Only truly shared config lives at the suite level:

```yaml
workspace: ./shared-workspace/

execution:
  targets:
    - name: claude-baseline
      use_target: ${{ AGENT_TARGET }}
    - name: claude-superpowers
      use_target: ${{ AGENT_TARGET }}
      hooks:
        before_each:
          command: ["bash", "./scripts/setup-variant.sh", "superpowers"]

# Suite-level assertions apply to ALL discovered cases (in addition to per-case assertions)
assertions:
  - Response does not include harmful content

tests: ./cases/                   # directory path → auto-discover
```

### What goes where

| Concern | Where | Why |
|---|---|---|
| Targets and target hooks | EVAL.yaml | Shared across all cases |
| Default workspace template | EVAL.yaml | Shared base environment |
| Suite-level assertions (apply to all cases) | EVAL.yaml | Shared quality gates |
| Task prompt (`input:`) | case.yaml | Different per case |
| Per-case assertions and graders | case.yaml | Each case declares its own verification contract |
| Per-case workspace overrides | case directory `workspace/` | Different repo state per case |
| Case metadata | case.yaml | Per-case difficulty, category, tags |

### Discovery rules

1. Each immediate subdirectory of `./cases/` is a test case
2. Subdirectory name becomes the `id` (unless `id:` is specified in `case.yaml`)
3. `case.yaml` is required — directories without it are skipped with a warning
4. `case.yaml` uses the existing test case schema (no new fields)
5. Suite-level `assertions:` from EVAL.yaml are merged with per-case `assertions:`
6. Subdirectories are sorted alphabetically for deterministic ordering

## Core change

The only new behavior is in the test case loader:
1. Detect that `tests:` value is a directory path (not a file)
2. Scan immediate subdirectories for `case.yaml`
3. Load each `case.yaml` using the existing test case schema
4. Set `id` from directory name if not specified
5. Merge suite-level assertions as normal

No new schema fields. No new file formats.

## Backward compatibility

- `tests: ./cases.yaml` (file path) continues to work as today
- `tests: ./cases/` (directory path) triggers auto-discovery
- Detection: if the path is a directory, auto-discover; if it's a file, parse as YAML/JSONL
- Inline tests continue to work unchanged

## Non-goals

- Not changing the EVAL.yaml format — auto-discover is just a new resolution mode for `tests:`
- Not adding new manifest formats or directory conventions beyond `case.yaml`
- Not adding shared grader scripts — per-case verification belongs in `case.yaml` assertions

## Related

- #1076 — benchmark starter pack (would use this for example benchmarks)
- #1137 — benchmarking best practices (would document this pattern)
- #1139 — public tracker (benchmark suites would use this structure)
- #1140 — eval portability validator (validates case directories are complete)

## Acceptance signals

- [ ] `tests: ./cases/` discovers subdirectories as test cases
- [ ] `case.yaml` is loaded using existing test case schema
- [ ] Directory name is used as `id` when not specified
- [ ] Suite-level `assertions:` merge with per-case `assertions:`
- [ ] Directories without `case.yaml` are skipped with a warning
- [ ] Existing `tests: ./file.yaml` and inline tests continue to work unchanged
- [ ] At least one showcase example demonstrating the pattern

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: auto-discover test cases from directory structure #1141

Objective

Problem

Design

Convention

case.yaml

EVAL.yaml

What goes where

Discovery rules

Core change

Backward compatibility

Non-goals

Related

Acceptance signals

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Concern	Where	Why
Targets and target hooks	EVAL.yaml	Shared across all cases
Default workspace template	EVAL.yaml	Shared base environment
Suite-level assertions (apply to all cases)	EVAL.yaml	Shared quality gates
Task prompt (`input:`)	case.yaml	Different per case
Per-case assertions and graders	case.yaml	Each case declares its own verification contract
Per-case workspace overrides	case directory `workspace/`	Different repo state per case
Case metadata	case.yaml	Per-case difficulty, category, tags

feat: auto-discover test cases from directory structure #1141

Description

Objective

Problem

Design

Convention

case.yaml

EVAL.yaml

What goes where

Discovery rules

Core change

Backward compatibility

Non-goals

Related

Acceptance signals

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions