Build a capability fixture corpus by adrienlacombe · Pull Request #156 · AbdelStark/worldforge

adrienlacombe · 2026-05-04T06:22:30Z

Refs #135 (WF-B2). Builds on the provenance envelope shipped in #155.

Summary

Adds a packaged corpus of canonical input fixtures for the seven WorldForge provider capabilities (predict, reason, embed, generate, transfer, score, policy). Each capability ships exactly one valid_baseline.json plus at least two invalid_<reason> fixtures with distinct expected_error_pattern entries — 21 fixtures total.
Each fixture is a small JSON envelope (schema_version: 1) carrying an id, capability, data_class (synthetic | captured | host-supplied), expected outcome, expected error pattern, description, and a payload whose keys map directly onto the matching assert_*_conformance() keyword arguments — fixtures can be passed straight through without intermediate translation.
The corpus lives under src/worldforge/testing/fixtures/ so it ships with the wheel and is reachable through the public testing API.
Wires the corpus into tests/test_provider_contracts.py (mock-provider conformance for predict, reason, embed, generate) and adds 29 dedicated tests in tests/test_capability_fixtures.py that exercise every fixture: valid baselines run through the framework facade; invalid fixtures the framework already rejects trip WorldForgeError with their declared pattern; fixtures whose contract claim is not yet enforced assert their structural invariants and document the pin for future validators.
Documentation: new public page docs/src/fixtures.md, contributor charter at src/worldforge/testing/fixtures/README.md, mkdocs nav and SUMMARY entries, CHANGELOG, and the WF-B2 acceptance checklist in docs/src/roadmap-continuation.md ticked.

Acceptance criteria (from #135)

Each capability has at least one valid fixture and two invalid boundary fixtures.
Fixtures are used by conformance or evaluation tests (tests/test_provider_contracts.py, tests/test_capability_fixtures.py).
Fixture docs state whether data is synthetic, captured, or host-supplied.
Package contract remains small and installable (no pyproject.toml change required; fixtures ship via the existing src/worldforge/testing/ package layout).

Public testing surface

New helpers re-exported from worldforge.testing:

Symbol	Purpose
`CAPABILITY_FIXTURE_NAMES`	Tuple of capability names covered by the corpus.
`CapabilityFixture`	Frozen dataclass returned by the loader.
`FIXTURE_SCHEMA_VERSION`	Currently `1`; bump only with a loader migration.
`list_fixture_names(capability)`	Sorted file stems for a capability.
`load_capability_fixture(capability, name)`	Validate and return one fixture.
`iter_capability_fixtures(capability)`	Iterate validated fixtures for one capability.
`iter_all_fixtures()`	Iterate every fixture across the corpus.

Test plan

uv run ruff check src tests examples scripts clean
uv run ruff format --check src tests examples scripts clean
uv lock --check
uv run python scripts/generate_provider_docs.py --check
uv run pytest — 679 passed / 2 skipped
uv run --extra harness pytest --cov=src/worldforge --cov-report=term-missing --cov-fail-under=90 — 90.16 percent
bash scripts/test_package.sh — fixtures ship in the wheel automatically
uv run mkdocs build --strict

Migration

Pure addition. No public API was renamed or removed. Existing tests, conformance helpers, and provider fixtures under tests/fixtures/providers/ are untouched. Downstream issues (WF-B5 evidence bundles, WF-B7 failure-case galleries) can now consume worldforge.testing.iter_all_fixtures() instead of re-deriving inputs.

Refs AbdelStark#135 (WF-B2). Adds a packaged corpus of canonical input fixtures for the seven WorldForge provider capabilities — predict, reason, embed, generate, transfer, score, and policy — so conformance tests, evaluation suites, and provider authors can reuse public-input shapes instead of re-deriving payloads in each test file. Each capability ships exactly one valid_baseline.json plus at least two invalid_<reason> fixtures with distinct expected_error_pattern entries. Fixtures are JSON envelopes (schema_version: 1) carrying an id, capability, data_class (synthetic | captured | host-supplied), expected outcome, expected error pattern, description, and a payload whose keys match the matching assert_*_conformance() keyword arguments. The corpus lives under src/worldforge/testing/fixtures/ so it ships with the wheel and is reachable through the public testing API. The new worldforge.testing.capability_fixtures module exposes load_capability_fixture(), iter_capability_fixtures(), iter_all_fixtures(), list_fixture_names(), and CapabilityFixture, all re-exported from worldforge.testing. tests/test_capability_fixtures.py exercises every fixture (valid baselines run through the mock provider; invalid fixtures the framework already rejects trip WorldForgeError with their declared pattern; fixtures whose contract claim is not yet enforced assert their structural invariants). tests/test_provider_contracts.py picks up the corpus through assert_*_conformance() to wire the corpus into the existing conformance flow. Documentation lives in docs/src/fixtures.md and src/worldforge/testing/fixtures/README.md (contributor charter); both link from the docs nav and SUMMARY. CHANGELOG and docs/src/roadmap-continuation.md acceptance criteria are updated.

Refs #136 (WF-B3). Builds on the provenance envelope (#155) and the capability fixture corpus (#156). Ships five named benchmark presets so maintainers can run release-regression workloads without re-deriving inputs and budgets: - mock-smoke (checkout-safe): fast every-branch CI smoke check on the mock provider with a tight success-rate gate. - parser-overhead (checkout-safe): twenty-iteration latency-bounded run that measures WorldForge adapter-path overhead through the mock. - remote-media-dryrun (remote-media): single-iteration dry-run for cosmos or runway. Skips with a typed reason when neither env is configured. - prepared-host (prepared-host): three-iteration evidence run for any configured prepared-host runtime (leworldmodel for score; lerobot or gr00t for policy). Skips with a typed reason when no eligible runtime is available. - release-evidence (release): release-gated regression check across every mock-supported operation with strict latency, throughput, and success-rate budgets, intended to be paired with --run-workspace for release attestation. Each preset is a frozen BenchmarkPreset dataclass under worldforge.benchmark_presets, validated at construction (category, providers, operations, iterations, concurrency, failure_tolerance, runtime profiles). Preset inputs and budgets live under src/worldforge/benchmark_presets/_data/ as JSON files matching the existing --input-file and --budget-file wire formats and ship in the wheel through the existing src/worldforge package layout. CLI: three new flags on the existing benchmark subcommand. --preset NAME overrides --provider, --operation, --iterations, --concurrency, --input-file, and --budget-file. --list-presets and --show-preset NAME print the catalogue or one preset's details (markdown, json, or csv) and exit. The preset name is recorded in run_metadata.preset and threaded into the provenance envelope's command argv and notes when the report runs. Skip semantics: presets that gate on a provider runtime profile call worldforge.testing.runtime_profiles.provider_profile_skip_reason; when the preset's failure_tolerance is "skip-when-env-missing" the CLI prints a typed reason and exits 0 (treated by release CI as "evidence not available on this host"). fail-on-violation presets always run; budget violations exit non-zero with the standard violation table. Tests cover the registry (cardinality, validation), runtime gating across remote-media and prepared-host categories, and the CLI list/show/run paths in markdown, json, and csv formats. The benchmark help snapshot is updated for the three new flags. CHANGELOG and the WF-B3 acceptance checklist in docs/src/roadmap-continuation.md are updated. Co-authored-by: WorldForge contributor <contributor@worldforge.local>

adrienlacombe requested a review from AbdelStark as a code owner May 4, 2026 06:22

AbdelStark merged commit 2aa3571 into AbdelStark:main May 4, 2026
7 checks passed

adrienlacombe mentioned this pull request May 4, 2026

Add benchmark preset suites for release regressions #157

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build a capability fixture corpus#156

Build a capability fixture corpus#156
AbdelStark merged 1 commit intoAbdelStark:mainfrom
adrienlacombe:capability-fixture-corpus

adrienlacombe commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

adrienlacombe commented May 4, 2026

Summary

Acceptance criteria (from #135)

Public testing surface

Test plan

Migration

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants