Build a capability fixture corpus#156
Merged
AbdelStark merged 1 commit intoAbdelStark:mainfrom May 4, 2026
Merged
Conversation
Refs AbdelStark#135 (WF-B2). Adds a packaged corpus of canonical input fixtures for the seven WorldForge provider capabilities — predict, reason, embed, generate, transfer, score, and policy — so conformance tests, evaluation suites, and provider authors can reuse public-input shapes instead of re-deriving payloads in each test file. Each capability ships exactly one valid_baseline.json plus at least two invalid_<reason> fixtures with distinct expected_error_pattern entries. Fixtures are JSON envelopes (schema_version: 1) carrying an id, capability, data_class (synthetic | captured | host-supplied), expected outcome, expected error pattern, description, and a payload whose keys match the matching assert_*_conformance() keyword arguments. The corpus lives under src/worldforge/testing/fixtures/ so it ships with the wheel and is reachable through the public testing API. The new worldforge.testing.capability_fixtures module exposes load_capability_fixture(), iter_capability_fixtures(), iter_all_fixtures(), list_fixture_names(), and CapabilityFixture, all re-exported from worldforge.testing. tests/test_capability_fixtures.py exercises every fixture (valid baselines run through the mock provider; invalid fixtures the framework already rejects trip WorldForgeError with their declared pattern; fixtures whose contract claim is not yet enforced assert their structural invariants). tests/test_provider_contracts.py picks up the corpus through assert_*_conformance() to wire the corpus into the existing conformance flow. Documentation lives in docs/src/fixtures.md and src/worldforge/testing/fixtures/README.md (contributor charter); both link from the docs nav and SUMMARY. CHANGELOG and docs/src/roadmap-continuation.md acceptance criteria are updated.
12 tasks
AbdelStark
pushed a commit
that referenced
this pull request
May 4, 2026
Refs #136 (WF-B3). Builds on the provenance envelope (#155) and the capability fixture corpus (#156). Ships five named benchmark presets so maintainers can run release-regression workloads without re-deriving inputs and budgets: - mock-smoke (checkout-safe): fast every-branch CI smoke check on the mock provider with a tight success-rate gate. - parser-overhead (checkout-safe): twenty-iteration latency-bounded run that measures WorldForge adapter-path overhead through the mock. - remote-media-dryrun (remote-media): single-iteration dry-run for cosmos or runway. Skips with a typed reason when neither env is configured. - prepared-host (prepared-host): three-iteration evidence run for any configured prepared-host runtime (leworldmodel for score; lerobot or gr00t for policy). Skips with a typed reason when no eligible runtime is available. - release-evidence (release): release-gated regression check across every mock-supported operation with strict latency, throughput, and success-rate budgets, intended to be paired with --run-workspace for release attestation. Each preset is a frozen BenchmarkPreset dataclass under worldforge.benchmark_presets, validated at construction (category, providers, operations, iterations, concurrency, failure_tolerance, runtime profiles). Preset inputs and budgets live under src/worldforge/benchmark_presets/_data/ as JSON files matching the existing --input-file and --budget-file wire formats and ship in the wheel through the existing src/worldforge package layout. CLI: three new flags on the existing benchmark subcommand. --preset NAME overrides --provider, --operation, --iterations, --concurrency, --input-file, and --budget-file. --list-presets and --show-preset NAME print the catalogue or one preset's details (markdown, json, or csv) and exit. The preset name is recorded in run_metadata.preset and threaded into the provenance envelope's command argv and notes when the report runs. Skip semantics: presets that gate on a provider runtime profile call worldforge.testing.runtime_profiles.provider_profile_skip_reason; when the preset's failure_tolerance is "skip-when-env-missing" the CLI prints a typed reason and exits 0 (treated by release CI as "evidence not available on this host"). fail-on-violation presets always run; budget violations exit non-zero with the standard violation table. Tests cover the registry (cardinality, validation), runtime gating across remote-media and prepared-host categories, and the CLI list/show/run paths in markdown, json, and csv formats. The benchmark help snapshot is updated for the three new flags. CHANGELOG and the WF-B3 acceptance checklist in docs/src/roadmap-continuation.md are updated. Co-authored-by: WorldForge contributor <contributor@worldforge.local>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Refs #135 (WF-B2). Builds on the provenance envelope shipped in #155.
Summary
predict,reason,embed,generate,transfer,score,policy). Each capability ships exactly onevalid_baseline.jsonplus at least twoinvalid_<reason>fixtures with distinctexpected_error_patternentries — 21 fixtures total.schema_version: 1) carrying an id, capability,data_class(synthetic|captured|host-supplied), expected outcome, expected error pattern, description, and apayloadwhose keys map directly onto the matchingassert_*_conformance()keyword arguments — fixtures can be passed straight through without intermediate translation.src/worldforge/testing/fixtures/so it ships with the wheel and is reachable through the public testing API.tests/test_provider_contracts.py(mock-provider conformance forpredict,reason,embed,generate) and adds 29 dedicated tests intests/test_capability_fixtures.pythat exercise every fixture: valid baselines run through the framework facade; invalid fixtures the framework already rejects tripWorldForgeErrorwith their declared pattern; fixtures whose contract claim is not yet enforced assert their structural invariants and document the pin for future validators.docs/src/fixtures.md, contributor charter atsrc/worldforge/testing/fixtures/README.md, mkdocs nav and SUMMARY entries, CHANGELOG, and the WF-B2 acceptance checklist indocs/src/roadmap-continuation.mdticked.Acceptance criteria (from #135)
tests/test_provider_contracts.py,tests/test_capability_fixtures.py).pyproject.tomlchange required; fixtures ship via the existingsrc/worldforge/testing/package layout).Public testing surface
New helpers re-exported from
worldforge.testing:CAPABILITY_FIXTURE_NAMESCapabilityFixtureFIXTURE_SCHEMA_VERSION1; bump only with a loader migration.list_fixture_names(capability)load_capability_fixture(capability, name)iter_capability_fixtures(capability)iter_all_fixtures()Test plan
uv run ruff check src tests examples scriptscleanuv run ruff format --check src tests examples scriptscleanuv lock --checkuv run python scripts/generate_provider_docs.py --checkuv run pytest— 679 passed / 2 skippeduv run --extra harness pytest --cov=src/worldforge --cov-report=term-missing --cov-fail-under=90— 90.16 percentbash scripts/test_package.sh— fixtures ship in the wheel automaticallyuv run mkdocs build --strictMigration
Pure addition. No public API was renamed or removed. Existing tests, conformance helpers, and provider fixtures under
tests/fixtures/providers/are untouched. Downstream issues (WF-B5 evidence bundles, WF-B7 failure-case galleries) can now consumeworldforge.testing.iter_all_fixtures()instead of re-deriving inputs.