Skip to content

Build a capability fixture corpus#156

Merged
AbdelStark merged 1 commit intoAbdelStark:mainfrom
adrienlacombe:capability-fixture-corpus
May 4, 2026
Merged

Build a capability fixture corpus#156
AbdelStark merged 1 commit intoAbdelStark:mainfrom
adrienlacombe:capability-fixture-corpus

Conversation

@adrienlacombe
Copy link
Copy Markdown
Contributor

Refs #135 (WF-B2). Builds on the provenance envelope shipped in #155.

Summary

  • Adds a packaged corpus of canonical input fixtures for the seven WorldForge provider capabilities (predict, reason, embed, generate, transfer, score, policy). Each capability ships exactly one valid_baseline.json plus at least two invalid_<reason> fixtures with distinct expected_error_pattern entries — 21 fixtures total.
  • Each fixture is a small JSON envelope (schema_version: 1) carrying an id, capability, data_class (synthetic | captured | host-supplied), expected outcome, expected error pattern, description, and a payload whose keys map directly onto the matching assert_*_conformance() keyword arguments — fixtures can be passed straight through without intermediate translation.
  • The corpus lives under src/worldforge/testing/fixtures/ so it ships with the wheel and is reachable through the public testing API.
  • Wires the corpus into tests/test_provider_contracts.py (mock-provider conformance for predict, reason, embed, generate) and adds 29 dedicated tests in tests/test_capability_fixtures.py that exercise every fixture: valid baselines run through the framework facade; invalid fixtures the framework already rejects trip WorldForgeError with their declared pattern; fixtures whose contract claim is not yet enforced assert their structural invariants and document the pin for future validators.
  • Documentation: new public page docs/src/fixtures.md, contributor charter at src/worldforge/testing/fixtures/README.md, mkdocs nav and SUMMARY entries, CHANGELOG, and the WF-B2 acceptance checklist in docs/src/roadmap-continuation.md ticked.

Acceptance criteria (from #135)

  • Each capability has at least one valid fixture and two invalid boundary fixtures.
  • Fixtures are used by conformance or evaluation tests (tests/test_provider_contracts.py, tests/test_capability_fixtures.py).
  • Fixture docs state whether data is synthetic, captured, or host-supplied.
  • Package contract remains small and installable (no pyproject.toml change required; fixtures ship via the existing src/worldforge/testing/ package layout).

Public testing surface

New helpers re-exported from worldforge.testing:

Symbol Purpose
CAPABILITY_FIXTURE_NAMES Tuple of capability names covered by the corpus.
CapabilityFixture Frozen dataclass returned by the loader.
FIXTURE_SCHEMA_VERSION Currently 1; bump only with a loader migration.
list_fixture_names(capability) Sorted file stems for a capability.
load_capability_fixture(capability, name) Validate and return one fixture.
iter_capability_fixtures(capability) Iterate validated fixtures for one capability.
iter_all_fixtures() Iterate every fixture across the corpus.

Test plan

  • uv run ruff check src tests examples scripts clean
  • uv run ruff format --check src tests examples scripts clean
  • uv lock --check
  • uv run python scripts/generate_provider_docs.py --check
  • uv run pytest — 679 passed / 2 skipped
  • uv run --extra harness pytest --cov=src/worldforge --cov-report=term-missing --cov-fail-under=90 — 90.16 percent
  • bash scripts/test_package.sh — fixtures ship in the wheel automatically
  • uv run mkdocs build --strict

Migration

Pure addition. No public API was renamed or removed. Existing tests, conformance helpers, and provider fixtures under tests/fixtures/providers/ are untouched. Downstream issues (WF-B5 evidence bundles, WF-B7 failure-case galleries) can now consume worldforge.testing.iter_all_fixtures() instead of re-deriving inputs.

Refs AbdelStark#135 (WF-B2). Adds a packaged corpus of canonical input fixtures for the seven
WorldForge provider capabilities — predict, reason, embed, generate, transfer, score,
and policy — so conformance tests, evaluation suites, and provider authors can reuse
public-input shapes instead of re-deriving payloads in each test file.

Each capability ships exactly one valid_baseline.json plus at least two invalid_<reason>
fixtures with distinct expected_error_pattern entries. Fixtures are JSON envelopes
(schema_version: 1) carrying an id, capability, data_class (synthetic | captured |
host-supplied), expected outcome, expected error pattern, description, and a payload
whose keys match the matching assert_*_conformance() keyword arguments. The corpus lives
under src/worldforge/testing/fixtures/ so it ships with the wheel and is reachable
through the public testing API.

The new worldforge.testing.capability_fixtures module exposes
load_capability_fixture(), iter_capability_fixtures(), iter_all_fixtures(),
list_fixture_names(), and CapabilityFixture, all re-exported from
worldforge.testing. tests/test_capability_fixtures.py exercises every fixture (valid
baselines run through the mock provider; invalid fixtures the framework already rejects
trip WorldForgeError with their declared pattern; fixtures whose contract claim is not
yet enforced assert their structural invariants). tests/test_provider_contracts.py picks
up the corpus through assert_*_conformance() to wire the corpus into the existing
conformance flow.

Documentation lives in docs/src/fixtures.md and src/worldforge/testing/fixtures/README.md
(contributor charter); both link from the docs nav and SUMMARY. CHANGELOG and
docs/src/roadmap-continuation.md acceptance criteria are updated.
@adrienlacombe adrienlacombe requested a review from AbdelStark as a code owner May 4, 2026 06:22
@AbdelStark AbdelStark merged commit 2aa3571 into AbdelStark:main May 4, 2026
7 checks passed
AbdelStark pushed a commit that referenced this pull request May 4, 2026
Refs #136 (WF-B3). Builds on the provenance envelope (#155) and the capability
fixture corpus (#156).

Ships five named benchmark presets so maintainers can run release-regression
workloads without re-deriving inputs and budgets:

- mock-smoke (checkout-safe): fast every-branch CI smoke check on the mock
  provider with a tight success-rate gate.
- parser-overhead (checkout-safe): twenty-iteration latency-bounded run that
  measures WorldForge adapter-path overhead through the mock.
- remote-media-dryrun (remote-media): single-iteration dry-run for cosmos or
  runway. Skips with a typed reason when neither env is configured.
- prepared-host (prepared-host): three-iteration evidence run for any
  configured prepared-host runtime (leworldmodel for score; lerobot or gr00t
  for policy). Skips with a typed reason when no eligible runtime is
  available.
- release-evidence (release): release-gated regression check across every
  mock-supported operation with strict latency, throughput, and success-rate
  budgets, intended to be paired with --run-workspace for release attestation.

Each preset is a frozen BenchmarkPreset dataclass under
worldforge.benchmark_presets, validated at construction (category, providers,
operations, iterations, concurrency, failure_tolerance, runtime profiles).
Preset inputs and budgets live under
src/worldforge/benchmark_presets/_data/ as JSON files matching the existing
--input-file and --budget-file wire formats and ship in the wheel through
the existing src/worldforge package layout.

CLI: three new flags on the existing benchmark subcommand. --preset NAME
overrides --provider, --operation, --iterations, --concurrency,
--input-file, and --budget-file. --list-presets and --show-preset NAME
print the catalogue or one preset's details (markdown, json, or csv) and
exit. The preset name is recorded in run_metadata.preset and threaded into
the provenance envelope's command argv and notes when the report runs.

Skip semantics: presets that gate on a provider runtime profile call
worldforge.testing.runtime_profiles.provider_profile_skip_reason; when the
preset's failure_tolerance is "skip-when-env-missing" the CLI prints a
typed reason and exits 0 (treated by release CI as "evidence not available
on this host"). fail-on-violation presets always run; budget violations
exit non-zero with the standard violation table.

Tests cover the registry (cardinality, validation), runtime gating across
remote-media and prepared-host categories, and the CLI list/show/run paths
in markdown, json, and csv formats. The benchmark help snapshot is updated
for the three new flags. CHANGELOG and the WF-B3 acceptance checklist in
docs/src/roadmap-continuation.md are updated.

Co-authored-by: WorldForge contributor <contributor@worldforge.local>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants