Skip to content

data: build v0.9 cross-benchmark passfail packs#394

Merged
AbdelStark merged 1 commit into
mainfrom
issue-387-v0-9-cross-benchmark-pack
Jun 6, 2026
Merged

data: build v0.9 cross-benchmark passfail packs#394
AbdelStark merged 1 commit into
mainfrom
issue-387-v0-9-cross-benchmark-pack

Conversation

@AbdelStark
Copy link
Copy Markdown
Owner

@AbdelStark AbdelStark commented Jun 6, 2026

Summary

  • Add multi-source pass/fail pack inputs for HumanEval + MBPP-Plus style completion-label artifacts while keeping legacy single-source mode.
  • Split by benchmark/problem, add optional held-out split coverage gates, and fail typed split_coverage_blocker errors when required labels are not evaluable.
  • Enrich pack reports/manifests with benchmark counts, output-magnitude coverage, held-out coverage, readiness gates, and explicit benchmark_id row metadata.
  • Document the v0.9 multi-source script mode and data-model row contract.

Validation

  • uv run pytest tests/data/execution_pack/test_passfail_pack.py
  • uv run pytest tests/data/execution_pack tests/training/test_execution_pack_loader.py tests/docs/test_usage_docs.py tests/docs/test_execution_substrate_docs.py
  • uv run python -m compileall -q -x 'tests/fixtures/codestate/invalid_(before|after)\.py$' codelewm tests scripts/build-passfail-pack
  • uv run scripts/build-passfail-pack --help
  • git diff --check
  • uv run pytest tests/ (941 passed, 8 skipped, 1 warning)

Closes #387

@AbdelStark AbdelStark merged commit 777eb64 into main Jun 6, 2026
9 checks passed
@AbdelStark AbdelStark deleted the issue-387-v0-9-cross-benchmark-pack branch June 6, 2026 09:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

v0.9 data: build cross-benchmark pass/fail execution pack with stratified labels

1 participant