data: build v0.9 cross-benchmark passfail packs by AbdelStark · Pull Request #394 · AbdelStark/CodeLeWM

AbdelStark · 2026-06-06T09:10:03Z

Summary

Add multi-source pass/fail pack inputs for HumanEval + MBPP-Plus style completion-label artifacts while keeping legacy single-source mode.
Split by benchmark/problem, add optional held-out split coverage gates, and fail typed split_coverage_blocker errors when required labels are not evaluable.
Enrich pack reports/manifests with benchmark counts, output-magnitude coverage, held-out coverage, readiness gates, and explicit benchmark_id row metadata.
Document the v0.9 multi-source script mode and data-model row contract.

uv run pytest tests/data/execution_pack/test_passfail_pack.py
uv run pytest tests/data/execution_pack tests/training/test_execution_pack_loader.py tests/docs/test_usage_docs.py tests/docs/test_execution_substrate_docs.py
uv run python -m compileall -q -x 'tests/fixtures/codestate/invalid_(before|after)\.py$' codelewm tests scripts/build-passfail-pack
uv run scripts/build-passfail-pack --help
git diff --check
uv run pytest tests/ (941 passed, 8 skipped, 1 warning)

Closes #387

data: build v0.9 cross-benchmark passfail packs

6c6b4d6

AbdelStark merged commit 777eb64 into main Jun 6, 2026
9 checks passed

AbdelStark deleted the issue-387-v0-9-cross-benchmark-pack branch June 6, 2026 09:12