Publish v0.8 execution evaluation results by AbdelStark · Pull Request #384 · AbdelStark/CodeLeWM

AbdelStark · 2026-06-05T21:15:05Z

Summary

publish the v0.8 correctness-aware execution results report and public artifact index
add tracked docs/benchmark/v0_8/ eval manifests/reports/score rows for both seeds
add v0.8 dataset/model cards with the HumanEval-positive but overall-claim-closed boundary
add docs regression coverage so future edits keep the v0.8 claim boundary explicit

seed 42 HF job 6a2278d2e6aa50b87b9eba56: COMPLETED
seed 1729 HF job 6a227a6ce52fdd2a02ed9005: COMPLETED
both uploaded verified run artifacts; checkpoint mirrors were downloaded back and verified

HumanEval WS-D rerank passes on both seeds
MBPP-Plus WS-D rerank does not clear the claim gate
pass/fail latent probe does not beat controls consistently
output-magnitude probe is not evaluable because the v0.8 pack has zero val labels for that target
overall v0.8 downstream claim remains closed

uv run pytest tests/ -> 939 passed, 8 skipped, 1 warning
uv run pytest tests/docs -> 122 passed
uv run python -m compileall -q -x 'tests/fixtures/codestate/invalid_(before|after)\.py$' codelewm tests
git diff --cached --check
parent-aware manifest verification for all 10 copied docs/benchmark/v0_8 eval manifests
uv run codelewm secret-scan docs/benchmark/v0_8 docs/benchmark/EXECUTION_V0_8_RESULTS_2026-06-05.md docs/benchmark/PUBLIC_ARTIFACT_INDEX_2026-06-05.md docs/cards/codelewm-v0-8-execution-dataset-2026-06-05.md docs/cards/codelewm-v0-8-execution-model-seed-42-2026-06-05.md docs/cards/codelewm-v0-8-execution-model-seed-1729-2026-06-05.md --json

Closes #371.

docs: publish v0.8 execution evaluation

2491ae3

AbdelStark merged commit 1748ca2 into main Jun 5, 2026
9 checks passed

AbdelStark deleted the issue-371-v0-8-eval-results branch June 5, 2026 21:17

AbdelStark mentioned this pull request Jun 5, 2026

[TRACKER] v0.8 correctness-aware execution world model #364

Closed

11 tasks