Skip to content

Publish v0.8 execution evaluation results#384

Merged
AbdelStark merged 1 commit into
mainfrom
issue-371-v0-8-eval-results
Jun 5, 2026
Merged

Publish v0.8 execution evaluation results#384
AbdelStark merged 1 commit into
mainfrom
issue-371-v0-8-eval-results

Conversation

@AbdelStark
Copy link
Copy Markdown
Owner

Summary

  • publish the v0.8 correctness-aware execution results report and public artifact index
  • add tracked docs/benchmark/v0_8/ eval manifests/reports/score rows for both seeds
  • add v0.8 dataset/model cards with the HumanEval-positive but overall-claim-closed boundary
  • add docs regression coverage so future edits keep the v0.8 claim boundary explicit

Job status

  • seed 42 HF job 6a2278d2e6aa50b87b9eba56: COMPLETED
  • seed 1729 HF job 6a227a6ce52fdd2a02ed9005: COMPLETED
  • both uploaded verified run artifacts; checkpoint mirrors were downloaded back and verified

Result boundary

  • HumanEval WS-D rerank passes on both seeds
  • MBPP-Plus WS-D rerank does not clear the claim gate
  • pass/fail latent probe does not beat controls consistently
  • output-magnitude probe is not evaluable because the v0.8 pack has zero val labels for that target
  • overall v0.8 downstream claim remains closed

Validation

  • uv run pytest tests/ -> 939 passed, 8 skipped, 1 warning
  • uv run pytest tests/docs -> 122 passed
  • uv run python -m compileall -q -x 'tests/fixtures/codestate/invalid_(before|after)\.py$' codelewm tests
  • git diff --cached --check
  • parent-aware manifest verification for all 10 copied docs/benchmark/v0_8 eval manifests
  • uv run codelewm secret-scan docs/benchmark/v0_8 docs/benchmark/EXECUTION_V0_8_RESULTS_2026-06-05.md docs/benchmark/PUBLIC_ARTIFACT_INDEX_2026-06-05.md docs/cards/codelewm-v0-8-execution-dataset-2026-06-05.md docs/cards/codelewm-v0-8-execution-model-seed-42-2026-06-05.md docs/cards/codelewm-v0-8-execution-model-seed-1729-2026-06-05.md --json

Closes #371.

@AbdelStark AbdelStark merged commit 1748ca2 into main Jun 5, 2026
9 checks passed
@AbdelStark AbdelStark deleted the issue-371-v0-8-eval-results branch June 5, 2026 21:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

v0.8 eval: correctness + WS-D rerank gate + results report

1 participant