Publish v0.9 claim audit by AbdelStark · Pull Request #400 · AbdelStark/CodeLeWM

AbdelStark · 2026-06-07T03:15:16Z

Summary

publishes the v0.9 final claim audit, public artifact index, dataset card, and seed-specific model cards
adds the checked-in v0.9 eval artifact tree for seed 42 and seed 1729 across retrieval, surprise, probe, HumanEval/MBPP-Plus rerank, and p-pass calibration
fixes rerank completion score rows to retain benchmark_id and split metadata so combined p-pass calibration does not collapse benchmark slices to unknown
updates roadmap/goal prompt state so [TRACKER] v0.9 data/eval repair for cross-benchmark correctness claims #385/v0.9 hygiene: reconcile stale trackers and roadmap queue state #386-v0.9 eval/report: run full gate suite and publish claim audit #392 are historical completed diagnostic work, not an active queue

Claim Boundary

v0.9 stays overall claim-closed. Both seeds clear HumanEval WS-D reranking, but MBPP-Plus has zero lift over no-action, broad semantic-decoy and representation gates remain closed, and p-pass calibration is reported from completion-score baselines rather than a standalone serialized p_pass key.

Validation

uv run pytest tests/docs/test_hf_ml_intern_training.py -> 8 passed
uv run pytest tests/ -> 967 passed, 8 skipped, 1 torch nested-tensor warning
uv run python -m compileall -q -x 'tests/fixtures/codestate/invalid_(before|after)\.py$' codelewm tests
uv run codelewm --help
uv run codelewm openrouter byok-register --dry-run --json
uv run scripts/llm-world-model-demo -> manifest verify ok, artifact/html secret scans ok, claim gate closed
parent-aware manifest verification for the v0.9 pack, both training runs, and all 12 checked-in v0.9 eval reports -> ok
uv run codelewm secret-scan docs/benchmark/v0_9 --json -> ok, no findings
git diff --check

Closes #392.

Publish v0.9 claim audit

4934c89

AbdelStark merged commit c1e1372 into main Jun 7, 2026
9 checks passed

AbdelStark deleted the issue-392-v0-9-claim-audit branch June 7, 2026 03:18

AbdelStark mentioned this pull request Jun 7, 2026

[TRACKER] v0.9 data/eval repair for cross-benchmark correctness claims #385

Closed

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Publish v0.9 claim audit#400

Publish v0.9 claim audit#400
AbdelStark merged 1 commit into
mainfrom
issue-392-v0-9-claim-audit

AbdelStark commented Jun 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AbdelStark commented Jun 7, 2026

Summary

Claim Boundary

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant