forked from lucas-maes/le-wm
-
Notifications
You must be signed in to change notification settings - Fork 0
[TRACKER] v0.9 data/eval repair for cross-benchmark correctness claims #385
Copy link
Copy link
Closed
Labels
area:dataArea: dataArea: dataarea:evaluationArea: evaluationArea: evaluationarea:modelArea: modelArea: modelarea:resultsBenchmark results, reports, and research evidenceBenchmark results, reports, and research evidenceeffort:lLarge multi-file implementation changeLarge multi-file implementation changepriority:p1Required for v1.0 or core follow-throughRequired for v1.0 or core follow-throughspec:rfc-0015RFC-0015 v0.7 execution-substrate improvementsRFC-0015 v0.7 execution-substrate improvementstrackingSubsystem tracking issueSubsystem tracking issuetype:trackingTracking issue for a spec subsystemTracking issue for a spec subsystem
Metadata
Metadata
Assignees
Labels
area:dataArea: dataArea: dataarea:evaluationArea: evaluationArea: evaluationarea:modelArea: modelArea: modelarea:resultsBenchmark results, reports, and research evidenceBenchmark results, reports, and research evidenceeffort:lLarge multi-file implementation changeLarge multi-file implementation changepriority:p1Required for v1.0 or core follow-throughRequired for v1.0 or core follow-throughspec:rfc-0015RFC-0015 v0.7 execution-substrate improvementsRFC-0015 v0.7 execution-substrate improvementstrackingSubsystem tracking issueSubsystem tracking issuetype:trackingTracking issue for a spec subsystemTracking issue for a spec subsystem
Goal
Complete the v0.9 data/eval repair cycle end to end. v0.8 proved that the correctness-aware scorer can win on HumanEval WS-D, but it stayed claim-closed because the evidence was not cross-benchmark and several gates were not actually evaluable.
This tracker turns the v0.8 blockers into an implementation queue before any new GPU run:
passed,output_magnitude_bucket, and semantic-decoy categories;p_passROC-AUC and calibration reporting;Final outcome
Completed on 2026-06-07 by PR #400, merged at
c1e1372a4b3ae19b391a7cda7ce15c775c734485.Final reports:
docs/benchmark/EXECUTION_V0_9_RESULTS_2026-06-07.mddocs/benchmark/PUBLIC_ARTIFACT_INDEX_2026-06-07.mddocs/cards/codelewm-v0-9-execution-dataset-2026-06-07.mddocs/cards/codelewm-v0-9-execution-model-seed-42-2026-06-07.mddocs/cards/codelewm-v0-9-execution-model-seed-1729-2026-06-07.mdClaim boundary: v0.9 remains overall claim-closed. Both seeds clear HumanEval WS-D reranking, but MBPP-Plus has zero lift over no-action, broad semantic-decoy and representation gates remain closed, and p-pass calibration is reported from completion-score baselines rather than a standalone serialized
p_passkey.Evidence starting point
docs/benchmark/EXECUTION_V0_8_RESULTS_2026-06-05.md.output_magnitude_bucketwas not evaluable because the v0.8 pack val split had zero labels for that target.p_passROC-AUC was not emitted as a first-class eval metric.Workstreams
p_passROC-AUC and calibration reports.Dependency graph
#387 was the data preflight root. #391 launched only after #387, #388, #389, and #390 landed and showed that the v0.8 blocked gates were evaluable or typed as artifact-backed blockers.
Global acceptance criteria
passedandoutput_magnitude_bucket.p_passROC-AUC/calibration is emitted as a first-class report and copied into results docs.reports/job_progress.jsonl, status summarizer, manifest verification, checkpoint inspection, and secret scans.mainwith green CI and local validation evidence.Claim boundary
Do not claim CodeLeWM generally improves coding unless both benchmark-level downstream rerank gates and the required representation/regression gates pass with the required seeds and artifact verification. v0.9 did not clear that bar, so it is published as negative/diagnostic evidence with the HumanEval-only positive slice called out narrowly.