Skip to content

[TRACKER] v0.9 data/eval repair for cross-benchmark correctness claims #385

@AbdelStark

Description

@AbdelStark

Goal

Complete the v0.9 data/eval repair cycle end to end. v0.8 proved that the correctness-aware scorer can win on HumanEval WS-D, but it stayed claim-closed because the evidence was not cross-benchmark and several gates were not actually evaluable.

This tracker turns the v0.8 blockers into an implementation queue before any new GPU run:

  • cross-benchmark pass/fail training/eval data, not HumanEval-only;
  • stratified split coverage for passed, output_magnitude_bucket, and semantic-decoy categories;
  • first-class held-out p_pass ROC-AUC and calibration reporting;
  • repaired semantic-decoy alignment/coverage;
  • a guarded 2-seed HF Jobs run only after data/eval preflight passes;
  • artifact-backed reports, cards, claim audit, and tracker cleanup.

Final outcome

Completed on 2026-06-07 by PR #400, merged at c1e1372a4b3ae19b391a7cda7ce15c775c734485.

Final reports:

  • docs/benchmark/EXECUTION_V0_9_RESULTS_2026-06-07.md
  • docs/benchmark/PUBLIC_ARTIFACT_INDEX_2026-06-07.md
  • docs/cards/codelewm-v0-9-execution-dataset-2026-06-07.md
  • docs/cards/codelewm-v0-9-execution-model-seed-42-2026-06-07.md
  • docs/cards/codelewm-v0-9-execution-model-seed-1729-2026-06-07.md

Claim boundary: v0.9 remains overall claim-closed. Both seeds clear HumanEval WS-D reranking, but MBPP-Plus has zero lift over no-action, broad semantic-decoy and representation gates remain closed, and p-pass calibration is reported from completion-score baselines rather than a standalone serialized p_pass key.

Evidence starting point

  • v0.8 jobs completed and artifacts verified: docs/benchmark/EXECUTION_V0_8_RESULTS_2026-06-05.md.
  • HumanEval WS-D rerank passed both seeds.
  • MBPP-Plus WS-D rerank did not clear.
  • output_magnitude_bucket was not evaluable because the v0.8 pack val split had zero labels for that target.
  • broad semantic-decoy surprise was pair-coverage blocked after aligning the v0.6 decoy pack with the v0.8 pass/fail pack.
  • held-out training-pack p_pass ROC-AUC was not emitted as a first-class eval metric.

Workstreams

Dependency graph

#386 -> #387 -> (#388, #389, #390) -> #391 -> #392

#387 was the data preflight root. #391 launched only after #387, #388, #389, and #390 landed and showed that the v0.8 blocked gates were evaluable or typed as artifact-backed blockers.

Global acceptance criteria

  • stale tracker/roadmap state no longer points agents at closed data: build public-safe 100-example downstream reranking set #210/eval: run scaled downstream reranking comparison and claim gate #211/[Tracking] v1.2 next positive-model research hypothesis #212 or superseded v0.7 work as if it were current.
  • v0.9 data/eval preflight produces manifest-backed evidence before GPU launch.
  • cross-benchmark correctness pack covers at least HumanEval and MBPP-Plus and has stratified val/test labels for passed and output_magnitude_bucket.
  • semantic-decoy eval has enough aligned scorable pairs to make the broad surprise gate evaluable or records a typed, artifact-backed blocker.
  • held-out p_pass ROC-AUC/calibration is emitted as a first-class report and copied into results docs.
  • any HF training jobs use the existing observability path: structured job events, reports/job_progress.jsonl, status summarizer, manifest verification, checkpoint inspection, and secret scans.
  • final report states whether v0.9 opens or keeps closed the downstream claim gate; no public claim exceeds the evidence.
  • all PRs are merged to main with green CI and local validation evidence.

Claim boundary

Do not claim CodeLeWM generally improves coding unless both benchmark-level downstream rerank gates and the required representation/regression gates pass with the required seeds and artifact verification. v0.9 did not clear that bar, so it is published as negative/diagnostic evidence with the HumanEval-only positive slice called out narrowly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:dataArea: dataarea:evaluationArea: evaluationarea:modelArea: modelarea:resultsBenchmark results, reports, and research evidenceeffort:lLarge multi-file implementation changepriority:p1Required for v1.0 or core follow-throughspec:rfc-0015RFC-0015 v0.7 execution-substrate improvementstrackingSubsystem tracking issuetype:trackingTracking issue for a spec subsystem

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions