[TRACKER] v0.9 data/eval repair for cross-benchmark correctness claims

## Goal

Complete the v0.9 data/eval repair cycle end to end. v0.8 proved that the correctness-aware scorer can win on HumanEval WS-D, but it stayed claim-closed because the evidence was not cross-benchmark and several gates were not actually evaluable.

This tracker turns the v0.8 blockers into an implementation queue before any new GPU run:

- cross-benchmark pass/fail training/eval data, not HumanEval-only;
- stratified split coverage for `passed`, `output_magnitude_bucket`, and semantic-decoy categories;
- first-class held-out `p_pass` ROC-AUC and calibration reporting;
- repaired semantic-decoy alignment/coverage;
- a guarded 2-seed HF Jobs run only after data/eval preflight passes;
- artifact-backed reports, cards, claim audit, and tracker cleanup.

## Final outcome

Completed on 2026-06-07 by PR #400, merged at `c1e1372a4b3ae19b391a7cda7ce15c775c734485`.

Final reports:

- `docs/benchmark/EXECUTION_V0_9_RESULTS_2026-06-07.md`
- `docs/benchmark/PUBLIC_ARTIFACT_INDEX_2026-06-07.md`
- `docs/cards/codelewm-v0-9-execution-dataset-2026-06-07.md`
- `docs/cards/codelewm-v0-9-execution-model-seed-42-2026-06-07.md`
- `docs/cards/codelewm-v0-9-execution-model-seed-1729-2026-06-07.md`

Claim boundary: v0.9 remains overall claim-closed. Both seeds clear HumanEval WS-D reranking, but MBPP-Plus has zero lift over no-action, broad semantic-decoy and representation gates remain closed, and p-pass calibration is reported from completion-score baselines rather than a standalone serialized `p_pass` key.

## Evidence starting point

- v0.8 jobs completed and artifacts verified: `docs/benchmark/EXECUTION_V0_8_RESULTS_2026-06-05.md`.
- HumanEval WS-D rerank passed both seeds.
- MBPP-Plus WS-D rerank did not clear.
- `output_magnitude_bucket` was not evaluable because the v0.8 pack val split had zero labels for that target.
- broad semantic-decoy surprise was pair-coverage blocked after aligning the v0.6 decoy pack with the v0.8 pass/fail pack.
- held-out training-pack `p_pass` ROC-AUC was not emitted as a first-class eval metric.

## Workstreams

- [x] #386 - v0.9 hygiene: reconcile stale trackers and roadmap queue state.
- [x] #387 - v0.9 data: build cross-benchmark pass/fail execution pack with stratified labels.
- [x] #388 - v0.9 eval: emit held-out `p_pass` ROC-AUC and calibration reports.
- [x] #389 - v0.9 eval: repair semantic-decoy alignment and coverage gates.
- [x] #390 - v0.9 eval: enforce probe-label coverage and representation gates.
- [x] #391 - v0.9 train: guarded 2-seed HF Jobs run after data/eval preflight.
- [x] #392 - v0.9 eval/report: run full gate suite and publish claim audit.

## Dependency graph

```
#386 -> #387 -> (#388, #389, #390) -> #391 -> #392
```

#387 was the data preflight root. #391 launched only after #387, #388, #389, and #390 landed and showed that the v0.8 blocked gates were evaluable or typed as artifact-backed blockers.

## Global acceptance criteria

- [x] stale tracker/roadmap state no longer points agents at closed #210/#211/#212 or superseded v0.7 work as if it were current.
- [x] v0.9 data/eval preflight produces manifest-backed evidence before GPU launch.
- [x] cross-benchmark correctness pack covers at least HumanEval and MBPP-Plus and has stratified val/test labels for `passed` and `output_magnitude_bucket`.
- [x] semantic-decoy eval has enough aligned scorable pairs to make the broad surprise gate evaluable or records a typed, artifact-backed blocker.
- [x] held-out `p_pass` ROC-AUC/calibration is emitted as a first-class report and copied into results docs.
- [x] any HF training jobs use the existing observability path: structured job events, `reports/job_progress.jsonl`, status summarizer, manifest verification, checkpoint inspection, and secret scans.
- [x] final report states whether v0.9 opens or keeps closed the downstream claim gate; no public claim exceeds the evidence.
- [x] all PRs are merged to `main` with green CI and local validation evidence.

## Claim boundary

Do not claim CodeLeWM generally improves coding unless both benchmark-level downstream rerank gates and the required representation/regression gates pass with the required seeds and artifact verification. v0.9 did not clear that bar, so it is published as negative/diagnostic evidence with the HumanEval-only positive slice called out narrowly.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TRACKER] v0.9 data/eval repair for cross-benchmark correctness claims #385

Goal

Final outcome

Evidence starting point

Workstreams

Dependency graph

Global acceptance criteria

Claim boundary

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[TRACKER] v0.9 data/eval repair for cross-benchmark correctness claims #385

Description

Goal

Final outcome

Evidence starting point

Workstreams

Dependency graph

Global acceptance criteria

Claim boundary

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions