v0.9 eval/report: run full gate suite and publish claim audit

## Parent

#385

## What to build

Evaluate the verified v0.9 checkpoints across the full gate suite, publish tracked eval artifacts, and write the final report/cards/artifact index. This is the issue that decides whether the v0.9 claim opens or remains diagnostic.

## Acceptance criteria

- [ ] retrieval, surprise, probe, held-out `p_pass`, and WS-D rerank evals run for both seeds with required parent manifests.
- [ ] HumanEval and MBPP-Plus downstream tables include CodeLeWM, no-action, shuffled-action, lexical/static, LLM-order, random, confidence intervals, and claim decisions.
- [ ] tracked `docs/benchmark/v0_9/` artifacts include manifests, reports, and score rows required to reproduce the report.
- [ ] dataset/model cards and public artifact index list exact artifact ids, checkpoint SHA-256 values, run ids, and claim posture.
- [ ] final report explicitly states whether v0.9 opens or keeps closed the public downstream claim gate.
- [ ] validation includes full local tests, CI, parent-aware manifest verification, checkpoint inspection, and secret scans.
- [ ] tracker #385 is updated and closed only after this issue merges to `main`.

## Blocked by

#391.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.9 eval/report: run full gate suite and publish claim audit #392

Parent

What to build

Acceptance criteria

Blocked by

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

v0.9 eval/report: run full gate suite and publish claim audit #392

Description

Parent

What to build

Acceptance criteria

Blocked by

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions