Skip to content

v0.9 eval/report: run full gate suite and publish claim audit #392

@AbdelStark

Description

@AbdelStark

Parent

#385

What to build

Evaluate the verified v0.9 checkpoints across the full gate suite, publish tracked eval artifacts, and write the final report/cards/artifact index. This is the issue that decides whether the v0.9 claim opens or remains diagnostic.

Acceptance criteria

  • retrieval, surprise, probe, held-out p_pass, and WS-D rerank evals run for both seeds with required parent manifests.
  • HumanEval and MBPP-Plus downstream tables include CodeLeWM, no-action, shuffled-action, lexical/static, LLM-order, random, confidence intervals, and claim decisions.
  • tracked docs/benchmark/v0_9/ artifacts include manifests, reports, and score rows required to reproduce the report.
  • dataset/model cards and public artifact index list exact artifact ids, checkpoint SHA-256 values, run ids, and claim posture.
  • final report explicitly states whether v0.9 opens or keeps closed the public downstream claim gate.
  • validation includes full local tests, CI, parent-aware manifest verification, checkpoint inspection, and secret scans.
  • tracker [TRACKER] v0.9 data/eval repair for cross-benchmark correctness claims #385 is updated and closed only after this issue merges to main.

Blocked by

#391.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:evaluationArea: evaluationarea:releaseArea: releasearea:resultsBenchmark results, reports, and research evidenceeffort:lLarge multi-file implementation changepriority:p1Required for v1.0 or core follow-throughspec:rfc-0015RFC-0015 v0.7 execution-substrate improvementstype:docsDocumentation work

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions