Skip to content

feat(datasets): held-out validation split + runner gate (refs #259)#264

Merged
openclaw-dv merged 2 commits into
mainfrom
feat/held-out-validation-issue-259
May 14, 2026
Merged

feat(datasets): held-out validation split + runner gate (refs #259)#264
openclaw-dv merged 2 commits into
mainfrom
feat/held-out-validation-issue-259

Conversation

@openclaw-dv
Copy link
Copy Markdown
Collaborator

Summary

Adds the foundations for the held-out / private leaderboard cut tracked in #259:

  • synthbench.datasets.splitmake_split / is_held_out produce a deterministic, seeded partition over loaded items (sha256 of item_id + ":" + seed, order-independent). Default 25% held-out fraction; canonical pinned seed.
  • CLI gatesynthbench run --held-out evaluates against the held-out cut. The flag requires SYNTHBENCH_HELD_OUT_AUTH to be set; the gate is enforced before any provider/dataset is loaded so a CI job missing the secret fails in <100ms.
  • Per-dataset allow-list — runner-side filter applies only to pewtech (Pew ATP) and globalopinionqa, matching the scope in Held-out validation set + private leaderboard (Move 4 — integrity) #259. Other datasets behave exactly as before.
  • Packaging gate — env-var lock (not MANIFEST exclude) because the raw eval data is downloaded at runtime by the dataset adapters and was never in the wheel to begin with. pyproject.toml carries a comment locking in that decision.
  • Teststests/datasets/test_split.py covers determinism, seed sensitivity, order-independence, coverage tolerance, env-gate failure + success paths, and CLI integration. 13/13 pass; full suite 815 passed + 10 skipped, no regressions.
  • Docsdocs/held-out.md covers the split mechanics, the contributor-verification sketch (score + cell counts, not raw items), the env-gate model, and a migration note for shipped leaderboard rows on Pew ATP / GlobalOpinionQA (semantics of default run shifts from "all" → "public" for those two datasets only; historical rows are not retroactively rescored).

Choice: env-gate vs MANIFEST-exclude

Picked env-gate. Reasoning in pyproject.toml + docs/held-out.md:

The raw data values are NOT in this package — dataset adapters download Pew ATP / GlobalOpinionQA at runtime into the user data dir. So there are no files to exclude from the wheel; the held-out items don't get distributed by pip install either way. The env-var lock (SYNTHBENCH_HELD_OUT_AUTH) is the meaningful gate, and it carries forward to the future shared-secret model where the periodic re-eval worker authenticates against the leaderboard server.

If we ever move raw data into the package (currently we don't), the MANIFEST exclude becomes relevant — the doc calls this out as the right escalation path.

What's TODO'd

Listed as follow-ups in docs/held-out.md:

  • Server-side periodic re-eval cron.
  • Trust-badge UI (✓/⚠/✗) on leaderboard rows.
  • Δ-threshold calibration for red vs yellow (placeholder of 5% absolute borrowed from private_holdout.SPS_DIVERGENCE_THRESHOLD).
  • Threading the partition through BenchmarkRunner so the CLI doesn't have to do an extra ds.load() pre-flight.
  • The contributor-facing audit UI for "verify my score was computed correctly".

Backward-compat concerns

One semantic change worth highlighting on review:

For pewtech and globalopinionqa, the default behaviour of synthbench run now evaluates the public ~75% cut rather than 100%. Shipped leaderboard rows from before this PR were computed on the full set and remain valid as historical records, but new submissions on those two datasets are not directly comparable. The leaderboard UI should annotate the cutoff; the existing publish pipeline doesn't distinguish "all" vs "public" today, so this is a near-term follow-up. The other seven datasets are unchanged.

Closes the implementation half of #259.

Test plan

  • pytest tests/datasets/test_split.py — 13/13 pass
  • pytest tests/ (excluding adversarial) — 815 passed, 10 skipped, no regressions
  • Smoke: from synthbench.datasets.split import make_split, is_held_out; ... round-trip
  • Manual: synthbench run --held-out without env var should exit fast with a clear error pointing at Held-out validation set + private leaderboard (Move 4 — integrity) #259 (covered by test_cli_held_out_without_auth_exits_error)

Refs #259

Agent Mani added 2 commits May 1, 2026 21:05
Adds the foundations for the held-out / private leaderboard cut called
out in issue #259:

* New module `synthbench.datasets.split` with `make_split`,
  `is_held_out`, `require_held_out_auth`, and a per-dataset allow-list
  (`HELD_OUT_ENABLED_DATASETS` = pewtech + globalopinionqa).
* CLI flag `synthbench run --held-out` that gates on
  `SYNTHBENCH_HELD_OUT_AUTH` env var and routes the runner to evaluate
  only the held-out cut; default behaviour evaluates the public cut.
* Determinism via `sha256(item_id + ":" + seed)` with a pinned default
  seed; 25% held-out fraction by default.
* Tests covering determinism, reproducibility, coverage tolerance, env
  gate failure + success paths, and CLI integration.
* pyproject note locking in the env-gate distribution model (raw data
  is downloaded at runtime, never bundled in the wheel).
* docs/held-out.md explaining the split, the contributor verification
  sketch, the migration vs shipped scores, and what's a follow-up.

Out of scope (tracked as follow-ups on #259): server-side periodic
re-eval cron, the trust-badge UI, and the Δ-threshold calibration for
red/yellow/green badging.

Refs #259

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Executed-By: mayor
@openclaw-dv openclaw-dv merged commit c4b6763 into main May 14, 2026
8 of 10 checks passed
@openclaw-dv openclaw-dv deleted the feat/held-out-validation-issue-259 branch May 14, 2026 16:57
openclaw-dv added a commit that referenced this pull request May 14, 2026
Post-merge fix: PRs #263 and #264 introduced these files unformatted.
CI ruff format --check fails on every PR against main as a result.

semver: patch

Co-authored-by: mayor <mani@Wesleys-Mini.localdomain>
openclaw-dv pushed a commit that referenced this pull request May 14, 2026
Surfaces the leaderboard slot the future server-side held-out re-eval cron
will populate. Distinct from the existing verification_badge (publish-time
private-holdout cheat-detector): this one compares the published SPS
against a server-recomputed SPS over the held-out cut from PR #264.

- LeaderboardEntry gains sps_held_out, sps_held_out_delta,
  held_out_last_run, held_out_badge fields.
- LEADERBOARD_HELD_OUT_DELTA_THRESHOLD constant (0.05 placeholder, borrowed
  from private_holdout.SPS_DIVERGENCE_THRESHOLD) lives in site/src/lib/heldOut.ts
  with the badge-state derivation. Per-dataset calibration is sb-ejk.
- LeaderboardTable.astro renders the badge in desktop + mobile when the
  fields are present; absent fields → no badge.
- HeldOutValidationSection.astro explains the mechanism on the methodology
  page and distinguishes it from the sibling private-holdout section.

Larger follow-ups split into separate beads: sb-thn (cron), sb-8xo
(BenchmarkRunner partition threading), sb-7zi (verify-my-score audit UI),
sb-ejk (Δ-threshold calibration from real data). Foundation (PR #264) is
still unmerged so this PR is deliberately UI-only and works against an
empty data path.

Refs #259
openclaw-dv added a commit that referenced this pull request May 14, 2026
…268)

Surfaces the leaderboard slot the future server-side held-out re-eval cron
will populate. Distinct from the existing verification_badge (publish-time
private-holdout cheat-detector): this one compares the published SPS
against a server-recomputed SPS over the held-out cut from PR #264.

- LeaderboardEntry gains sps_held_out, sps_held_out_delta,
  held_out_last_run, held_out_badge fields.
- LEADERBOARD_HELD_OUT_DELTA_THRESHOLD constant (0.05 placeholder, borrowed
  from private_holdout.SPS_DIVERGENCE_THRESHOLD) lives in site/src/lib/heldOut.ts
  with the badge-state derivation. Per-dataset calibration is sb-ejk.
- LeaderboardTable.astro renders the badge in desktop + mobile when the
  fields are present; absent fields → no badge.
- HeldOutValidationSection.astro explains the mechanism on the methodology
  page and distinguishes it from the sibling private-holdout section.

Larger follow-ups split into separate beads: sb-thn (cron), sb-8xo
(BenchmarkRunner partition threading), sb-7zi (verify-my-score audit UI),
sb-ejk (Δ-threshold calibration from real data). Foundation (PR #264) is
still unmerged so this PR is deliberately UI-only and works against an
empty data path.

Refs #259

Co-authored-by: dementus <mani@Wesleys-Mini.localdomain>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant