feat(datasets): held-out validation split + runner gate (refs #259) by openclaw-dv · Pull Request #264 · DataViking-Tech/SynthBench

openclaw-dv · 2026-05-14T06:14:53Z

Summary

Adds the foundations for the held-out / private leaderboard cut tracked in #259:

synthbench.datasets.split — make_split / is_held_out produce a deterministic, seeded partition over loaded items (sha256 of item_id + ":" + seed, order-independent). Default 25% held-out fraction; canonical pinned seed.
CLI gate — synthbench run --held-out evaluates against the held-out cut. The flag requires SYNTHBENCH_HELD_OUT_AUTH to be set; the gate is enforced before any provider/dataset is loaded so a CI job missing the secret fails in <100ms.
Per-dataset allow-list — runner-side filter applies only to pewtech (Pew ATP) and globalopinionqa, matching the scope in Held-out validation set + private leaderboard (Move 4 — integrity) #259. Other datasets behave exactly as before.
Packaging gate — env-var lock (not MANIFEST exclude) because the raw eval data is downloaded at runtime by the dataset adapters and was never in the wheel to begin with. pyproject.toml carries a comment locking in that decision.
Tests — tests/datasets/test_split.py covers determinism, seed sensitivity, order-independence, coverage tolerance, env-gate failure + success paths, and CLI integration. 13/13 pass; full suite 815 passed + 10 skipped, no regressions.
Docs — docs/held-out.md covers the split mechanics, the contributor-verification sketch (score + cell counts, not raw items), the env-gate model, and a migration note for shipped leaderboard rows on Pew ATP / GlobalOpinionQA (semantics of default run shifts from "all" → "public" for those two datasets only; historical rows are not retroactively rescored).

Choice: env-gate vs MANIFEST-exclude

Picked env-gate. Reasoning in pyproject.toml + docs/held-out.md:

The raw data values are NOT in this package — dataset adapters download Pew ATP / GlobalOpinionQA at runtime into the user data dir. So there are no files to exclude from the wheel; the held-out items don't get distributed by pip install either way. The env-var lock (SYNTHBENCH_HELD_OUT_AUTH) is the meaningful gate, and it carries forward to the future shared-secret model where the periodic re-eval worker authenticates against the leaderboard server.

If we ever move raw data into the package (currently we don't), the MANIFEST exclude becomes relevant — the doc calls this out as the right escalation path.

What's TODO'd

Listed as follow-ups in docs/held-out.md:

Server-side periodic re-eval cron.
Trust-badge UI (✓/⚠/✗) on leaderboard rows.
Δ-threshold calibration for red vs yellow (placeholder of 5% absolute borrowed from private_holdout.SPS_DIVERGENCE_THRESHOLD).
Threading the partition through BenchmarkRunner so the CLI doesn't have to do an extra ds.load() pre-flight.
The contributor-facing audit UI for "verify my score was computed correctly".

Backward-compat concerns

One semantic change worth highlighting on review:

For pewtech and globalopinionqa, the default behaviour of synthbench run now evaluates the public ~75% cut rather than 100%. Shipped leaderboard rows from before this PR were computed on the full set and remain valid as historical records, but new submissions on those two datasets are not directly comparable. The leaderboard UI should annotate the cutoff; the existing publish pipeline doesn't distinguish "all" vs "public" today, so this is a near-term follow-up. The other seven datasets are unchanged.

Closes the implementation half of #259.

Test plan

pytest tests/datasets/test_split.py — 13/13 pass
pytest tests/ (excluding adversarial) — 815 passed, 10 skipped, no regressions
Smoke: from synthbench.datasets.split import make_split, is_held_out; ... round-trip
Manual: synthbench run --held-out without env var should exit fast with a clear error pointing at Held-out validation set + private leaderboard (Move 4 — integrity) #259 (covered by test_cli_held_out_without_auth_exits_error)

Refs #259

Adds the foundations for the held-out / private leaderboard cut called out in issue #259: * New module `synthbench.datasets.split` with `make_split`, `is_held_out`, `require_held_out_auth`, and a per-dataset allow-list (`HELD_OUT_ENABLED_DATASETS` = pewtech + globalopinionqa). * CLI flag `synthbench run --held-out` that gates on `SYNTHBENCH_HELD_OUT_AUTH` env var and routes the runner to evaluate only the held-out cut; default behaviour evaluates the public cut. * Determinism via `sha256(item_id + ":" + seed)` with a pinned default seed; 25% held-out fraction by default. * Tests covering determinism, reproducibility, coverage tolerance, env gate failure + success paths, and CLI integration. * pyproject note locking in the env-gate distribution model (raw data is downloaded at runtime, never bundled in the wheel). * docs/held-out.md explaining the split, the contributor verification sketch, the migration vs shipped scores, and what's a follow-up. Out of scope (tracked as follow-ups on #259): server-side periodic re-eval cron, the trust-badge UI, and the Δ-threshold calibration for red/yellow/green badging. Refs #259 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Executed-By: mayor

Post-merge fix: PRs #263 and #264 introduced these files unformatted. CI ruff format --check fails on every PR against main as a result. semver: patch Co-authored-by: mayor <mani@Wesleys-Mini.localdomain>

Surfaces the leaderboard slot the future server-side held-out re-eval cron will populate. Distinct from the existing verification_badge (publish-time private-holdout cheat-detector): this one compares the published SPS against a server-recomputed SPS over the held-out cut from PR #264. - LeaderboardEntry gains sps_held_out, sps_held_out_delta, held_out_last_run, held_out_badge fields. - LEADERBOARD_HELD_OUT_DELTA_THRESHOLD constant (0.05 placeholder, borrowed from private_holdout.SPS_DIVERGENCE_THRESHOLD) lives in site/src/lib/heldOut.ts with the badge-state derivation. Per-dataset calibration is sb-ejk. - LeaderboardTable.astro renders the badge in desktop + mobile when the fields are present; absent fields → no badge. - HeldOutValidationSection.astro explains the mechanism on the methodology page and distinguishes it from the sibling private-holdout section. Larger follow-ups split into separate beads: sb-thn (cron), sb-8xo (BenchmarkRunner partition threading), sb-7zi (verify-my-score audit UI), sb-ejk (Δ-threshold calibration from real data). Foundation (PR #264) is still unmerged so this PR is deliberately UI-only and works against an empty data path. Refs #259

…268) Surfaces the leaderboard slot the future server-side held-out re-eval cron will populate. Distinct from the existing verification_badge (publish-time private-holdout cheat-detector): this one compares the published SPS against a server-recomputed SPS over the held-out cut from PR #264. - LeaderboardEntry gains sps_held_out, sps_held_out_delta, held_out_last_run, held_out_badge fields. - LEADERBOARD_HELD_OUT_DELTA_THRESHOLD constant (0.05 placeholder, borrowed from private_holdout.SPS_DIVERGENCE_THRESHOLD) lives in site/src/lib/heldOut.ts with the badge-state derivation. Per-dataset calibration is sb-ejk. - LeaderboardTable.astro renders the badge in desktop + mobile when the fields are present; absent fields → no badge. - HeldOutValidationSection.astro explains the mechanism on the methodology page and distinguishes it from the sibling private-holdout section. Larger follow-ups split into separate beads: sb-thn (cron), sb-8xo (BenchmarkRunner partition threading), sb-7zi (verify-my-score audit UI), sb-ejk (Δ-threshold calibration from real data). Foundation (PR #264) is still unmerged so this PR is deliberately UI-only and works against an empty data path. Refs #259 Co-authored-by: dementus <mani@Wesleys-Mini.localdomain>

Agent Mani added 2 commits May 1, 2026 21:05

bd init: initialize beads issue tracking

dafb389

openclaw-dv merged commit c4b6763 into main May 14, 2026
8 of 10 checks passed

openclaw-dv deleted the feat/held-out-validation-issue-259 branch May 14, 2026 16:57

This was referenced May 14, 2026

feat(leaderboard): held-out re-eval trust badge UI scaffold (sb-qu1, Move 4 followup) #268

Merged

fix: ruff format on main (unblocks #265-#268 CI) #269

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(datasets): held-out validation split + runner gate (refs #259)#264

feat(datasets): held-out validation split + runner gate (refs #259)#264
openclaw-dv merged 2 commits into
mainfrom
feat/held-out-validation-issue-259

openclaw-dv commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

openclaw-dv commented May 14, 2026

Summary

Choice: env-gate vs MANIFEST-exclude

What's TODO'd

Backward-compat concerns

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant