feat(datasets): held-out validation split + runner gate (refs #259)#264
Merged
Conversation
added 2 commits
May 1, 2026 21:05
Adds the foundations for the held-out / private leaderboard cut called out in issue #259: * New module `synthbench.datasets.split` with `make_split`, `is_held_out`, `require_held_out_auth`, and a per-dataset allow-list (`HELD_OUT_ENABLED_DATASETS` = pewtech + globalopinionqa). * CLI flag `synthbench run --held-out` that gates on `SYNTHBENCH_HELD_OUT_AUTH` env var and routes the runner to evaluate only the held-out cut; default behaviour evaluates the public cut. * Determinism via `sha256(item_id + ":" + seed)` with a pinned default seed; 25% held-out fraction by default. * Tests covering determinism, reproducibility, coverage tolerance, env gate failure + success paths, and CLI integration. * pyproject note locking in the env-gate distribution model (raw data is downloaded at runtime, never bundled in the wheel). * docs/held-out.md explaining the split, the contributor verification sketch, the migration vs shipped scores, and what's a follow-up. Out of scope (tracked as follow-ups on #259): server-side periodic re-eval cron, the trust-badge UI, and the Δ-threshold calibration for red/yellow/green badging. Refs #259 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Executed-By: mayor
This was referenced May 14, 2026
openclaw-dv
added a commit
that referenced
this pull request
May 14, 2026
openclaw-dv
pushed a commit
that referenced
this pull request
May 14, 2026
Surfaces the leaderboard slot the future server-side held-out re-eval cron will populate. Distinct from the existing verification_badge (publish-time private-holdout cheat-detector): this one compares the published SPS against a server-recomputed SPS over the held-out cut from PR #264. - LeaderboardEntry gains sps_held_out, sps_held_out_delta, held_out_last_run, held_out_badge fields. - LEADERBOARD_HELD_OUT_DELTA_THRESHOLD constant (0.05 placeholder, borrowed from private_holdout.SPS_DIVERGENCE_THRESHOLD) lives in site/src/lib/heldOut.ts with the badge-state derivation. Per-dataset calibration is sb-ejk. - LeaderboardTable.astro renders the badge in desktop + mobile when the fields are present; absent fields → no badge. - HeldOutValidationSection.astro explains the mechanism on the methodology page and distinguishes it from the sibling private-holdout section. Larger follow-ups split into separate beads: sb-thn (cron), sb-8xo (BenchmarkRunner partition threading), sb-7zi (verify-my-score audit UI), sb-ejk (Δ-threshold calibration from real data). Foundation (PR #264) is still unmerged so this PR is deliberately UI-only and works against an empty data path. Refs #259
openclaw-dv
added a commit
that referenced
this pull request
May 14, 2026
…268) Surfaces the leaderboard slot the future server-side held-out re-eval cron will populate. Distinct from the existing verification_badge (publish-time private-holdout cheat-detector): this one compares the published SPS against a server-recomputed SPS over the held-out cut from PR #264. - LeaderboardEntry gains sps_held_out, sps_held_out_delta, held_out_last_run, held_out_badge fields. - LEADERBOARD_HELD_OUT_DELTA_THRESHOLD constant (0.05 placeholder, borrowed from private_holdout.SPS_DIVERGENCE_THRESHOLD) lives in site/src/lib/heldOut.ts with the badge-state derivation. Per-dataset calibration is sb-ejk. - LeaderboardTable.astro renders the badge in desktop + mobile when the fields are present; absent fields → no badge. - HeldOutValidationSection.astro explains the mechanism on the methodology page and distinguishes it from the sibling private-holdout section. Larger follow-ups split into separate beads: sb-thn (cron), sb-8xo (BenchmarkRunner partition threading), sb-7zi (verify-my-score audit UI), sb-ejk (Δ-threshold calibration from real data). Foundation (PR #264) is still unmerged so this PR is deliberately UI-only and works against an empty data path. Refs #259 Co-authored-by: dementus <mani@Wesleys-Mini.localdomain>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds the foundations for the held-out / private leaderboard cut tracked in #259:
synthbench.datasets.split—make_split/is_held_outproduce a deterministic, seeded partition over loaded items (sha256 ofitem_id + ":" + seed, order-independent). Default 25% held-out fraction; canonical pinned seed.synthbench run --held-outevaluates against the held-out cut. The flag requiresSYNTHBENCH_HELD_OUT_AUTHto be set; the gate is enforced before any provider/dataset is loaded so a CI job missing the secret fails in <100ms.pewtech(Pew ATP) andglobalopinionqa, matching the scope in Held-out validation set + private leaderboard (Move 4 — integrity) #259. Other datasets behave exactly as before.pyproject.tomlcarries a comment locking in that decision.tests/datasets/test_split.pycovers determinism, seed sensitivity, order-independence, coverage tolerance, env-gate failure + success paths, and CLI integration. 13/13 pass; full suite 815 passed + 10 skipped, no regressions.docs/held-out.mdcovers the split mechanics, the contributor-verification sketch (score + cell counts, not raw items), the env-gate model, and a migration note for shipped leaderboard rows on Pew ATP / GlobalOpinionQA (semantics of defaultrunshifts from "all" → "public" for those two datasets only; historical rows are not retroactively rescored).Choice: env-gate vs MANIFEST-exclude
Picked env-gate. Reasoning in
pyproject.toml+docs/held-out.md:If we ever move raw data into the package (currently we don't), the MANIFEST exclude becomes relevant — the doc calls this out as the right escalation path.
What's TODO'd
Listed as follow-ups in
docs/held-out.md:private_holdout.SPS_DIVERGENCE_THRESHOLD).BenchmarkRunnerso the CLI doesn't have to do an extrads.load()pre-flight.Backward-compat concerns
One semantic change worth highlighting on review:
For
pewtechandglobalopinionqa, the default behaviour ofsynthbench runnow evaluates the public ~75% cut rather than 100%. Shipped leaderboard rows from before this PR were computed on the full set and remain valid as historical records, but new submissions on those two datasets are not directly comparable. The leaderboard UI should annotate the cutoff; the existing publish pipeline doesn't distinguish "all" vs "public" today, so this is a near-term follow-up. The other seven datasets are unchanged.Closes the implementation half of #259.
Test plan
pytest tests/datasets/test_split.py— 13/13 passpytest tests/(excluding adversarial) — 815 passed, 10 skipped, no regressionsfrom synthbench.datasets.split import make_split, is_held_out; ...round-tripsynthbench run --held-outwithout env var should exit fast with a clear error pointing at Held-out validation set + private leaderboard (Move 4 — integrity) #259 (covered bytest_cli_held_out_without_auth_exits_error)Refs #259