Site tier-1: open-set banner, sensitivity selector, bootstrap intervals#9
Open
Site tier-1: open-set banner, sensitivity selector, bootstrap intervals#9
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Adds the credibility-tightening tier-1 leaderboard changes from
docs/site_improvements_scope.md, plus a shared sticky header that the
paper page now reuses.
Header
- Extract SiteHeader from Hero. The new component owns the sticky
brand + nav + view-selector + action-link layout and supports an
alwaysExpanded mode for pages without an in-page hero.
- Hero refactored to wrap SiteHeader and pass the country-aware
subtitle, stat strip, and snapshot pill as expandedContent. Drop the
"Top score" stat and the "Leading: <model>" sidebar; the leaderboard
itself is the canonical source for both.
- /paper uses SiteHeader with alwaysExpanded, no view selector, and a
Benchmark action link. The page body keeps its eyebrow/buttons/iframe.
Open-set banner + snapshot pill
- Above the leaderboard, a warning-tinted note states that scenarios
and reference outputs are public, so the public preview is open-set.
- Snapshot date pill (Snapshot 2026-05-01) appears in the hero stat row
on the home page and next to the Manuscript eyebrow on /paper.
Sensitivity-view selector
- New segmented control with five views: Main, Amount only, Binary
only, Positive cases, Zero cases. Selecting a view rescores models
client-side from scenarioPredictions and reorders the leaderboard;
the description for the active view appears next to the selector.
- New utilities under app/src/lib/:
- scoring.ts ports score_single_prediction (mean of exact, within-1%,
within-5%, within-10% for amount; classification accuracy for
binary; output-group resolution for person-expanded variables).
Verified against canonical analysis.py against the snapshot for
both US and UK headline scopes.
- sensitivity.ts builds the per-row score table from a DashboardBundle
and aggregates output-group means -> country -> global, preserving
the country-equal weighting. Sensitivity views filter rows before
aggregation.
Bootstrap rank intervals
- bootstrap.ts implements the household-resampling bootstrap with a
deterministic mulberry32 RNG (seed 42, 400 draws) and reports the
95% score interval and the rank range for each model under the
active sensitivity view.
- ModelLeaderboard renders Rank N(-M) - 95% L-U next to each model's
point estimate, with a tooltip naming the bootstrap parameters.
Repo
- Move the python wheel-artifact lib/ rule in .gitignore to /lib/ and
/lib64/ (top-level only) so app/src/lib/ is tracked.
Verification
- bun run lint - clean
- bun run build - clean (Next.js 16 production build)
- bun run start - SSR render of / contains the open-set banner, the
snapshot pill, the five sensitivity selector chips, and per-model
Rank/95% interval rows for all 12 models. /paper renders SiteHeader
with the snapshot pill and Benchmark action link, no view selector.
- bootstrap.ts now sums per-row sums and counts directly when aggregating output-group means inside each draw, so the bootstrap estimator matches the canonical headline scoring rule (each row contributes equally to the output-group mean instead of being collapsed to a per-scenario mean first). - modelScoresForView and bootstrapIntervals require every required country to have rows under the active sensitivity slice before returning a global ranking. ModelLeaderboard falls back to Main when a slice has no rows in one country (e.g. "Binary only" with no UK binary outputs) and surfaces a notice; sensitivity buttons that cannot apply globally are aria-disabled with a tooltip. - Sensitivity selector and country view selector now expose role, aria-label, and aria-pressed state. - SiteHeader collapsed nav items and action link are no longer keyboard-focusable while hidden (tabIndex=-1, aria-hidden). - useScrollProgress no longer subscribes to scroll when alwaysExpanded, and DEFAULT_DRAWS is exported and used as the single source for the bootstrap draw count (400). - .gitignore restores the Python lib/ blanket ignore and adds an explicit !app/src/lib/ + !app/src/lib/** allowlist so app/src/lib is tracked while nested lib/ directories elsewhere stay ignored.
- Sensitivity buttons that are aria-disabled for the Global view no longer fire onClick, and aria-pressed is force-cleared on those buttons so they cannot simultaneously claim selected and disabled state. Cursor changes to not-allowed too. - Auto-fallback notice quotes the slice label (e.g. "Binary only") instead of inlining a lower-cased phrase, so proper-noun feel is preserved.
Sensitivity buttons that are unavailable for the Global view now use the native disabled attribute (which removes them from the tab order and lets the browser ignore Enter/Space presses), instead of relying on aria-disabled + an undefined onClick. The aria-pressed force-clear is preserved so the button never claims selected and disabled simultaneously.
navVisible was set unconditionally on the home page even while the in-page nav had opacity:0 / max-width:0, leaving the nav links keyboard-focusable in the collapsed state. Tie navVisible to the same navOpacity > 0.05 threshold the visual hide uses, so the links stay out of the tab order until they are actually visible. Also dedupes navOpacity (was declared twice).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements the tier-1 leaderboard improvements from
docs/site_improvements_scope.md, plus a shared sticky header that the/paperpage now reuses (matching the home page chrome, without the Global/US/UK view selector). Round-2 review feedback is incorporated: the bootstrap estimator now matches the canonical row-equal scoring rule, and the global view auto-falls-back to Main when an active sensitivity slice has no rows in one country.Header
SiteHeaderfromHero. The new component owns the sticky brand + nav + view-selector + action-link layout and supports analwaysExpandedmode for pages that don't drive their own collapse.Herorefactored to wrapSiteHeaderand pass the country-aware subtitle, stat strip, and snapshot pill asexpandedContent. Drops theTop scorestat and theLeading: <model>sidebar — the leaderboard itself is the canonical source for both./paperusesSiteHeaderwithalwaysExpanded, no view selector, and aBenchmarkaction link. The page body keeps its eyebrow/buttons/iframe; the inline H1 is gone since the header carries the brand.tabIndex={-1},aria-hidden); the scroll listener short-circuits onalwaysExpanded.Open-set banner and snapshot pill
role="note"warning-tinted banner above the leaderboard.Snapshot 2026-05-01) on the home stat row and next to theManuscripteyebrow on/paper.Sensitivity-view selector
Main/Amount only/Binary only/Positive cases/Zero cases. Selecting a view rescores models client-side fromscenarioPredictionsand reorders the table; the description for the active view appears inline. The control isrole="group"witharia-labelledby; each button hasaria-pressed.viewSupportsGlobalreturns false, the leaderboard falls back to the Main view with arole="note"notice, and the offending button becomesaria-disabledwith the click handler short-circuited so it cannot simultaneously be pressed and disabled.app/src/lib/:scoring.tsportsscore_single_prediction(mean of exact / within-1% / within-5% / within-10% for amount outputs; classification accuracy for binary; output-group resolution for person-expanded variables). Verified against canonicalanalysis.pyfor both countries' top models.sensitivity.tsbuilds the per-row score table from aDashboardBundleand aggregates output-group means → country → global, preserving country-equal weighting. ExposesviewSupportsGlobalfor global-validity checks.Bootstrap rank intervals
bootstrap.tsimplements household-resampling with a deterministicmulberry32RNG (seed42,DEFAULT_DRAWS = 400). Inside each draw it adds bucketsumandcountdirectly across sampled scenarios so each row contributes equally to the output-group mean — matching the headlinemodelScoresForViewestimator instead of collapsing each scenario to a per-scenario mean first.ModelLeaderboardrendersRank N(-M) · 95% L-Unext to each model's point estimate. Sample current Global view: Rank 1 has 95% CI ~74.7–79.6; Rank 2-3 cluster ~73.5–78.7. Tooltip names the bootstrap parameters.app/src/data.json: US gpt-5.5 90.03 / grok-4.20 89.29 / gemini-3.1-pro-preview 88.21; UK gpt-5.5 77.18 / gemini-3.1-pro-preview 76.18 / grok-4.20 75.07; Global gpt-5.5 83.60 / gemini-3.1-pro-preview 82.20 / grok-4.20 82.18.Repo
.gitignorekeeps the Pythonlib/blanket ignore and adds an explicit!app/src/lib/+!app/src/lib/**allowlist soapp/src/lib/is tracked while nestedlib/directories elsewhere stay ignored.Verification
bun run lint— cleanbun run build— clean (Next.js 16 production build)bun run start— SSR HTML on/contains the open-set banner withrole="note", the snapshot pill, the sensitivity selector witharia-pressed× 5 androle="group", the country view selector witharia-pressed× 3 androle="group", per-model bootstrap intervals, and the "by PolicyEngine" pill./paperrenders the same chrome without the view selector.policybench/analysis.pyfor the snapshot's top-5 models per country.Test plan
/— confirm header drops "Top score" / "Leading" and the snapshot pill is visiblePositive cases→ leaderboard reorders and intervals refreshBinary onlywhile onGlobal→ leaderboard falls back to Main with a notice; the Binary-only button is aria-disabled/paper→ same sticky header style without a view selector🤖 Generated with Claude Code