Skip to content

M1.3 — Emit schema-valid results and render certificates on the live leaderboard #3

Description

@GiggleLiu

Background

New here? Read #9 first — it explains the project and defines every term below.

When a bug-hunting session (#2) finishes, it has a set of verified bugs — each a counterexample certificate (a JSON describing one bug) — plus cost/token accounting. This issue turns that into (a) a saved results file in a fixed, checkable format and (b) a leaderboard web page that shows real numbers and lets a visitor inspect each bug. Right now the leaderboard displays hardcoded fake sample data (DeepSeek, GPT-4o, etc. with made-up numbers); that must be replaced with real data loaded from the results files.

Objective

Save each run as one format-checked JSON file, build the index file the website reads, and make the leaderboard render each bug as a readable, re-checkable counterexample — replacing today's mock data.

Interface (Input → Output)

  • Input: the per-bug results of a model run (rule, verdict, certificate, cost/tokens).
  • Output: results/<model>-<version>.json (conforms to benchmark/results.schema.json) + results/index.json (the list the site loads), rendered by leaderboard/index.html.

Technical recommendations (suggestions)

  • Define benchmark/results.schema.json: model, library version, summary metrics (bugs found, total cost, total tokens, bugs/1000-tokens, bugs/$), and per-bug rows that include the certificate.
  • benchmark/build_index.py scans results/*.json → writes results/index.json.
  • In leaderboard/index.html, delete the hardcoded sample array (~lines 61–68) and instead fetch('../results/index.json'); add a per-bug view that shows the certificate (source → target → solutions → verdict).
  • make demo runs a small real session (M1.2 — Build the bug-hunting session: explore → certify → verify, under budget #2) and produces all of the above.

Verification (how a reviewer confirms this is done)

  1. Run make demo → it finishes cleanly and creates results/index.json.
  2. Open the leaderboard page next to the JSON file → the bug shown has the same numbers and details as the JSON.
  3. Prove the page uses real data, not hardcoded values: rename results/index.json and reload → the rows disappear (a hardcoded page wouldn't change).
  4. Prove the format check actually catches problems: take a results file, delete one required field, run make validate-results → it fails and names the missing field.

Dependencies

Depends on #2 (which produces the certificates + accounting).

Out of scope

The full public website with two ranking metrics and trajectory views (#6); automatic deployment (#7).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions