You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
New here? Read #9 first — it explains the project and defines every term below.
When a bug-hunting session (#2) finishes, it has a set of verified bugs — each a counterexample certificate (a JSON describing one bug) — plus cost/token accounting. This issue turns that into (a) a saved results file in a fixed, checkable format and (b) a leaderboard web page that shows real numbers and lets a visitor inspect each bug. Right now the leaderboard displays hardcoded fake sample data (DeepSeek, GPT-4o, etc. with made-up numbers); that must be replaced with real data loaded from the results files.
Objective
Save each run as one format-checked JSON file, build the index file the website reads, and make the leaderboard render each bug as a readable, re-checkable counterexample — replacing today's mock data.
Interface (Input → Output)
Input: the per-bug results of a model run (rule, verdict, certificate, cost/tokens).
Output:results/<model>-<version>.json (conforms to benchmark/results.schema.json) + results/index.json (the list the site loads), rendered by leaderboard/index.html.
Technical recommendations (suggestions)
Define benchmark/results.schema.json: model, library version, summary metrics (bugs found, total cost, total tokens, bugs/1000-tokens, bugs/$), and per-bug rows that include the certificate.
In leaderboard/index.html, delete the hardcoded sample array (~lines 61–68) and instead fetch('../results/index.json'); add a per-bug view that shows the certificate (source → target → solutions → verdict).
Verification (how a reviewer confirms this is done)
Run make demo → it finishes cleanly and creates results/index.json.
Open the leaderboard page next to the JSON file → the bug shown has the same numbers and details as the JSON.
Prove the page uses real data, not hardcoded values: rename results/index.json and reload → the rows disappear (a hardcoded page wouldn't change).
Prove the format check actually catches problems: take a results file, delete one required field, run make validate-results → it fails and names the missing field.
Dependencies
Depends on #2 (which produces the certificates + accounting).
Out of scope
The full public website with two ranking metrics and trajectory views (#6); automatic deployment (#7).
Background
When a bug-hunting session (#2) finishes, it has a set of verified bugs — each a counterexample certificate (a JSON describing one bug) — plus cost/token accounting. This issue turns that into (a) a saved results file in a fixed, checkable format and (b) a leaderboard web page that shows real numbers and lets a visitor inspect each bug. Right now the leaderboard displays hardcoded fake sample data (DeepSeek, GPT-4o, etc. with made-up numbers); that must be replaced with real data loaded from the results files.
Objective
Save each run as one format-checked JSON file, build the index file the website reads, and make the leaderboard render each bug as a readable, re-checkable counterexample — replacing today's mock data.
Interface (Input → Output)
results/<model>-<version>.json(conforms tobenchmark/results.schema.json) +results/index.json(the list the site loads), rendered byleaderboard/index.html.Technical recommendations (suggestions)
benchmark/results.schema.json: model, library version, summary metrics (bugs found, total cost, total tokens, bugs/1000-tokens, bugs/$), and per-bug rows that include the certificate.benchmark/build_index.pyscansresults/*.json→ writesresults/index.json.leaderboard/index.html, delete the hardcoded sample array (~lines 61–68) and insteadfetch('../results/index.json'); add a per-bug view that shows the certificate (source → target → solutions → verdict).make demoruns a small real session (M1.2 — Build the bug-hunting session: explore → certify → verify, under budget #2) and produces all of the above.Verification (how a reviewer confirms this is done)
make demo→ it finishes cleanly and createsresults/index.json.results/index.jsonand reload → the rows disappear (a hardcoded page wouldn't change).make validate-results→ it fails and names the missing field.Dependencies
Depends on #2 (which produces the certificates + accounting).
Out of scope
The full public website with two ranking metrics and trajectory views (#6); automatic deployment (#7).