M1.3 — Emit schema-valid results and render certificates on the live leaderboard

## Background
> New here? Read **#9** first — it explains the project and defines every term below.

When a bug-hunting session (#2) finishes, it has a set of verified bugs — each a **counterexample certificate** (a JSON describing one bug) — plus cost/token accounting. This issue turns that into (a) a saved results file in a fixed, checkable format and (b) a leaderboard web page that shows **real** numbers and lets a visitor inspect each bug. Right now the leaderboard displays hardcoded *fake* sample data (DeepSeek, GPT-4o, etc. with made-up numbers); that must be replaced with real data loaded from the results files.

## Objective
Save each run as one format-checked JSON file, build the index file the website reads, and make the leaderboard render each bug as a readable, re-checkable counterexample — replacing today's mock data.

## Interface (Input → Output)
- **Input:** the per-bug results of a model run (rule, verdict, certificate, cost/tokens).
- **Output:** `results/<model>-<version>.json` (conforms to `benchmark/results.schema.json`) + `results/index.json` (the list the site loads), rendered by `leaderboard/index.html`.

## Technical recommendations (suggestions)
- Define `benchmark/results.schema.json`: model, library version, summary metrics (bugs found, total cost, total tokens, bugs/1000-tokens, bugs/$), and per-bug rows that include the certificate.
- `benchmark/build_index.py` scans `results/*.json` → writes `results/index.json`.
- In `leaderboard/index.html`, delete the hardcoded sample array (~lines 61–68) and instead `fetch('../results/index.json')`; add a per-bug view that shows the certificate (source → target → solutions → verdict).
- `make demo` runs a small real session (#2) and produces all of the above.

## Verification (how a reviewer confirms this is done)
1. Run `make demo` → it finishes cleanly and creates `results/index.json`.
2. Open the leaderboard page next to the JSON file → the bug shown has the **same numbers and details as the JSON**.
3. **Prove the page uses real data, not hardcoded values:** rename `results/index.json` and reload → the rows **disappear** (a hardcoded page wouldn't change).
4. **Prove the format check actually catches problems:** take a results file, delete one required field, run `make validate-results` → it **fails and names the missing field**.

## Dependencies
Depends on #2 (which produces the certificates + accounting).

## Out of scope
The full public website with two ranking metrics and trajectory views (#6); automatic deployment (#7).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

M1.3 — Emit schema-valid results and render certificates on the live leaderboard #3

Background

Objective

Interface (Input → Output)

Technical recommendations (suggestions)

Verification (how a reviewer confirms this is done)

Dependencies

Out of scope

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

M1.3 — Emit schema-valid results and render certificates on the live leaderboard #3

Description

Background

Objective

Interface (Input → Output)

Technical recommendations (suggestions)

Verification (how a reviewer confirms this is done)

Dependencies

Out of scope

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions