AI CSS Battle Benchmark
Measures how well LLMs can reproduce pixel-perfect CSS targets from CSS Battle. Run multiple models against the same targets and compare scores, match rates, and cost on the dashboard.
- Docker Desktop (running, Linux containers mode)
- API key for at least one provider (OpenRouter, OpenAI, or Ollama)
cp .env.example .env
# Add your API key(s) to .env
npm run devOpen http://localhost:5173 for the dashboard.
The easiest way is the + Run tab in the dashboard — pick a model, provider, and hit Start. You can launch multiple runs in parallel, and the model field autocompletes previously used models filtered by the selected provider.
Alternatively via CLI:
docker compose run runner \
--model openai/gpt-4o \
--provider openrouter \
--attempts 3CLI options:
| Flag | Default | Description |
|---|---|---|
--model |
— | Model ID (required), e.g. openai/gpt-4o |
--provider |
openrouter |
openrouter | openai | ollama |
--targets |
battle |
battle | daily |
--target-id |
— | Run a single target by ID |
--attempts |
3 |
Attempts per target (best score counts) |
--prompt |
v1* |
Prompt version (v1, v2, …) |
--concurrency |
1 |
Run N targets in parallel |
--retries |
0 |
Retry a target if all attempts error |
--reasoning |
— | Reasoning effort for o-series models: low | medium | high |
*Set PROMPT_VERSION=v2 in .env to change the default.
Resume and target-range controls are available in the dashboard (+ Run tab).
You can force OpenRouter provider routing per model via a local config file:
cp config/openrouter.providers.example.json config/openrouter.providers.jsonDefault lookup path: ./config/openrouter.providers.json
Optional override: OPENROUTER_PROVIDER_CONFIG_PATH=...
Config shape:
{
"modelProviderOverrides": {
"openai/gpt-5-mini": "openai",
"moonshotai/*": "io.net",
"anthropic/claude-3.7-sonnet": {
"order": ["anthropic"],
"allow_fallbacks": false
}
}
}Rules:
"<provider>"forces a single provider (allow_fallbacks: false).["a", "b"]sets provider order (allow_fallbacks: false).{ ... }passes a raw OpenRouterproviderobject through unchanged."vendor/*"applies to all models with that prefix (for examplemoonshotai/*).- Matching priority is: exact model > longest
vendor/*prefix.
- The model receives the target image + canvas size + colors as context
- It generates an HTML/CSS solution (no JS, SVG, or external resources)
- The solution is rendered in headless Chromium at the exact canvas size (Quirks Mode)
- The render is pixel-diffed against the target using pixelmatch (threshold 0.01)
- A score is calculated from pixel match rate and code length
Score formula (CSS Battle): 399.99725 × 0.9905144^charCount + 599.9987
For imperfect matches the score is multiplied by match³:
| Match | Multiplier |
|---|---|
| 100 % | 1.000× — full score |
| 99 % | 0.970× |
| 95 % | 0.857× |
| 80 % | 0.512× |
| 50 % | 0.125× |
Color accuracy matters far more than code length. Only 100 % pixel matches count as perfect.
packages/
core/ Renderer (Puppeteer) + Scorer (pixelmatch) + LLM adapters
runner/ CLI benchmark orchestrator
api/ Express REST API + SSE progress stream
dashboard/ React + Vite dashboard (local + public build)
db/ SQLite adapter (built-in node:sqlite) + Supabase sync
targets/
images/ PNG reference images (battle + daily)
definitions/ Target metadata (colors, dimensions)
baselines/
human.json Human expert top scores (reference baseline)
prompts/
v1/ Original benchmark prompt
v2/ Improved prompt (better color accuracy guidance)
scripts/
upload-results.js Upload local SQLite results → Supabase (done rows only)
download-results.js Download results Supabase → local SQLite
upload-targets.js Seed battle/daily targets in Supabase
sync-targets.js Sync target definitions + images from Supabase
export-human-stats.js Export compact human baseline stats from Supabase leaderboard rows
recalculate-scores.js Recompute match% + scores for all stored runs
Results can be synced bidirectionally between local SQLite and Supabase via the ⇅ Sync tab or CLI scripts:
npm run upload # local SQLite → Supabase (only rows with status='done')
npm run download # Supabase → local SQLite
npm run upload-targets # seed battle_targets / daily_targets in Supabase
npm run sync # sync targets + images from Supabase locally
npm run export-human-stats # export baselines/human_stats.json from Supabase leaderboard rowsGenerate baselines/human_stats.json from a Supabase leaderboard relation:
npm run export-human-statsOptional overrides:
node --env-file=.env scripts/export-human-stats.js \
--source=battle_target_leaderboard_current_entries \
--output=baselines/human_stats.json \
--max-per-target=100Queue state (pending / running / waiting / paused / error attempts) never
leaves the local process — only completed done rows are synced.
Configure SUPABASE_RESULTS_URL and SUPABASE_RESULTS_KEY in .env. Run packages/db/schema.sql once in your Supabase project to set up the schema.
A read-only public variant (Leaderboard, Targets, Insights, About) is automatically built and deployed to GitHub Pages on every version tag (v*.*.*).
To trigger a deployment, push a tag:
git tag v1.0.0
git push --tagsRequired GitHub Secrets: VITE_SUPABASE_URL, VITE_SUPABASE_ANON_KEY.
GitHub Pages source must be set to GitHub Actions (repo Settings → Pages).
To build locally:
cd packages/dashboard
# Add to .env.public.local:
# VITE_SUPABASE_URL=https://xxx.supabase.co
# VITE_SUPABASE_ANON_KEY=eyJ...
npm run build:public # output → dist-public/The benchmark runner is built around a single table (runs) that doubles as
a persistent attempt queue. Every (run_id, target_id, attempt) combination
is pre-inserted before any work starts and moves through these statuses:
waiting → pending → running → done | error | paused
waiting— follow-up attempt (n ≥ 2), blocked until the previous attempt for the same target finishesdone.pending— claim-ready. A worker may pick it up.running— claimed by a worker; protected with aclaim_tokenso a pause or re-claim can't be overwritten by a stale worker.done— complete. Onlydonerows appear in the leaderboard (grouped bymodel + reasoning_effort), model-level insights, the History view and the Supabase upload.error— non-terminal. The row stays visible in the Queue view with a Retry button per attempt, plus Reset-all-errors per run.paused— set by Cancel. The original status is saved inpaused_fromso Resume restores the row exactly.
A runs_summary view aggregates per-run status with priority
paused > running > error > queued > done and powers the Queue / History
split in the dashboard.
Workers claim the next pending row atomically with BEGIN IMMEDIATE +
UPDATE ... RETURNING. Ordering is FIFO over (enqueued_at, id) across
all runs, so a resumed run re-enters at the back of the queue. Each active
worker pool is run-scoped and only claims rows for its own run_id.
On server startup, any leftover running rows (from a crashed process)
are flipped back to pending and their claim tokens are cleared.
API surface
POST /api/runs/start— new run or fill-run. Pre-enqueues all attempts.POST /api/runs/:runId/cancel— pauses the run (abort +paused_from).POST /api/runs/:runId/resume— restores the pre-pause state and bumpsenqueued_atto now; accepts optional JSON body{ "concurrency": <n> }.POST /api/runs/attempts/:id/retry— singleerror→pending.POST /api/runs/:runId/reset-errors— bulkerror→pending.GET /api/runs/queue— everything not-yet-done, with attempts nested.GET /api/runs/history— done-only runs, newest finish first.GET /api/runs/:runId/progress— SSE, used by the Run tab.
Queue state is local to each runner process — only done rows are ever
synced to Supabase.
# Single file
node --test packages/db/adapters/sqlite/runs.test.js
# All tests
node --test packages/**/*.test.js