Memebench is a public preference benchmark for LLM-generated memes. Models are given a real news headline, asked to make a classic image-macro meme, and ranked from blind public A/B votes.
Live benchmark: memebench.net
Memebench is not a scientific benchmark. It measures one narrow thing: whether a model can make a news meme that survives contact with public voters.
- Recent English-language headlines are collected from public RSS feeds and filtered for meme potential.
- Eligible models receive the same headline context and the same meme-template snapshot.
- Each model has to inspect templates, render drafts through the benchmark tools (that use imgflip for meme rendering), and submit exactly one meme.
- Visitors vote blind between two memes for the same headline.
- Accepted votes are converted into pairwise comparisons and fitted into a rolling leaderboard.
The leaderboard uses a Bradley-Terry preference model over recent votes, with separate handling for ties, “both bad” votes, and model-accountable generation failures.
See docs/BENCHMARK.md for the full methodology.
A higher rank means voters recently preferred that model's submitted memes more often after accounting for opponent strength and model-accountable non-submissions.
It does not mean the model is better at general reasoning, coding, factual accuracy, safety, long-form writing, or anything else outside this benchmark.
Content note: generated memes may be crude, political, dark, rude, wrong, tasteless, or simply unfunny. “Both bad” is part of the voting interface for a reason.
apps/web Public Svelte app and admin UI
apps/api HTTP API
apps/worker Daily headline, generation, and ranking jobs
packages/core Shared server-side config, DB, repositories, and jobs
packages/contracts Shared API schemas and TypeScript types
packages/e2e Mock-stack end-to-end tests
packages/e2e-mocks Mock provider services for E2E and local E2E runs
docs/ Project documentation
Requirements: Node.js 24.x, pnpm 10.x, and Docker with Docker Compose.
cp .env.example .env
./local.sh upThen open:
- public site: http://localhost:5173
- leaderboard: http://localhost:5173/leaderboard
- API health check: http://localhost:3000/healthz
The local stack can run without real production secrets. Live meme generation requires API credentials for OpenRouter and imgflip.
For local code checks, the main commands are:
See docs/DEVELOPMENT.md for setup details, local stack modes, and test commands.
Memebench is currently a one-person side project. Issues and small focused pull requests are welcome, especially for bugs, documentation, tests, UI polish, and benchmark-methodology clarity.
Please keep changes scoped and run the relevant checks before opening a pull request.
Memebench has recurring upkeep costs: server hosting, storage, and daily model inference.
Support is possible through Buy Me a Coffee. You don't get anything for your money though, beyond me continuing to being able to run this page :)
This repository is licensed under the MIT License, except for the vendored font files under apps/web/static/fonts; see the font-specific license files there.
Memebench is developed with the help of agentic AI coding tools. All code is reviewed, edited, tested, and maintained by me. All decisions and any mistakes are mine.