Skip to content

MaximilianAzendorf/memebench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

236 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Memebench

CI Live site Support License

Memebench is a public preference benchmark for LLM-generated memes. Models are given a real news headline, asked to make a classic image-macro meme, and ranked from blind public A/B votes.

Live benchmark: memebench.net

Memebench is not a scientific benchmark. It measures one narrow thing: whether a model can make a news meme that survives contact with public voters.

How it works

  1. Recent English-language headlines are collected from public RSS feeds and filtered for meme potential.
  2. Eligible models receive the same headline context and the same meme-template snapshot.
  3. Each model has to inspect templates, render drafts through the benchmark tools (that use imgflip for meme rendering), and submit exactly one meme.
  4. Visitors vote blind between two memes for the same headline.
  5. Accepted votes are converted into pairwise comparisons and fitted into a rolling leaderboard.

The leaderboard uses a Bradley-Terry preference model over recent votes, with separate handling for ties, “both bad” votes, and model-accountable generation failures.

See docs/BENCHMARK.md for the full methodology.

Reading the leaderboard

A higher rank means voters recently preferred that model's submitted memes more often after accounting for opponent strength and model-accountable non-submissions.

It does not mean the model is better at general reasoning, coding, factual accuracy, safety, long-form writing, or anything else outside this benchmark.

Content note: generated memes may be crude, political, dark, rude, wrong, tasteless, or simply unfunny. “Both bad” is part of the voting interface for a reason.

Repository layout

apps/web              Public Svelte app and admin UI
apps/api              HTTP API
apps/worker           Daily headline, generation, and ranking jobs
packages/core         Shared server-side config, DB, repositories, and jobs
packages/contracts    Shared API schemas and TypeScript types
packages/e2e          Mock-stack end-to-end tests
packages/e2e-mocks    Mock provider services for E2E and local E2E runs
docs/                 Project documentation

Running locally

Requirements: Node.js 24.x, pnpm 10.x, and Docker with Docker Compose.

cp .env.example .env
./local.sh up

Then open:

The local stack can run without real production secrets. Live meme generation requires API credentials for OpenRouter and imgflip.

For local code checks, the main commands are:

See docs/DEVELOPMENT.md for setup details, local stack modes, and test commands.

Documentation

Contributing

Memebench is currently a one-person side project. Issues and small focused pull requests are welcome, especially for bugs, documentation, tests, UI polish, and benchmark-methodology clarity.

Please keep changes scoped and run the relevant checks before opening a pull request.

Support

Memebench has recurring upkeep costs: server hosting, storage, and daily model inference.

Support is possible through Buy Me a Coffee. You don't get anything for your money though, beyond me continuing to being able to run this page :)

License

This repository is licensed under the MIT License, except for the vendored font files under apps/web/static/fonts; see the font-specific license files there.

AI usage transparency note

Memebench is developed with the help of agentic AI coding tools. All code is reviewed, edited, tested, and maintained by me. All decisions and any mistakes are mine.

About

An LLM benchmark where models create memes from current news headlines and humans vote on the results.

Topics

Resources

License

Stars

Watchers

Forks

Sponsor this project

Contributors