Local LLM N-vs-N Benchmark (LM Studio GGUFs via llama.cpp)

Small harness that serves models from ~/.lmstudio/models/ through a llama-swap proxy in front of llama.cpp podman containers, and benchmarks them on quality, latency, and cost (energy). Supports any number of models: round-robin tournament (pairwise_all) or absolute rubric (scored). Emits JSON, Markdown, and a single-file HTML dashboard.

Matrix: tasks × prompt_variants × models.

Documentation

Start here:

HUMANS.md — operators & developers: prerequisites, install, run, configure, troubleshoot, clean up.
AGENTS.md — LLMs & contributors: design invariants, hardware caveats, judge-mode selection, editing conventions. CLAUDE.md is a symlink here.
CONTRIBUTING.md — opening a PR: pre-commit checklist, commit style, what not to change without discussing.

Reference:

config.yaml — single source of truth for server, models, prompts, dataset, judge, cost, output. Inline comments describe every knob.

Design choices (explicit)

Concern	Choice	Reason
Serving	`llama-swap` proxy in front of `podman` + `ghcr.io/ggml-org/llama.cpp:server-vulkan`	`llama-swap` owns model lifecycle (boot/unload on demand). The Vulkan image works on AMD (Strix Halo tested), NVIDIA, Intel without per-backend wrangling. No `llama-server` binary ships in LM Studio's own `~/.lmstudio/extensions/backends/*/` — only `.so` libraries for LM Studio's internal runtime.
One model at a time	`llama-swap` unloads current backend before starting next; runner iterates per-model-sequentially on top of that	Unified-memory APUs (and modest-VRAM discrete GPUs) can't hold A + B + judge concurrently. Each model pays exactly one swap, absorbed by warmup.
Transport	OpenAI-compatible `/v1/chat/completions`	llama.cpp server exposes it; one client class works for Ollama, vLLM, LM Studio, llama.cpp.
Quality scorer	`pairwise_all` tournament (default) or `scored` 1-5 rubric via LLM judge; plus heuristic (`exact` / `contains` / `regex`) for structured tasks	Tournament gives sharp ranking on small N (2-4); rubric scales linearly for larger N. Heuristics catch hard ground-truth items without a judge round trip.
Pairwise positional-bias mitigation	Order randomized per call (seeded from `run.seed`); swapped verdicts inverted before counting	Judges show a 5-15% preference for slot A; flipping per call averages it out across the matrix. `order: "AB" \| "BA"` is stored on every judgement.
Default context	`4096`	Benchmark prompts are short; keeping `ctx` small cuts load time + memory. Override per model or in `server.ctx`.
"Cost" for local models	Energy estimate via `nvidia-smi --query-gpu=power.draw` or `rocm-smi --showpower` sampled at call start + end; average × wall time × `$/kWh`	No per-token price for local. Energy is the honest cost axis.
Cost fallback	`energy_wh` / `cost_usd` = `null` when neither tool works	No silent substitution with latency. On Strix Halo, `rocm-smi` often fails on `libdrm_amdgpu.so` — expect `null`.
Dataset	Generated from seeded templates across `qa` / `code` / `summarize` / `classify`	Simple stack, no external corpus. Deterministic via `run.seed`.
Stack	Python + `httpx` + `pyyaml`, stdlib for everything else	No promptfoo / lm-eval / framework.
Dashboard	One static HTML file with Chart.js via CDN, reads embedded run JSON	No build step, opens from disk.

Layout

config.yaml          the contract
bin/llama-swap.sh    proxy launcher (up / down / sweep / wait)
bench/               clients · dataset · download · llama_swap · metrics · runner · report
run.sh               uv venv + install + pinned llama-swap bootstrap + run
.cache/              vendored llama-swap binary + generated proxy config (gitignored)
datasets/            generated inputs (gitignored)
results/             run-<ts>.json / .md / .html (gitignored)

Per-module breakdown with behavior notes: AGENTS.md § Layout.

Quick start

./run.sh fetch    # pull GGUFs referenced in config.yaml from Hugging Face
./run.sh          # dataset + all phases + reports

For prerequisites, configuration, and troubleshooting see HUMANS.md.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.github		.github
bench		bench
bin		bin
tests		tests
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
HUMANS.md		HUMANS.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
config.yaml		config.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Local LLM N-vs-N Benchmark (LM Studio GGUFs via llama.cpp)

Documentation

Design choices (explicit)

Layout

Quick start

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Local LLM N-vs-N Benchmark (LM Studio GGUFs via llama.cpp)

Documentation

Design choices (explicit)

Layout

Quick start

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages