Skip to content

CartesianXR7/bellwether

bellwether

tests python methodology license

The cost-and-failure-mode benchmark for LLM agents. Methodology plus Python package for honest, reproducible cross-provider agent evaluation.

Live leaderboard · Methodology

Why I built this. I kept running into the same procurement question (which provider for THIS task, at THIS cost, with THESE failure modes) and the existing public benchmarks didn't answer it. So I built the toolkit and methodology I wished existed.

Why

Cross-provider LLM benchmarks today rank capability ("which model is smarter on average"). HELM and Chatbot Arena own that ground.

Practitioners building production systems need a different answer: which provider for THIS task, at THIS cost when retries and failures are accounted for, with THESE failure modes that map to my product's tolerance.

bellwether answers the procurement question and ships the toolkit anyone can run on their own prompts.

What it measures

  • effective_TCoT: total cost per successfully completed task, including the cost of failed retries. The procurement-question metric, not the average-quality one.
  • Failure-mode taxonomy: classify how models fail, not just whether (refusal, confabulation, schema break, truncation, partial, off-task, timeout, error). Maps to product-tolerance decisions.
  • Machine-checkable ground truth only. No LLM-as-judge. Sidesteps the well-documented judge-bias issue.
  • Prompt portability. Headline numbers use one canonical prompt across providers; portability cost (tuned vs canonical) is a v1 promise with a real contract.
  • Critique-Pass track (optional, v0.2+). Wraps each attempt with a single self-revision step under a locked canonical prompt; reports per-model delta in effective_TCoT and success_rate so the cost-vs-accuracy tradeoff of "ask the model to review itself" is a procurement-grade number, not folklore. See METHODOLOGY s13.

See METHODOLOGY.md for formulas, retry policy, validator contract, and reproducibility caveats.

Install

From source:

git clone https://github.com/cartesianxr7/bellwether
cd bellwether
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
cp .env.example .env       # add provider API keys (see below)
pre-commit install         # optional, gates secret leaks
pytest                     # 161 tests; all should pass

Or from PyPI:

pip install bellwether

Provider keys are read from .env. Each provider is independent; you only need keys for the ones you intend to bench.

Provider Env var Required for
Anthropic ANTHROPIC_API_KEY Claude models (Sonnet 4.6, Haiku 4.5, Opus 4.7)
OpenAI OPENAI_API_KEY GPT-4o, GPT-4o-mini, o3, o3-mini, o4-mini
Google GOOGLE_API_KEY Gemini 2.5 Flash-Lite, Flash, Pro
xAI XAI_API_KEY Grok 4, Grok 4-fast, Grok 3, Grok 3-mini
Perplexity PERPLEXITY_API_KEY Sonar, Sonar Pro, Sonar Reasoning (+ Pro)
OpenRouter OPENROUTER_API_KEY Llama 4 Scout/Maverick, Llama 3.3 70B, DeepSeek Chat/R1, Mistral Large, Cohere Command R+, Qwen 3 235B

Run

bellwether list providers           # show all registered model entries (27 across 6 providers)
bellwether list tasks               # show registered tasks (3 in v0.x)

# Smoke test: 2 instances, 1 run each, default provider, $1 cap, ~10s and ~$0.01:
bellwether run --instances 2 --n 1 --max-cost 1

# Standard bench: 5 instances, 3 runs per instance, all 27 models across 6
# providers, $5 cap (v0.4 full sweep at --instances 3 --n 2 cost about $0.52):
bellwether run --provider all --instances 5 --n 3 --max-cost 5

# Critique-Pass evaluation track: wraps each attempt with a locked single-shot
# self-revision pass; reports per-model delta against the critique-off baseline.
# Costs about 2x baseline TCoT on average (median +120% on the v0.2 pass).
bellwether run --critique-pass --provider all --instances 3 --n 2 --max-cost 5

# Re-render the leaderboard from existing results without re-running:
bellwether report results

The cost guardrail (--max-cost USD) is a hard cap on total spend per invocation. Strongly recommended.

Status

Current: package v0.4.0, methodology v0.2. 3 tasks across 27 distinct provider models in 6 providers. 161 tests passing. CI on Python 3.11 / 3.12 / 3.13. See ROADMAP.md for the detailed list.

Shipped

  • v0.1 (2026-05-06): methodology, package, CLI, structured_extraction task across 3 providers (Claude Sonnet 4.6 / GPT-4o / Gemini 2.5 Flash-Lite). effective_TCoT formula, eight-mode failure-mode taxonomy, schema-only retry feedback, cost guardrail, dirty-tree gate.
  • v0.3 (2026-05-07): added function_call_routing and synthetic_rag tasks; 8 provider models; mean ± std reporting per s7; tied-rank marker; per-task drill-down pages; cost calculator widget; glossary.
  • v0.4 (2026-05-08): 27 model coverage across 6 providers (anthropic, openai, google, xai, perplexity, openrouter); model_class field separating reasoning and search models from standard chat; OpenAI-compatible adapter generalization (xAI / Perplexity / OpenRouter share one adapter class).
  • v0.2 methodology (2026-05-16, methodology bump alongside package v0.4): Critique-Pass evaluation track (--critique-pass). Optional single-shot self-revision under a locked canonical prompt; twin-leaderboard reporting with per-model effective_TCoT and success_rate deltas. Real-document OCR explicitly deferred to v1 per s10.

Next (v0.5)

Async runner (parallelize adapter calls across providers); BYO task config (TOML schema, --task-config PATH); plugin loader (--plugins-dir); bootstrap CIs on effective_TCoT replacing the v0.3 tied-rank heuristic; default N raised to 5; search-cost accounting for Perplexity Sonar entries; tuned-prompt track formalization; historical leaderboard with trend lines.

v1

Real-document document_extraction_ocr task on a redistribution-compatible open corpus (handwriting, multi-column layouts, scan artifacts, JPEG compression, multi-page); image-input pricing in pricing.py with modality field on the Task protocol; code-generation task with sandboxing (HumanEval+ tier-easy); real-dataset replacements for the synthetic v0.x tasks (BFCL, FinanceBench, GAIA validation, GovReport); hosted run service built on the plugin loader.

Repository

Contributing

See CONTRIBUTING.md. Adding a task or a provider adapter is a single PR; the contract is documented and small. Architecture overview in ARCHITECTURE.md; roadmap in ROADMAP.md; community standards in CODE_OF_CONDUCT.md.

Citation

If you use bellwether or its methodology in your work, please cite it. BibTeX:

@software{bellwether2026,
  author = {Hedrick, Stephen},
  title = {bellwether: cost-and-failure-mode benchmark for LLM agents},
  year = {2026},
  version = {0.4.0},
  url = {https://github.com/cartesianxr7/bellwether},
  license = {MIT}
}

CITATION.cff is the machine-readable form (GitHub renders a "Cite this repository" button from it).

License

MIT. See LICENSE.

Author

Stephen Hedrick.

About

The cost-and-failure-mode benchmark for LLM agents. Methodology plus Python package for honest, reproducible cross-provider agent evaluation.

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors