OpenClaw Agent Benchmark

Real-world LLM benchmark for AI agent tool calling on consumer GPUs (RTX 3090 24GB).

Unlike coding benchmarks, this tests what actually matters for an autonomous AI agent: tool calling, multi-step workflows, instruction following, bilingual FR/EN support, and JSON reliability.

Results — 24 models tested

#	Model	Quant	Score	Tools	Multi	Instr	BiLi	JSON	tok/s	Verdict
ref	Claude Sonnet 4.5 (API)	—	8.6	8.2	9.0	7.5	10.0	10.0	34.6*	VIABLE
1	Qwen 2.5 Coder 32B	Q4_K_M	9.3	10.0	10.0	7.5	10.0	10.0	15.2	VIABLE
2	Qwen 2.5 Instruct 32B	Q4_K_M	9.3	10.0	9.0	8.3	10.0	10.0	17.5	VIABLE
3	Magistral Small 2509 24B	Q6_K	8.2	6.2	9.0	7.5	10.0	10.0	16.2	VIABLE
3	Falcon-H1 34B Instruct	Q4_K_M	8.2	10.0	6.7	7.5	10.0	10.0	16.9	VIABLE
5	Hermes 4.3 36B	Q3_K_M	8.0	8.2	8.0	5.8	10.0	10.0	14.0	VIABLE
6	Mistral Small 3.2 24B	Q6_K	7.9	9.0	5.7	7.5	10.0	10.0	16.9	VIABLE
7	Qwen3.5 35B-A3B (MoE)	Q4_K_M	7.9	8.2	10.0	5.8	3.5	4.6	84.9	FAIL
8	Qwen3 32B Dense	Q4_K_M	7.7	8.2	6.7	5.8	8.8	10.0	16.0	VIABLE
9	Qwen3 30B-A3B (MoE)	Q4_K_M	7.6	8.2	4.7	7.5	8.3	10.0	125.6	VIABLE
10	Qwen3-Coder 30B-A3B (MoE)	Q4_K_M	7.5	6.2	4.7	7.5	10.0	10.0	128.2	VIABLE
11	Devstral Small 2 24B	Q6_K	7.5	8.2	4.7	7.5	10.0	10.0	15.9	VIABLE
12	QwQ 32B	Q4_K_M	7.3	8.2	4.7	7.5	7.0	10.0	15.5	VIABLE
13	Granite 4.0-H 32B (MoE)	Q4_K_M	7.2	8.2	4.7	5.8	10.0	10.0	53.3	VIABLE
14	Qwen3.5 27B Dense	Q4_K_M	7.1	8.2	6.7	8.3	3.5	6.6	17.9	VIABLE
15	GLM-4.7 Flash (MoE)	Q4_K_M	6.6	8.2	2.3	8.3	5.3	7.4	87.8	VIABLE
16	Devstral Small v1 24B	Q6_K	5.6	6.4	0.0	5.8	10.0	10.0	16.8	VIABLE
17	Aya Expanse 32B	Q4_K_M	5.5	6.4	0.0	5.8	10.0	10.0	14.8	VIABLE
17	Gemma 3 27B IT	Q4_K_M	5.5	6.4	0.0	5.8	10.0	8.0	18.2	VIABLE
19	Phi-4 14B	Q8_0	4.6	2.4	0.0	5.8	10.0	10.0	21.2	VIABLE
20	EXAONE 4.0 32B	Q4_K_M	4.2	1.0	0.0	7.5	8.8	6.6	15.1	VIABLE
21	R1 Distill Qwen 32B	Q4_K_M	4.0	1.0	0.0	6.5	6.5	9.4	15.3	FAIL
22	GPT-OSS 20B (MoE)	Q4_K_M	3.5	2.8	0.0	5.8	5.3	1.4	121.8	FAIL
23	OLMo 3.1 32B Think	Q4_K_M	3.4	3.2	0.0	5.0	7.5	0.0	14.4	FAIL

* Claude tok/s estimated from API wall time (not comparable with local inference)

Test Protocol

7 categories, 25 tests:

Category	Weight	Tests	What we measure
T1 Tool Calling	25%	5	Single tool calls: exec, read, edit, web_search, browser
T2 Multi-step	25%	3	Chain 3+ tools: email→HARO→CRM, KB→syndication, analytics→LinkedIn
T3 Instructions	20%	4	Ask confirmation, respond in FR, verify CRM, link-in-comment rule
T4 Bilingual FR/EN	10%	4	Pure EN, pure FR, FR→EN switch, FR stability in long context
T5 JSON Reliability	10%	5	Parseable JSON, correct types, nested structures, consistency (3x)
T6 Speed	5%	1	tok/s on long generation (400 words)
T7 Prefix Cache	5%	1	Speedup on identical system prompt (3 calls, measure prompt_ms)

Each test scored 0-10. Final score = weighted average. Hard-fail if: tool calling = 0, JSON < 5, speed < 8 tok/s, or prefix cache = 0.

Hardware

GPU: NVIDIA RTX 3090 24GB
Runtime: llama.cpp (CUDA, flash attention enabled)
Context: 65,536 tokens
KV Cache: q4_0
Quantization: Q4_K_M for 32B+ models, Q6_K for 24B models, Q8_0 for 14B

Quick Start

Run on your current model (llama-server must be running on port 8080):

node benchmark.mjs --model-id my-model --output results/my-model.json

Run with automatic model swap:

# Single model
./run-benchmark.sh qwen2.5-coder-32b-q4km

# All models in models.json
./run-benchmark.sh

# Skip model swap (test whatever is currently loaded)
./run-benchmark.sh --skip-swap

Generate scorecard from results:

node benchmark.mjs --scorecard results/

Test with Anthropic API (Claude):

Set your API key in models.json or ANTHROPIC_API_KEY env var, then:

node benchmark.mjs --model-id claude-sonnet-4-5 --output results/claude.json

Files

File	Description
`benchmark.mjs`	Main benchmark script (Node.js, zero dependencies)
`models.json`	Model registry with download URLs and configs
`model-swap.sh`	Download model + restart llama-server
`run-benchmark.sh`	Orchestrate: swap → benchmark → next → restore baseline
`results/all/`	Raw JSON results for all 24 models

Adding Your Own Model

Add an entry to models.json:

{
  "id": "my-model-q4km",
  "name": "My Model 32B Q4_K_M",
  "filename": "my-model-Q4_K_M.gguf",
  "download": {
    "repo": "username/My-Model-GGUF",
    "file": "my-model-Q4_K_M.gguf"
  },
  "sizeGB": 20,
  "ctxSize": 65536,
  "extraFlags": "",
  "cacheType": "q4_0"
}

Then run: ./run-benchmark.sh my-model-q4km

Key Findings

Qwen 2.5 Coder 32B (Oct 2024) is still the best agent model — newer Qwen 3/3.5 regressed on tool calling
Reasoning models make terrible agents — R1 Distill, QwQ, OLMo Think waste tokens on chain-of-thought instead of calling tools
MoE with ~3B active params is too shallow for reliable multi-step agent workflows
Magistral Small 2509 is the best French-friendly agent — 9/10 multi-step, perfect bilingual, runs at Q6_K in 19GB
Q4_K_M is the max viable quant for 32B models at 65K context on 24GB VRAM

Requirements

Node.js 20+
llama.cpp with CUDA (for local models)
huggingface-cli (for auto-download)
~24GB VRAM (RTX 3090/4090 or equivalent)

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenClaw Agent Benchmark

Results — 24 models tested

Test Protocol

Hardware

Quick Start

Run on your current model (llama-server must be running on port 8080):

Run with automatic model swap:

Generate scorecard from results:

Test with Anthropic API (Claude):

Files

Adding Your Own Model

Key Findings

Requirements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
results/all		results/all
.gitignore		.gitignore
README.md		README.md
benchmark.mjs		benchmark.mjs
model-swap.sh		model-swap.sh
models.json		models.json
run-benchmark.sh		run-benchmark.sh

Folders and files

Latest commit

History

Repository files navigation

OpenClaw Agent Benchmark

Results — 24 models tested

Test Protocol

Hardware

Quick Start

Run on your current model (llama-server must be running on port 8080):

Run with automatic model swap:

Generate scorecard from results:

Test with Anthropic API (Claude):

Files

Adding Your Own Model

Key Findings

Requirements

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages