Real-world LLM benchmark for AI agent tool calling on consumer GPUs (RTX 3090 24GB).
Unlike coding benchmarks, this tests what actually matters for an autonomous AI agent: tool calling, multi-step workflows, instruction following, bilingual FR/EN support, and JSON reliability.
| # | Model | Quant | Score | Tools | Multi | Instr | BiLi | JSON | tok/s | Verdict |
|---|---|---|---|---|---|---|---|---|---|---|
| ref | Claude Sonnet 4.5 (API) | — | 8.6 | 8.2 | 9.0 | 7.5 | 10.0 | 10.0 | 34.6* | VIABLE |
| 1 | Qwen 2.5 Coder 32B | Q4_K_M | 9.3 | 10.0 | 10.0 | 7.5 | 10.0 | 10.0 | 15.2 | VIABLE |
| 2 | Qwen 2.5 Instruct 32B | Q4_K_M | 9.3 | 10.0 | 9.0 | 8.3 | 10.0 | 10.0 | 17.5 | VIABLE |
| 3 | Magistral Small 2509 24B | Q6_K | 8.2 | 6.2 | 9.0 | 7.5 | 10.0 | 10.0 | 16.2 | VIABLE |
| 3 | Falcon-H1 34B Instruct | Q4_K_M | 8.2 | 10.0 | 6.7 | 7.5 | 10.0 | 10.0 | 16.9 | VIABLE |
| 5 | Hermes 4.3 36B | Q3_K_M | 8.0 | 8.2 | 8.0 | 5.8 | 10.0 | 10.0 | 14.0 | VIABLE |
| 6 | Mistral Small 3.2 24B | Q6_K | 7.9 | 9.0 | 5.7 | 7.5 | 10.0 | 10.0 | 16.9 | VIABLE |
| 7 | Qwen3.5 35B-A3B (MoE) | Q4_K_M | 7.9 | 8.2 | 10.0 | 5.8 | 3.5 | 4.6 | 84.9 | FAIL |
| 8 | Qwen3 32B Dense | Q4_K_M | 7.7 | 8.2 | 6.7 | 5.8 | 8.8 | 10.0 | 16.0 | VIABLE |
| 9 | Qwen3 30B-A3B (MoE) | Q4_K_M | 7.6 | 8.2 | 4.7 | 7.5 | 8.3 | 10.0 | 125.6 | VIABLE |
| 10 | Qwen3-Coder 30B-A3B (MoE) | Q4_K_M | 7.5 | 6.2 | 4.7 | 7.5 | 10.0 | 10.0 | 128.2 | VIABLE |
| 11 | Devstral Small 2 24B | Q6_K | 7.5 | 8.2 | 4.7 | 7.5 | 10.0 | 10.0 | 15.9 | VIABLE |
| 12 | QwQ 32B | Q4_K_M | 7.3 | 8.2 | 4.7 | 7.5 | 7.0 | 10.0 | 15.5 | VIABLE |
| 13 | Granite 4.0-H 32B (MoE) | Q4_K_M | 7.2 | 8.2 | 4.7 | 5.8 | 10.0 | 10.0 | 53.3 | VIABLE |
| 14 | Qwen3.5 27B Dense | Q4_K_M | 7.1 | 8.2 | 6.7 | 8.3 | 3.5 | 6.6 | 17.9 | VIABLE |
| 15 | GLM-4.7 Flash (MoE) | Q4_K_M | 6.6 | 8.2 | 2.3 | 8.3 | 5.3 | 7.4 | 87.8 | VIABLE |
| 16 | Devstral Small v1 24B | Q6_K | 5.6 | 6.4 | 0.0 | 5.8 | 10.0 | 10.0 | 16.8 | VIABLE |
| 17 | Aya Expanse 32B | Q4_K_M | 5.5 | 6.4 | 0.0 | 5.8 | 10.0 | 10.0 | 14.8 | VIABLE |
| 17 | Gemma 3 27B IT | Q4_K_M | 5.5 | 6.4 | 0.0 | 5.8 | 10.0 | 8.0 | 18.2 | VIABLE |
| 19 | Phi-4 14B | Q8_0 | 4.6 | 2.4 | 0.0 | 5.8 | 10.0 | 10.0 | 21.2 | VIABLE |
| 20 | EXAONE 4.0 32B | Q4_K_M | 4.2 | 1.0 | 0.0 | 7.5 | 8.8 | 6.6 | 15.1 | VIABLE |
| 21 | R1 Distill Qwen 32B | Q4_K_M | 4.0 | 1.0 | 0.0 | 6.5 | 6.5 | 9.4 | 15.3 | FAIL |
| 22 | GPT-OSS 20B (MoE) | Q4_K_M | 3.5 | 2.8 | 0.0 | 5.8 | 5.3 | 1.4 | 121.8 | FAIL |
| 23 | OLMo 3.1 32B Think | Q4_K_M | 3.4 | 3.2 | 0.0 | 5.0 | 7.5 | 0.0 | 14.4 | FAIL |
* Claude tok/s estimated from API wall time (not comparable with local inference)
7 categories, 25 tests:
| Category | Weight | Tests | What we measure |
|---|---|---|---|
| T1 Tool Calling | 25% | 5 | Single tool calls: exec, read, edit, web_search, browser |
| T2 Multi-step | 25% | 3 | Chain 3+ tools: email→HARO→CRM, KB→syndication, analytics→LinkedIn |
| T3 Instructions | 20% | 4 | Ask confirmation, respond in FR, verify CRM, link-in-comment rule |
| T4 Bilingual FR/EN | 10% | 4 | Pure EN, pure FR, FR→EN switch, FR stability in long context |
| T5 JSON Reliability | 10% | 5 | Parseable JSON, correct types, nested structures, consistency (3x) |
| T6 Speed | 5% | 1 | tok/s on long generation (400 words) |
| T7 Prefix Cache | 5% | 1 | Speedup on identical system prompt (3 calls, measure prompt_ms) |
Each test scored 0-10. Final score = weighted average. Hard-fail if: tool calling = 0, JSON < 5, speed < 8 tok/s, or prefix cache = 0.
- GPU: NVIDIA RTX 3090 24GB
- Runtime: llama.cpp (CUDA, flash attention enabled)
- Context: 65,536 tokens
- KV Cache: q4_0
- Quantization: Q4_K_M for 32B+ models, Q6_K for 24B models, Q8_0 for 14B
node benchmark.mjs --model-id my-model --output results/my-model.json# Single model
./run-benchmark.sh qwen2.5-coder-32b-q4km
# All models in models.json
./run-benchmark.sh
# Skip model swap (test whatever is currently loaded)
./run-benchmark.sh --skip-swapnode benchmark.mjs --scorecard results/Set your API key in models.json or ANTHROPIC_API_KEY env var, then:
node benchmark.mjs --model-id claude-sonnet-4-5 --output results/claude.json| File | Description |
|---|---|
benchmark.mjs |
Main benchmark script (Node.js, zero dependencies) |
models.json |
Model registry with download URLs and configs |
model-swap.sh |
Download model + restart llama-server |
run-benchmark.sh |
Orchestrate: swap → benchmark → next → restore baseline |
results/all/ |
Raw JSON results for all 24 models |
Add an entry to models.json:
{
"id": "my-model-q4km",
"name": "My Model 32B Q4_K_M",
"filename": "my-model-Q4_K_M.gguf",
"download": {
"repo": "username/My-Model-GGUF",
"file": "my-model-Q4_K_M.gguf"
},
"sizeGB": 20,
"ctxSize": 65536,
"extraFlags": "",
"cacheType": "q4_0"
}Then run: ./run-benchmark.sh my-model-q4km
- Qwen 2.5 Coder 32B (Oct 2024) is still the best agent model — newer Qwen 3/3.5 regressed on tool calling
- Reasoning models make terrible agents — R1 Distill, QwQ, OLMo Think waste tokens on chain-of-thought instead of calling tools
- MoE with ~3B active params is too shallow for reliable multi-step agent workflows
- Magistral Small 2509 is the best French-friendly agent — 9/10 multi-step, perfect bilingual, runs at Q6_K in 19GB
- Q4_K_M is the max viable quant for 32B models at 65K context on 24GB VRAM
- Node.js 20+
- llama.cpp with CUDA (for local models)
huggingface-cli(for auto-download)- ~24GB VRAM (RTX 3090/4090 or equivalent)
MIT