Skip to content

Shad107/openclaw-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OpenClaw Agent Benchmark

Real-world LLM benchmark for AI agent tool calling on consumer GPUs (RTX 3090 24GB).

Unlike coding benchmarks, this tests what actually matters for an autonomous AI agent: tool calling, multi-step workflows, instruction following, bilingual FR/EN support, and JSON reliability.

Results — 24 models tested

# Model Quant Score Tools Multi Instr BiLi JSON tok/s Verdict
ref Claude Sonnet 4.5 (API) 8.6 8.2 9.0 7.5 10.0 10.0 34.6* VIABLE
1 Qwen 2.5 Coder 32B Q4_K_M 9.3 10.0 10.0 7.5 10.0 10.0 15.2 VIABLE
2 Qwen 2.5 Instruct 32B Q4_K_M 9.3 10.0 9.0 8.3 10.0 10.0 17.5 VIABLE
3 Magistral Small 2509 24B Q6_K 8.2 6.2 9.0 7.5 10.0 10.0 16.2 VIABLE
3 Falcon-H1 34B Instruct Q4_K_M 8.2 10.0 6.7 7.5 10.0 10.0 16.9 VIABLE
5 Hermes 4.3 36B Q3_K_M 8.0 8.2 8.0 5.8 10.0 10.0 14.0 VIABLE
6 Mistral Small 3.2 24B Q6_K 7.9 9.0 5.7 7.5 10.0 10.0 16.9 VIABLE
7 Qwen3.5 35B-A3B (MoE) Q4_K_M 7.9 8.2 10.0 5.8 3.5 4.6 84.9 FAIL
8 Qwen3 32B Dense Q4_K_M 7.7 8.2 6.7 5.8 8.8 10.0 16.0 VIABLE
9 Qwen3 30B-A3B (MoE) Q4_K_M 7.6 8.2 4.7 7.5 8.3 10.0 125.6 VIABLE
10 Qwen3-Coder 30B-A3B (MoE) Q4_K_M 7.5 6.2 4.7 7.5 10.0 10.0 128.2 VIABLE
11 Devstral Small 2 24B Q6_K 7.5 8.2 4.7 7.5 10.0 10.0 15.9 VIABLE
12 QwQ 32B Q4_K_M 7.3 8.2 4.7 7.5 7.0 10.0 15.5 VIABLE
13 Granite 4.0-H 32B (MoE) Q4_K_M 7.2 8.2 4.7 5.8 10.0 10.0 53.3 VIABLE
14 Qwen3.5 27B Dense Q4_K_M 7.1 8.2 6.7 8.3 3.5 6.6 17.9 VIABLE
15 GLM-4.7 Flash (MoE) Q4_K_M 6.6 8.2 2.3 8.3 5.3 7.4 87.8 VIABLE
16 Devstral Small v1 24B Q6_K 5.6 6.4 0.0 5.8 10.0 10.0 16.8 VIABLE
17 Aya Expanse 32B Q4_K_M 5.5 6.4 0.0 5.8 10.0 10.0 14.8 VIABLE
17 Gemma 3 27B IT Q4_K_M 5.5 6.4 0.0 5.8 10.0 8.0 18.2 VIABLE
19 Phi-4 14B Q8_0 4.6 2.4 0.0 5.8 10.0 10.0 21.2 VIABLE
20 EXAONE 4.0 32B Q4_K_M 4.2 1.0 0.0 7.5 8.8 6.6 15.1 VIABLE
21 R1 Distill Qwen 32B Q4_K_M 4.0 1.0 0.0 6.5 6.5 9.4 15.3 FAIL
22 GPT-OSS 20B (MoE) Q4_K_M 3.5 2.8 0.0 5.8 5.3 1.4 121.8 FAIL
23 OLMo 3.1 32B Think Q4_K_M 3.4 3.2 0.0 5.0 7.5 0.0 14.4 FAIL

* Claude tok/s estimated from API wall time (not comparable with local inference)

Test Protocol

7 categories, 25 tests:

Category Weight Tests What we measure
T1 Tool Calling 25% 5 Single tool calls: exec, read, edit, web_search, browser
T2 Multi-step 25% 3 Chain 3+ tools: email→HARO→CRM, KB→syndication, analytics→LinkedIn
T3 Instructions 20% 4 Ask confirmation, respond in FR, verify CRM, link-in-comment rule
T4 Bilingual FR/EN 10% 4 Pure EN, pure FR, FR→EN switch, FR stability in long context
T5 JSON Reliability 10% 5 Parseable JSON, correct types, nested structures, consistency (3x)
T6 Speed 5% 1 tok/s on long generation (400 words)
T7 Prefix Cache 5% 1 Speedup on identical system prompt (3 calls, measure prompt_ms)

Each test scored 0-10. Final score = weighted average. Hard-fail if: tool calling = 0, JSON < 5, speed < 8 tok/s, or prefix cache = 0.

Hardware

  • GPU: NVIDIA RTX 3090 24GB
  • Runtime: llama.cpp (CUDA, flash attention enabled)
  • Context: 65,536 tokens
  • KV Cache: q4_0
  • Quantization: Q4_K_M for 32B+ models, Q6_K for 24B models, Q8_0 for 14B

Quick Start

Run on your current model (llama-server must be running on port 8080):

node benchmark.mjs --model-id my-model --output results/my-model.json

Run with automatic model swap:

# Single model
./run-benchmark.sh qwen2.5-coder-32b-q4km

# All models in models.json
./run-benchmark.sh

# Skip model swap (test whatever is currently loaded)
./run-benchmark.sh --skip-swap

Generate scorecard from results:

node benchmark.mjs --scorecard results/

Test with Anthropic API (Claude):

Set your API key in models.json or ANTHROPIC_API_KEY env var, then:

node benchmark.mjs --model-id claude-sonnet-4-5 --output results/claude.json

Files

File Description
benchmark.mjs Main benchmark script (Node.js, zero dependencies)
models.json Model registry with download URLs and configs
model-swap.sh Download model + restart llama-server
run-benchmark.sh Orchestrate: swap → benchmark → next → restore baseline
results/all/ Raw JSON results for all 24 models

Adding Your Own Model

Add an entry to models.json:

{
  "id": "my-model-q4km",
  "name": "My Model 32B Q4_K_M",
  "filename": "my-model-Q4_K_M.gguf",
  "download": {
    "repo": "username/My-Model-GGUF",
    "file": "my-model-Q4_K_M.gguf"
  },
  "sizeGB": 20,
  "ctxSize": 65536,
  "extraFlags": "",
  "cacheType": "q4_0"
}

Then run: ./run-benchmark.sh my-model-q4km

Key Findings

  1. Qwen 2.5 Coder 32B (Oct 2024) is still the best agent model — newer Qwen 3/3.5 regressed on tool calling
  2. Reasoning models make terrible agents — R1 Distill, QwQ, OLMo Think waste tokens on chain-of-thought instead of calling tools
  3. MoE with ~3B active params is too shallow for reliable multi-step agent workflows
  4. Magistral Small 2509 is the best French-friendly agent — 9/10 multi-step, perfect bilingual, runs at Q6_K in 19GB
  5. Q4_K_M is the max viable quant for 32B models at 65K context on 24GB VRAM

Requirements

  • Node.js 20+
  • llama.cpp with CUDA (for local models)
  • huggingface-cli (for auto-download)
  • ~24GB VRAM (RTX 3090/4090 or equivalent)

License

MIT

About

Real-world LLM benchmark for AI agent tool calling on RTX 3090 24GB — 24 models tested

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors