QuantClaw is a plug-and-play task-type routing quantization plugin for OpenClaw. It classifies each incoming request, maps it to a precision tier (4bit, 8bit, or 16bit), and routes the request to the right model target so you can balance quality, latency, and cost without asking users to choose precision manually.
QuantClaw is built from quantization studies on OpenClaw workloads rather than from fixed intuition. We evaluate quantized and high-precision models across 24 task types, 104 tasks, 6 models, and scales from 9B to 744B.
Results on Claw-Eval (release v0.0.0):
| Model | Params (B) | BF16 / FP8 | NVFP4 |
|---|---|---|---|
| GLM-4.7-Flash | 30 | 0.6370 | 0.6034 |
| GLM-5 | 744 | 0.7130 | 0.7229 |
| MiniMax-M2.5 | 229 | 0.6760 | 0.6823 |
| Qwen3.5-9B | 9 | 0.4267 | 0.4107 |
| Qwen3.5-35B-A3B | 35 | 0.6686 | 0.6549 |
| Qwen3.5-397B-A17B | 397 | 0.7048 | 0.6937 |
- High-sensitivity tasks such as coding, safety, and complex workflows benefit from higher precision.
- Low-sensitivity tasks such as research, multimodal understanding, comprehension, knowledge lookup, office QA, and data analysis can often run well on lower precision.
Automatic Adaptation |
Intelligent Routing |
Full Customizability |
Built-in Observability |
|---|---|---|---|
| Rules first, then a judge model for requests. | Map each query to 4bit, 8bit, or 16bit targets. | Tune task types, patterns, targets, pricing, and backends. | Track routing, tokens, cost, sessions, and live config changes. |
Install
# Prerequisite: OpenClaw is already installed.
# Install from Clawhub (recommended)
openclaw plugins install clawhub:@sparkengineai/quantclaw
# If OpenClaw is running from a source checkout and the CLI is not on PATH:
cd /path/to/openclaw
node openclaw.mjs plugins install @sparkengineai/quantclaw
# Or install from source
git clone https://github.com/SparkEngineAI/QuantClaw-plugin.git ./quantclaw
openclaw plugins install ./quantclaw
# If the OpenClaw CLI is not on PATH:
cd /path/to/openclaw
node openclaw.mjs plugins install /path/to/quantclawCreate or bootstrap the runtime config
QuantClaw reads its runtime config from:
~/.openclaw/quantclaw.json
If the file does not exist, starting OpenClaw with the plugin enabled will generate a default quantclaw.json. If you are working from this repository directly, you can also start from the provided example:
cp config.example.json ~/.openclaw/quantclaw.jsonEdit the detector chain and targets
{
"quant": {
"enabled": true,
"detectors": ["ruleDetector", "loadModelDetector"],
"judge": {
"endpoint": "http://127.0.0.1:8000",
"model": "BAAI/bge-m3",
"providerType": "openai-compatible",
"apiKey": "",
"cacheTtlMs": 300000
}
}
}Start OpenClaw and open the dashboard
http://127.0.0.1:18789/plugins/quantclaw/stats
The runtime schema supports:
- ordered detectors:
ruleDetector,loadModelDetector - per-task-type
id,description,precision,keywords, andpatterns - per-tier model targets with independent provider, model, endpoint, api key, and pricing
- model-level pricing overrides for cost reporting
- hot reload when
~/.openclaw/quantclaw.jsonchanges
Example taskTypes config:
{
"taskTypes": [
{
"id": "coding",
"precision": "16bit",
"description": "code review, bug analysis, implementation, debugging, kernels, async behavior, web development",
"keywords": ["code", "debug", "bug", "Python", "CUDA", "编程", "代码"],
"patterns": [
"fix the bug in this repository",
"(?=.*(?:refactor|重构))(?=.*(?:typescript|ts|node)).*"
]
}
],
"defaultTaskType": "standard"
}Example targets config:
{
"targets": {
"4bit": {
"provider": "quantclaw-4bit",
"model": "glm-4.7-flash-int4-autoround",
"endpoint": "https://api.example.com/v1",
"apiKey": "${QC_4BIT_API_KEY}",
"displayName": "4-bit Target",
"pricing": {
"inputPer1M": 0.051,
"outputPer1M": 0.34
}
},
"16bit": {
"provider": "quantclaw-16bit",
"model": "glm-4.7-flash",
"endpoint": "https://api.openai.com/v1",
"apiKey": "${QC_16BIT_API_KEY}",
"displayName": "16-bit Target",
"pricing": {
"inputPer1M": 0.06,
"outputPer1M": 0.4
}
}
}
}Example modelPricing overrides:
{
"modelPricing": {
"glm-4.7-flash": {
"inputPer1M": 0.06,
"outputPer1M": 0.4
},
"glm-4.7-flash-int4-autoround": {
"inputPer1M": 0.051,
"outputPer1M": 0.34
}
}
}Target-level pricing is used first for that precision tier. If it is absent, QuantClaw falls back to modelPricing for cost reporting.
loadModelDetector supports either a local embedding-based router exposed through an OpenAI-compatible API or a regular OpenAI-compatible LLM judge.
Build a local embedding router index:
python router/embedding_task_router.py --model-name BAAI/bge-m3 --device cuda --config-path ~/.openclaw/quantclaw.json --output-dir ./embedding_router_index-bge-m3 build --print-summaryServe that router as an OpenAI-compatible endpoint:
python router/embedding_task_router_server.py --model-name BAAI/bge-m3 --device cuda --output-dir ./embedding_router_index-bge-m3 --port 8012If your machine does not have a GPU, change --device cuda to --device cpu.
If you do not want to run the local embedding router, you can point quant.judge.endpoint at any OpenAI-compatible LLM endpoint instead.
We especially acknowledge:
Manyi Zhang, Ji-Fu Li*, Zhongao Sun, Xiaohao Liu, Zhenhua Dong, Xianzhi Yu, Haoli Bai (Project Lead), Xiaobo Xia
Follow SparkEngineAI on WeChat. We hope to share cutting-edge progress in AI Infra, light up stars in the AI field, and help everyone learn and draw inspiration.
If QuantClaw helps your research, engineering work, or benchmark studies, please cite:
@article{zhang2026quantclaw,
title={QuantClaw: Precision Where It Matters for OpenClaw},
author={Zhang, Manyi and Li, Ji-Fu and Sun, Zhongao and Liu, Xiaohao and Dong, Zhenghua and Yu, Xianzhi and Bai, Haoli and Xia, Xiaobo},
journal={arXiv preprint arXiv:2604.22577},
year={2026}
}





