The tool-call detective for small models on Apple Silicon.
When a small model botches a tool call, Toolhound tells you who did it โ
the chat template, the framework parser, or the model itself.
Everyone benchmarks tool-calling with a single number: "Model X gets 71% of function calls right." That number is a lie of omission. It can't tell you why the other 29% failed โ and the why is the only thing that tells you what to do next.
Two models can score the same accuracy for completely different reasons:
| Model | Same task, its dominant failure isโฆ | So the fix isโฆ |
|---|---|---|
| Qwen2.5-0.5B | ๐งฉ the chat template mangles tool tokens | File a bug upstream โ the model was never given a fair chance |
| Qwen2.5-1.5B | ๐ง valid JSON, wrong tool/args | Better model or better prompt โ grammar can't fix judgment |
| Llama-3.2-3B | ๐ง valid JSON, wrong tool/args | Better model or better prompt |
One of those is not the model's fault. A plain accuracy score hides that. Toolhound doesn't.
Toolhound is a reproducible diagnostic harness that runs entirely on your Mac (via MLX) and attributes every single failure to one of four causes โ with bootstrap confidence intervals on every metric.
Every failed tool call gets pinned on exactly one culprit:
| Cause | Whose fault | Reportable? |
|---|---|---|
framework_template_bug |
The chat template / tokenizer mangled the tool tokens | โ Upstream bug |
framework_parser_gap |
The model emitted a rescuable call; the framework parser missed it | โ Upstream bug |
model_format_failure |
The model can't emit a parseable call at all | The model |
model_decision_failure |
Valid format, but wrong tool or wrong arguments | The model |
The trick that makes this attribution valid: the parser is lenient ("ๅฎฝ่ฟ"), the scorer is strict ("ไธฅๅค"). We decouple "could any reasonable parser rescue this output?" from "is this the correct answer?" โ so a format failure is never confused with a judgment failure, and an upstream parser gap is never blamed on the model.
Toolhound is a measuring stick, not another schema-adaptation method. Its value is honest measurement:
- โ It finds chat-template bugs and parser gaps โ and gives you a minimal repro to file upstream.
- โ It separates "the model can't format" from "the model can't decide" โ because grammar-constrained decoding fixes the first and can never fix the second.
- โ It quantifies quantization damage (bf16 vs. q4) without confounding it with template differences.
- โ In v2, it benchmarks existing zero-training fixes (e.g. PA-Tool) on a held-out test set โ never claiming an improvement unless its confidence interval is disjoint from baseline.
We are not the first to notice that chat templates break tool tokens, and we don't claim to be. Toolhound's contribution is making that failure legible, attributable, and reproducible on consumer Apple hardware.
Requires an Apple Silicon Mac (M1 or newer), macOS 14+, Python 3.11+. MLX runs only on Apple Silicon โ your conda env must be arm64, not Rosetta (
conda infoโ platform should readosx-arm64).
git clone https://github.com/Code-byte404/toolhound.git
cd toolhound
conda create -n toolhound python=3.11 && conda activate toolhound
pip install -e ".[dev]"
# Smoke test: MLX loads a tiny non-gated model and generates one call
python scripts/smoke.pyThen run the detective on a model:
# 1) Reliability report: how often does each model get tool calls right?
toolprobe run \
--model qwen2.5-1.5b \
--quant bf16,q4 \
--cases cases/default.jsonl \
--out reports/
# 2) Attribution: for every failure, name the culprit (run under strict + lenient parsers)
toolprobe attribute --model qwen2.5-1.5b
# 3) Compare a zero-training fix against baseline (v2)
toolprobe run --model qwen2.5-1.5b --cases cases/test.jsonl --method baseline,pa_toolBoth commands write matching *.json (machine-readable) and *.md (human-readable) reports into reports/,
each stamped with a full reproducibility header: chip, RAM, macOS version, exact mlx / mlx-lm versions,
model repo + revision, and the injected date.
(The bundled model keys โ qwen2.5-0.5b, qwen2.5-1.5b, llama-3.2-3b โ are registered in
src/toolprobe/backend.py; add your own there.)
Reliability โ layered scoring, so you see exactly where each model drops off:
Model: qwen2.5-1.5b (q4) 95% bootstrap CI ยท Apple M2 Pro
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
parse_ok โโโโโโโโโโโโโโโโโโโโ 0.96 [0.92, 0.99]
schema_valid โโโโโโโโโโโโโโโโโโโโ 0.96 [0.92, 0.99]
tool_correct โโโโโโโโโโโโโโโโโโโโ 0.96 [0.92, 0.99]
args_correct โโโโโโโโโโโโโโโโโโโโ 0.71 [0.63, 0.79]
Attribution โ every failure pinned to a suspect, shown under both parser tiers so you can see the conclusion doesn't flip when the parser gets more lenient:
Failure attribution (strict parser) Failure attribution (lenient parser)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
framework_template_bug 4 framework_template_bug 4
framework_parser_gap 6 โ rescuable! framework_parser_gap 1
model_format_failure 9 model_format_failure 8
model_decision_failure 22 model_decision_failure 22
(Reliability numbers above are from a real bundled 3-model run on an Apple M2 Pro; the attribution
counts show the two-tier layout โ run toolprobe attribute for your own CI-backed figures.)
When you benchmark a method against baseline, Toolhound doesn't just print a delta โ it flags each metric
credible only when the method's bootstrap CI is disjoint from baseline's. In a bundled 3-model demo
run, PA-Tool (a real zero-training tool-renaming method) didn't clear that bar on any metric โ and
on one model it measurably hurt argument accuracy:
Method comparison โ qwen2.5-1.5b (q4) ยท pa_tool vs. baseline
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
metric baseline pa_tool delta credible
tool_correct 0.96 0.96 +0.00 no
args_correct 0.71 0.43 โ0.28 no โ caught, not rubber-stamped
That's the entire point of a measuring stick: it tells you when a fix doesn't work, with the statistics to
back it up. (Demonstration on the exploratory default.jsonl; real method selection uses the held-out
dev / test split so a gain has to generalize to unseen slots.)
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
raw model output โ โ template_sanity check โ tokens survived round-trip?
โโโโโโโโโโโโโฌโโโโโโโโโโโโโโโ
no โ โ yes
โผ โผ
framework_template_bug โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ parse_framework (strict) โ did the framework see a call?
โโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโ
no โ โ yes
โผ โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโ scorer (strict):
โ parse_rescue (lenient) โ right tool? right args?
โโโโโโฌโโโโโโโโโโโโโโโโฌโโโโ โ
rescuedโ garbageโ โผ
โผ โผ model_decision_failure
framework_parser_gap model_format_failure
The pipeline is a clean, testable data flow โ each stage is a pure function with one job:
case โ templates โ runner โ parser โ scorer โ attribution โ report
Only one module (backend.py) is allowed to import mlx / mlx_lm; a hygiene test enforces it. That
keeps the parser, scorer, and attribution logic 100% pure and unit-testable on any machine (no Mac required
for the logic tier โ 101 tests run in <1s in CI).
- v1 โ diagnostic harness + four-cause attribution + bootstrap CIs (this release)
- v2 (in progress) โ 304-case dev/test dataset with slot-disjoint splits;
PA-Toolmethod integration - v1.1 seam โ grammar-constrained decoding (Outlines / XGrammar) โ the abstention-safe grammar hook is
already reserved in
backend.py - More methods โ TSCG, constrained decoding benchmarks (integrate & measure existing work โ not invent new)
- More models โ expand the registry beyond Qwen / Llama
- PNG report export for dropping straight into issues and blog posts
This is a young project with a clear mission and a lot of well-scoped, self-contained ways to help. If any of these sound fun, open an issue and say hi:
- ๐งฉ Add a model to the registry and file the template/parser bugs Toolhound finds upstream.
- ๐งช Add test cases โ especially tricky abstention traps (utterances that look like tool requests but aren't).
- ๐ฌ Implement the constrained-decoding seam (
backend.generate(grammar=...)is stubbed and waiting). - ๐ Add a method โ wire an existing zero-training tool-calling fix into the
methods/framework and let the benchmark judge it fairly. - ๐ Docs โ help port the methodology notes to English.
Every metric in this project has a confidence interval and every logic path has a test. Please keep it that way โ
ruff check . && pytest is the bar, and any claimed improvement must show non-overlapping CIs vs. baseline.
New here? โ CONTRIBUTING.md walks you from clone to merged PR (dev setup, the hard rules, the bar), and the good first issues are concrete places to start.
temperature=0,top_p=1, fixed seed โ deterministic generation.- Confidence intervals come from bootstrap resampling over the case set, not seed variance.
- bf16 vs. q4 comparisons assert an identical tokenizer + chat template first, so quantization damage is never confounded with template differences.
- Every model is tested with its own chat template's tool format โ never a hand-rolled one.
- Dates like "today / Friday" are pinned to a fixed injected date so runs are reproducible forever.
The whole report table is re-runnable with a single command.
This project is actively looking for collaborators. Whether you want to add a model, contribute test cases, port a method, or just compare notes on small-model tool-calling reliability โ I'd love to hear from you.
- ๐ Issues & ideas: open a GitHub issue
- โ๏ธ Reach the maintainer: frankfish1984@gmail.com
Released under the Apache License 2.0. See LICENSE.
Built on MLX and mlx-lm by Apple.
Method integrations credit their original authors (see each file in src/toolprobe/methods/).
Toolhound ships as the mlx-toolprobe package with the toolprobe CLI. ๐พ
