Independent measurement across 5 models: output reduction is real (~-31%) but total cost never dropped; the 65-75% claim does not reproduce #520

cipherfoxie · 2026-06-12T18:40:14Z

cipherfoxie
Jun 12, 2026

Measured the skill with deterministic gates on two self-hosted models (Qwen3.6-35B, Mistral-Small-4, via opencode headless, skill injected verbatim as project rules, injection verified with a canary instruction) and three Claude models via API (Sonnet 4.6, Opus 4.8, Fable 5). N=3 per cell, chat answers scored against frozen fact checklists, coding tasks gated by typecheck/rename verification.

What reproduces: output-token reduction on chat-style answers, consistently around -31% on four of five models (best case -33% on Opus). Technical accuracy held, all checklists passed in both arms.

What does not reproduce: the 65-75% token-reduction claim, on any of the five models. Two reasons fall out of the data. First, the instruction itself rides along as ~1k input tokens on every request, and on coding tasks input dominates, so total tokens often went up (Qwen ts-rename: 89k baseline vs 111k with the skill). Second, measured in dollars on the Claude models, the caveman arm was never cheaper (e.g. Opus $0.554 vs $0.555; Fable 5 outputs got 18% longer and cost more). One model going the wrong direction entirely suggests the effect is also model-dependent.

Suggestion: qualify the claim toward "up to ~33% shorter chat outputs, model-dependent, with no total-cost saving measured on agentic coding workloads", or scope it to the workloads where it holds. Happy to share raw runs: harness https://github.com/cipherfoxie/agent-bench, full writeup https://sovgrid.org/blog/caveman-local-benchmark/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Independent measurement across 5 models: output reduction is real (~-31%) but total cost never dropped; the 65-75% claim does not reproduce #520

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Independent measurement across 5 models: output reduction is real (~-31%) but total cost never dropped; the 65-75% claim does not reproduce #520

Uh oh!

cipherfoxie Jun 12, 2026

Replies: 0 comments

cipherfoxie
Jun 12, 2026