Independent measurement across 5 models: output reduction is real (~-31%) but total cost never dropped; the 65-75% claim does not reproduce #520
cipherfoxie
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Measured the skill with deterministic gates on two self-hosted models (Qwen3.6-35B, Mistral-Small-4, via opencode headless, skill injected verbatim as project rules, injection verified with a canary instruction) and three Claude models via API (Sonnet 4.6, Opus 4.8, Fable 5). N=3 per cell, chat answers scored against frozen fact checklists, coding tasks gated by typecheck/rename verification.
What reproduces: output-token reduction on chat-style answers, consistently around -31% on four of five models (best case -33% on Opus). Technical accuracy held, all checklists passed in both arms.
What does not reproduce: the 65-75% token-reduction claim, on any of the five models. Two reasons fall out of the data. First, the instruction itself rides along as ~1k input tokens on every request, and on coding tasks input dominates, so total tokens often went up (Qwen ts-rename: 89k baseline vs 111k with the skill). Second, measured in dollars on the Claude models, the caveman arm was never cheaper (e.g. Opus $0.554 vs $0.555; Fable 5 outputs got 18% longer and cost more). One model going the wrong direction entirely suggests the effect is also model-dependent.
Suggestion: qualify the claim toward "up to ~33% shorter chat outputs, model-dependent, with no total-cost saving measured on agentic coding workloads", or scope it to the workloads where it holds. Happy to share raw runs: harness https://github.com/cipherfoxie/agent-bench, full writeup https://sovgrid.org/blog/caveman-local-benchmark/
Beta Was this translation helpful? Give feedback.
All reactions