Fix reasoning-token pricing semantics (#32)#73
Conversation
…tariffs (#32) Two pricing-correctness bugs that distorted reported spend whenever reasoning tokens were involved: 1. Codex `usage.reasoning` was being double-billed at the output rate even though Codex's `output_tokens` already includes reasoning. On the issue's 10-turn sample (660k input / 53k output / 29k reasoning / 5.6M cacheRead) this overstates cost by $0.43 / 11.3% of the slice. 2. `cost.reasoning` from the `models.dev` snapshot was discarded during `flatten()`, so any model with a distinct reasoning tariff (Alibaba Qwen reasoning models, etc.) couldn't be priced correctly. Fix: - Extend `ModelCost` with `reasoning?: number` and `reasoningMode: 'included_in_output' | 'separate' | 'same_as_output'`. - `flatten()` preserves `cost.reasoning` and tags the entry `separate`. Models without a distinct reasoning tariff default to `same_as_output` (preserves existing Claude billing). - `costForUsage` branches on the resolved mode. `costForTurn` infers `included_in_output` for `source: 'codex'` so reasoning is recorded but not billed on top of output. - `usage.reasoning` is still preserved in `TurnRecord` for observability. The bug was in pricing, not data capture. Tests cover all four acceptance criteria from the issue: Codex input=1M/output=500k/reasoning=200k bills 10.0 (not 13.0); the synthetic `separate` model bills 13 for 1M of each bucket; the documented Codex regression scenario; `flatten` preserves reasoning; the builtin snapshot retains at least one separate-tariff model. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Devin Review found 1 potential issue.
🐛 1 issue in files not directly in the diff
🐛 costForTurnLocal in waste.ts still double-bills reasoning tokens for Codex and ignores separate reasoning tariffs (packages/analyze/src/waste.ts:351-362)
waste.ts has its own local cost computation (costForTurnLocal at line 351) that was not updated with the new reasoning-mode semantics introduced by this PR. It unconditionally bills reasoning at the output rate ((u.reasoning / PER_MILLION) * rate.output at line 358), regardless of the turn's source or the model's reasoningMode. This means:
- Codex turns are still double-billed in waste attribution — the exact bug this PR is supposed to fix. Codex
output_tokensalready includes reasoning, butcostForTurnLocaladds reasoning cost on top. - Models with a separate reasoning tariff (e.g. Alibaba Qwen) are billed at the output rate instead of their distinct reasoning tariff.
costForTurnLocal is called at packages/analyze/src/waste.ts:93 to compute sessionGrand (total session cost for waste attribution), so waste-attribution totals will be inconsistent with costForTurn from cost.ts for any session involving reasoning tokens.
View 4 additional findings in Devin Review.
…iew on #73) `packages/analyze/src/waste.ts` had a private `costForTurnLocal` that duplicated `costForUsage`'s arithmetic but predated the reasoning-mode work in this PR. It unconditionally billed `usage.reasoning` at the output rate, so: 1. Codex turns were still double-billed in `sessionGrand` / `grandCost` / `unattributedCost` — the exact bug #32 was supposed to fix, just in a different code path. 2. Models with a separate reasoning tariff (e.g. Alibaba Qwen) were billed at the output rate instead of `rate.reasoning`. The fix is to delegate to the canonical `costForTurn`, which already threads `reasoningModeForSource` (Codex -> `included_in_output`) and honors `ModelCost.reasoningMode` per model. That keeps waste totals consistent with `cost.ts` / `costForTurn` for any session involving reasoning tokens. Adds a regression test that constructs a Codex turn with input/output/reasoning and asserts `attributeWaste(...).grandTotal` does not include `reasoning x output_rate`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Re: Devin review — Valid finding. Fixed in d1ad409.
Added a regression test in The four "additional findings" mentioned in the Devin badge are not visible from the GitHub PR view (only the one inlined finding is in the review body), so I cannot address them here. If a reviewer can surface them as line-anchored comments or paste them into this thread, happy to take another pass. |
# Conflicts: # CHANGELOG.md # packages/analyze/CHANGELOG.md # packages/analyze/src/cost.ts # packages/analyze/src/waste.ts
Fixes #32.
Summary
Two pricing-correctness bugs that distorted reported spend whenever reasoning tokens were involved:
costForUsagealways addedusage.reasoning * rate.outputon top ofusage.output, but Codex'soutput_tokensalready includes reasoning (perccusage's data-loader semantics). Reasoning was billed twice.cost.reasoningfrommodels.devwas being silently discarded.flatten()only keptinput/output/cache_read/cache_write, so any model with a distinct reasoning tariff (Alibaba Qwen reasoning models, ~70 entries in the current snapshot) couldn't be priced correctly.User-visible impact (cost numbers will change)
This is a pricing correction. Reported
burn summary/burn by-tool/burn comparetotals will move downward for any session with non-zero reasoning tokens:reasoning_tokens × output_rateper turn. On the issue's documented 10-turn sample (input=660,698,output=52,676,reasoning=29,070,cacheRead=5,618,688):reasoningMode: 'same_as_output'preserves the existing billing path.cost.reasoning(Qwen reasoning, etc.): now priced at the published reasoning tariff instead of falling through at the output rate.Design
ModelCostgainsreasoning?: numberandreasoningMode: 'included_in_output' | 'separate' | 'same_as_output'.flatten()preservescost.reasoningand tags entries withreasoningMode: 'separate'. Models without a reasoning tariff default tosame_as_output.costForUsage(usage, model, pricing, { reasoningMode? })branches on the resolved mode.costForTurninfersincluded_in_outputforsource: 'codex'so reasoning is preserved onTurnRecordfor observability but not billed twice.usage.reasoningis unchanged inTurnRecord— the bug is in pricing, not data capture (per issue non-goals).Test plan
pnpm install && pnpm run test:ts— 346/346 tests passpackages/analyze/src/cost.test.tscover all four acceptance criteria from the issue:input=1M / output=500k / reasoning=200kbills10.0(not13.0)separate-tariff model withinput=1, output=4, reasoning=8and 1M of each bucket bills13flatten()preservescost.reasoningand tagsreasoningMode: 'separate'models.devsnapshot has at least oneseparate-tariff model after flatteningsame_as_outputsemantics)costForTurnselects the right mode automatically bysource; explicitreasoningModeoption overridesPricingTableliterals inmcpandplan-usagetests updated for the new required fieldNotes for reviewers
packages/analyze/CHANGELOG.md[Unreleased]documents the fix in detail; rootCHANGELOG.md[Unreleased]calls out the user-visible cost change because totals will move on next pull.flattenis now exported from@relayburn/analyzefor callers that want to build aPricingTablefrom in-memorymodels.devpayloads (used by the new tests).