Skip to content

Fix reasoning-token pricing semantics (#32)#73

Merged
willwashburn merged 3 commits intomainfrom
fix/reasoning-pricing-32
Apr 26, 2026
Merged

Fix reasoning-token pricing semantics (#32)#73
willwashburn merged 3 commits intomainfrom
fix/reasoning-pricing-32

Conversation

@willwashburn
Copy link
Copy Markdown
Member

@willwashburn willwashburn commented Apr 25, 2026

Fixes #32.

Summary

Two pricing-correctness bugs that distorted reported spend whenever reasoning tokens were involved:

  1. Codex reasoning was being double-billed. costForUsage always added usage.reasoning * rate.output on top of usage.output, but Codex's output_tokens already includes reasoning (per ccusage's data-loader semantics). Reasoning was billed twice.
  2. cost.reasoning from models.dev was being silently discarded. flatten() only kept input / output / cache_read / cache_write, so any model with a distinct reasoning tariff (Alibaba Qwen reasoning models, ~70 entries in the current snapshot) couldn't be priced correctly.

User-visible impact (cost numbers will change)

This is a pricing correction. Reported burn summary / burn by-tool / burn compare totals will move downward for any session with non-zero reasoning tokens:

  • Codex sessions: drops by reasoning_tokens × output_rate per turn. On the issue's documented 10-turn sample (input=660,698, output=52,676, reasoning=29,070, cacheRead=5,618,688):
    • Before: $4.282607
    • After: $3.846557
    • Delta: −$0.43605 (~11.3% of the Codex slice)
  • Anthropic Claude sessions: unchanged. Default reasoningMode: 'same_as_output' preserves the existing billing path.
  • Models with cost.reasoning (Qwen reasoning, etc.): now priced at the published reasoning tariff instead of falling through at the output rate.

Design

  • ModelCost gains reasoning?: number and reasoningMode: 'included_in_output' | 'separate' | 'same_as_output'.
  • flatten() preserves cost.reasoning and tags entries with reasoningMode: 'separate'. Models without a reasoning tariff default to same_as_output.
  • costForUsage(usage, model, pricing, { reasoningMode? }) branches on the resolved mode. costForTurn infers included_in_output for source: 'codex' so reasoning is preserved on TurnRecord for observability but not billed twice.
  • usage.reasoning is unchanged in TurnRecord — the bug is in pricing, not data capture (per issue non-goals).

Test plan

  • pnpm install && pnpm run test:ts — 346/346 tests pass
  • New tests in packages/analyze/src/cost.test.ts cover all four acceptance criteria from the issue:
    • Codex input=1M / output=500k / reasoning=200k bills 10.0 (not 13.0)
    • Codex regression: the documented 11.3% overstatement scenario reproduces and matches the corrected number
    • Synthetic separate-tariff model with input=1, output=4, reasoning=8 and 1M of each bucket bills 13
    • flatten() preserves cost.reasoning and tags reasoningMode: 'separate'
    • Builtin models.dev snapshot has at least one separate-tariff model after flattening
  • Existing Claude reasoning behavior unchanged (test renamed/retargeted to same_as_output semantics)
  • costForTurn selects the right mode automatically by source; explicit reasoningMode option overrides
  • Other PricingTable literals in mcp and plan-usage tests updated for the new required field

Notes for reviewers

  • README gains a short "Reasoning-token pricing semantics" subsection under Packages.
  • packages/analyze/CHANGELOG.md [Unreleased] documents the fix in detail; root CHANGELOG.md [Unreleased] calls out the user-visible cost change because totals will move on next pull.
  • flatten is now exported from @relayburn/analyze for callers that want to build a PricingTable from in-memory models.dev payloads (used by the new tests).

Open in Devin Review

…tariffs (#32)

Two pricing-correctness bugs that distorted reported spend whenever
reasoning tokens were involved:

1. Codex `usage.reasoning` was being double-billed at the output rate
   even though Codex's `output_tokens` already includes reasoning. On
   the issue's 10-turn sample (660k input / 53k output / 29k reasoning /
   5.6M cacheRead) this overstates cost by $0.43 / 11.3% of the slice.

2. `cost.reasoning` from the `models.dev` snapshot was discarded during
   `flatten()`, so any model with a distinct reasoning tariff (Alibaba
   Qwen reasoning models, etc.) couldn't be priced correctly.

Fix:
- Extend `ModelCost` with `reasoning?: number` and `reasoningMode:
  'included_in_output' | 'separate' | 'same_as_output'`.
- `flatten()` preserves `cost.reasoning` and tags the entry `separate`.
  Models without a distinct reasoning tariff default to `same_as_output`
  (preserves existing Claude billing).
- `costForUsage` branches on the resolved mode. `costForTurn` infers
  `included_in_output` for `source: 'codex'` so reasoning is recorded
  but not billed on top of output.
- `usage.reasoning` is still preserved in `TurnRecord` for observability.
  The bug was in pricing, not data capture.

Tests cover all four acceptance criteria from the issue: Codex
input=1M/output=500k/reasoning=200k bills 10.0 (not 13.0); the
synthetic `separate` model bills 13 for 1M of each bucket; the
documented Codex regression scenario; `flatten` preserves reasoning;
the builtin snapshot retains at least one separate-tariff model.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

🐛 1 issue in files not directly in the diff

🐛 costForTurnLocal in waste.ts still double-bills reasoning tokens for Codex and ignores separate reasoning tariffs (packages/analyze/src/waste.ts:351-362)

waste.ts has its own local cost computation (costForTurnLocal at line 351) that was not updated with the new reasoning-mode semantics introduced by this PR. It unconditionally bills reasoning at the output rate ((u.reasoning / PER_MILLION) * rate.output at line 358), regardless of the turn's source or the model's reasoningMode. This means:

  1. Codex turns are still double-billed in waste attribution — the exact bug this PR is supposed to fix. Codex output_tokens already includes reasoning, but costForTurnLocal adds reasoning cost on top.
  2. Models with a separate reasoning tariff (e.g. Alibaba Qwen) are billed at the output rate instead of their distinct reasoning tariff.

costForTurnLocal is called at packages/analyze/src/waste.ts:93 to compute sessionGrand (total session cost for waste attribution), so waste-attribution totals will be inconsistent with costForTurn from cost.ts for any session involving reasoning tokens.

View 4 additional findings in Devin Review.

Open in Devin Review

…iew on #73)

`packages/analyze/src/waste.ts` had a private `costForTurnLocal` that
duplicated `costForUsage`'s arithmetic but predated the reasoning-mode
work in this PR. It unconditionally billed `usage.reasoning` at the
output rate, so:

1. Codex turns were still double-billed in `sessionGrand` /
   `grandCost` / `unattributedCost` — the exact bug #32 was supposed
   to fix, just in a different code path.
2. Models with a separate reasoning tariff (e.g. Alibaba Qwen) were
   billed at the output rate instead of `rate.reasoning`.

The fix is to delegate to the canonical `costForTurn`, which already
threads `reasoningModeForSource` (Codex -> `included_in_output`) and
honors `ModelCost.reasoningMode` per model. That keeps waste totals
consistent with `cost.ts` / `costForTurn` for any session involving
reasoning tokens.

Adds a regression test that constructs a Codex turn with
input/output/reasoning and asserts `attributeWaste(...).grandTotal`
does not include `reasoning x output_rate`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@willwashburn
Copy link
Copy Markdown
Member Author

Re: Devin reviewcostForTurnLocal in waste.ts (lines 351-362) double-billing reasoning for Codex and ignoring separate reasoning tariffs:

Valid finding. Fixed in d1ad409.

packages/analyze/src/waste.ts had its own private cost arithmetic that predated the reasoning-mode work in this PR. It now delegates to the canonical costForTurn from cost.ts, so waste-attribution sessionGrand / grandCost / unattributedCost honor:

  • Codex source -> included_in_output (reasoning is not billed on top of output).
  • Per-model ModelCost.reasoningMode (e.g. Qwen reasoning models with separate use their distinct rate.reasoning).

Added a regression test in waste.test.ts (session grand total honors source-aware reasoning semantics ...) that constructs a Codex turn with input=1000 / output=500 / reasoning=200 and asserts attributeWaste(...).grandTotal does not include reasoning x output_rate. Updated packages/analyze/CHANGELOG.md [Unreleased] to call out the waste-attribution change. All 347 TS tests pass.

The four "additional findings" mentioned in the Devin badge are not visible from the GitHub PR view (only the one inlined finding is in the review body), so I cannot address them here. If a reviewer can surface them as line-anchored comments or paste them into this thread, happy to take another pass.

# Conflicts:
#	CHANGELOG.md
#	packages/analyze/CHANGELOG.md
#	packages/analyze/src/cost.ts
#	packages/analyze/src/waste.ts
@willwashburn willwashburn merged commit a016b54 into main Apr 26, 2026
@willwashburn willwashburn deleted the fix/reasoning-pricing-32 branch April 26, 2026 00:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix reasoning-token pricing semantics and preserve models.dev reasoning tariffs

1 participant