Fix reasoning-token pricing semantics (#32) by willwashburn · Pull Request #73 · AgentWorkforce/burn

willwashburn · 2026-04-25T01:51:19Z

Fixes #32.

Summary

Two pricing-correctness bugs that distorted reported spend whenever reasoning tokens were involved:

Codex reasoning was being double-billed. costForUsage always added usage.reasoning * rate.output on top of usage.output, but Codex's output_tokens already includes reasoning (per ccusage's data-loader semantics). Reasoning was billed twice.
cost.reasoning from models.dev was being silently discarded. flatten() only kept input / output / cache_read / cache_write, so any model with a distinct reasoning tariff (Alibaba Qwen reasoning models, ~70 entries in the current snapshot) couldn't be priced correctly.

User-visible impact (cost numbers will change)

This is a pricing correction. Reported burn summary / burn by-tool / burn compare totals will move downward for any session with non-zero reasoning tokens:

Codex sessions: drops by reasoning_tokens × output_rate per turn. On the issue's documented 10-turn sample (input=660,698, output=52,676, reasoning=29,070, cacheRead=5,618,688):
- Before: $4.282607
- After: $3.846557
- Delta: −$0.43605 (~11.3% of the Codex slice)
Anthropic Claude sessions: unchanged. Default reasoningMode: 'same_as_output' preserves the existing billing path.
Models with cost.reasoning (Qwen reasoning, etc.): now priced at the published reasoning tariff instead of falling through at the output rate.

Design

ModelCost gains reasoning?: number and reasoningMode: 'included_in_output' | 'separate' | 'same_as_output'.
flatten() preserves cost.reasoning and tags entries with reasoningMode: 'separate'. Models without a reasoning tariff default to same_as_output.
costForUsage(usage, model, pricing, { reasoningMode? }) branches on the resolved mode. costForTurn infers included_in_output for source: 'codex' so reasoning is preserved on TurnRecord for observability but not billed twice.
usage.reasoning is unchanged in TurnRecord — the bug is in pricing, not data capture (per issue non-goals).

Test plan

pnpm install && pnpm run test:ts — 346/346 tests pass
New tests in packages/analyze/src/cost.test.ts cover all four acceptance criteria from the issue:
- Codex input=1M / output=500k / reasoning=200k bills 10.0 (not 13.0)
- Codex regression: the documented 11.3% overstatement scenario reproduces and matches the corrected number
- Synthetic separate-tariff model with input=1, output=4, reasoning=8 and 1M of each bucket bills 13
- flatten() preserves cost.reasoning and tags reasoningMode: 'separate'
- Builtin models.dev snapshot has at least one separate-tariff model after flattening
Existing Claude reasoning behavior unchanged (test renamed/retargeted to same_as_output semantics)
costForTurn selects the right mode automatically by source; explicit reasoningMode option overrides
Other PricingTable literals in mcp and plan-usage tests updated for the new required field

Notes for reviewers

README gains a short "Reasoning-token pricing semantics" subsection under Packages.
packages/analyze/CHANGELOG.md [Unreleased] documents the fix in detail; root CHANGELOG.md [Unreleased] calls out the user-visible cost change because totals will move on next pull.
flatten is now exported from @relayburn/analyze for callers that want to build a PricingTable from in-memory models.dev payloads (used by the new tests).

…tariffs (#32) Two pricing-correctness bugs that distorted reported spend whenever reasoning tokens were involved: 1. Codex `usage.reasoning` was being double-billed at the output rate even though Codex's `output_tokens` already includes reasoning. On the issue's 10-turn sample (660k input / 53k output / 29k reasoning / 5.6M cacheRead) this overstates cost by $0.43 / 11.3% of the slice. 2. `cost.reasoning` from the `models.dev` snapshot was discarded during `flatten()`, so any model with a distinct reasoning tariff (Alibaba Qwen reasoning models, etc.) couldn't be priced correctly. Fix: - Extend `ModelCost` with `reasoning?: number` and `reasoningMode: 'included_in_output' | 'separate' | 'same_as_output'`. - `flatten()` preserves `cost.reasoning` and tags the entry `separate`. Models without a distinct reasoning tariff default to `same_as_output` (preserves existing Claude billing). - `costForUsage` branches on the resolved mode. `costForTurn` infers `included_in_output` for `source: 'codex'` so reasoning is recorded but not billed on top of output. - `usage.reasoning` is still preserved in `TurnRecord` for observability. The bug was in pricing, not data capture. Tests cover all four acceptance criteria from the issue: Codex input=1M/output=500k/reasoning=200k bills 10.0 (not 13.0); the synthetic `separate` model bills 13 for 1M of each bucket; the documented Codex regression scenario; `flatten` preserves reasoning; the builtin snapshot retains at least one separate-tariff model. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

devin-ai-integration

Devin Review found 1 potential issue.

🐛 1 issue in files not directly in the diff

🐛 `costForTurnLocal` in waste.ts still double-bills reasoning tokens for Codex and ignores separate reasoning tariffs (`packages/analyze/src/waste.ts:351-362`)

waste.ts has its own local cost computation (costForTurnLocal at line 351) that was not updated with the new reasoning-mode semantics introduced by this PR. It unconditionally bills reasoning at the output rate ((u.reasoning / PER_MILLION) * rate.output at line 358), regardless of the turn's source or the model's reasoningMode. This means:

Codex turns are still double-billed in waste attribution — the exact bug this PR is supposed to fix. Codex output_tokens already includes reasoning, but costForTurnLocal adds reasoning cost on top.
Models with a separate reasoning tariff (e.g. Alibaba Qwen) are billed at the output rate instead of their distinct reasoning tariff.

costForTurnLocal is called at packages/analyze/src/waste.ts:93 to compute sessionGrand (total session cost for waste attribution), so waste-attribution totals will be inconsistent with costForTurn from cost.ts for any session involving reasoning tokens.

View 4 additional findings in Devin Review.

…iew on #73) `packages/analyze/src/waste.ts` had a private `costForTurnLocal` that duplicated `costForUsage`'s arithmetic but predated the reasoning-mode work in this PR. It unconditionally billed `usage.reasoning` at the output rate, so: 1. Codex turns were still double-billed in `sessionGrand` / `grandCost` / `unattributedCost` — the exact bug #32 was supposed to fix, just in a different code path. 2. Models with a separate reasoning tariff (e.g. Alibaba Qwen) were billed at the output rate instead of `rate.reasoning`. The fix is to delegate to the canonical `costForTurn`, which already threads `reasoningModeForSource` (Codex -> `included_in_output`) and honors `ModelCost.reasoningMode` per model. That keeps waste totals consistent with `cost.ts` / `costForTurn` for any session involving reasoning tokens. Adds a regression test that constructs a Codex turn with input/output/reasoning and asserts `attributeWaste(...).grandTotal` does not include `reasoning x output_rate`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

willwashburn · 2026-04-25T02:20:02Z

Re: Devin review — costForTurnLocal in waste.ts (lines 351-362) double-billing reasoning for Codex and ignoring separate reasoning tariffs:

Valid finding. Fixed in d1ad409.

packages/analyze/src/waste.ts had its own private cost arithmetic that predated the reasoning-mode work in this PR. It now delegates to the canonical costForTurn from cost.ts, so waste-attribution sessionGrand / grandCost / unattributedCost honor:

Codex source -> included_in_output (reasoning is not billed on top of output).
Per-model ModelCost.reasoningMode (e.g. Qwen reasoning models with separate use their distinct rate.reasoning).

Added a regression test in waste.test.ts (session grand total honors source-aware reasoning semantics ...) that constructs a Codex turn with input=1000 / output=500 / reasoning=200 and asserts attributeWaste(...).grandTotal does not include reasoning x output_rate. Updated packages/analyze/CHANGELOG.md [Unreleased] to call out the waste-attribution change. All 347 TS tests pass.

The four "additional findings" mentioned in the Devin badge are not visible from the GitHub PR view (only the one inlined finding is in the review body), so I cannot address them here. If a reviewer can surface them as line-anchored comments or paste them into this thread, happy to take another pass.

# Conflicts: # CHANGELOG.md # packages/analyze/CHANGELOG.md # packages/analyze/src/cost.ts # packages/analyze/src/waste.ts

devin-ai-integration Bot reviewed Apr 25, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into fix/reasoning-pricing-32

8381e90

# Conflicts: # CHANGELOG.md # packages/analyze/CHANGELOG.md # packages/analyze/src/cost.ts # packages/analyze/src/waste.ts

willwashburn merged commit a016b54 into main Apr 26, 2026

willwashburn deleted the fix/reasoning-pricing-32 branch April 26, 2026 00:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix reasoning-token pricing semantics (#32)#73

Fix reasoning-token pricing semantics (#32)#73
willwashburn merged 3 commits intomainfrom
fix/reasoning-pricing-32

willwashburn commented Apr 25, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

willwashburn commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

willwashburn commented Apr 25, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

User-visible impact (cost numbers will change)

Design

Test plan

Notes for reviewers

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

🐛 costForTurnLocal in waste.ts still double-bills reasoning tokens for Codex and ignores separate reasoning tariffs (packages/analyze/src/waste.ts:351-362)

Uh oh!

willwashburn commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

willwashburn commented Apr 25, 2026 •

edited by devin-ai-integration Bot

Loading

🐛 `costForTurnLocal` in waste.ts still double-bills reasoning tokens for Codex and ignores separate reasoning tariffs (`packages/analyze/src/waste.ts:351-362`)