Skip to content

v0.7.1 — reasoning + billable output normalization (ADR-0016)

Latest

Choose a tag to compare

@dratner dratner released this 31 May 21:25
6d9a7aa

Patch release. Investigated as a Gemini bug, shipped as the cross-provider normalization the toolkit was missing.

The bug

A chat to gemini-3.1-pro-preview-customtools with MaxTokens=1024 returned OutputTokens=36, StopReason=MAX_TOKENS. Confusing because 36 is nowhere near 1024 — the model had silently burned ~988 tokens on thinking and the toolkit dropped them. Worse, the bug wasn't Gemini-specific: every provider's wire output_tokens had a different meaning, so Usage.OutputTokens was silently inconsistent across providers.

Fix (ADR-0016)

Two new fields on llms.Usage:

ReasoningTokens      int  // separately-metered thinking; ADDITIVE to OutputTokens
BillableOutputTokens int  // what the provider charges as \"output\"

OutputTokens is redefined as visible-only across every adapter. Budget math holds: a length truncation fires when Input + Output + Reasoning approaches the cap. Cost math: read BillableOutputTokens (one field, one semantic, no per-provider branching).

Provider OutputTokens ReasoningTokens BillableOutputTokens
Gemini CandidatesTokenCount ThoughtsTokenCount Candidates + Thoughts
OpenAI Responses wire − OutputTokensDetails.ReasoningTokens OutputTokensDetails.ReasoningTokens wire.OutputTokens
OpenAI Chat Completions (vLLM) same subtraction via CompletionTokensDetails same wire.CompletionTokens
Anthropic wire.OutputTokens (includes thinking when on; not separable) 0 (SDK has no field) mirrors OutputTokens
Ollama eval_count 0 mirrors OutputTokens

Breaking semantic for OpenAI o-series

Pre-v0.7.1, Usage.OutputTokens was the OpenAI wire output_tokens, which for o-series silently included reasoning. Now it's visible-only. Migration: code reading OutputTokens as the billing-relevant number should read BillableOutputTokens instead (carries the exact same value the old field did). Non-reasoning OpenAI calls are unaffected — reasoning_tokens=0 makes the subtraction a no-op.

Pre-1.0 is the right time to make this consistent; carrying the inconsistency to v1.0 would have been worse.

Demo bonus (examples/chat)

The bubble footer surfaces the breakdown so a max_tokens stop on a small visible output is self-explanatory:

gemini-3.1-pro-preview-customtools · max tokens · 12 in / 102 out / 918 reasoning · 1020 billable · 10875 ms

Non-reasoning models keep the terse <in> in / <out> out form (billable would just duplicate out).

Compatibility

Additive new fields. Existing fields keep their names. The only semantic shift is OpenAI o-series OutputTokens. No MAESTRO_DIVERGENCES.md row — Maestro never surfaced either field; nothing to diverge from.

Pre-1.0; v0.x minor versions may break.

🤖 Generated with Claude Code