feat: track reasoning token usage#670
Conversation
Capture provider-reported reasoning-token breakdowns alongside output tokens without changing output token totals. Carry the field through model usage aggregation and add coverage for parsing, facade tracking, and deltas. Refs #665
Include reasoning token counts in the local model usage summary while preserving output and total token semantics. Telemetry remains unchanged. Refs #665
When providers return reasoning content without a numeric usage breakdown, estimate reasoning tokens from that content while preserving provider-reported output and total token counts. Refs #665
PR #670 Review — feat: track reasoning token usageSummaryAdds reasoning-token tracking to model usage stats while preserving provider output-token semantics:
The change is well-scoped, the tests are thorough, and the "declare, don't orchestrate" contract is preserved (no engine flow changes). FindingsType inconsistency from
|
Greptile SummaryThis PR adds first-class reasoning token tracking to the model usage pipeline. Provider-reported counts are parsed from OpenAI-compatible fields (
|
| Filename | Overview |
|---|---|
| packages/data-designer-engine/src/data_designer/engine/models/usage.py | Adds reasoning_tokens and reasoning_token_count_source to TokenUsageStats with Pydantic validator, use_enum_values=True, and correct merge logic in extend(). |
| packages/data-designer-engine/src/data_designer/engine/models/clients/parsing.py | Adds extract_reasoning_token_count (reads provider fields) and fill_reasoning_token_count_from_content (tiktoken estimation); estimation is correctly skipped when provider count already exists or usage is None. |
| packages/data-designer-engine/src/data_designer/engine/models/clients/types.py | Adds reasoning_tokens/reasoning_token_count_source to the Usage dataclass with post_init pair-consistency validation; clean change. |
| packages/data-designer-engine/src/data_designer/engine/models/registry.py | Extends delta computation and log formatting to include reasoning token counts; the ESTIMATED label logic and delta source attribution are correct. |
| packages/data-designer-engine/src/data_designer/engine/utils/token_counting.py | New shared helper that lazy-loads tiktoken via the facade and caches the cl100k_base encoder; replaces the inline _get_tokenizer that was duplicated in column_statistics_calculations.py. |
| packages/data-designer-engine/src/data_designer/engine/models/facade.py | Passes reasoning_tokens and reasoning_token_count_source through from Usage to TokenUsageStats; straightforward propagation with no semantic changes. |
| packages/data-designer-engine/src/data_designer/engine/models/clients/adapters/anthropic_translation.py | Calls fill_reasoning_token_count_from_content after extract_usage so Anthropic thinking blocks trigger estimation when the provider omits a count; correct and symmetric with the OpenAI-compatible path. |
| packages/data-designer-engine/tests/engine/models/clients/test_parsing.py | New tests cover provider-reported, estimated, missing, zero-count, no-usage, and source-preference cases; monkeypatches count_text_tokens to isolate tiktoken from unit tests. |
| packages/data-designer-engine/tests/engine/utils/test_token_counting.py | Verifies correctness, caching behaviour, and that importing the module does not eagerly trigger tiktoken initialization (subprocess-based lazy-import test). |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Provider response JSON] --> B[extract_usage]
B --> C{reasoning_tokens\nin provider fields?}
C -->|Yes| D["source = PROVIDER\nreasoning_tokens = N"]
C -->|No| E["reasoning_tokens = None\nsource = None"]
D --> F[Usage dataclass]
E --> F
F --> G[fill_reasoning_token_count_from_content]
G --> H{usage is None?}
H -->|Yes| I[return None]
H -->|No| J{reasoning_tokens\nalready set?}
J -->|Yes| K[return usage as-is]
J -->|No| L{reasoning_content\npresent?}
L -->|No| K
L -->|Yes| M[count_text_tokens via tiktoken]
M --> N["source = ESTIMATED\nreasoning_tokens = estimated_count"]
N --> O[new Usage via dataclasses.replace]
K --> P[ChatCompletionResponse]
O --> P
P --> Q[ModelFacade.completion]
Q --> R["TokenUsageStats.extend()\nmerge_token_count_sources"]
R --> S[Cumulative TokenUsageStats]
S --> T{log_model_usage}
T -->|source == ESTIMATED| U["reasoning=N (estimated)\n+ tiktoken note"]
T -->|source == PROVIDER| V["reasoning=N"]
T -->|reasoning_tokens is None| W[no reasoning field]
Reviews (6): Last reviewed commit: "Merge branch 'main' into nmulepati/feat-..." | Re-trigger Greptile
eric-tramel
left a comment
There was a problem hiding this comment.
Is it really necessary to make these estimates with an alternate tokenizer? This is added overhead that can add up quick for large tasks.
What are the contexts in which token usage isn't reported? Main public APIs (OAI/Ant/OR) should already be reporting token usage on request responses. Private APIs might not account exactly because the thinking is hidden, but they should be providing the correct token count for $ accounting.
In the case of self-host, we should also expect to get token response counts.
Where are we seeing this happen and can we push this to an uncounted edge case and not take on added effort of tiktoken etc in such cases?
Anthropic, Nvidia build hosted open models don't report reasoning token counts. We already have the tikitoken overhead used for column statistics. |
andreatgretel
left a comment
There was a problem hiding this comment.
Approved after a1b8d864. My review threads are resolved, and the no-provider-usage behavior is now explicit and tested. Looks good from my side.
📋 Summary
Adds reasoning token tracking to model usage stats while preserving provider output token semantics. Provider-reported reasoning counts are recorded exactly when available; when a provider returns reasoning/thinking content and normal usage but omits the reasoning-token breakdown, DD estimates the reasoning count using the shared tiktoken
cl100k_basehelper and labels it as estimated.🔗 Related Issue
Fixes #665
🔄 Changes
reasoning_tokensandreasoning_token_count_sourceto token usage stats.completion_tokens_details.reasoning_tokens,output_tokens_details.reasoning_tokens, and top-level variants.usageentirely.Usageobjects and aggregatedTokenUsageStats.(estimated)labels when applicable.reasoning_token_countterminology and returnint | Nonewhere applicable.🔍 Attention Areas
usage.py—TokenUsageStats.reasoning_tokensis now nullable and accompanied byreasoning_token_count_source.types.py— canonicalUsagenow validates reasoning count/source consistency when clients construct responses.parsing.py— provider-reported reasoning counts are parsed separately from estimated counts derived from reasoning content.registry.py— the usage summary omits reasoning counts when providers return neither a count nor reasoning content.token_counting.py— tiktoken is lazy-loaded and only accessed when token estimation is actually needed.🧪 Testing
PYTHONPATH=packages/data-designer-config/src:packages/data-designer-engine/src:packages/data-designer/src uv run --group dev pytest packages/data-designer-engine/tests/engine/models packages/data-designer-engine/tests/engine/analysis packages/data-designer/tests/test_lazy_imports.py -q(589 passed)PYTHONPATH=packages/data-designer-config/src:packages/data-designer-engine/src:packages/data-designer/src uv run --group dev pytest packages/data-designer/tests/test_lazy_imports.py packages/data-designer-engine/tests/engine/models/clients/test_parsing.py packages/data-designer-engine/tests/engine/models/clients/test_anthropic_translation.py packages/data-designer-engine/tests/engine/models/clients/test_types.py packages/data-designer-engine/tests/engine/models/test_facade.py packages/data-designer-engine/tests/engine/utils/test_token_counting.py -q(159 passed)uv run --group dev ruff check packages/data-designer-config/src/data_designer/lazy_heavy_imports.py packages/data-designer-engine/src/data_designer/engine/models/clients/types.py packages/data-designer-engine/src/data_designer/engine/models/clients/parsing.py packages/data-designer-engine/src/data_designer/engine/utils/token_counting.py packages/data-designer-engine/tests/engine/models/clients/test_anthropic_translation.py packages/data-designer-engine/tests/engine/models/clients/test_parsing.py packages/data-designer-engine/tests/engine/models/clients/test_types.py packages/data-designer-engine/tests/engine/models/test_facade.py packages/data-designer-engine/tests/engine/utils/test_token_counting.pyuv run --group dev ruff format --check packages/data-designer-config/src/data_designer/lazy_heavy_imports.py packages/data-designer-engine/src/data_designer/engine/models/clients/types.py packages/data-designer-engine/src/data_designer/engine/models/clients/parsing.py packages/data-designer-engine/src/data_designer/engine/utils/token_counting.py packages/data-designer-engine/tests/engine/models/clients/test_anthropic_translation.py packages/data-designer-engine/tests/engine/models/clients/test_parsing.py packages/data-designer-engine/tests/engine/models/clients/test_types.py packages/data-designer-engine/tests/engine/models/test_facade.py packages/data-designer-engine/tests/engine/utils/test_token_counting.py✅ Checklist