Skip to content

feat: track reasoning token usage#670

Merged
nabinchha merged 11 commits into
mainfrom
nmulepati/feat-665-reasoning-token-usage
May 18, 2026
Merged

feat: track reasoning token usage#670
nabinchha merged 11 commits into
mainfrom
nmulepati/feat-665-reasoning-token-usage

Conversation

@nabinchha
Copy link
Copy Markdown
Contributor

@nabinchha nabinchha commented May 15, 2026

📋 Summary

Adds reasoning token tracking to model usage stats while preserving provider output token semantics. Provider-reported reasoning counts are recorded exactly when available; when a provider returns reasoning/thinking content and normal usage but omits the reasoning-token breakdown, DD estimates the reasoning count using the shared tiktoken cl100k_base helper and labels it as estimated.

🔗 Related Issue

Fixes #665

🔄 Changes

  • Add reasoning_tokens and reasoning_token_count_source to token usage stats.
  • Parse provider-reported reasoning counts from OpenAI-compatible usage fields, including completion_tokens_details.reasoning_tokens, output_tokens_details.reasoning_tokens, and top-level variants.
  • Estimate missing reasoning counts from returned reasoning/thinking content when provider usage exists but no reasoning count is reported.
  • Do not synthesize usage stats when a provider omits usage entirely.
  • Validate the reasoning count/source pair on both canonical Usage objects and aggregated TokenUsageStats.
  • Lazy-load tiktoken through the shared lazy import facade so normal chat parsing does not eagerly import tokenizer code.
  • Reuse the same tiktoken-based helper for reasoning estimates and column statistics.
  • Update model usage summary logging to show reasoning counts only when known, with (estimated) labels when applicable.
  • Clarify internal helper names and annotations so count-oriented helpers use reasoning_token_count terminology and return int | None where applicable.
  • Add tests for provider-reported, estimated, missing, zero, no-usage, and source-validation cases.

🔍 Attention Areas

⚠️ Reviewers: Please pay special attention to the following:

  • usage.pyTokenUsageStats.reasoning_tokens is now nullable and accompanied by reasoning_token_count_source.
  • types.py — canonical Usage now validates reasoning count/source consistency when clients construct responses.
  • parsing.py — provider-reported reasoning counts are parsed separately from estimated counts derived from reasoning content.
  • registry.py — the usage summary omits reasoning counts when providers return neither a count nor reasoning content.
  • token_counting.py — tiktoken is lazy-loaded and only accessed when token estimation is actually needed.

🧪 Testing

  • PYTHONPATH=packages/data-designer-config/src:packages/data-designer-engine/src:packages/data-designer/src uv run --group dev pytest packages/data-designer-engine/tests/engine/models packages/data-designer-engine/tests/engine/analysis packages/data-designer/tests/test_lazy_imports.py -q (589 passed)
  • PYTHONPATH=packages/data-designer-config/src:packages/data-designer-engine/src:packages/data-designer/src uv run --group dev pytest packages/data-designer/tests/test_lazy_imports.py packages/data-designer-engine/tests/engine/models/clients/test_parsing.py packages/data-designer-engine/tests/engine/models/clients/test_anthropic_translation.py packages/data-designer-engine/tests/engine/models/clients/test_types.py packages/data-designer-engine/tests/engine/models/test_facade.py packages/data-designer-engine/tests/engine/utils/test_token_counting.py -q (159 passed)
  • uv run --group dev ruff check packages/data-designer-config/src/data_designer/lazy_heavy_imports.py packages/data-designer-engine/src/data_designer/engine/models/clients/types.py packages/data-designer-engine/src/data_designer/engine/models/clients/parsing.py packages/data-designer-engine/src/data_designer/engine/utils/token_counting.py packages/data-designer-engine/tests/engine/models/clients/test_anthropic_translation.py packages/data-designer-engine/tests/engine/models/clients/test_parsing.py packages/data-designer-engine/tests/engine/models/clients/test_types.py packages/data-designer-engine/tests/engine/models/test_facade.py packages/data-designer-engine/tests/engine/utils/test_token_counting.py
  • uv run --group dev ruff format --check packages/data-designer-config/src/data_designer/lazy_heavy_imports.py packages/data-designer-engine/src/data_designer/engine/models/clients/types.py packages/data-designer-engine/src/data_designer/engine/models/clients/parsing.py packages/data-designer-engine/src/data_designer/engine/utils/token_counting.py packages/data-designer-engine/tests/engine/models/clients/test_anthropic_translation.py packages/data-designer-engine/tests/engine/models/clients/test_parsing.py packages/data-designer-engine/tests/engine/models/clients/test_types.py packages/data-designer-engine/tests/engine/models/test_facade.py packages/data-designer-engine/tests/engine/utils/test_token_counting.py
  • Unit tests added/updated
  • E2E tests added/updated (N/A — usage accounting covered by unit/model tests)

✅ Checklist

  • Follows commit message conventions
  • Commits are signed off (DCO)
  • Architecture docs updated (N/A — no architecture docs needed)

nabinchha added 6 commits May 15, 2026 15:33
Capture provider-reported reasoning-token breakdowns alongside output tokens without changing output token totals. Carry the field through model usage aggregation and add coverage for parsing, facade tracking, and deltas.

Refs #665
Include reasoning token counts in the local model usage summary while preserving output and total token semantics. Telemetry remains unchanged.

Refs #665
When providers return reasoning content without a numeric usage breakdown, estimate reasoning tokens from that content while preserving provider-reported output and total token counts.

Refs #665
@nabinchha nabinchha requested a review from a team as a code owner May 15, 2026 21:37
@github-actions
Copy link
Copy Markdown
Contributor

PR #670 Review — feat: track reasoning token usage

Summary

Adds reasoning-token tracking to model usage stats while preserving provider output-token semantics:

  • New TokenCountSource enum (PROVIDER / ESTIMATED) and two new fields on TokenUsageStats / Usage: reasoning_tokens and reasoning_token_count_source.
  • extract_usage now reads provider counts from completion_tokens_details.reasoning_tokens, output_tokens_details.reasoning_tokens, and a top-level reasoning_tokens variant.
  • New fill_reasoning_tokens_from_content estimates a count via tiktoken (cl100k_base) when the provider returns reasoning content but no count.
  • New shared data_designer.engine.utils.token_counting module collapses two duplicate tokenizer helpers (one in column_statistics_calculations.py, one new) into a single count_text_tokens / get_cl100k_base_tokenizer pair.
  • log_model_usage shows reasoning=N (or reasoning=N (estimated)) only when known, plus a one-line note when the count was estimated.
  • Delta computation in get_usage_deltas propagates reasoning counts and their source.
  • Tests cover provider-reported, estimated, missing, zero, and merge-precedence cases. 16 files / +485 / −47.

The change is well-scoped, the tests are thorough, and the "declare, don't orchestrate" contract is preserved (no engine flow changes).

Findings

Type inconsistency from use_enum_values=True — code smell

packages/data-designer-engine/src/data_designer/engine/models/usage.py:20

TokenUsageStats sets model_config = ConfigDict(use_enum_values=True), so once a TokenUsageStats is instantiated, reasoning_token_count_source is the string value, not the enum. This forces helpers to defensively compare against both:

  • registry.py:31if source == TokenCountSource.ESTIMATED or source == TokenCountSource.ESTIMATED.value: — the or branch reads as a workaround.
  • merge_token_count_sources accepts TokenCountSource | str | None and returns str | None. Asymmetric signatures are easy to misuse from new call sites.

Because TokenCountSource is str, Enum, equality enum_member == "provider" is True, so the or branch is technically redundant. Either remove use_enum_values=True and keep the field typed as the enum end-to-end, or drop the enum side of the comparisons and treat the field as a string with enum constants used only for writing. Picking one shape will simplify three call sites and make future call sites easier to type-check.

Cross-module import widens clients/types.py's dependency footprint

packages/data-designer-engine/src/data_designer/engine/models/clients/types.py:9

clients/types.py was previously a leaf module (stdlib only). It now imports TokenCountSource from models/usage.py, which transitively imports Pydantic. This is not an architectural violation per AGENTS.md (both stay in engine), but it does push Pydantic into the import path for any module that only wanted the lightweight Usage / AssistantMessage dataclasses. Two cheaper alternatives:

  • Define TokenCountSource in clients/types.py (or a small engine.models.token_sources module) and import it from usage.py instead.
  • Type the field as str | None on the Usage dataclass and reserve the enum for the Pydantic boundary.

Worth a sentence of justification in the PR if the current direction is intentional.

Usage is mutated in fill_reasoning_tokens_from_content

packages/data-designer-engine/src/data_designer/engine/models/clients/parsing.py:326

usage.reasoning_tokens = count_text_tokens(reasoning_content)
...
usage.reasoning_token_count_source = TokenCountSource.ESTIMATED

Usage is a plain @dataclass (not frozen), so this is legal, but it's a subtle shift from the rest of parsing.py, where Usage is constructed once and returned. If any caller ever holds a reference to the pre-fill Usage (e.g., logging, retries, batch dedup), they'll observe the mutation. A dataclasses.replace(usage, …) returning a new instance would match the construct-once style of extract_usage and remove the partial-mutation hazard in the except branch (where reasoning_tokens is set but the source label might not be — this is currently dodged because the assignment to reasoning_tokens happens inside the try, but it's a fragile arrangement).

extract_reasoning_tokens return type is Any

packages/data-designer-engine/src/data_designer/engine/models/clients/parsing.py:309

The function returns whatever providers shipped (commonly int or stringified int, both seen in the new tests). Callers immediately funnel through coerce_to_int_or_none. The honest signature is int | str | None; Any here is broader than the actual contract and weakens type-checking at the (single) call site.

Redundant invariant enforcement

packages/data-designer-engine/src/data_designer/engine/models/usage.py:27 and :43

The Pydantic model_validator already guarantees that reasoning_tokens and reasoning_token_count_source are both set or both unset. The runtime check at the top of extend() re-verifies one half of that invariant. The duplication is harmless but the validator alone is sufficient — calls that violate the rule would fail when the next validation runs (e.g., dump/copy). If you want a guard at the API boundary too, fine; otherwise drop it.

Edge case: negative or vanishing reasoning deltas

packages/data-designer-engine/src/data_designer/engine/models/registry.py:181

get_token_delta returns None when current is None, even if prev had a non-zero count. Combined with the delta_reasoning is not None and delta_reasoning > 0 filter, a model whose provider stops reporting reasoning tokens mid-run will have its reasoning delta silently dropped. This is consistent with how delta_input > 0 works for the other counts, so probably fine — but there's no test pinning the "previously had reasoning, now None" case. Worth one small parametrize.

Minor

  • packages/data-designer-engine/src/data_designer/engine/utils/__init__.py is a new empty (header-only) file. Per AGENTS.md, namespace packages avoid __init__.py. Confirm this utils/ is intended as a regular package (it sits below data_designer.engine, which is itself a regular subpackage, so this should be fine — but worth being intentional).
  • format_reasoning_tokens and get_token_delta are module-level helpers in registry.py but are only called from ModelRegistry. No real harm, but they have no other consumers and could be _format_reasoning_tokens / _get_token_delta to signal that. The "estimated" label string could also be a module-level constant rather than reformed via f-string each call.
  • The PR description's PR checklist shows DCO sign-off unchecked. Confirm before merge if your workflow requires it.
  • No release-notes / CHANGELOG entry. The log format visible to users is changing (tokens: input=…, output=…, reasoning=… (estimated), total=…) — a one-line user-facing note is cheap.

Test coverage

Strong. New tests cover: OpenAI chat-completions and Responses-API field shapes, top-level reasoning_tokens, missing-count + reasoning-content estimation, provider-vs-estimated precedence, zero-count display, merge precedence (estimated wins), extend() invariant errors, delta propagation with TokenCountSource.PROVIDER. Gaps worth one parametrize each:

  • Snapshot has reasoning_tokens != None, current is None (delta drop case above).
  • extract_reasoning_tokens with malformed provider payloads (e.g., completion_tokens_details: None, reasoning_tokens: "abc" — the latter currently flows to coerce_to_int_or_none, which presumably returns None; pin it).
  • fill_reasoning_tokens_from_content when the provider supplied reasoning_tokens=0 (no-op since 0 is not None, but worth confirming).

Verdict

Approvable with minor revisions. The functionality, layering direction, and tests are sound. Recommended pre-merge fixes:

  1. Resolve the TokenCountSource enum-vs-string ambiguity (drop use_enum_values=True or stop comparing against the enum identity at call sites). Highest-leverage change.
  2. Tighten extract_reasoning_tokens return type to int | str | None.
  3. Either justify or rework the clients/types.pymodels/usage.py import.
  4. Optional: switch fill_reasoning_tokens_from_content to a dataclasses.replace style and add the two missing parametrize cases.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 15, 2026

Greptile Summary

This PR adds first-class reasoning token tracking to the model usage pipeline. Provider-reported counts are parsed from OpenAI-compatible fields (completion_tokens_details.reasoning_tokens, output_tokens_details.reasoning_tokens, top-level reasoning_tokens), and when a provider returns reasoning/thinking content but omits a count, the token total is estimated using a shared lazy-loaded cl100k_base tiktoken helper and labelled ESTIMATED in the aggregated stats.

  • parsing.py: extract_reasoning_token_count reads all known provider field locations; fill_reasoning_token_count_from_content estimates from content only when no provider count exists and skips estimation when usage is None (provider omitted the block entirely).
  • usage.py / types.py: Both TokenUsageStats (Pydantic) and Usage (dataclass) enforce a pair-consistency invariant — reasoning_tokens and reasoning_token_count_source must be set together or not at all.
  • token_counting.py: Consolidates the previously duplicated _get_tokenizer helper from column_statistics_calculations.py into a single lazy-loaded, lru_cache-backed utility, with a subprocess-based test verifying tiktoken is never eagerly imported.

Confidence Score: 5/5

Safe to merge — the change is additive, all new fields are nullable with validated invariants, and the estimation fallback is guarded so it never synthesizes a count when the provider omitted usage entirely.

The reasoning token fields are introduced as nullable with pair-consistency guards at both the dataclass and Pydantic model layers, so existing callers that don't supply them continue to work unchanged. The estimation path is strictly opt-in (only reached when usage is present but the count is absent), and use_enum_values=True on TokenUsageStats is handled correctly in all comparison sites. Delta computation and log formatting handle the None case cleanly. Test coverage is thorough across provider-reported, estimated, zero, and missing cases.

No files require special attention.

Important Files Changed

Filename Overview
packages/data-designer-engine/src/data_designer/engine/models/usage.py Adds reasoning_tokens and reasoning_token_count_source to TokenUsageStats with Pydantic validator, use_enum_values=True, and correct merge logic in extend().
packages/data-designer-engine/src/data_designer/engine/models/clients/parsing.py Adds extract_reasoning_token_count (reads provider fields) and fill_reasoning_token_count_from_content (tiktoken estimation); estimation is correctly skipped when provider count already exists or usage is None.
packages/data-designer-engine/src/data_designer/engine/models/clients/types.py Adds reasoning_tokens/reasoning_token_count_source to the Usage dataclass with post_init pair-consistency validation; clean change.
packages/data-designer-engine/src/data_designer/engine/models/registry.py Extends delta computation and log formatting to include reasoning token counts; the ESTIMATED label logic and delta source attribution are correct.
packages/data-designer-engine/src/data_designer/engine/utils/token_counting.py New shared helper that lazy-loads tiktoken via the facade and caches the cl100k_base encoder; replaces the inline _get_tokenizer that was duplicated in column_statistics_calculations.py.
packages/data-designer-engine/src/data_designer/engine/models/facade.py Passes reasoning_tokens and reasoning_token_count_source through from Usage to TokenUsageStats; straightforward propagation with no semantic changes.
packages/data-designer-engine/src/data_designer/engine/models/clients/adapters/anthropic_translation.py Calls fill_reasoning_token_count_from_content after extract_usage so Anthropic thinking blocks trigger estimation when the provider omits a count; correct and symmetric with the OpenAI-compatible path.
packages/data-designer-engine/tests/engine/models/clients/test_parsing.py New tests cover provider-reported, estimated, missing, zero-count, no-usage, and source-preference cases; monkeypatches count_text_tokens to isolate tiktoken from unit tests.
packages/data-designer-engine/tests/engine/utils/test_token_counting.py Verifies correctness, caching behaviour, and that importing the module does not eagerly trigger tiktoken initialization (subprocess-based lazy-import test).

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Provider response JSON] --> B[extract_usage]
    B --> C{reasoning_tokens\nin provider fields?}
    C -->|Yes| D["source = PROVIDER\nreasoning_tokens = N"]
    C -->|No| E["reasoning_tokens = None\nsource = None"]
    D --> F[Usage dataclass]
    E --> F
    F --> G[fill_reasoning_token_count_from_content]
    G --> H{usage is None?}
    H -->|Yes| I[return None]
    H -->|No| J{reasoning_tokens\nalready set?}
    J -->|Yes| K[return usage as-is]
    J -->|No| L{reasoning_content\npresent?}
    L -->|No| K
    L -->|Yes| M[count_text_tokens via tiktoken]
    M --> N["source = ESTIMATED\nreasoning_tokens = estimated_count"]
    N --> O[new Usage via dataclasses.replace]
    K --> P[ChatCompletionResponse]
    O --> P
    P --> Q[ModelFacade.completion]
    Q --> R["TokenUsageStats.extend()\nmerge_token_count_sources"]
    R --> S[Cumulative TokenUsageStats]
    S --> T{log_model_usage}
    T -->|source == ESTIMATED| U["reasoning=N (estimated)\n+ tiktoken note"]
    T -->|source == PROVIDER| V["reasoning=N"]
    T -->|reasoning_tokens is None| W[no reasoning field]
Loading

Reviews (6): Last reviewed commit: "Merge branch 'main' into nmulepati/feat-..." | Re-trigger Greptile

Copy link
Copy Markdown
Contributor

@eric-tramel eric-tramel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really necessary to make these estimates with an alternate tokenizer? This is added overhead that can add up quick for large tasks.

What are the contexts in which token usage isn't reported? Main public APIs (OAI/Ant/OR) should already be reporting token usage on request responses. Private APIs might not account exactly because the thinking is hidden, but they should be providing the correct token count for $ accounting.

In the case of self-host, we should also expect to get token response counts.

Where are we seeing this happen and can we push this to an uncounted edge case and not take on added effort of tiktoken etc in such cases?

@nabinchha
Copy link
Copy Markdown
Contributor Author

Is it really necessary to make these estimates with an alternate tokenizer? This is added overhead that can add up quick for large tasks.

What are the contexts in which token usage isn't reported? Main public APIs (OAI/Ant/OR) should already be reporting token usage on request responses. Private APIs might not account exactly because the thinking is hidden, but they should be providing the correct token count for $ accounting.

In the case of self-host, we should also expect to get token response counts.

Where are we seeing this happen and can we push this to an uncounted edge case and not take on added effort of tiktoken etc in such cases?

Anthropic, Nvidia build hosted open models don't report reasoning token counts. We already have the tikitoken overhead used for column statistics.

Comment thread packages/data-designer-engine/src/data_designer/engine/models/clients/parsing.py Outdated
Comment thread packages/data-designer-engine/src/data_designer/engine/utils/token_counting.py Outdated
Comment thread packages/data-designer-engine/tests/engine/models/clients/test_parsing.py Outdated
Copy link
Copy Markdown
Contributor

@andreatgretel andreatgretel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved after a1b8d864. My review threads are resolved, and the no-provider-usage behavior is now explicit and tested. Looks good from my side.

@nabinchha nabinchha merged commit 7199762 into main May 18, 2026
49 checks passed
@nabinchha nabinchha deleted the nmulepati/feat-665-reasoning-token-usage branch May 18, 2026 18:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Track reasoning token counts separately from output tokens

3 participants