feat: track reasoning token usage by nabinchha · Pull Request #670 · NVIDIA-NeMo/DataDesigner

nabinchha · 2026-05-15T21:37:34Z

📋 Summary

Adds reasoning token tracking to model usage stats while preserving provider output token semantics. Provider-reported reasoning counts are recorded exactly when available; when a provider returns reasoning/thinking content and normal usage but omits the reasoning-token breakdown, DD estimates the reasoning count using the shared tiktoken cl100k_base helper and labels it as estimated.

🔗 Related Issue

Fixes #665

🔄 Changes

Add reasoning_tokens and reasoning_token_count_source to token usage stats.
Parse provider-reported reasoning counts from OpenAI-compatible usage fields, including completion_tokens_details.reasoning_tokens, output_tokens_details.reasoning_tokens, and top-level variants.
Estimate missing reasoning counts from returned reasoning/thinking content when provider usage exists but no reasoning count is reported.
Do not synthesize usage stats when a provider omits usage entirely.
Validate the reasoning count/source pair on both canonical Usage objects and aggregated TokenUsageStats.
Lazy-load tiktoken through the shared lazy import facade so normal chat parsing does not eagerly import tokenizer code.
Reuse the same tiktoken-based helper for reasoning estimates and column statistics.
Update model usage summary logging to show reasoning counts only when known, with (estimated) labels when applicable.
Clarify internal helper names and annotations so count-oriented helpers use reasoning_token_count terminology and return int | None where applicable.
Add tests for provider-reported, estimated, missing, zero, no-usage, and source-validation cases.

🔍 Attention Areas

⚠️ Reviewers: Please pay special attention to the following:

usage.py — TokenUsageStats.reasoning_tokens is now nullable and accompanied by reasoning_token_count_source.
types.py — canonical Usage now validates reasoning count/source consistency when clients construct responses.
parsing.py — provider-reported reasoning counts are parsed separately from estimated counts derived from reasoning content.
registry.py — the usage summary omits reasoning counts when providers return neither a count nor reasoning content.
token_counting.py — tiktoken is lazy-loaded and only accessed when token estimation is actually needed.

🧪 Testing

PYTHONPATH=packages/data-designer-config/src:packages/data-designer-engine/src:packages/data-designer/src uv run --group dev pytest packages/data-designer-engine/tests/engine/models packages/data-designer-engine/tests/engine/analysis packages/data-designer/tests/test_lazy_imports.py -q (589 passed)
PYTHONPATH=packages/data-designer-config/src:packages/data-designer-engine/src:packages/data-designer/src uv run --group dev pytest packages/data-designer/tests/test_lazy_imports.py packages/data-designer-engine/tests/engine/models/clients/test_parsing.py packages/data-designer-engine/tests/engine/models/clients/test_anthropic_translation.py packages/data-designer-engine/tests/engine/models/clients/test_types.py packages/data-designer-engine/tests/engine/models/test_facade.py packages/data-designer-engine/tests/engine/utils/test_token_counting.py -q (159 passed)
uv run --group dev ruff check packages/data-designer-config/src/data_designer/lazy_heavy_imports.py packages/data-designer-engine/src/data_designer/engine/models/clients/types.py packages/data-designer-engine/src/data_designer/engine/models/clients/parsing.py packages/data-designer-engine/src/data_designer/engine/utils/token_counting.py packages/data-designer-engine/tests/engine/models/clients/test_anthropic_translation.py packages/data-designer-engine/tests/engine/models/clients/test_parsing.py packages/data-designer-engine/tests/engine/models/clients/test_types.py packages/data-designer-engine/tests/engine/models/test_facade.py packages/data-designer-engine/tests/engine/utils/test_token_counting.py
uv run --group dev ruff format --check packages/data-designer-config/src/data_designer/lazy_heavy_imports.py packages/data-designer-engine/src/data_designer/engine/models/clients/types.py packages/data-designer-engine/src/data_designer/engine/models/clients/parsing.py packages/data-designer-engine/src/data_designer/engine/utils/token_counting.py packages/data-designer-engine/tests/engine/models/clients/test_anthropic_translation.py packages/data-designer-engine/tests/engine/models/clients/test_parsing.py packages/data-designer-engine/tests/engine/models/clients/test_types.py packages/data-designer-engine/tests/engine/models/test_facade.py packages/data-designer-engine/tests/engine/utils/test_token_counting.py
Unit tests added/updated
E2E tests added/updated (N/A — usage accounting covered by unit/model tests)

✅ Checklist

Follows commit message conventions
Commits are signed off (DCO)
Architecture docs updated (N/A — no architecture docs needed)

Capture provider-reported reasoning-token breakdowns alongside output tokens without changing output token totals. Carry the field through model usage aggregation and add coverage for parsing, facade tracking, and deltas. Refs #665

Include reasoning token counts in the local model usage summary while preserving output and total token semantics. Telemetry remains unchanged. Refs #665

When providers return reasoning content without a numeric usage breakdown, estimate reasoning tokens from that content while preserving provider-reported output and total token counts. Refs #665

github-actions · 2026-05-15T21:40:49Z

PR #670 Review — feat: track reasoning token usage

Summary

Adds reasoning-token tracking to model usage stats while preserving provider output-token semantics:

New TokenCountSource enum (PROVIDER / ESTIMATED) and two new fields on TokenUsageStats / Usage: reasoning_tokens and reasoning_token_count_source.
extract_usage now reads provider counts from completion_tokens_details.reasoning_tokens, output_tokens_details.reasoning_tokens, and a top-level reasoning_tokens variant.
New fill_reasoning_tokens_from_content estimates a count via tiktoken (cl100k_base) when the provider returns reasoning content but no count.
New shared data_designer.engine.utils.token_counting module collapses two duplicate tokenizer helpers (one in column_statistics_calculations.py, one new) into a single count_text_tokens / get_cl100k_base_tokenizer pair.
log_model_usage shows reasoning=N (or reasoning=N (estimated)) only when known, plus a one-line note when the count was estimated.
Delta computation in get_usage_deltas propagates reasoning counts and their source.
Tests cover provider-reported, estimated, missing, zero, and merge-precedence cases. 16 files / +485 / −47.

The change is well-scoped, the tests are thorough, and the "declare, don't orchestrate" contract is preserved (no engine flow changes).

Findings

Type inconsistency from `use_enum_values=True` — code smell

packages/data-designer-engine/src/data_designer/engine/models/usage.py:20

TokenUsageStats sets model_config = ConfigDict(use_enum_values=True), so once a TokenUsageStats is instantiated, reasoning_token_count_source is the string value, not the enum. This forces helpers to defensively compare against both:

registry.py:31 — if source == TokenCountSource.ESTIMATED or source == TokenCountSource.ESTIMATED.value: — the or branch reads as a workaround.
merge_token_count_sources accepts TokenCountSource | str | None and returns str | None. Asymmetric signatures are easy to misuse from new call sites.

Because TokenCountSource is str, Enum, equality enum_member == "provider" is True, so the or branch is technically redundant. Either remove use_enum_values=True and keep the field typed as the enum end-to-end, or drop the enum side of the comparisons and treat the field as a string with enum constants used only for writing. Picking one shape will simplify three call sites and make future call sites easier to type-check.

Cross-module import widens `clients/types.py`'s dependency footprint

packages/data-designer-engine/src/data_designer/engine/models/clients/types.py:9

clients/types.py was previously a leaf module (stdlib only). It now imports TokenCountSource from models/usage.py, which transitively imports Pydantic. This is not an architectural violation per AGENTS.md (both stay in engine), but it does push Pydantic into the import path for any module that only wanted the lightweight Usage / AssistantMessage dataclasses. Two cheaper alternatives:

Define TokenCountSource in clients/types.py (or a small engine.models.token_sources module) and import it from usage.py instead.
Type the field as str | None on the Usage dataclass and reserve the enum for the Pydantic boundary.

Worth a sentence of justification in the PR if the current direction is intentional.

`Usage` is mutated in `fill_reasoning_tokens_from_content`

packages/data-designer-engine/src/data_designer/engine/models/clients/parsing.py:326

usage.reasoning_tokens = count_text_tokens(reasoning_content)
...
usage.reasoning_token_count_source = TokenCountSource.ESTIMATED

Usage is a plain @dataclass (not frozen), so this is legal, but it's a subtle shift from the rest of parsing.py, where Usage is constructed once and returned. If any caller ever holds a reference to the pre-fill Usage (e.g., logging, retries, batch dedup), they'll observe the mutation. A dataclasses.replace(usage, …) returning a new instance would match the construct-once style of extract_usage and remove the partial-mutation hazard in the except branch (where reasoning_tokens is set but the source label might not be — this is currently dodged because the assignment to reasoning_tokens happens inside the try, but it's a fragile arrangement).

`extract_reasoning_tokens` return type is `Any`

packages/data-designer-engine/src/data_designer/engine/models/clients/parsing.py:309

The function returns whatever providers shipped (commonly int or stringified int, both seen in the new tests). Callers immediately funnel through coerce_to_int_or_none. The honest signature is int | str | None; Any here is broader than the actual contract and weakens type-checking at the (single) call site.

Redundant invariant enforcement

packages/data-designer-engine/src/data_designer/engine/models/usage.py:27 and :43

The Pydantic model_validator already guarantees that reasoning_tokens and reasoning_token_count_source are both set or both unset. The runtime check at the top of extend() re-verifies one half of that invariant. The duplication is harmless but the validator alone is sufficient — calls that violate the rule would fail when the next validation runs (e.g., dump/copy). If you want a guard at the API boundary too, fine; otherwise drop it.

Edge case: negative or vanishing reasoning deltas

packages/data-designer-engine/src/data_designer/engine/models/registry.py:181

get_token_delta returns None when current is None, even if prev had a non-zero count. Combined with the delta_reasoning is not None and delta_reasoning > 0 filter, a model whose provider stops reporting reasoning tokens mid-run will have its reasoning delta silently dropped. This is consistent with how delta_input > 0 works for the other counts, so probably fine — but there's no test pinning the "previously had reasoning, now None" case. Worth one small parametrize.

Minor

packages/data-designer-engine/src/data_designer/engine/utils/__init__.py is a new empty (header-only) file. Per AGENTS.md, namespace packages avoid __init__.py. Confirm this utils/ is intended as a regular package (it sits below data_designer.engine, which is itself a regular subpackage, so this should be fine — but worth being intentional).
format_reasoning_tokens and get_token_delta are module-level helpers in registry.py but are only called from ModelRegistry. No real harm, but they have no other consumers and could be _format_reasoning_tokens / _get_token_delta to signal that. The "estimated" label string could also be a module-level constant rather than reformed via f-string each call.
The PR description's PR checklist shows DCO sign-off unchecked. Confirm before merge if your workflow requires it.
No release-notes / CHANGELOG entry. The log format visible to users is changing (tokens: input=…, output=…, reasoning=… (estimated), total=…) — a one-line user-facing note is cheap.

Test coverage

Strong. New tests cover: OpenAI chat-completions and Responses-API field shapes, top-level reasoning_tokens, missing-count + reasoning-content estimation, provider-vs-estimated precedence, zero-count display, merge precedence (estimated wins), extend() invariant errors, delta propagation with TokenCountSource.PROVIDER. Gaps worth one parametrize each:

Snapshot has reasoning_tokens != None, current is None (delta drop case above).
extract_reasoning_tokens with malformed provider payloads (e.g., completion_tokens_details: None, reasoning_tokens: "abc" — the latter currently flows to coerce_to_int_or_none, which presumably returns None; pin it).
fill_reasoning_tokens_from_content when the provider supplied reasoning_tokens=0 (no-op since 0 is not None, but worth confirming).

Verdict

Approvable with minor revisions. The functionality, layering direction, and tests are sound. Recommended pre-merge fixes:

Resolve the TokenCountSource enum-vs-string ambiguity (drop use_enum_values=True or stop comparing against the enum identity at call sites). Highest-leverage change.
Tighten extract_reasoning_tokens return type to int | str | None.
Either justify or rework the clients/types.py → models/usage.py import.
Optional: switch fill_reasoning_tokens_from_content to a dataclasses.replace style and add the two missing parametrize cases.

greptile-apps · 2026-05-15T21:43:33Z

Greptile Summary

This PR adds first-class reasoning token tracking to the model usage pipeline. Provider-reported counts are parsed from OpenAI-compatible fields (completion_tokens_details.reasoning_tokens, output_tokens_details.reasoning_tokens, top-level reasoning_tokens), and when a provider returns reasoning/thinking content but omits a count, the token total is estimated using a shared lazy-loaded cl100k_base tiktoken helper and labelled ESTIMATED in the aggregated stats.

parsing.py: extract_reasoning_token_count reads all known provider field locations; fill_reasoning_token_count_from_content estimates from content only when no provider count exists and skips estimation when usage is None (provider omitted the block entirely).
usage.py / types.py: Both TokenUsageStats (Pydantic) and Usage (dataclass) enforce a pair-consistency invariant — reasoning_tokens and reasoning_token_count_source must be set together or not at all.
token_counting.py: Consolidates the previously duplicated _get_tokenizer helper from column_statistics_calculations.py into a single lazy-loaded, lru_cache-backed utility, with a subprocess-based test verifying tiktoken is never eagerly imported.

Confidence Score: 5/5

Safe to merge — the change is additive, all new fields are nullable with validated invariants, and the estimation fallback is guarded so it never synthesizes a count when the provider omitted usage entirely.

The reasoning token fields are introduced as nullable with pair-consistency guards at both the dataclass and Pydantic model layers, so existing callers that don't supply them continue to work unchanged. The estimation path is strictly opt-in (only reached when usage is present but the count is absent), and use_enum_values=True on TokenUsageStats is handled correctly in all comparison sites. Delta computation and log formatting handle the None case cleanly. Test coverage is thorough across provider-reported, estimated, zero, and missing cases.

No files require special attention.

Important Files Changed

Filename	Overview
packages/data-designer-engine/src/data_designer/engine/models/usage.py	Adds reasoning_tokens and reasoning_token_count_source to TokenUsageStats with Pydantic validator, use_enum_values=True, and correct merge logic in extend().
packages/data-designer-engine/src/data_designer/engine/models/clients/parsing.py	Adds extract_reasoning_token_count (reads provider fields) and fill_reasoning_token_count_from_content (tiktoken estimation); estimation is correctly skipped when provider count already exists or usage is None.
packages/data-designer-engine/src/data_designer/engine/models/clients/types.py	Adds reasoning_tokens/reasoning_token_count_source to the Usage dataclass with post_init pair-consistency validation; clean change.
packages/data-designer-engine/src/data_designer/engine/models/registry.py	Extends delta computation and log formatting to include reasoning token counts; the ESTIMATED label logic and delta source attribution are correct.
packages/data-designer-engine/src/data_designer/engine/utils/token_counting.py	New shared helper that lazy-loads tiktoken via the facade and caches the cl100k_base encoder; replaces the inline _get_tokenizer that was duplicated in column_statistics_calculations.py.
packages/data-designer-engine/src/data_designer/engine/models/facade.py	Passes reasoning_tokens and reasoning_token_count_source through from Usage to TokenUsageStats; straightforward propagation with no semantic changes.
packages/data-designer-engine/src/data_designer/engine/models/clients/adapters/anthropic_translation.py	Calls fill_reasoning_token_count_from_content after extract_usage so Anthropic thinking blocks trigger estimation when the provider omits a count; correct and symmetric with the OpenAI-compatible path.
packages/data-designer-engine/tests/engine/models/clients/test_parsing.py	New tests cover provider-reported, estimated, missing, zero-count, no-usage, and source-preference cases; monkeypatches count_text_tokens to isolate tiktoken from unit tests.
packages/data-designer-engine/tests/engine/utils/test_token_counting.py	Verifies correctness, caching behaviour, and that importing the module does not eagerly trigger tiktoken initialization (subprocess-based lazy-import test).

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Provider response JSON] --> B[extract_usage]
    B --> C{reasoning_tokens\nin provider fields?}
    C -->|Yes| D["source = PROVIDER\nreasoning_tokens = N"]
    C -->|No| E["reasoning_tokens = None\nsource = None"]
    D --> F[Usage dataclass]
    E --> F
    F --> G[fill_reasoning_token_count_from_content]
    G --> H{usage is None?}
    H -->|Yes| I[return None]
    H -->|No| J{reasoning_tokens\nalready set?}
    J -->|Yes| K[return usage as-is]
    J -->|No| L{reasoning_content\npresent?}
    L -->|No| K
    L -->|Yes| M[count_text_tokens via tiktoken]
    M --> N["source = ESTIMATED\nreasoning_tokens = estimated_count"]
    N --> O[new Usage via dataclasses.replace]
    K --> P[ChatCompletionResponse]
    O --> P
    P --> Q[ModelFacade.completion]
    Q --> R["TokenUsageStats.extend()\nmerge_token_count_sources"]
    R --> S[Cumulative TokenUsageStats]
    S --> T{log_model_usage}
    T -->|source == ESTIMATED| U["reasoning=N (estimated)\n+ tiktoken note"]
    T -->|source == PROVIDER| V["reasoning=N"]
    T -->|reasoning_tokens is None| W[no reasoning field]

_{Reviews (6): Last reviewed commit: "Merge branch 'main' into nmulepati/feat-..." | Re-trigger Greptile}

eric-tramel

Is it really necessary to make these estimates with an alternate tokenizer? This is added overhead that can add up quick for large tasks.

What are the contexts in which token usage isn't reported? Main public APIs (OAI/Ant/OR) should already be reporting token usage on request responses. Private APIs might not account exactly because the thinking is hidden, but they should be providing the correct token count for $ accounting.

In the case of self-host, we should also expect to get token response counts.

Where are we seeing this happen and can we push this to an uncounted edge case and not take on added effort of tiktoken etc in such cases?

nabinchha · 2026-05-18T15:39:28Z

Is it really necessary to make these estimates with an alternate tokenizer? This is added overhead that can add up quick for large tasks.

What are the contexts in which token usage isn't reported? Main public APIs (OAI/Ant/OR) should already be reporting token usage on request responses. Private APIs might not account exactly because the thinking is hidden, but they should be providing the correct token count for $ accounting.

In the case of self-host, we should also expect to get token response counts.

Where are we seeing this happen and can we push this to an uncounted edge case and not take on added effort of tiktoken etc in such cases?

Anthropic, Nvidia build hosted open models don't report reasoning token counts. We already have the tikitoken overhead used for column statistics.

andreatgretel

Approved after a1b8d864. My review threads are resolved, and the no-provider-usage behavior is now explicit and tested. Looks good from my side.

nabinchha added 6 commits May 15, 2026 15:33

feat: track reasoning token usage

c1cf4a8

Capture provider-reported reasoning-token breakdowns alongside output tokens without changing output token totals. Carry the field through model usage aggregation and add coverage for parsing, facade tracking, and deltas. Refs #665

fix: show reasoning tokens in usage summary

7b5d361

Include reasoning token counts in the local model usage summary while preserving output and total token semantics. Telemetry remains unchanged. Refs #665

fix: estimate missing reasoning token counts

6c6aa2b

When providers return reasoning content without a numeric usage breakdown, estimate reasoning tokens from that content while preserving provider-reported output and total token counts. Refs #665

fix: track reasoning token count source

a0a6a19

fix: simplify reasoning token source

1af324d

fix: omit unknown reasoning tokens from logs

7bfa4bb

nabinchha requested a review from a team as a code owner May 15, 2026 21:37

nabinchha temporarily deployed to agentic-ci May 15, 2026 21:37 — with GitHub Actions Inactive

nabinchha added 3 commits May 15, 2026 15:44

refactor: clarify reasoning token count helpers

0737ae9

test: move token counting tests

a2b7f8d

fix: enforce reasoning token source

cf20e42

github-actions Bot mentioned this pull request May 18, 2026

Agentic CI: Issue & PR Triage Tracker #562

Open

eric-tramel reviewed May 18, 2026

View reviewed changes

andreatgretel reviewed May 18, 2026

View reviewed changes

Comment thread packages/data-designer-engine/src/data_designer/engine/models/clients/parsing.py Outdated

andreatgretel reviewed May 18, 2026

View reviewed changes

Comment thread packages/data-designer-engine/src/data_designer/engine/models/clients/types.py

andreatgretel reviewed May 18, 2026

View reviewed changes

Comment thread packages/data-designer-engine/src/data_designer/engine/utils/token_counting.py Outdated

andreatgretel reviewed May 18, 2026

View reviewed changes

Comment thread packages/data-designer-engine/tests/engine/models/clients/test_parsing.py Outdated

nabinchha added 2 commits May 18, 2026 11:10

fix: address reasoning usage review

a1b8d86

Merge branch 'main' into nmulepati/feat-665-reasoning-token-usage

f692612

nabinchha requested review from andreatgretel and eric-tramel May 18, 2026 17:36

andreatgretel approved these changes May 18, 2026

View reviewed changes

nabinchha merged commit 7199762 into main May 18, 2026
49 checks passed

nabinchha deleted the nmulepati/feat-665-reasoning-token-usage branch May 18, 2026 18:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: track reasoning token usage#670

feat: track reasoning token usage#670
nabinchha merged 11 commits into
mainfrom
nmulepati/feat-665-reasoning-token-usage

nabinchha commented May 15, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

greptile-apps Bot commented May 15, 2026 •

edited

Loading

Confidence Score: 5/5

Flowchart

Uh oh!

eric-tramel left a comment

Uh oh!

nabinchha commented May 18, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andreatgretel left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nabinchha commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📋 Summary

🔗 Related Issue

🔄 Changes

🔍 Attention Areas

🧪 Testing

✅ Checklist

Uh oh!

github-actions Bot commented May 15, 2026

PR #670 Review — feat: track reasoning token usage

Summary

Findings

Type inconsistency from use_enum_values=True — code smell

Cross-module import widens clients/types.py's dependency footprint

Usage is mutated in fill_reasoning_tokens_from_content

extract_reasoning_tokens return type is Any

Redundant invariant enforcement

Edge case: negative or vanishing reasoning deltas

Minor

Test coverage

Verdict

Uh oh!

greptile-apps Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

eric-tramel left a comment

Choose a reason for hiding this comment

Uh oh!

nabinchha commented May 18, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andreatgretel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nabinchha commented May 15, 2026 •

edited

Loading

Type inconsistency from `use_enum_values=True` — code smell

Cross-module import widens `clients/types.py`'s dependency footprint

`Usage` is mutated in `fill_reasoning_tokens_from_content`

`extract_reasoning_tokens` return type is `Any`

greptile-apps Bot commented May 15, 2026 •

edited

Loading