feat(core): add reasoning tokens and eval_run aggregate metrics#634
Merged
feat(core): add reasoning tokens and eval_run aggregate metrics#634
Conversation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add `reasoning` field to TokenUsage and ProviderTokenUsage interfaces - Extract reasoning_tokens from Claude CLI provider's usage response - Extract reasoningTokens from AI SDK provider's usage response - Add `candidateDurationMs` to EvaluationResult (agent-only time, excludes grading) - Override `durationMs` with total case time (includes grading) in orchestrator - Update TimingArtifact to include reasoning token accumulation - Fix artifact-writer tests for new reasoning field Closes #633
- Revert candidateDurationMs and durationMs override — durationMs stays as the candidate-only duration from the provider - Add evalRun field to EvaluationResult with total durationMs (candidate + grading) and aggregated tokenUsage (candidate + all evaluators) - Add aggregateEvaluatorTokenUsage helper that recursively sums token usage from evaluator results including nested children Addresses review feedback on #633: durationMs is already candidate-only, so keep it as-is and add a separate total eval run field instead.
Use the standard outputTokenDetails.reasoningTokens and inputTokenDetails.cacheReadTokens paths instead of the deprecated top-level reasoningTokens field. This correctly extracts reasoning tokens from OpenAI, OpenRouter, Azure, Anthropic, and Gemini providers that go through the Vercel AI SDK.
… error paths - Replace || with != null for reasoning/cached checks in evalRun aggregation to correctly include fields when value is 0 - Add evalRun.durationMs to evaluator error catch path so consumers always get timing even when grading fails
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #633
reasoningfield toTokenUsageandProviderTokenUsageinterfaces. Extractreasoning_tokensfrom Claude CLI provider andreasoningTokensfrom AI SDK provider.eval_runaggregate field: NewevalRunfield onEvaluationResultwith totaldurationMs(candidate + grading wall-clock time) and aggregatedtokenUsage(candidate + all evaluator tokens, including nested children).durationMssemantics: The existing top-leveldurationMsremains the candidate/agent-only duration from the provider. Updated JSDoc to make this explicit.TimingArtifact.token_usagenow includesreasoningcount.JSONL output example (snake_case via
toSnakeCaseDeep){ "duration_ms": 8000, "token_usage": { "input": 5648, "output": 3473, "reasoning": 2880, "cached": 0 }, "eval_run": { "duration_ms": 11037, "token_usage": { "input": 6200, "output": 4100, "reasoning": 2880 } } }Test plan
bun run typecheckpassesbun test packages/core/test/evaluation/— 1052 tests passbun test apps/cli/test/commands/eval/artifact-writer.test.ts— 29 tests pass