Skip to content

feat(core): add reasoning tokens and eval_run aggregate metrics#634

Merged
christso merged 9 commits intomainfrom
feat/633-reasoning-tokens-candidate-duration
Mar 17, 2026
Merged

feat(core): add reasoning tokens and eval_run aggregate metrics#634
christso merged 9 commits intomainfrom
feat/633-reasoning-tokens-candidate-duration

Conversation

@christso
Copy link
Collaborator

@christso christso commented Mar 17, 2026

Summary

Closes #633

  • Reasoning tokens: Add reasoning field to TokenUsage and ProviderTokenUsage interfaces. Extract reasoning_tokens from Claude CLI provider and reasoningTokens from AI SDK provider.
  • eval_run aggregate field: New evalRun field on EvaluationResult with total durationMs (candidate + grading wall-clock time) and aggregated tokenUsage (candidate + all evaluator tokens, including nested children).
  • Clarified durationMs semantics: The existing top-level durationMs remains the candidate/agent-only duration from the provider. Updated JSDoc to make this explicit.
  • Timing artifact: TimingArtifact.token_usage now includes reasoning count.

JSONL output example (snake_case via toSnakeCaseDeep)

{
  "duration_ms": 8000,
  "token_usage": { "input": 5648, "output": 3473, "reasoning": 2880, "cached": 0 },
  "eval_run": {
    "duration_ms": 11037,
    "token_usage": { "input": 6200, "output": 4100, "reasoning": 2880 }
  }
}

Test plan

  • bun run typecheck passes
  • bun test packages/core/test/evaluation/ — 1052 tests pass
  • bun test apps/cli/test/commands/eval/artifact-writer.test.ts — 29 tests pass
  • Pre-push hooks (build, typecheck, lint, test) all pass

christso and others added 5 commits March 17, 2026 10:38
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add `reasoning` field to TokenUsage and ProviderTokenUsage interfaces
- Extract reasoning_tokens from Claude CLI provider's usage response
- Extract reasoningTokens from AI SDK provider's usage response
- Add `candidateDurationMs` to EvaluationResult (agent-only time, excludes grading)
- Override `durationMs` with total case time (includes grading) in orchestrator
- Update TimingArtifact to include reasoning token accumulation
- Fix artifact-writer tests for new reasoning field

Closes #633
@christso christso marked this pull request as ready for review March 17, 2026 00:41
@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Mar 17, 2026

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: 2c92efd
Status:⚡️  Build in progress...

View logs

- Revert candidateDurationMs and durationMs override — durationMs stays
  as the candidate-only duration from the provider
- Add evalRun field to EvaluationResult with total durationMs (candidate
  + grading) and aggregated tokenUsage (candidate + all evaluators)
- Add aggregateEvaluatorTokenUsage helper that recursively sums token
  usage from evaluator results including nested children

Addresses review feedback on #633: durationMs is already candidate-only,
so keep it as-is and add a separate total eval run field instead.
@christso christso changed the title feat(core): add reasoning tokens and candidate duration metrics feat(core): add reasoning tokens and eval_run aggregate metrics Mar 17, 2026
Use the standard outputTokenDetails.reasoningTokens and
inputTokenDetails.cacheReadTokens paths instead of the deprecated
top-level reasoningTokens field. This correctly extracts reasoning
tokens from OpenAI, OpenRouter, Azure, Anthropic, and Gemini providers
that go through the Vercel AI SDK.
… error paths

- Replace || with != null for reasoning/cached checks in evalRun
  aggregation to correctly include fields when value is 0
- Add evalRun.durationMs to evaluator error catch path so consumers
  always get timing even when grading fails
@christso christso merged commit ff954c7 into main Mar 17, 2026
1 check was pending
@christso christso deleted the feat/633-reasoning-tokens-candidate-duration branch March 17, 2026 02:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: show duration_ms for agent being evaluated

1 participant