feat(core): add reasoning tokens and eval_run aggregate metrics by christso · Pull Request #634 · EntityProcess/agentv

christso · 2026-03-17T00:41:12Z

Summary

Closes #633

Reasoning tokens: Add reasoning field to TokenUsage and ProviderTokenUsage interfaces. Extract reasoning_tokens from Claude CLI provider and reasoningTokens from AI SDK provider.
eval_run aggregate field: New evalRun field on EvaluationResult with total durationMs (candidate + grading wall-clock time) and aggregated tokenUsage (candidate + all evaluator tokens, including nested children).
Clarified durationMs semantics: The existing top-level durationMs remains the candidate/agent-only duration from the provider. Updated JSDoc to make this explicit.
Timing artifact: TimingArtifact.token_usage now includes reasoning count.

JSONL output example (snake_case via `toSnakeCaseDeep`)

{
  "duration_ms": 8000,
  "token_usage": { "input": 5648, "output": 3473, "reasoning": 2880, "cached": 0 },
  "eval_run": {
    "duration_ms": 11037,
    "token_usage": { "input": 6200, "output": 4100, "reasoning": 2880 }
  }
}

Test plan

bun run typecheck passes
bun test packages/core/test/evaluation/ — 1052 tests pass
bun test apps/cli/test/commands/eval/artifact-writer.test.ts — 29 tests pass
Pre-push hooks (build, typecheck, lint, test) all pass

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add `reasoning` field to TokenUsage and ProviderTokenUsage interfaces - Extract reasoning_tokens from Claude CLI provider's usage response - Extract reasoningTokens from AI SDK provider's usage response - Add `candidateDurationMs` to EvaluationResult (agent-only time, excludes grading) - Override `durationMs` with total case time (includes grading) in orchestrator - Update TimingArtifact to include reasoning token accumulation - Fix artifact-writer tests for new reasoning field Closes #633

cloudflare-workers-and-pages · 2026-03-17T00:42:20Z

Deploying agentv with Cloudflare Pages

Latest commit:	`2c92efd`
Status:	⚡️ Build in progress...

View logs

- Revert candidateDurationMs and durationMs override — durationMs stays as the candidate-only duration from the provider - Add evalRun field to EvaluationResult with total durationMs (candidate + grading) and aggregated tokenUsage (candidate + all evaluators) - Add aggregateEvaluatorTokenUsage helper that recursively sums token usage from evaluator results including nested children Addresses review feedback on #633: durationMs is already candidate-only, so keep it as-is and add a separate total eval run field instead.

Use the standard outputTokenDetails.reasoningTokens and inputTokenDetails.cacheReadTokens paths instead of the deprecated top-level reasoningTokens field. This correctly extracts reasoning tokens from OpenAI, OpenRouter, Azure, Anthropic, and Gemini providers that go through the Vercel AI SDK.

… error paths - Replace || with != null for reasoning/cached checks in evalRun aggregation to correctly include fields when value is 0 - Add evalRun.durationMs to evaluator error catch path so consumers always get timing even when grading fails

christso and others added 5 commits March 17, 2026 10:38

chore(deps): start ai sdk 6 migration

be79a8c

chore(deps): bump ai sdk from v5 to v6

841fbe1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

refactor(providers): remove v2/v3 compatibility casts

24a8b0f

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

test(providers): update mocks to specificationVersion v3

43f139f

christso marked this pull request as ready for review March 17, 2026 00:41

christso added 2 commits March 17, 2026 01:06

style: fix biome formatting for aggregateEvaluatorTokenUsage

6f0886c

christso changed the title ~~feat(core): add reasoning tokens and candidate duration metrics~~ feat(core): add reasoning tokens and eval_run aggregate metrics Mar 17, 2026

christso added 2 commits March 17, 2026 01:44

christso merged commit ff954c7 into main Mar 17, 2026
1 check was pending

christso deleted the feat/633-reasoning-tokens-candidate-duration branch March 17, 2026 02:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(core): add reasoning tokens and eval_run aggregate metrics#634

feat(core): add reasoning tokens and eval_run aggregate metrics#634
christso merged 9 commits intomainfrom
feat/633-reasoning-tokens-candidate-duration

christso commented Mar 17, 2026 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

christso commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

JSONL output example (snake_case via toSnakeCaseDeep)

Test plan

Uh oh!

cloudflare-workers-and-pages bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying agentv with Cloudflare Pages

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

christso commented Mar 17, 2026 •

edited

Loading

JSONL output example (snake_case via `toSnakeCaseDeep`)

cloudflare-workers-and-pages bot commented Mar 17, 2026 •

edited

Loading