CLI: capture eval-runner failures and timeouts in result#3330
Merged
Conversation
Extends the eval-runner result JSON with two new fields that any
consumer of `EVAL_RUNNER_RESULT_FILE` can read:
- `error: string | null` — set when an exception is thrown inside
the message loop (auth blip, MCP crash, network error, SDK throw,
etc.). Today those exceptions propagate out of `runEval()` to
`main()`'s top-level catch, which emits a stripped-down
`{ success: false, error }` JSON without the timings, tools, turns,
or partial state that were captured before the throw. With this
change, the exception is caught inside `runEval()`, so consumers
get the full structured result alongside the error message.
- `timedOut: boolean` — flipped by the timeout callback before it
calls `query.interrupt()`. Today timeouts are indistinguishable
from clean model failures: both surface as `success: false` with
no other differentiator. Consumers can now tell them apart and
attribute time-budget exhaustion correctly.
Together these address the `finalError` ask in #3262
section 1 ("First actionable failure") and complete the failure-
visibility story started by #3273 (which added `firstToolError`,
`toolEvents`, `phaseTimingsMs`).
The change is generic — it benefits any consumer of the eval-runner's
structured output (Studio's own scripts, `npm run eval`, future
internal CI, third-party benchmark harnesses).
Originally drafted on the experimental Static Site Importer branch
(#3309); split out for upstream review because the eval-runner
changes are independently valuable and shouldn't ship gated on
that experiment.
Refs: #3262, #3273, #3309
## AI assistance
- **AI assistance:** Yes
- **Tool(s):** Claude Code (Sonnet 4.5)
- **Used for:** Drafted the catch wrapper, the `timedOut` flag,
and the result-shape extension under Chris's direction. Chris
reviewed the diff, the issue framing, and the split rationale.
chubes4
added a commit
to chubes4/homeboy-rigs
that referenced
this pull request
May 4, 2026
Pairs with Automattic/studio#3330, which adds two new fields to the eval-runner result JSON: - result.error: string | null — set when an exception is caught inside the message loop (auth blip, MCP crash, network error, SDK throw, etc.) - result.timedOut: boolean — set when the timeout callback fires before query.interrupt() Bench changes: - studio-agent-runtime + studio-agent-site-info: failure-detail message now distinguishes 'timed out after Nms', 'exception: ...', and 'exit=N'. Lifts the timeout literal into a named constant so the message stays in sync with the actual budget. - studio-agent-site-build: agentSucceeded gate now also requires !result.timedOut. Adds two new metrics: - timed_out (1 when the run hit the time budget) - agent_runner_error (1 when an exception landed in result.error) These categorize regressions that today look like generic agent failures. Backwards-compatible: on Studio versions older than #3330, both fields are nullish and the bench behaves exactly as it did before. Refs: Automattic/studio#3330, Automattic/studio#3262, Automattic/studio#3273 ## AI assistance - **AI assistance:** Yes - **Tool(s):** Claude Code (Sonnet 4.5) - **Used for:** Drafted the gate-tightening, the constants, and the failure-detail messages under Chris's direction.
Collaborator
📊 Performance Test ResultsComparing f4ab2be vs trunk app-size
site-editor
site-startup
Results are median values from multiple test runs. Legend: 🟢 Improvement (faster) | 🔴 Regression (slower) | ⚪ No change (<50ms diff) |
youknowriad
approved these changes
May 4, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Extends the eval-runner result JSON with two new fields that any consumer of
EVAL_RUNNER_RESULT_FILEcan read:error: string | null— set when an exception is thrown inside the message loop (auth blip, MCP crash, network error, SDK throw, etc.).timedOut: boolean— flipped by the timeout callback before it callsquery.interrupt().Together these address the
finalErrorask in #3262 section 1 ("First actionable failure") and complete the failure-visibility story started by #3273 (firstToolError,toolEvents,phaseTimingsMs).Why
Three failure modes today, three different visibility outcomes:
success: falsecleanlyquery.interrupt()aftertimeoutMs)success: false, indistinguishable from a clean model failuretimedOut: truedistinguishes itmain()'s top-level → emits stripped-down{ success: false, error }JSON, losing all timings/tools/turnsrunEval()→ full structured result witherrorset alongside everything elseThe third row is the most consequential: today, a run that completes meaningful work and then hits a late exception loses all the diagnostic state that was already captured. The new
try { … } catchkeeps the structured result intact.What this enables
A consumer can now answer:
result.timedOut.)successwastrue? (Today: not representable —success: trueand the exception path are mutually exclusive. After: both fields can coexist.)main()-level fallback JSON. After: preserved in the structured result.)This is the same shape as #3273's contribution — producer-side observability improvements that any downstream consumer benefits from. Studio's own
npm run eval, future internal CI, promptfoo configurations, and external benchmark harnesses all read the same JSON contract.Diff
8 lines:
let error: string | null = nullandlet timedOut = falsedeclarationstimedOut = truebefore callingquery.interrupt()try { … } catch ( caught ) { error = … }wrapper around the message loopThe catch only changes behavior on exception paths; the existing successful and clean-failure paths are unchanged.
Origin
Originally drafted on the experimental Static Site Importer branch (#3309) where it was being used by an out-of-tree benchmark harness. Split out for upstream review because:
The SSI draft PR will rebase on top of this once it lands.
Tests
npm run typecheck(all workspaces) — cleannpx eslint apps/cli/ai/eval-runner.ts— cleanThe change is small enough that it's covered by the existing eval-runner exercise paths (
npm run eval, etc.). If reviewers want explicit unit tests around the new fields I'm happy to add them.Refs
AI assistance
timedOutflag, and the result-shape extension under Chris's direction. Chris reviewed the diff, the issue framing, and the split rationale.