Problem
In eval outputs, low scores can come from non-LLM causes (missing toolchain/runtime, repo/environment constraints), but currently this is easy to misread as model quality or judge quality.
Example from CargoWise customs evals:
- Criteria required "Builds successfully with all tests passing"
- Agent made correct code/test edits
- Environment lacked
dotnet, so build verification could not run
- Score dropped (e.g., 0.67) even though core coding behavior was largely correct
Industry Research
Surveyed how Inspect AI (UK AISI), Braintrust, W&B Weave, DeepEval, OpenAI Evals, and LangSmith handle this separation. Key findings:
- All mature frameworks keep scores and errors structurally separate.
- Inspect AI has the most architecturally complete model:
EvalError, EvalSampleLimit with typed enums, and ToolCallError with typed type for recoverable tool errors. Limits and errors are tracked as separate fields on EvalSample.
- W&B Weave uses a discriminated union for evaluation status plus
TraceStatus at the call level.
- DeepEval tracks errors at the metric level separate from
score and success.
- Braintrust keeps
error and scores as separate top-level fields.
Implementation (PR #436)
Result-level fields
// Required — present on every result record
executionStatus: 'ok' | 'quality_failure' | 'execution_error';
// Optional — only when executionStatus is 'execution_error'
failureStage?: 'setup' | 'repo_setup' | 'agent' | 'evaluator' | 'teardown';
failureReasonCode?: string; // e.g. 'provider_error', 'template_error', 'script_error', 'clone_error', 'budget_exceeded'
// Structured error detail
executionError?: { message: string; stage: FailureStage; };
Classification logic
- Errors are classified at the catch site in the orchestrator with appropriate stage/reason:
- Workspace creation errors →
setup / template_error
- Repo materialization errors →
repo_setup / clone_error
- Script errors →
setup / script_error
- Provider errors →
agent / provider_error
- Evaluator errors →
evaluator / evaluator_error
- Budget exceeded →
setup / budget_exceeded
- Successful results use
QUALITY_PASS_THRESHOLD (0.8): score >= 0.8 → ok, below → quality_failure
Score handling
Scores remain 0 for execution errors (backward-compatible), but execution errors are excluded from quality metrics (mean, median, histogram, top/bottom results) in summary aggregation.
Trial aggregation
- Any trial
ok → aggregate ok
- All trials
execution_error → aggregate execution_error
- Otherwise →
quality_failure
Summary output
Total tests: 10
Passed: 5
Quality failures: 3
Execution errors: 2
Mean score: 0.750 (8 quality tests, 2 execution errors excluded)
Execution errors by stage:
agent: 1
setup: 1
Execution errors by reason:
provider_error: 1
template_error: 1
Follow-up Issues
Acceptance Signals
Non-Goals
- Classifying every possible error —
failureReasonCode is extensible, not exhaustive
- Automatic root-cause analysis of errors
- Changing how quality failures (assertion/scoring issues) are evaluated
- Setting score to
null for execution errors (kept as 0 for backward compatibility)
Problem
In eval outputs, low scores can come from non-LLM causes (missing toolchain/runtime, repo/environment constraints), but currently this is easy to misread as model quality or judge quality.
Example from CargoWise customs evals:
dotnet, so build verification could not runIndustry Research
Surveyed how Inspect AI (UK AISI), Braintrust, W&B Weave, DeepEval, OpenAI Evals, and LangSmith handle this separation. Key findings:
EvalError,EvalSampleLimitwith typed enums, andToolCallErrorwith typedtypefor recoverable tool errors. Limits and errors are tracked as separate fields onEvalSample.TraceStatusat the call level.scoreandsuccess.errorandscoresas separate top-level fields.Implementation (PR #436)
Result-level fields
Classification logic
setup/template_errorrepo_setup/clone_errorsetup/script_erroragent/provider_errorevaluator/evaluator_errorsetup/budget_exceededQUALITY_PASS_THRESHOLD(0.8): score >= 0.8 →ok, below →quality_failureScore handling
Scores remain
0for execution errors (backward-compatible), but execution errors are excluded from quality metrics (mean, median, histogram, top/bottom results) in summary aggregation.Trial aggregation
ok→ aggregateokexecution_error→ aggregateexecution_errorquality_failureSummary output
Follow-up Issues
--retry-errorsCLI flag to re-run only execution errorsfail_on_errortolerance config for eval runsAcceptance Signals
executionStatusfield on every resultexecution_errorwith correct stage/reasonNon-Goals
failureReasonCodeis extensible, not exhaustivenullfor execution errors (kept as 0 for backward compatibility)