Add explicit tooling/execution-failure status separate from model score

## Problem

In eval outputs, low scores can come from non-LLM causes (missing toolchain/runtime, repo/environment constraints), but currently this is easy to misread as model quality or judge quality.

Example from CargoWise customs evals:
- Criteria required "Builds successfully with all tests passing"
- Agent made correct code/test edits
- Environment lacked `dotnet`, so build verification could not run
- Score dropped (e.g., 0.67) even though core coding behavior was largely correct

## Industry Research

Surveyed how Inspect AI (UK AISI), Braintrust, W&B Weave, DeepEval, OpenAI Evals, and LangSmith handle this separation. Key findings:

- **All mature frameworks keep scores and errors structurally separate.**
- **Inspect AI** has the most architecturally complete model: `EvalError`, `EvalSampleLimit` with typed enums, and `ToolCallError` with typed `type` for recoverable tool errors. Limits and errors are tracked as separate fields on `EvalSample`.
- **W&B Weave** uses a discriminated union for evaluation status plus `TraceStatus` at the call level.
- **DeepEval** tracks errors at the metric level separate from `score` and `success`.
- **Braintrust** keeps `error` and `scores` as separate top-level fields.

## Implementation (PR #436)

### Result-level fields

```typescript
// Required — present on every result record
executionStatus: 'ok' | 'quality_failure' | 'execution_error';

// Optional — only when executionStatus is 'execution_error'
failureStage?: 'setup' | 'repo_setup' | 'agent' | 'evaluator' | 'teardown';
failureReasonCode?: string; // e.g. 'provider_error', 'template_error', 'script_error', 'clone_error', 'budget_exceeded'

// Structured error detail
executionError?: { message: string; stage: FailureStage; };
```

### Classification logic

- Errors are classified at the catch site in the orchestrator with appropriate stage/reason:
  - Workspace creation errors → `setup` / `template_error`
  - Repo materialization errors → `repo_setup` / `clone_error`
  - Script errors → `setup` / `script_error`
  - Provider errors → `agent` / `provider_error`
  - Evaluator errors → `evaluator` / `evaluator_error`
  - Budget exceeded → `setup` / `budget_exceeded`
- Successful results use `QUALITY_PASS_THRESHOLD` (0.8): score >= 0.8 → `ok`, below → `quality_failure`

### Score handling

Scores remain `0` for execution errors (backward-compatible), but execution errors are **excluded from quality metrics** (mean, median, histogram, top/bottom results) in summary aggregation.

### Trial aggregation

- Any trial `ok` → aggregate `ok`
- All trials `execution_error` → aggregate `execution_error`
- Otherwise → `quality_failure`

### Summary output

```
Total tests: 10
Passed: 5
Quality failures: 3
Execution errors: 2
Mean score: 0.750 (8 quality tests, 2 execution errors excluded)

Execution errors by stage:
  agent: 1
  setup: 1

Execution errors by reason:
  provider_error: 1
  template_error: 1
```

## Follow-up Issues

- #433 — `--retry-errors` CLI flag to re-run only execution errors
- #434 — `fail_on_error` tolerance config for eval runs
- #435 — Track retried transient errors for diagnostics

## Acceptance Signals

- [x] Result records include `executionStatus` field on every result
- [x] Summary output separates quality failures from execution errors
- [x] Execution errors excluded from quality metrics (mean, median, histogram)
- [x] Trial aggregation propagates executionStatus correctly
- [x] E2E verified: provider errors produce `execution_error` with correct stage/reason

## Non-Goals

- Classifying every possible error — `failureReasonCode` is extensible, not exhaustive
- Automatic root-cause analysis of errors
- Changing how quality failures (assertion/scoring issues) are evaluated
- Setting score to `null` for execution errors (kept as 0 for backward compatibility)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add explicit tooling/execution-failure status separate from model score #431

Problem

Industry Research

Implementation (PR #436)

Result-level fields

Classification logic

Score handling

Trial aggregation

Summary output

Follow-up Issues

Acceptance Signals

Non-Goals

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add explicit tooling/execution-failure status separate from model score #431

Description

Problem

Industry Research

Implementation (PR #436)

Result-level fields

Classification logic

Score handling

Trial aggregation

Summary output

Follow-up Issues

Acceptance Signals

Non-Goals

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions