Skip to content

fix(cloud-eval): lift grader execution errors into RowMetric.error#201

Merged
placerda merged 1 commit into
developfrom
fix/cloud-eval-surface-evaluator-errors
May 29, 2026
Merged

fix(cloud-eval): lift grader execution errors into RowMetric.error#201
placerda merged 1 commit into
developfrom
fix/cloud-eval-surface-evaluator-errors

Conversation

@placerda
Copy link
Copy Markdown
Contributor

When a Foundry azure_ai_evaluator grader fails to execute (most commonly because the evaluator service principal lacks Cognitive Services OpenAI User on the target model deployment), the per-metric score comes back null and the real cause is buried in result.sample.error.message. Operators see only actual=missing in the threshold table and the CI gate fires as threshold_failed for the wrong reason.

Root cause (real shape from PO's run)

{
  "name": "coherence",
  "score": null,
  "passed": null,
  "status": "error",
  "sample": {
    "error": {
      "code": "FAILED_EXECUTION",
      "message": "OpenAI API hits AuthenticationError: 401 PermissionDenied: The principal lacks the required data action chat/completions/action"
    }
  }
}
```nThe parser only probed `error` as a top-level string. It never reached `sample.error.message`.

## Fix

- `cloud_results._metric_from_result` now calls a new `_extract_grader_error` that probes (in order): top-level `error`, `sample.error`, and `status == 'error'` as last resort. Accepts both string and `{code, message}` dict shapes; prefixes the error code when present.
- `orchestrator` 0-usable-scores warning now quotes the first per-metric error so CI logs carry the actionable cause without operators having to download the raw artifact.
- 3 new unit tests in `test_cloud_results.py` covering: `sample.error.message` dict lift, top-level `error` dict, and the case where a real score must not be shadowed by a stale `sample.error: null`.

## Verification

- `python -m pytest tests/unit/test_cloud_results.py -x -q` → 14 passed.
- `python -m pytest tests/ -x -q` → 792 passed, 3 skipped.
- Real artifact from `placerda/agentops-prompt-quickstart` run #26618645299 shown above as the source of the fixture.

When a Foundry azure_ai_evaluator grader fails to execute (e.g., the evaluator service principal lacks Cognitive Services OpenAI User on the model deployment), the per-metric score returns null and the real cause is buried in result.sample.error.message. Without surfacing it, operators see only actual=missing in the threshold table and have to dig into cloud_output_items.json to find the RBAC failure.

The parser now extracts sample.error.message (and top-level error dicts), prefixing the error code when present. The orchestrator's 0-usable-scores warning quotes the first grader error so CI logs carry the actionable cause.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@placerda placerda merged commit 6bea832 into develop May 29, 2026
12 checks passed
@placerda placerda deleted the fix/cloud-eval-surface-evaluator-errors branch May 29, 2026 05:13
placerda added a commit that referenced this pull request May 29, 2026
) (#202)

When a Foundry azure_ai_evaluator grader fails to execute (e.g., the evaluator service principal lacks Cognitive Services OpenAI User on the model deployment), the per-metric score returns null and the real cause is buried in result.sample.error.message. Without surfacing it, operators see only actual=missing in the threshold table and have to dig into cloud_output_items.json to find the RBAC failure.

The parser now extracts sample.error.message (and top-level error dicts), prefixing the error code when present. The orchestrator's 0-usable-scores warning quotes the first grader error so CI logs carry the actionable cause.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants