fix(cloud-eval): lift grader execution errors into RowMetric.error by placerda · Pull Request #201 · Azure/agentops

placerda · 2026-05-29T05:08:38Z

When a Foundry azure_ai_evaluator grader fails to execute (most commonly because the evaluator service principal lacks Cognitive Services OpenAI User on the target model deployment), the per-metric score comes back null and the real cause is buried in result.sample.error.message. Operators see only actual=missing in the threshold table and the CI gate fires as threshold_failed for the wrong reason.

Root cause (real shape from PO's run)

{
  "name": "coherence",
  "score": null,
  "passed": null,
  "status": "error",
  "sample": {
    "error": {
      "code": "FAILED_EXECUTION",
      "message": "OpenAI API hits AuthenticationError: 401 PermissionDenied: The principal lacks the required data action chat/completions/action"
    }
  }
}
```nThe parser only probed `error` as a top-level string. It never reached `sample.error.message`.

## Fix

- `cloud_results._metric_from_result` now calls a new `_extract_grader_error` that probes (in order): top-level `error`, `sample.error`, and `status == 'error'` as last resort. Accepts both string and `{code, message}` dict shapes; prefixes the error code when present.
- `orchestrator` 0-usable-scores warning now quotes the first per-metric error so CI logs carry the actionable cause without operators having to download the raw artifact.
- 3 new unit tests in `test_cloud_results.py` covering: `sample.error.message` dict lift, top-level `error` dict, and the case where a real score must not be shadowed by a stale `sample.error: null`.

## Verification

- `python -m pytest tests/unit/test_cloud_results.py -x -q` → 14 passed.
- `python -m pytest tests/ -x -q` → 792 passed, 3 skipped.
- Real artifact from `placerda/agentops-prompt-quickstart` run #26618645299 shown above as the source of the fixture.

When a Foundry azure_ai_evaluator grader fails to execute (e.g., the evaluator service principal lacks Cognitive Services OpenAI User on the model deployment), the per-metric score returns null and the real cause is buried in result.sample.error.message. Without surfacing it, operators see only actual=missing in the threshold table and have to dig into cloud_output_items.json to find the RBAC failure. The parser now extracts sample.error.message (and top-level error dicts), prefixing the error code when present. The orchestrator's 0-usable-scores warning quotes the first grader error so CI logs carry the actionable cause. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

) (#202) When a Foundry azure_ai_evaluator grader fails to execute (e.g., the evaluator service principal lacks Cognitive Services OpenAI User on the model deployment), the per-metric score returns null and the real cause is buried in result.sample.error.message. Without surfacing it, operators see only actual=missing in the threshold table and have to dig into cloud_output_items.json to find the RBAC failure. The parser now extracts sample.error.message (and top-level error dicts), prefixing the error code when present. The orchestrator's 0-usable-scores warning quotes the first grader error so CI logs carry the actionable cause. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

placerda merged commit 6bea832 into develop May 29, 2026
12 checks passed

placerda deleted the fix/cloud-eval-surface-evaluator-errors branch May 29, 2026 05:13

placerda mentioned this pull request May 29, 2026

fix(cloud-eval): lift grader execution errors into RowMetric.error #202

Merged

placerda mentioned this pull request May 29, 2026

docs(skill): require Cognitive Services OpenAI User as prereq RBAC role #203

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cloud-eval): lift grader execution errors into RowMetric.error#201

fix(cloud-eval): lift grader execution errors into RowMetric.error#201
placerda merged 1 commit into
developfrom
fix/cloud-eval-surface-evaluator-errors

placerda commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

placerda commented May 29, 2026

Root cause (real shape from PO's run)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants