fix(cloud-eval): lift grader execution errors into RowMetric.error#201
Merged
Conversation
When a Foundry azure_ai_evaluator grader fails to execute (e.g., the evaluator service principal lacks Cognitive Services OpenAI User on the model deployment), the per-metric score returns null and the real cause is buried in result.sample.error.message. Without surfacing it, operators see only actual=missing in the threshold table and have to dig into cloud_output_items.json to find the RBAC failure. The parser now extracts sample.error.message (and top-level error dicts), prefixing the error code when present. The orchestrator's 0-usable-scores warning quotes the first grader error so CI logs carry the actionable cause. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
placerda
added a commit
that referenced
this pull request
May 29, 2026
) (#202) When a Foundry azure_ai_evaluator grader fails to execute (e.g., the evaluator service principal lacks Cognitive Services OpenAI User on the model deployment), the per-metric score returns null and the real cause is buried in result.sample.error.message. Without surfacing it, operators see only actual=missing in the threshold table and have to dig into cloud_output_items.json to find the RBAC failure. The parser now extracts sample.error.message (and top-level error dicts), prefixing the error code when present. The orchestrator's 0-usable-scores warning quotes the first grader error so CI logs carry the actionable cause. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When a Foundry
azure_ai_evaluatorgrader fails to execute (most commonly because the evaluator service principal lacksCognitive Services OpenAI Useron the target model deployment), the per-metricscorecomes backnulland the real cause is buried inresult.sample.error.message. Operators see onlyactual=missingin the threshold table and the CI gate fires asthreshold_failedfor the wrong reason.Root cause (real shape from PO's run)
{ "name": "coherence", "score": null, "passed": null, "status": "error", "sample": { "error": { "code": "FAILED_EXECUTION", "message": "OpenAI API hits AuthenticationError: 401 PermissionDenied: The principal lacks the required data action chat/completions/action" } } } ```nThe parser only probed `error` as a top-level string. It never reached `sample.error.message`. ## Fix - `cloud_results._metric_from_result` now calls a new `_extract_grader_error` that probes (in order): top-level `error`, `sample.error`, and `status == 'error'` as last resort. Accepts both string and `{code, message}` dict shapes; prefixes the error code when present. - `orchestrator` 0-usable-scores warning now quotes the first per-metric error so CI logs carry the actionable cause without operators having to download the raw artifact. - 3 new unit tests in `test_cloud_results.py` covering: `sample.error.message` dict lift, top-level `error` dict, and the case where a real score must not be shadowed by a stale `sample.error: null`. ## Verification - `python -m pytest tests/unit/test_cloud_results.py -x -q` → 14 passed. - `python -m pytest tests/ -x -q` → 792 passed, 3 skipped. - Real artifact from `placerda/agentops-prompt-quickstart` run #26618645299 shown above as the source of the fixture.