fix(cloud-eval): lift grader execution errors into RowMetric.error#202
Merged
Conversation
) When a Foundry azure_ai_evaluator grader fails to execute (e.g., the evaluator service principal lacks Cognitive Services OpenAI User on the model deployment), the per-metric score returns null and the real cause is buried in result.sample.error.message. Without surfacing it, operators see only actual=missing in the threshold table and have to dig into cloud_output_items.json to find the RBAC failure. The parser now extracts sample.error.message (and top-level error dicts), prefixing the error code when present. The orchestrator's 0-usable-scores warning quotes the first grader error so CI logs carry the actionable cause. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hotfix mirror of #201 onto
main.Source-controlled trace of the same fix already merged to
developas commit6bea832.Summary
When a Foundry
azure_ai_evaluatorgrader fails to execute (most commonly because the evaluator service principal lacksCognitive Services OpenAI Useron the target model deployment), the per-metricscorecomes backnulland the real cause is buried inresult.sample.error.message. The cloud-results parser now lifts that message intoRowMetric.error(including the errorcodeprefix when present), so the actionable error appears inresults.jsonandreport.mdinstead of operators only seeingactual=missingin the threshold table. The orchestrator's "0 usable metric scores" warning also quotes the first grader error so CI logs carry the signal without operators having to download the raw artifact.Files
src/agentops/pipeline/cloud_results.py— new_extract_grader_error+_normalize_error_payloadhelpers;_metric_from_resultnow lifts grader errors.src/agentops/pipeline/orchestrator.py— new_first_metric_errorhelper; the "0 usable scores" warning quotes the first error.tests/unit/test_cloud_results.py— 3 new tests covering the RBAC failure shape, top-levelerrordict, and "real score wins over stalesample.error".CHANGELOG.md— entry under[Unreleased]▸Fixed.Verification
pytest tests/unit/test_cloud_results.py→ 14 passed (locally on develop before merge).pytest tests/→ 792 passed, 3 skipped (locally on develop before merge).Notes
main's existing convention (no.gitattributeshere yet).develop ← main(no-op for code; merges only the branch pointer).Closes the gap left by #195/#196 (parser widening) and #197/#199 (artifact upload): those probed for scores that never existed because the grader had errored. This PR finally surfaces the error.