hotfix(cloud-eval): extract scores from real Foundry azure_ai_evaluator result shape (cherry-pick #195)#196
Merged
Conversation
…result shape (#195) The cloud-eval parser was returning value=null for every metric in real Foundry runs even when graders completed successfully, causing the PR / deploy gate to fire 'Threshold status: FAILED' with all thresholds showing actual=missing on the very first tutorial pass. Root cause: _metric_from_result only probed {score|value|result|passed} at the top level. The real azure_ai_evaluator shape (verified against Azure/azure-sdk-for-python fixture evaluation_util_convert_expected_output.json) emits {type, name, metric, score, label, reason, threshold, passed, sample, status}, and some custom prompt-based graders nest the score under sample.score / details.score. Fix: widen the probe to (score, value, result, metric_value, rating, grader_score, numeric_value), then passed (bool), then label ('pass'/'fail'), then descend into sample/details. Treat score: 0 as a legitimate value (was being lost). When still nothing found, record a structured error pointing at the new raw-items artifact. Also: always persist the raw Foundry output_items as cloud_output_items.json next to results.json so future parser regressions are debuggable from the artifact bundle alone, and emit an explicit progress warning when a cloud run yields zero usable scores despite returning rows. Tests: +5 new tests covering the real Foundry shape, score=0 boundary, label-only fallback, nested sample.score, and the diagnostic error path. Full suite: 789 passed, 3 skipped. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> (cherry picked from commit 0fe6b00)
placerda
added a commit
that referenced
this pull request
May 29, 2026
…ones) # Conflicts: # CHANGELOG.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cherry-picks #195 onto main so the fix ships to PyPI / Marketplace alongside v0.3.0 users without waiting for the next release cycle.
What this is
Verbatim cherry-pick of the develop merge commit
0fe6b00. CHANGELOG.md conflict resolved by re-applying the same[Unreleased]entries on top of main's current changelog (which had the same shape).Why ship to main directly
Same argument as #194 (PyYAML hotfix shipped to main last week): every tutorial user that hits the prompt-agent PR / deploy workflow today fires this bug, gets a red gate on first pass, and loses the
first pass is greenlearning moment that the tutorial depends on. Cannot wait for the next minor.Validation
git diff --ignore-cr-at-eol --stat origin/main..HEAD→+239 / -7across 4 files (matches the develop PR exactly).python -m pytest tests/unit/test_cloud_results.py -x -q→11 passed.789 passed, 3 skipped.Background
See #195 for the full root-cause writeup, real Foundry on-the-wire schema, and the +5 new tests covering the real
azure_ai_evaluatorshape,score: 0boundary,label-only graders, nestedsample.score, and the diagnostic-error path.