hotfix(cloud-eval): extract scores from real Foundry azure_ai_evaluator result shape (cherry-pick #195) by placerda · Pull Request #196 · Azure/agentops

placerda · 2026-05-29T04:18:42Z

Cherry-picks #195 onto main so the fix ships to PyPI / Marketplace alongside v0.3.0 users without waiting for the next release cycle.

What this is

Verbatim cherry-pick of the develop merge commit 0fe6b00. CHANGELOG.md conflict resolved by re-applying the same [Unreleased] entries on top of main's current changelog (which had the same shape).

Why ship to main directly

Same argument as #194 (PyYAML hotfix shipped to main last week): every tutorial user that hits the prompt-agent PR / deploy workflow today fires this bug, gets a red gate on first pass, and loses the first pass is green learning moment that the tutorial depends on. Cannot wait for the next minor.

Validation

git diff --ignore-cr-at-eol --stat origin/main..HEAD → +239 / -7 across 4 files (matches the develop PR exactly).
python -m pytest tests/unit/test_cloud_results.py -x -q → 11 passed.
Full suite on develop pre-merge: 789 passed, 3 skipped.

Background

See #195 for the full root-cause writeup, real Foundry on-the-wire schema, and the +5 new tests covering the real azure_ai_evaluator shape, score: 0 boundary, label-only graders, nested sample.score, and the diagnostic-error path.

…result shape (#195) The cloud-eval parser was returning value=null for every metric in real Foundry runs even when graders completed successfully, causing the PR / deploy gate to fire 'Threshold status: FAILED' with all thresholds showing actual=missing on the very first tutorial pass. Root cause: _metric_from_result only probed {score|value|result|passed} at the top level. The real azure_ai_evaluator shape (verified against Azure/azure-sdk-for-python fixture evaluation_util_convert_expected_output.json) emits {type, name, metric, score, label, reason, threshold, passed, sample, status}, and some custom prompt-based graders nest the score under sample.score / details.score. Fix: widen the probe to (score, value, result, metric_value, rating, grader_score, numeric_value), then passed (bool), then label ('pass'/'fail'), then descend into sample/details. Treat score: 0 as a legitimate value (was being lost). When still nothing found, record a structured error pointing at the new raw-items artifact. Also: always persist the raw Foundry output_items as cloud_output_items.json next to results.json so future parser regressions are debuggable from the artifact bundle alone, and emit an explicit progress warning when a cloud run yields zero usable scores despite returning rows. Tests: +5 new tests covering the real Foundry shape, score=0 boundary, label-only fallback, nested sample.score, and the diagnostic error path. Full suite: 789 passed, 3 skipped. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> (cherry picked from commit 0fe6b00)

…ones) # Conflicts: # CHANGELOG.md

placerda merged commit 812887a into main May 29, 2026
1 check passed

placerda deleted the hotfix/cloud-results-null-scores branch May 29, 2026 04:19

placerda added a commit that referenced this pull request May 29, 2026

chore: merge main into develop (cloud-results hotfix PR #196 + tombst…

fef1e35

…ones) # Conflicts: # CHANGELOG.md

placerda mentioned this pull request May 29, 2026

fix(cloud-eval): lift grader execution errors into RowMetric.error #202

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hotfix(cloud-eval): extract scores from real Foundry azure_ai_evaluator result shape (cherry-pick #195)#196

hotfix(cloud-eval): extract scores from real Foundry azure_ai_evaluator result shape (cherry-pick #195)#196
placerda merged 1 commit into
mainfrom
hotfix/cloud-results-null-scores

placerda commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

placerda commented May 29, 2026

What this is

Why ship to main directly

Validation

Background

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant