Skip to content

hotfix(cloud-eval): extract scores from real Foundry azure_ai_evaluator result shape (cherry-pick #195)#196

Merged
placerda merged 1 commit into
mainfrom
hotfix/cloud-results-null-scores
May 29, 2026
Merged

hotfix(cloud-eval): extract scores from real Foundry azure_ai_evaluator result shape (cherry-pick #195)#196
placerda merged 1 commit into
mainfrom
hotfix/cloud-results-null-scores

Conversation

@placerda
Copy link
Copy Markdown
Contributor

Cherry-picks #195 onto main so the fix ships to PyPI / Marketplace alongside v0.3.0 users without waiting for the next release cycle.

What this is

Verbatim cherry-pick of the develop merge commit 0fe6b00. CHANGELOG.md conflict resolved by re-applying the same [Unreleased] entries on top of main's current changelog (which had the same shape).

Why ship to main directly

Same argument as #194 (PyYAML hotfix shipped to main last week): every tutorial user that hits the prompt-agent PR / deploy workflow today fires this bug, gets a red gate on first pass, and loses the first pass is green learning moment that the tutorial depends on. Cannot wait for the next minor.

Validation

  • git diff --ignore-cr-at-eol --stat origin/main..HEAD+239 / -7 across 4 files (matches the develop PR exactly).
  • python -m pytest tests/unit/test_cloud_results.py -x -q11 passed.
  • Full suite on develop pre-merge: 789 passed, 3 skipped.

Background

See #195 for the full root-cause writeup, real Foundry on-the-wire schema, and the +5 new tests covering the real azure_ai_evaluator shape, score: 0 boundary, label-only graders, nested sample.score, and the diagnostic-error path.

…result shape (#195)

The cloud-eval parser was returning value=null for every metric in real Foundry runs even when graders completed successfully, causing the PR / deploy gate to fire 'Threshold status: FAILED' with all thresholds showing actual=missing on the very first tutorial pass.

Root cause: _metric_from_result only probed {score|value|result|passed} at the top level. The real azure_ai_evaluator shape (verified against Azure/azure-sdk-for-python fixture evaluation_util_convert_expected_output.json) emits {type, name, metric, score, label, reason, threshold, passed, sample, status}, and some custom prompt-based graders nest the score under sample.score / details.score.

Fix: widen the probe to (score, value, result, metric_value, rating, grader_score, numeric_value), then passed (bool), then label ('pass'/'fail'), then descend into sample/details. Treat score: 0 as a legitimate value (was being lost). When still nothing found, record a structured error pointing at the new raw-items artifact.

Also: always persist the raw Foundry output_items as cloud_output_items.json next to results.json so future parser regressions are debuggable from the artifact bundle alone, and emit an explicit progress warning when a cloud run yields zero usable scores despite returning rows.

Tests: +5 new tests covering the real Foundry shape, score=0 boundary, label-only fallback, nested sample.score, and the diagnostic error path. Full suite: 789 passed, 3 skipped.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
(cherry picked from commit 0fe6b00)
@placerda placerda merged commit 812887a into main May 29, 2026
1 check passed
@placerda placerda deleted the hotfix/cloud-results-null-scores branch May 29, 2026 04:19
placerda added a commit that referenced this pull request May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant