fix: distinguish grader execution errors from quality-gate failures by placerda · Pull Request #226 · Azure/agentops

placerda · 2026-06-01T19:53:56Z

What

Fixes the confusing Threshold status: FAILED that the prompt-agent tutorial (step 17) hit on the first agentops eval run after granting data-plane RBAC, even though every threshold was green.

Root cause

items_passed_all counts rows where every grader returned a score. When evaluator workers error out on a subset of rows (auth/RBAC/timeout), no row passes all graders, so items_passed_all = 0 flips overall_passed to false while the aggregate thresholds (which skip None values) all pass. The result: a phantom quality failure that is actually a grader execution failure. The most common trigger is a Cognitive Services OpenAI User role granted moments earlier that is still propagating to the independent, per-call evaluator workers.

Changes

agentops eval run now detects errored graders combined with all-thresholds-passed and prints a Warning explaining it is an execution error (not a quality regression), names the RBAC-propagation cause, surfaces the first underlying grader error, and advises waiting a few minutes before re-running. Exit-code contract unchanged.
New _grader_error_summary helper + focused unit tests (tests/unit/test_eval_run_grader_errors.py).
Corrected RBAC propagation guidance (several minutes, intermittent FAILED-with-green-thresholds symptom) in the prompt-agent, hosted-agent and end-to-end tutorials and the agentops-eval skill; re-synced the plugin copy.

Testing

python -m pytest tests/ -x -q -> 836 passed, 1 skipped.

When cloud/local evaluator workers error out on a subset of rows (most commonly data-plane RBAC that is still propagating), no dataset row has every grader return a score, so items_passed_all is 0 and �gentops eval run reports Threshold status: FAILED even though every computable threshold passed. This produced confusing phantom quality failures on the first run after granting the Cognitive Services OpenAI User role. - CLI now detects errored graders combined with all-thresholds-passed and prints a Warning clarifying this is an execution failure, names the RBAC-propagation cause, surfaces the first grader error, and advises waiting + re-running. Exit-code contract unchanged. - Added _grader_error_summary helper + focused unit tests. - Corrected RBAC propagation guidance (several minutes, intermittent FAILED-with-green-thresholds symptom) in the prompt-agent, hosted-agent and end-to-end tutorials and the agentops-eval skill; re-synced plugin. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

placerda merged commit faf48cf into develop Jun 1, 2026
12 checks passed

placerda deleted the fix/grader-error-warning branch June 1, 2026 19:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: distinguish grader execution errors from quality-gate failures#226

fix: distinguish grader execution errors from quality-gate failures#226
placerda merged 1 commit into
developfrom
fix/grader-error-warning

placerda commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

placerda commented Jun 1, 2026

What

Root cause

Changes

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant