Skip to content

fix: distinguish grader execution errors from quality-gate failures#226

Merged
placerda merged 1 commit into
developfrom
fix/grader-error-warning
Jun 1, 2026
Merged

fix: distinguish grader execution errors from quality-gate failures#226
placerda merged 1 commit into
developfrom
fix/grader-error-warning

Conversation

@placerda
Copy link
Copy Markdown
Contributor

@placerda placerda commented Jun 1, 2026

What

Fixes the confusing Threshold status: FAILED that the prompt-agent tutorial (step 17) hit on the first agentops eval run after granting data-plane RBAC, even though every threshold was green.

Root cause

items_passed_all counts rows where every grader returned a score. When evaluator workers error out on a subset of rows (auth/RBAC/timeout), no row passes all graders, so items_passed_all = 0 flips overall_passed to false while the aggregate thresholds (which skip None values) all pass. The result: a phantom quality failure that is actually a grader execution failure. The most common trigger is a Cognitive Services OpenAI User role granted moments earlier that is still propagating to the independent, per-call evaluator workers.

Changes

  • agentops eval run now detects errored graders combined with all-thresholds-passed and prints a Warning explaining it is an execution error (not a quality regression), names the RBAC-propagation cause, surfaces the first underlying grader error, and advises waiting a few minutes before re-running. Exit-code contract unchanged.
  • New _grader_error_summary helper + focused unit tests (tests/unit/test_eval_run_grader_errors.py).
  • Corrected RBAC propagation guidance (several minutes, intermittent FAILED-with-green-thresholds symptom) in the prompt-agent, hosted-agent and end-to-end tutorials and the agentops-eval skill; re-synced the plugin copy.

Testing

  • python -m pytest tests/ -x -q -> 836 passed, 1 skipped.

When cloud/local evaluator workers error out on a subset of rows (most
commonly data-plane RBAC that is still propagating), no dataset row has
every grader return a score, so items_passed_all is 0 and �gentops eval
run reports Threshold status: FAILED even though every computable
threshold passed. This produced confusing phantom quality failures on
the first run after granting the Cognitive Services OpenAI User role.

- CLI now detects errored graders combined with all-thresholds-passed
  and prints a Warning clarifying this is an execution failure, names the
  RBAC-propagation cause, surfaces the first grader error, and advises
  waiting + re-running. Exit-code contract unchanged.
- Added _grader_error_summary helper + focused unit tests.
- Corrected RBAC propagation guidance (several minutes, intermittent
  FAILED-with-green-thresholds symptom) in the prompt-agent, hosted-agent
  and end-to-end tutorials and the agentops-eval skill; re-synced plugin.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@placerda placerda merged commit faf48cf into develop Jun 1, 2026
12 checks passed
@placerda placerda deleted the fix/grader-error-warning branch June 1, 2026 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant