fix: distinguish grader execution errors from quality-gate failures#226
Merged
Conversation
When cloud/local evaluator workers error out on a subset of rows (most commonly data-plane RBAC that is still propagating), no dataset row has every grader return a score, so items_passed_all is 0 and �gentops eval run reports Threshold status: FAILED even though every computable threshold passed. This produced confusing phantom quality failures on the first run after granting the Cognitive Services OpenAI User role. - CLI now detects errored graders combined with all-thresholds-passed and prints a Warning clarifying this is an execution failure, names the RBAC-propagation cause, surfaces the first grader error, and advises waiting + re-running. Exit-code contract unchanged. - Added _grader_error_summary helper + focused unit tests. - Corrected RBAC propagation guidance (several minutes, intermittent FAILED-with-green-thresholds symptom) in the prompt-agent, hosted-agent and end-to-end tutorials and the agentops-eval skill; re-synced plugin. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Fixes the confusing
Threshold status: FAILEDthat the prompt-agent tutorial (step 17) hit on the firstagentops eval runafter granting data-plane RBAC, even though every threshold was green.Root cause
items_passed_allcounts rows where every grader returned a score. When evaluator workers error out on a subset of rows (auth/RBAC/timeout), no row passes all graders, soitems_passed_all = 0flipsoverall_passedtofalsewhile the aggregate thresholds (which skipNonevalues) all pass. The result: a phantom quality failure that is actually a grader execution failure. The most common trigger is aCognitive Services OpenAI Userrole granted moments earlier that is still propagating to the independent, per-call evaluator workers.Changes
agentops eval runnow detects errored graders combined with all-thresholds-passed and prints aWarningexplaining it is an execution error (not a quality regression), names the RBAC-propagation cause, surfaces the first underlying grader error, and advises waiting a few minutes before re-running. Exit-code contract unchanged._grader_error_summaryhelper + focused unit tests (tests/unit/test_eval_run_grader_errors.py).agentops-evalskill; re-synced the plugin copy.Testing
python -m pytest tests/ -x -q-> 836 passed, 1 skipped.