diff --git a/CHANGELOG.md b/CHANGELOG.md index 0b3a8bb..d67d95a 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -5,6 +5,28 @@ This format follows [Keep a Changelog](https://keepachangelog.com/) and adheres ## [Unreleased] +### Changed +- **`agentops eval run` now distinguishes a grader *execution* failure from a + quality-gate failure.** When evaluator workers error out on a subset of rows + (auth/RBAC/timeout), no row has every grader return a score, so + `items_passed_all` is `0` and the run reports `Threshold status: FAILED` even + though every threshold that *could* be computed passed. The CLI now detects + this case (errored graders combined with all thresholds passing) and prints a + `Warning` explaining that this is an execution error, not a quality + regression, names the most common cause (data-plane RBAC granted moments + earlier that is still propagating to the evaluator workers), surfaces the + first underlying grader error, and advises waiting a few minutes before + re-running. The exit-code contract is unchanged. Added the + `_grader_error_summary` helper plus focused unit tests. +- **Corrected the RBAC propagation guidance in the tutorials and the + `agentops-eval` skill.** Data-plane role assignments on Cognitive Services + accounts can take several minutes (not 30-120 seconds) to reach the + independent, per-row evaluator workers, which can produce an *intermittent* + `FAILED` with otherwise-green thresholds on the first run after granting + access. The prompt-agent, hosted-agent, and end-to-end tutorials and the + skill now describe this symptom and tell readers to wait and re-run rather + than lower thresholds. + ## [0.3.5] - 2026-06-01 ### Changed diff --git a/docs/tutorial-end-to-end.md b/docs/tutorial-end-to-end.md index 4074a9d..e338baf 100644 --- a/docs/tutorial-end-to-end.md +++ b/docs/tutorial-end-to-end.md @@ -312,7 +312,15 @@ az role assignment create ` --scope /subscriptions//resourceGroups/ ``` -Propagation usually completes within 30–120 seconds. +> **Give the assignment a few minutes to propagate.** Data-plane role +> assignments on the AI Services account do **not** take effect +> instantly — propagation to the evaluator workers can take several +> minutes (occasionally up to ~15). Evaluators authenticate per call, so +> the **first eval right after granting the role may show intermittent +> `AuthenticationError` on a subset of graders and report +> `Threshold status: FAILED` even when every threshold is green**. This +> is a grader execution failure, not a quality regression — wait a few +> minutes and re-run the eval. ## 2. Create the travel eval dataset diff --git a/docs/tutorial-hosted-agent-quickstart.md b/docs/tutorial-hosted-agent-quickstart.md index 9c7ae2e..188f076 100644 --- a/docs/tutorial-hosted-agent-quickstart.md +++ b/docs/tutorial-hosted-agent-quickstart.md @@ -334,7 +334,15 @@ az role assignment create ` --scope /subscriptions//resourceGroups/ ``` -Propagation usually completes within 30–120 seconds. +> **Give the assignment a few minutes to propagate.** Data-plane role +> assignments on the AI Services account do **not** take effect +> instantly — propagation to the local/Foundry evaluator workers can +> take several minutes (occasionally up to ~15). Evaluators authenticate +> per call, so the **first eval right after granting the role may show +> intermittent `AuthenticationError` on a subset of graders and report +> `Threshold status: FAILED` even when every threshold is green**. This +> is a grader execution failure, not a quality regression — wait a few +> minutes and re-run the eval. ## 5. Initialize AgentOps interactively diff --git a/docs/tutorial-prompt-agent-quickstart.md b/docs/tutorial-prompt-agent-quickstart.md index d622b25..b2843d7 100644 --- a/docs/tutorial-prompt-agent-quickstart.md +++ b/docs/tutorial-prompt-agent-quickstart.md @@ -270,10 +270,23 @@ az role assignment create ` ``` Repeat the command with the `travel-agent-dev` resource group if the dev -project lives in a different RG. The assignment usually propagates within -30–120 seconds. AgentOps Doctor will detect the missing assignment in a -future release, but until then this is a manual one-time setup step per -new environment. +project lives in a different RG. + +> **Give the assignment a few minutes to propagate.** Data-plane role +> assignments on the AI Services account do **not** take effect +> instantly — propagation to the Foundry evaluator workers can take +> several minutes (occasionally up to ~15). The cloud eval runs each +> grader as an independent worker that authenticates separately, so the +> **first run right after granting the role may show intermittent +> `AuthenticationError` on a subset of graders and report +> `Threshold status: FAILED` even when every threshold is green** (no +> single row had all graders succeed). This is a grader execution +> failure, not a quality regression. Wait a few minutes and re-run +> `agentops eval run` — once propagation finishes, every grader scores +> and the gate passes. + +AgentOps Doctor will detect the missing assignment in a future release, +but until then this is a manual one-time setup step per new environment. ## 4. Seed `travel-agent` in the sandbox project diff --git a/plugins/agentops/skills/agentops-eval/SKILL.md b/plugins/agentops/skills/agentops-eval/SKILL.md index 662fb53..b5b2701 100644 --- a/plugins/agentops/skills/agentops-eval/SKILL.md +++ b/plugins/agentops/skills/agentops-eval/SKILL.md @@ -73,6 +73,17 @@ Skip this step only if the user explicitly says the role is already assigned, or if a previous `agentops eval run` succeeded against the same Foundry account. +**Propagation:** data-plane role assignments do not take effect +instantly — allow several minutes (occasionally up to ~15) before the +first eval. The cloud/local graders authenticate per call, so if the +user runs an eval immediately after this preflight and sees intermittent +`AuthenticationError` on a subset of graders plus +`Threshold status: FAILED` while the visible thresholds are green, that +is propagation lag (a grader **execution** failure), not a quality +regression. Tell the user to wait a few minutes and re-run +`agentops eval run`; do not treat it as a failing gate or start changing +thresholds. + ## Step 1 - Analyze evaluation setup Run the deterministic local triage first: diff --git a/src/agentops/cli/app.py b/src/agentops/cli/app.py index edeff00..dafdc3e 100644 --- a/src/agentops/cli/app.py +++ b/src/agentops/cli/app.py @@ -2055,10 +2055,57 @@ def _run_flat_schema_eval( if result.summary.overall_passed: typer.echo(f"{_cli_label('Threshold status')}: {style('PASSED', 'bold', 'green')}") return + + # Distinguish a genuine quality-gate failure from grader *execution* + # errors. When evaluator workers error (auth/RBAC/timeout) on a subset of + # rows, no row has every grader succeed, so `items_passed_all` is 0 and the + # gate reports FAILED even though every threshold that *could* be computed + # passed. Surfacing this prevents users from chasing a phantom quality + # regression - the most common cause is data-plane RBAC granted moments + # earlier that is still propagating to the evaluator workers. + errored, total, first_error = _grader_error_summary(result) + all_thresholds_passed = ( + result.summary.thresholds_total > 0 + and result.summary.thresholds_passed == result.summary.thresholds_total + ) + if errored and all_thresholds_passed: + typer.echo( + f"{_cli_warn('Warning')}: {errored} of {total} grader execution(s) " + "errored, so no dataset row had every grader return a score. This is " + "a grader execution failure, not a quality regression - every " + "threshold that could be computed passed. The most common cause is " + "data-plane RBAC granted recently that is still propagating to the " + "evaluator workers; wait a few minutes and re-run `agentops eval run`.", + err=True, + ) + if first_error: + typer.echo(f"{_cli_warn('Warning')}: first grader error: {first_error}", err=True) + typer.echo(f"{_cli_label('Threshold status')}: {style('FAILED', 'bold', 'red')}") raise typer.Exit(code=exit_code_from(result)) +def _grader_error_summary(result) -> tuple[int, int, Optional[str]]: + """Return ``(errored_metric_count, total_metric_count, first_error)``. + + Walks every per-row metric in the run so the CLI can tell a grader + *execution* failure (auth/RBAC/timeout) apart from a quality-gate failure. + The first non-empty error string is lifted out as the actionable cause. + """ + errored = 0 + total = 0 + first_error: Optional[str] = None + for row in result.rows: + for metric in row.metrics: + total += 1 + err = getattr(metric, "error", None) + if isinstance(err, str) and err.strip(): + errored += 1 + if first_error is None: + first_error = err.strip() + return errored, total, first_error + + def _default_flat_output_dir(config_path: Path) -> Path: base = config_path.parent / ".agentops" / "results" timestamp = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H-%M-%SZ") diff --git a/src/agentops/templates/skills/agentops-eval/SKILL.md b/src/agentops/templates/skills/agentops-eval/SKILL.md index 662fb53..b5b2701 100644 --- a/src/agentops/templates/skills/agentops-eval/SKILL.md +++ b/src/agentops/templates/skills/agentops-eval/SKILL.md @@ -73,6 +73,17 @@ Skip this step only if the user explicitly says the role is already assigned, or if a previous `agentops eval run` succeeded against the same Foundry account. +**Propagation:** data-plane role assignments do not take effect +instantly — allow several minutes (occasionally up to ~15) before the +first eval. The cloud/local graders authenticate per call, so if the +user runs an eval immediately after this preflight and sees intermittent +`AuthenticationError` on a subset of graders plus +`Threshold status: FAILED` while the visible thresholds are green, that +is propagation lag (a grader **execution** failure), not a quality +regression. Tell the user to wait a few minutes and re-run +`agentops eval run`; do not treat it as a failing gate or start changing +thresholds. + ## Step 1 - Analyze evaluation setup Run the deterministic local triage first: diff --git a/tests/unit/test_eval_run_grader_errors.py b/tests/unit/test_eval_run_grader_errors.py new file mode 100644 index 0000000..565e53c --- /dev/null +++ b/tests/unit/test_eval_run_grader_errors.py @@ -0,0 +1,150 @@ +"""CLI behaviour when graders *execute* but a subset errors out. + +A grader execution error (auth/RBAC/timeout) is not a quality regression, but +because ``items_passed_all`` requires every grader on a row to succeed, a single +errored grader flips ``overall_passed`` to ``False`` and the run reports +``Threshold status: FAILED`` even though every computable threshold passed. + +The CLI must surface that distinction loudly so users (the most common trigger +is data-plane RBAC that is still propagating) do not chase a phantom quality +failure or start lowering thresholds. +""" + +from __future__ import annotations + +import json +from pathlib import Path + +from typer.testing import CliRunner + +from agentops.cli.app import _grader_error_summary, app +from agentops.core.results import ( + RowMetric, + RowResult, + RunResult, + RunSummary, + TargetInfo, + ThresholdEvaluation, +) + +runner = CliRunner() + +_AUTH_ERROR = ( + "FAILED_EXECUTION: (UserError) OpenAI API hits AuthenticationError: " + "Principal does not have access to API/Operation." +) + + +def _result_with_partial_grader_errors() -> RunResult: + """One row where coherence scored but similarity errored on auth.""" + row = RowResult( + row_index=0, + input="plan a trip", + expected="an itinerary", + response="here is an itinerary", + metrics=[ + RowMetric(name="coherence", value=5.0), + RowMetric(name="similarity", value=None, error=_AUTH_ERROR), + ], + ) + summary = RunSummary( + items_total=1, + items_passed_all=0, # the errored grader means no row passed all + items_pass_rate=0.0, + thresholds_total=1, + thresholds_passed=1, # every computable threshold passed + threshold_pass_rate=1.0, + overall_passed=False, + ) + return RunResult( + started_at="2026-06-01T00:00:00+00:00", + finished_at="2026-06-01T00:01:00+00:00", + duration_seconds=60.0, + target=TargetInfo(kind="foundry_prompt", raw="travel-agent:2"), + dataset_path="dataset.jsonl", + evaluators=["CoherenceEvaluator", "SimilarityEvaluator"], + rows=[row], + aggregate_metrics={"coherence": 5.0}, + thresholds=[ + ThresholdEvaluation( + metric="coherence", + criteria=">=", + expected=">=3", + actual="5", + passed=True, + ) + ], + summary=summary, + ) + + +def test_grader_error_summary_counts_and_lifts_first_error() -> None: + errored, total, first_error = _grader_error_summary( + _result_with_partial_grader_errors() + ) + assert (errored, total) == (1, 2) + assert first_error is not None + assert "AuthenticationError" in first_error + + +def _write_minimal_config(tmp_path: Path) -> Path: + dataset = tmp_path / "dataset.jsonl" + dataset.write_text(json.dumps({"input": "hi", "expected": "hi"}), encoding="utf-8") + config = tmp_path / "agentops.yaml" + config.write_text( + json.dumps( + {"version": 1, "agent": "model:gpt-4o", "dataset": str(dataset)} + ), + encoding="utf-8", + ) + return config + + +def test_eval_run_warns_on_partial_grader_errors(tmp_path, monkeypatch) -> None: + config = _write_minimal_config(tmp_path) + output = tmp_path / "out" + output.mkdir() + + crafted = _result_with_partial_grader_errors() + import agentops.pipeline.orchestrator as orch + + monkeypatch.setattr(orch, "run_evaluation", lambda *a, **k: crafted) + + result = runner.invoke( + app, + ["eval", "run", "--config", str(config), "--output", str(output)], + ) + + # A grader-execution failure keeps the gate-failed exit code... + assert result.exit_code == 2, result.output + # ...but the user is told it is an execution error, not a quality failure. + assert "grader execution(s) errored" in result.output + assert "propagating" in result.output + assert "AuthenticationError" in result.output + assert "FAILED" in result.output + + +def test_eval_run_no_warning_when_no_grader_errors(tmp_path, monkeypatch) -> None: + config = _write_minimal_config(tmp_path) + output = tmp_path / "out" + output.mkdir() + + clean = _result_with_partial_grader_errors() + # Drop the errored grader so the row is clean and the gate genuinely passes. + clean.rows[0].metrics = [RowMetric(name="coherence", value=5.0)] + clean.summary.items_passed_all = 1 + clean.summary.items_pass_rate = 1.0 + clean.summary.overall_passed = True + + import agentops.pipeline.orchestrator as orch + + monkeypatch.setattr(orch, "run_evaluation", lambda *a, **k: clean) + + result = runner.invoke( + app, + ["eval", "run", "--config", str(config), "--output", str(output)], + ) + + assert result.exit_code == 0, result.output + assert "PASSED" in result.output + assert "grader execution(s) errored" not in result.output