Azure · placerda · Jun 1, 2026 · Jun 1, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,6 +5,28 @@ This format follows [Keep a Changelog](https://keepachangelog.com/) and adheres
 
 ## [Unreleased]
 
+### Changed
+- **`agentops eval run` now distinguishes a grader *execution* failure from a
+  quality-gate failure.** When evaluator workers error out on a subset of rows
+  (auth/RBAC/timeout), no row has every grader return a score, so
+  `items_passed_all` is `0` and the run reports `Threshold status: FAILED` even
+  though every threshold that *could* be computed passed. The CLI now detects
+  this case (errored graders combined with all thresholds passing) and prints a
+  `Warning` explaining that this is an execution error, not a quality
+  regression, names the most common cause (data-plane RBAC granted moments
+  earlier that is still propagating to the evaluator workers), surfaces the
+  first underlying grader error, and advises waiting a few minutes before
+  re-running. The exit-code contract is unchanged. Added the
+  `_grader_error_summary` helper plus focused unit tests.
+- **Corrected the RBAC propagation guidance in the tutorials and the
+  `agentops-eval` skill.** Data-plane role assignments on Cognitive Services
+  accounts can take several minutes (not 30-120 seconds) to reach the
+  independent, per-row evaluator workers, which can produce an *intermittent*
+  `FAILED` with otherwise-green thresholds on the first run after granting
+  access. The prompt-agent, hosted-agent, and end-to-end tutorials and the
+  skill now describe this symptom and tell readers to wait and re-run rather
+  than lower thresholds.
+
 ## [0.3.5] - 2026-06-01
 
 ### Changed

diff --git a/docs/tutorial-end-to-end.md b/docs/tutorial-end-to-end.md
@@ -312,7 +312,15 @@ az role assignment create `
   --scope /subscriptions/<subscription-id>/resourceGroups/<resource-group>
 ```
 
-Propagation usually completes within 30–120 seconds.
+> **Give the assignment a few minutes to propagate.** Data-plane role
+> assignments on the AI Services account do **not** take effect
+> instantly — propagation to the evaluator workers can take several
+> minutes (occasionally up to ~15). Evaluators authenticate per call, so
+> the **first eval right after granting the role may show intermittent
+> `AuthenticationError` on a subset of graders and report
+> `Threshold status: FAILED` even when every threshold is green**. This
+> is a grader execution failure, not a quality regression — wait a few
+> minutes and re-run the eval.
 
 ## 2. Create the travel eval dataset
 

diff --git a/docs/tutorial-hosted-agent-quickstart.md b/docs/tutorial-hosted-agent-quickstart.md
@@ -334,7 +334,15 @@ az role assignment create `
   --scope /subscriptions/<subscription-id>/resourceGroups/<resource-group>
 ```
 
-Propagation usually completes within 30–120 seconds.
+> **Give the assignment a few minutes to propagate.** Data-plane role
+> assignments on the AI Services account do **not** take effect
+> instantly — propagation to the local/Foundry evaluator workers can
+> take several minutes (occasionally up to ~15). Evaluators authenticate
+> per call, so the **first eval right after granting the role may show
+> intermittent `AuthenticationError` on a subset of graders and report
+> `Threshold status: FAILED` even when every threshold is green**. This
+> is a grader execution failure, not a quality regression — wait a few
+> minutes and re-run the eval.
 
 ## 5. Initialize AgentOps interactively
 

diff --git a/docs/tutorial-prompt-agent-quickstart.md b/docs/tutorial-prompt-agent-quickstart.md
@@ -270,10 +270,23 @@ az role assignment create `
 ```
 
 Repeat the command with the `travel-agent-dev` resource group if the dev
-project lives in a different RG. The assignment usually propagates within
-30–120 seconds. AgentOps Doctor will detect the missing assignment in a
-future release, but until then this is a manual one-time setup step per
-new environment.
+project lives in a different RG.
+
+> **Give the assignment a few minutes to propagate.** Data-plane role
+> assignments on the AI Services account do **not** take effect
+> instantly — propagation to the Foundry evaluator workers can take
+> several minutes (occasionally up to ~15). The cloud eval runs each
+> grader as an independent worker that authenticates separately, so the
+> **first run right after granting the role may show intermittent
+> `AuthenticationError` on a subset of graders and report
+> `Threshold status: FAILED` even when every threshold is green** (no
+> single row had all graders succeed). This is a grader execution
+> failure, not a quality regression. Wait a few minutes and re-run
+> `agentops eval run` — once propagation finishes, every grader scores
+> and the gate passes.
+
+AgentOps Doctor will detect the missing assignment in a future release,
+but until then this is a manual one-time setup step per new environment.
 
 ## 4. Seed `travel-agent` in the sandbox project
 

diff --git a/plugins/agentops/skills/agentops-eval/SKILL.md b/plugins/agentops/skills/agentops-eval/SKILL.md
@@ -73,6 +73,17 @@ Skip this step only if the user explicitly says the role is already
 assigned, or if a previous `agentops eval run` succeeded against the
 same Foundry account.
 
+**Propagation:** data-plane role assignments do not take effect
+instantly — allow several minutes (occasionally up to ~15) before the
+first eval. The cloud/local graders authenticate per call, so if the
+user runs an eval immediately after this preflight and sees intermittent
+`AuthenticationError` on a subset of graders plus
+`Threshold status: FAILED` while the visible thresholds are green, that
+is propagation lag (a grader **execution** failure), not a quality
+regression. Tell the user to wait a few minutes and re-run
+`agentops eval run`; do not treat it as a failing gate or start changing
+thresholds.
+
 ## Step 1 - Analyze evaluation setup
 
 Run the deterministic local triage first:

diff --git a/src/agentops/cli/app.py b/src/agentops/cli/app.py
@@ -2055,10 +2055,57 @@ def _run_flat_schema_eval(
     if result.summary.overall_passed:
         typer.echo(f"{_cli_label('Threshold status')}: {style('PASSED', 'bold', 'green')}")
         return
+
+    # Distinguish a genuine quality-gate failure from grader *execution*
+    # errors. When evaluator workers error (auth/RBAC/timeout) on a subset of
+    # rows, no row has every grader succeed, so `items_passed_all` is 0 and the
+    # gate reports FAILED even though every threshold that *could* be computed
+    # passed. Surfacing this prevents users from chasing a phantom quality
+    # regression - the most common cause is data-plane RBAC granted moments
+    # earlier that is still propagating to the evaluator workers.
+    errored, total, first_error = _grader_error_summary(result)
+    all_thresholds_passed = (
+        result.summary.thresholds_total > 0
+        and result.summary.thresholds_passed == result.summary.thresholds_total
+    )
+    if errored and all_thresholds_passed:
+        typer.echo(
+            f"{_cli_warn('Warning')}: {errored} of {total} grader execution(s) "
+            "errored, so no dataset row had every grader return a score. This is "
+            "a grader execution failure, not a quality regression - every "
+            "threshold that could be computed passed. The most common cause is "
+            "data-plane RBAC granted recently that is still propagating to the "
+            "evaluator workers; wait a few minutes and re-run `agentops eval run`.",
+            err=True,
+        )
+        if first_error:
+            typer.echo(f"{_cli_warn('Warning')}: first grader error: {first_error}", err=True)
+
     typer.echo(f"{_cli_label('Threshold status')}: {style('FAILED', 'bold', 'red')}")
     raise typer.Exit(code=exit_code_from(result))
 
 
+def _grader_error_summary(result) -> tuple[int, int, Optional[str]]:
+    """Return ``(errored_metric_count, total_metric_count, first_error)``.
+
+    Walks every per-row metric in the run so the CLI can tell a grader
+    *execution* failure (auth/RBAC/timeout) apart from a quality-gate failure.
+    The first non-empty error string is lifted out as the actionable cause.
+    """
+    errored = 0
+    total = 0
+    first_error: Optional[str] = None
+    for row in result.rows:
+        for metric in row.metrics:
+            total += 1
+            err = getattr(metric, "error", None)
+            if isinstance(err, str) and err.strip():
+                errored += 1
+                if first_error is None:
+                    first_error = err.strip()
+    return errored, total, first_error
+
+
 def _default_flat_output_dir(config_path: Path) -> Path:
     base = config_path.parent / ".agentops" / "results"
     timestamp = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H-%M-%SZ")

diff --git a/src/agentops/templates/skills/agentops-eval/SKILL.md b/src/agentops/templates/skills/agentops-eval/SKILL.md
@@ -73,6 +73,17 @@ Skip this step only if the user explicitly says the role is already
 assigned, or if a previous `agentops eval run` succeeded against the
 same Foundry account.
 
+**Propagation:** data-plane role assignments do not take effect
+instantly — allow several minutes (occasionally up to ~15) before the
+first eval. The cloud/local graders authenticate per call, so if the
+user runs an eval immediately after this preflight and sees intermittent
+`AuthenticationError` on a subset of graders plus
+`Threshold status: FAILED` while the visible thresholds are green, that
+is propagation lag (a grader **execution** failure), not a quality
+regression. Tell the user to wait a few minutes and re-run
+`agentops eval run`; do not treat it as a failing gate or start changing
+thresholds.
+
 ## Step 1 - Analyze evaluation setup
 
 Run the deterministic local triage first:

diff --git a/tests/unit/test_eval_run_grader_errors.py b/tests/unit/test_eval_run_grader_errors.py
@@ -0,0 +1,150 @@
+"""CLI behaviour when graders *execute* but a subset errors out.
+
+A grader execution error (auth/RBAC/timeout) is not a quality regression, but
+because ``items_passed_all`` requires every grader on a row to succeed, a single
+errored grader flips ``overall_passed`` to ``False`` and the run reports
+``Threshold status: FAILED`` even though every computable threshold passed.
+
+The CLI must surface that distinction loudly so users (the most common trigger
+is data-plane RBAC that is still propagating) do not chase a phantom quality
+failure or start lowering thresholds.
+"""
+
+from __future__ import annotations
+
+import json
+from pathlib import Path
+
+from typer.testing import CliRunner
+
+from agentops.cli.app import _grader_error_summary, app
+from agentops.core.results import (
+    RowMetric,
+    RowResult,
+    RunResult,
+    RunSummary,
+    TargetInfo,
+    ThresholdEvaluation,
+)
+
+runner = CliRunner()
+
+_AUTH_ERROR = (
+    "FAILED_EXECUTION: (UserError) OpenAI API hits AuthenticationError: "
+    "Principal does not have access to API/Operation."
+)
+
+
+def _result_with_partial_grader_errors() -> RunResult:
+    """One row where coherence scored but similarity errored on auth."""
+    row = RowResult(
+        row_index=0,
+        input="plan a trip",
+        expected="an itinerary",
+        response="here is an itinerary",
+        metrics=[
+            RowMetric(name="coherence", value=5.0),
+            RowMetric(name="similarity", value=None, error=_AUTH_ERROR),
+        ],
+    )
+    summary = RunSummary(
+        items_total=1,
+        items_passed_all=0,  # the errored grader means no row passed all
+        items_pass_rate=0.0,
+        thresholds_total=1,
+        thresholds_passed=1,  # every computable threshold passed
+        threshold_pass_rate=1.0,
+        overall_passed=False,
+    )
+    return RunResult(
+        started_at="2026-06-01T00:00:00+00:00",
+        finished_at="2026-06-01T00:01:00+00:00",
+        duration_seconds=60.0,
+        target=TargetInfo(kind="foundry_prompt", raw="travel-agent:2"),
+        dataset_path="dataset.jsonl",
+        evaluators=["CoherenceEvaluator", "SimilarityEvaluator"],
+        rows=[row],
+        aggregate_metrics={"coherence": 5.0},
+        thresholds=[
+            ThresholdEvaluation(
+                metric="coherence",
+                criteria=">=",
+                expected=">=3",
+                actual="5",
+                passed=True,
+            )
+        ],
+        summary=summary,
+    )
+
+
+def test_grader_error_summary_counts_and_lifts_first_error() -> None:
+    errored, total, first_error = _grader_error_summary(
+        _result_with_partial_grader_errors()
+    )
+    assert (errored, total) == (1, 2)
+    assert first_error is not None
+    assert "AuthenticationError" in first_error
+
+
+def _write_minimal_config(tmp_path: Path) -> Path:
+    dataset = tmp_path / "dataset.jsonl"
+    dataset.write_text(json.dumps({"input": "hi", "expected": "hi"}), encoding="utf-8")
+    config = tmp_path / "agentops.yaml"
+    config.write_text(
+        json.dumps(
+            {"version": 1, "agent": "model:gpt-4o", "dataset": str(dataset)}
+        ),
+        encoding="utf-8",
+    )
+    return config
+
+
+def test_eval_run_warns_on_partial_grader_errors(tmp_path, monkeypatch) -> None:
+    config = _write_minimal_config(tmp_path)
+    output = tmp_path / "out"
+    output.mkdir()
+
+    crafted = _result_with_partial_grader_errors()
+    import agentops.pipeline.orchestrator as orch
+
+    monkeypatch.setattr(orch, "run_evaluation", lambda *a, **k: crafted)
+
+    result = runner.invoke(
+        app,
+        ["eval", "run", "--config", str(config), "--output", str(output)],
+    )
+
+    # A grader-execution failure keeps the gate-failed exit code...
+    assert result.exit_code == 2, result.output
+    # ...but the user is told it is an execution error, not a quality failure.
+    assert "grader execution(s) errored" in result.output
+    assert "propagating" in result.output
+    assert "AuthenticationError" in result.output
+    assert "FAILED" in result.output
+
+
+def test_eval_run_no_warning_when_no_grader_errors(tmp_path, monkeypatch) -> None:
+    config = _write_minimal_config(tmp_path)
+    output = tmp_path / "out"
+    output.mkdir()
+
+    clean = _result_with_partial_grader_errors()
+    # Drop the errored grader so the row is clean and the gate genuinely passes.
+    clean.rows[0].metrics = [RowMetric(name="coherence", value=5.0)]
+    clean.summary.items_passed_all = 1
+    clean.summary.items_pass_rate = 1.0
+    clean.summary.overall_passed = True
+
+    import agentops.pipeline.orchestrator as orch
+
+    monkeypatch.setattr(orch, "run_evaluation", lambda *a, **k: clean)
+
+    result = runner.invoke(
+        app,
+        ["eval", "run", "--config", str(config), "--output", str(output)],
+    )
+
+    assert result.exit_code == 0, result.output
+    assert "PASSED" in result.output
+    assert "grader execution(s) errored" not in result.output