Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,28 @@ This format follows [Keep a Changelog](https://keepachangelog.com/) and adheres

## [Unreleased]

### Changed
- **`agentops eval run` now distinguishes a grader *execution* failure from a
quality-gate failure.** When evaluator workers error out on a subset of rows
(auth/RBAC/timeout), no row has every grader return a score, so
`items_passed_all` is `0` and the run reports `Threshold status: FAILED` even
though every threshold that *could* be computed passed. The CLI now detects
this case (errored graders combined with all thresholds passing) and prints a
`Warning` explaining that this is an execution error, not a quality
regression, names the most common cause (data-plane RBAC granted moments
earlier that is still propagating to the evaluator workers), surfaces the
first underlying grader error, and advises waiting a few minutes before
re-running. The exit-code contract is unchanged. Added the
`_grader_error_summary` helper plus focused unit tests.
- **Corrected the RBAC propagation guidance in the tutorials and the
`agentops-eval` skill.** Data-plane role assignments on Cognitive Services
accounts can take several minutes (not 30-120 seconds) to reach the
independent, per-row evaluator workers, which can produce an *intermittent*
`FAILED` with otherwise-green thresholds on the first run after granting
access. The prompt-agent, hosted-agent, and end-to-end tutorials and the
skill now describe this symptom and tell readers to wait and re-run rather
than lower thresholds.

## [0.3.5] - 2026-06-01

### Changed
Expand Down
10 changes: 9 additions & 1 deletion docs/tutorial-end-to-end.md
Original file line number Diff line number Diff line change
Expand Up @@ -312,7 +312,15 @@ az role assignment create `
--scope /subscriptions/<subscription-id>/resourceGroups/<resource-group>
```

Propagation usually completes within 30–120 seconds.
> **Give the assignment a few minutes to propagate.** Data-plane role
> assignments on the AI Services account do **not** take effect
> instantly — propagation to the evaluator workers can take several
> minutes (occasionally up to ~15). Evaluators authenticate per call, so
> the **first eval right after granting the role may show intermittent
> `AuthenticationError` on a subset of graders and report
> `Threshold status: FAILED` even when every threshold is green**. This
> is a grader execution failure, not a quality regression — wait a few
> minutes and re-run the eval.

## 2. Create the travel eval dataset

Expand Down
10 changes: 9 additions & 1 deletion docs/tutorial-hosted-agent-quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -334,7 +334,15 @@ az role assignment create `
--scope /subscriptions/<subscription-id>/resourceGroups/<resource-group>
```

Propagation usually completes within 30–120 seconds.
> **Give the assignment a few minutes to propagate.** Data-plane role
> assignments on the AI Services account do **not** take effect
> instantly — propagation to the local/Foundry evaluator workers can
> take several minutes (occasionally up to ~15). Evaluators authenticate
> per call, so the **first eval right after granting the role may show
> intermittent `AuthenticationError` on a subset of graders and report
> `Threshold status: FAILED` even when every threshold is green**. This
> is a grader execution failure, not a quality regression — wait a few
> minutes and re-run the eval.

## 5. Initialize AgentOps interactively

Expand Down
21 changes: 17 additions & 4 deletions docs/tutorial-prompt-agent-quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -270,10 +270,23 @@ az role assignment create `
```

Repeat the command with the `travel-agent-dev` resource group if the dev
project lives in a different RG. The assignment usually propagates within
30–120 seconds. AgentOps Doctor will detect the missing assignment in a
future release, but until then this is a manual one-time setup step per
new environment.
project lives in a different RG.

> **Give the assignment a few minutes to propagate.** Data-plane role
> assignments on the AI Services account do **not** take effect
> instantly — propagation to the Foundry evaluator workers can take
> several minutes (occasionally up to ~15). The cloud eval runs each
> grader as an independent worker that authenticates separately, so the
> **first run right after granting the role may show intermittent
> `AuthenticationError` on a subset of graders and report
> `Threshold status: FAILED` even when every threshold is green** (no
> single row had all graders succeed). This is a grader execution
> failure, not a quality regression. Wait a few minutes and re-run
> `agentops eval run` — once propagation finishes, every grader scores
> and the gate passes.

AgentOps Doctor will detect the missing assignment in a future release,
but until then this is a manual one-time setup step per new environment.

## 4. Seed `travel-agent` in the sandbox project

Expand Down
11 changes: 11 additions & 0 deletions plugins/agentops/skills/agentops-eval/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,17 @@ Skip this step only if the user explicitly says the role is already
assigned, or if a previous `agentops eval run` succeeded against the
same Foundry account.

**Propagation:** data-plane role assignments do not take effect
instantly — allow several minutes (occasionally up to ~15) before the
first eval. The cloud/local graders authenticate per call, so if the
user runs an eval immediately after this preflight and sees intermittent
`AuthenticationError` on a subset of graders plus
`Threshold status: FAILED` while the visible thresholds are green, that
is propagation lag (a grader **execution** failure), not a quality
regression. Tell the user to wait a few minutes and re-run
`agentops eval run`; do not treat it as a failing gate or start changing
thresholds.

## Step 1 - Analyze evaluation setup

Run the deterministic local triage first:
Expand Down
47 changes: 47 additions & 0 deletions src/agentops/cli/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -2055,10 +2055,57 @@ def _run_flat_schema_eval(
if result.summary.overall_passed:
typer.echo(f"{_cli_label('Threshold status')}: {style('PASSED', 'bold', 'green')}")
return

# Distinguish a genuine quality-gate failure from grader *execution*
# errors. When evaluator workers error (auth/RBAC/timeout) on a subset of
# rows, no row has every grader succeed, so `items_passed_all` is 0 and the
# gate reports FAILED even though every threshold that *could* be computed
# passed. Surfacing this prevents users from chasing a phantom quality
# regression - the most common cause is data-plane RBAC granted moments
# earlier that is still propagating to the evaluator workers.
errored, total, first_error = _grader_error_summary(result)
all_thresholds_passed = (
result.summary.thresholds_total > 0
and result.summary.thresholds_passed == result.summary.thresholds_total
)
if errored and all_thresholds_passed:
typer.echo(
f"{_cli_warn('Warning')}: {errored} of {total} grader execution(s) "
"errored, so no dataset row had every grader return a score. This is "
"a grader execution failure, not a quality regression - every "
"threshold that could be computed passed. The most common cause is "
"data-plane RBAC granted recently that is still propagating to the "
"evaluator workers; wait a few minutes and re-run `agentops eval run`.",
err=True,
)
if first_error:
typer.echo(f"{_cli_warn('Warning')}: first grader error: {first_error}", err=True)

typer.echo(f"{_cli_label('Threshold status')}: {style('FAILED', 'bold', 'red')}")
raise typer.Exit(code=exit_code_from(result))


def _grader_error_summary(result) -> tuple[int, int, Optional[str]]:
"""Return ``(errored_metric_count, total_metric_count, first_error)``.

Walks every per-row metric in the run so the CLI can tell a grader
*execution* failure (auth/RBAC/timeout) apart from a quality-gate failure.
The first non-empty error string is lifted out as the actionable cause.
"""
errored = 0
total = 0
first_error: Optional[str] = None
for row in result.rows:
for metric in row.metrics:
total += 1
err = getattr(metric, "error", None)
if isinstance(err, str) and err.strip():
errored += 1
if first_error is None:
first_error = err.strip()
return errored, total, first_error


def _default_flat_output_dir(config_path: Path) -> Path:
base = config_path.parent / ".agentops" / "results"
timestamp = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H-%M-%SZ")
Expand Down
11 changes: 11 additions & 0 deletions src/agentops/templates/skills/agentops-eval/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,17 @@ Skip this step only if the user explicitly says the role is already
assigned, or if a previous `agentops eval run` succeeded against the
same Foundry account.

**Propagation:** data-plane role assignments do not take effect
instantly — allow several minutes (occasionally up to ~15) before the
first eval. The cloud/local graders authenticate per call, so if the
user runs an eval immediately after this preflight and sees intermittent
`AuthenticationError` on a subset of graders plus
`Threshold status: FAILED` while the visible thresholds are green, that
is propagation lag (a grader **execution** failure), not a quality
regression. Tell the user to wait a few minutes and re-run
`agentops eval run`; do not treat it as a failing gate or start changing
thresholds.

## Step 1 - Analyze evaluation setup

Run the deterministic local triage first:
Expand Down
150 changes: 150 additions & 0 deletions tests/unit/test_eval_run_grader_errors.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
"""CLI behaviour when graders *execute* but a subset errors out.

A grader execution error (auth/RBAC/timeout) is not a quality regression, but
because ``items_passed_all`` requires every grader on a row to succeed, a single
errored grader flips ``overall_passed`` to ``False`` and the run reports
``Threshold status: FAILED`` even though every computable threshold passed.

The CLI must surface that distinction loudly so users (the most common trigger
is data-plane RBAC that is still propagating) do not chase a phantom quality
failure or start lowering thresholds.
"""

from __future__ import annotations

import json
from pathlib import Path

from typer.testing import CliRunner

from agentops.cli.app import _grader_error_summary, app
from agentops.core.results import (
RowMetric,
RowResult,
RunResult,
RunSummary,
TargetInfo,
ThresholdEvaluation,
)

runner = CliRunner()

_AUTH_ERROR = (
"FAILED_EXECUTION: (UserError) OpenAI API hits AuthenticationError: "
"Principal does not have access to API/Operation."
)


def _result_with_partial_grader_errors() -> RunResult:
"""One row where coherence scored but similarity errored on auth."""
row = RowResult(
row_index=0,
input="plan a trip",
expected="an itinerary",
response="here is an itinerary",
metrics=[
RowMetric(name="coherence", value=5.0),
RowMetric(name="similarity", value=None, error=_AUTH_ERROR),
],
)
summary = RunSummary(
items_total=1,
items_passed_all=0, # the errored grader means no row passed all
items_pass_rate=0.0,
thresholds_total=1,
thresholds_passed=1, # every computable threshold passed
threshold_pass_rate=1.0,
overall_passed=False,
)
return RunResult(
started_at="2026-06-01T00:00:00+00:00",
finished_at="2026-06-01T00:01:00+00:00",
duration_seconds=60.0,
target=TargetInfo(kind="foundry_prompt", raw="travel-agent:2"),
dataset_path="dataset.jsonl",
evaluators=["CoherenceEvaluator", "SimilarityEvaluator"],
rows=[row],
aggregate_metrics={"coherence": 5.0},
thresholds=[
ThresholdEvaluation(
metric="coherence",
criteria=">=",
expected=">=3",
actual="5",
passed=True,
)
],
summary=summary,
)


def test_grader_error_summary_counts_and_lifts_first_error() -> None:
errored, total, first_error = _grader_error_summary(
_result_with_partial_grader_errors()
)
assert (errored, total) == (1, 2)
assert first_error is not None
assert "AuthenticationError" in first_error


def _write_minimal_config(tmp_path: Path) -> Path:
dataset = tmp_path / "dataset.jsonl"
dataset.write_text(json.dumps({"input": "hi", "expected": "hi"}), encoding="utf-8")
config = tmp_path / "agentops.yaml"
config.write_text(
json.dumps(
{"version": 1, "agent": "model:gpt-4o", "dataset": str(dataset)}
),
encoding="utf-8",
)
return config


def test_eval_run_warns_on_partial_grader_errors(tmp_path, monkeypatch) -> None:
config = _write_minimal_config(tmp_path)
output = tmp_path / "out"
output.mkdir()

crafted = _result_with_partial_grader_errors()
import agentops.pipeline.orchestrator as orch

monkeypatch.setattr(orch, "run_evaluation", lambda *a, **k: crafted)

result = runner.invoke(
app,
["eval", "run", "--config", str(config), "--output", str(output)],
)

# A grader-execution failure keeps the gate-failed exit code...
assert result.exit_code == 2, result.output
# ...but the user is told it is an execution error, not a quality failure.
assert "grader execution(s) errored" in result.output
assert "propagating" in result.output
assert "AuthenticationError" in result.output
assert "FAILED" in result.output


def test_eval_run_no_warning_when_no_grader_errors(tmp_path, monkeypatch) -> None:
config = _write_minimal_config(tmp_path)
output = tmp_path / "out"
output.mkdir()

clean = _result_with_partial_grader_errors()
# Drop the errored grader so the row is clean and the gate genuinely passes.
clean.rows[0].metrics = [RowMetric(name="coherence", value=5.0)]
clean.summary.items_passed_all = 1
clean.summary.items_pass_rate = 1.0
clean.summary.overall_passed = True

import agentops.pipeline.orchestrator as orch

monkeypatch.setattr(orch, "run_evaluation", lambda *a, **k: clean)

result = runner.invoke(
app,
["eval", "run", "--config", str(config), "--output", str(output)],
)

assert result.exit_code == 0, result.output
assert "PASSED" in result.output
assert "grader execution(s) errored" not in result.output
Loading