Azure · placerda · Apr 13, 2026 · Apr 7, 2026 · Apr 7, 2026 · Apr 7, 2026
diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md
@@ -54,7 +54,9 @@ Only the following commands are in scope:
 
 - `agentops init`
 - `agentops eval run --config <run.yaml> [--output <dir>]`
+- `agentops eval compare --runs <ID1>,<ID2>[,ID3,...] [--output <dir>]`
 - `agentops report --in <results.json> [--out <report.md>]`
+- `agentops config cicd [--force] [--dir <path>]`
 
 Do not add new commands or flags unless explicitly discussed.
 
@@ -80,7 +82,7 @@ See `docs/how-it-works.md` for the full source-code map and architecture diagram
 - Keep CLI command handlers **thin** (`cli/app.py`) — only parse args and call `services/`
 - Place business logic in:
   - `core/` — config loading, Pydantic models, thresholds, report generation. **Must have zero Azure SDK imports and zero network calls.**
-  - `services/` — orchestration (runner), Foundry publishing, workspace init, report regen
+  - `services/` — orchestration (runner), comparison, CI/CD workflow generation, Foundry publishing, workspace init, report regen
   - `backends/` — execution backends (Foundry, subprocess). Each implements the `Backend` protocol from `base.py`.
 - Use `pathlib.Path` everywhere (no raw string paths)
 - No side effects at import time
@@ -130,6 +132,7 @@ The Foundry backend (`backends/foundry_backend.py`) is the largest and most comp
 - Auto-derive Azure OpenAI endpoint from the project endpoint via `_derive_openai_endpoint_from_project()` — users should not need to set `AZURE_OPENAI_ENDPOINT` manually.
 - Agent invocation supports both reference-based and threads-based API calls.
 - Evaluator names map from class names to builtins: `SimilarityEvaluator` → `builtin.similarity`.
+- Cloud evaluator routing uses frozensets: `_EVALUATORS_NEEDING_GROUND_TRUTH`, `_EVALUATORS_NEEDING_CONTEXT`, `_EVALUATORS_NEEDING_TOOL_CALLS`, `_EVALUATORS_NEEDING_TOOL_DEFS_ONLY`, `_EVALUATORS_NEEDING_OUTPUT_ITEMS`. NLP evaluators with required init params use `_NLP_DEFAULT_INIT_PARAMS`.
 
 ### Environment Variables
 
@@ -208,6 +211,10 @@ When cloud evaluation is used, a `cloud_evaluation.json` is also produced contai
   - Foundry backend helpers (`test_foundry_backend.py`)
   - Subprocess backend (`test_subprocess_backend.py`)
   - Initializer (`test_initializer.py`)
+  - CI/CD workflow generation (`test_cicd.py`)
+  - CLI command behavior (`test_cli_commands.py`)
+  - Eval comparison logic (`test_comparison.py`)
+  - OTLP telemetry instrumentation (`test_telemetry.py`)
 - Integration test for:
   - `agentops eval run` end-to-end using a fake subprocess backend (`test_eval_run_integration.py`)
 - Tests must assert correct **exit codes**
@@ -248,7 +255,7 @@ When generating or modifying code:
 - Azure SDK imports must be **lazy** (inside functions, not top-level)
 - Never hardcode Azure API versions — let the SDK handle versioning
 - Keep user-facing log output clean — no warning cascades or retry noise
-- When adding evaluator support, update both cloud (`_cloud_evaluator_data_mapping` + `_cloud_evaluator_needs_model`) and local paths
+- When adding evaluator support, add the builtin name to the correct frozenset in `foundry_backend.py` (`_EVALUATORS_NEEDING_GROUND_TRUTH`, `_EVALUATORS_NEEDING_CONTEXT`, `_EVALUATORS_NEEDING_TOOL_CALLS`, `_EVALUATORS_NEEDING_TOOL_DEFS_ONLY`, or `_EVALUATORS_NEEDING_OUTPUT_ITEMS`), update `_NLP_DEFAULT_INIT_PARAMS` if init params are required, and update both cloud (`_cloud_evaluator_data_mapping` + `_cloud_evaluator_needs_model`) and local paths
 - All new logic must have corresponding unit tests in `tests/unit/`
 - Always mock Azure SDK calls in tests — tests must run without credentials
 - The `core/` package must remain free of Azure imports and I/O

diff --git a/AGENTS.md b/AGENTS.md
@@ -17,17 +17,18 @@ Primary capabilities:
 
 Public CLI contract:
 - `agentops init`
-- `agentops eval run --config <run.yaml> [--output <dir>]`
-- `agentops eval compare --runs <baseline>,<current>`
-- `agentops report --in <results.json> [--out <report.md>]`
+- `agentops eval run --config <run.yaml> [--output <dir>] [--format md|html|all]`
+- `agentops eval compare --runs <ID1>,<ID2>[,ID3,...] [--output <dir>]`
+- `agentops report --in <results.json> [--out <report.md>] [--format md|html|all]`
+- `agentops config cicd [--force] [--dir <path>]`
 
 Planned CLI stubs (not implemented in this release):
 - `agentops run list|show`
 - `agentops run view <id> [--entry N]`
 - `agentops report show|export`
 - `agentops bundle list|show`
 - `agentops dataset validate|describe|import`
-- `agentops config validate|show|cicd`
+- `agentops config validate|show`
 - `agentops trace init`
 - `agentops monitor setup|dashboard|alert`
 - `agentops model list`
@@ -114,6 +115,8 @@ src/
     │   ├── runner.py                  # Main evaluation orchestration
     │   ├── initializer.py             # `.agentops/` workspace scaffolding
     │   ├── reporting.py               # `results.json` -> `report.md`
+    │   ├── comparison.py              # `agentops eval compare` logic
+    │   ├── cicd.py                    # CI/CD workflow generation
     │   └── foundry_evals.py           # Foundry evaluation publishing helpers
     │
     ├── backends/
@@ -129,10 +132,13 @@ src/
     └── templates/
         ├── config.yaml                # Seed workspace config
         ├── run.yaml                   # Seed run config
+        ├── run-agent.yaml             # Seed agent run config
+        ├── run-rag.yaml               # Seed RAG run config
         ├── .gitignore                 # Seed `.agentops/.gitignore`
         ├── bundles/                   # Starter bundle YAML files
         ├── datasets/                  # Starter dataset YAML configs
-        └── data/                      # Starter dataset JSONL rows
+        ├── data/                      # Starter dataset JSONL rows
+        └── workflows/                 # CI/CD workflow templates
 ```
 
 ### Tests
@@ -149,7 +155,11 @@ tests/
     ├── test_reporter.py               # Report generation and threshold output
     ├── test_foundry_backend.py        # Foundry backend helpers
     ├── test_subprocess_backend.py     # Subprocess backend behavior
-    └── test_initializer.py            # `.agentops/` scaffold behavior
+    ├── test_initializer.py            # `.agentops/` scaffold behavior
+    ├── test_cicd.py                   # CI/CD workflow generation
+    ├── test_cli_commands.py           # CLI command behavior
+    ├── test_comparison.py             # Eval comparison logic
+    └── test_telemetry.py              # OTLP telemetry instrumentation
 ```
 
 ### Documentation
@@ -242,6 +252,7 @@ Key sections:
 - `format.type`
 - `format.input_field`
 - `format.expected_field`
+- `format.context_field`
 
 Dataset rows live separately in `.agentops/data/*.jsonl`.
 
@@ -351,7 +362,9 @@ Common derived run metrics:
 ### Agent with Tools
 - Target: Foundry agent
 - Bundle: `agent_tools_baseline.yaml`
-- Current status: placeholder baseline ready for expansion
+- Evaluators: `TaskCompletionEvaluator`, `ToolCallAccuracyEvaluator`, `avg_latency_seconds`
+- Typical row fields: `input`, `expected`, `tool_definitions`
+- Primary evaluator pattern: task completion + tool accuracy + latency
 
 ---
 

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,6 +6,20 @@ This format follows [Keep a Changelog](https://keepachangelog.com/) and adheres
 ## [Unreleased]
 
 ### Added
+- Extend Foundry cloud evaluation to support 22 built-in evaluators (up from 8), covering quality, agent, safety, RAG, tool, and NLP evaluator categories. Verified end-to-end with live Foundry cloud evaluation.
+  - Quality: `CoherenceEvaluator`, `FluencyEvaluator`, `RelevanceEvaluator`
+  - Agent: `IntentResolutionEvaluator`, `TaskCompletionEvaluator`, `TaskAdherenceEvaluator`
+  - Similarity: `ResponseCompletenessEvaluator`
+  - RAG: `GroundednessProEvaluator`, `RetrievalEvaluator`
+  - Safety: `ViolenceEvaluator`, `SexualEvaluator`, `SelfHarmEvaluator`, `HateUnfairnessEvaluator`
+  - Tool: `ToolSelectionEvaluator`, `ToolInputAccuracyEvaluator`, `ToolOutputUtilizationEvaluator`, `ToolCallSuccessEvaluator`
+- Add dynamic `item_schema` building — automatically includes `tool_definitions` and `context` fields when the enabled evaluators require them.
+- Add CI/CD integration models documentation: PR quality gate, scheduled regression, post-deployment validation, multi-environment promotion, Azure DevOps pipeline.
+- Add gating best practices: threshold design, scenario-specific evaluator selection, comparison-based regression detection.
+- Add supported evaluators reference table to CI/CD documentation.
+- Improve error messages when evaluators return no score (e.g. safety evaluators in unsupported regions) — surface the service error and suggest `--verbose`.
+- Fix NLP evaluator names in frozensets to match `_to_builtin_evaluator_name` conversion (`bleu_score`, `rouge_score`, `gleu_score`, `meteor_score` instead of `bleu`, `rouge`, `gleu`, `meteor`).
+- Add default `initialization_parameters` for `RougeScoreEvaluator` (`rouge_type: rouge1`).
 - Add optional OTLP tracing for evaluation runs — set `AGENTOPS_OTLP_ENDPOINT` to emit OpenTelemetry spans.
   - Three-layer schema: CICD semconv (pipeline run/task), GenAI semconv (agent invocation), and `agentops.eval.*` (evaluator scores/thresholds).
   - Per-row item spans with evaluator child spans showing score, threshold, and pass/fail.

diff --git a/README.md b/README.md
@@ -156,7 +156,7 @@ Starter bundles created by `agentops init`:
 |---|---|---|
 | `model_direct_baseline` (default) | `SimilarityEvaluator` + `avg_latency_seconds` | Model-direct QA checks |
 | `rag_retrieval_baseline` | `GroundednessEvaluator` + `avg_latency_seconds` | RAG groundedness checks |
-| `agent_tools_baseline` | `SimilarityEvaluator` + `avg_latency_seconds` | Agent-with-tools baseline (placeholder) |
+| `agent_tools_baseline` | `TaskCompletionEvaluator` + `ToolCallAccuracyEvaluator` + `avg_latency_seconds` | Agent-with-tools baseline |
 
 `datasets/` stores YAML dataset definitions.
 `data/` stores JSONL rows referenced by dataset definitions.
@@ -168,7 +168,7 @@ Starter bundles created by `agentops init`:
 | Command | Description | Status |
 |---|---|---|
 | `agentops --version` | Show installed version | ✅ |
-| `agentops init [--path DIR]` | Scaffold project workspace and starter files | ✅ |
+| `agentops init [--dir DIR]` | Scaffold project workspace and starter files | ✅ |
 | `agentops eval run` | Evaluate a dataset against a bundle | ✅ |
 | `agentops eval compare --runs ID1,ID2` | Compare two past runs | ✅ |
 | `agentops run list\|show` | List or inspect past runs | 🚧 |
@@ -188,9 +188,10 @@ Implemented command usage:
 
 ```bash
 agentops --version
-agentops init [--path <dir>]
-agentops eval run [--config <path>] [--output <dir>]
-agentops report [--in <results.json>] [--out <report.md>]
+agentops init [--dir <dir>]
+agentops eval run [--config <path>] [--output <dir>] [--format md|html|all]
+agentops eval compare --runs ID1,ID2 [--output <dir>] [--format md|html|all]
+agentops report [--in <results.json>] [--out <report.md>] [--format md|html|all]
 agentops config cicd [--force] [--dir <path>]
 ```
 
@@ -237,13 +238,13 @@ Skills are distributed from this GitHub repository. Install them in VS Code:
 1. Open **VS Code** with **GitHub Copilot Chat** enabled.
 2. Use the Copilot skill install command and point to this repository:
    - Source: `Azure/agentops`
-   - Skills are located under `.github/plugins/agentops/skills/`
+     - Skills are located under `plugins/agentops/skills/`
 3. Once installed, Copilot will automatically use the skills when you ask about AgentOps evaluation, regressions, or observability.
 
 Alternatively, you can copy the skill files manually:
 ```bash
 # Copy skills to your user-level skills directory
-cp -r .github/plugins/agentops/skills/* ~/.agents/skills/
+cp -r plugins/agentops/skills/* ~/.agents/skills/
 ```
 
 ### For Repository Contributors