Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 9 additions & 2 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,9 @@ Only the following commands are in scope:

- `agentops init`
- `agentops eval run --config <run.yaml> [--output <dir>]`
- `agentops eval compare --runs <ID1>,<ID2>[,ID3,...] [--output <dir>]`
- `agentops report --in <results.json> [--out <report.md>]`
- `agentops config cicd [--force] [--dir <path>]`

Do not add new commands or flags unless explicitly discussed.

Expand All @@ -80,7 +82,7 @@ See `docs/how-it-works.md` for the full source-code map and architecture diagram
- Keep CLI command handlers **thin** (`cli/app.py`) — only parse args and call `services/`
- Place business logic in:
- `core/` — config loading, Pydantic models, thresholds, report generation. **Must have zero Azure SDK imports and zero network calls.**
- `services/` — orchestration (runner), Foundry publishing, workspace init, report regen
- `services/` — orchestration (runner), comparison, CI/CD workflow generation, Foundry publishing, workspace init, report regen
- `backends/` — execution backends (Foundry, subprocess). Each implements the `Backend` protocol from `base.py`.
- Use `pathlib.Path` everywhere (no raw string paths)
- No side effects at import time
Expand Down Expand Up @@ -130,6 +132,7 @@ The Foundry backend (`backends/foundry_backend.py`) is the largest and most comp
- Auto-derive Azure OpenAI endpoint from the project endpoint via `_derive_openai_endpoint_from_project()` — users should not need to set `AZURE_OPENAI_ENDPOINT` manually.
- Agent invocation supports both reference-based and threads-based API calls.
- Evaluator names map from class names to builtins: `SimilarityEvaluator` → `builtin.similarity`.
- Cloud evaluator routing uses frozensets: `_EVALUATORS_NEEDING_GROUND_TRUTH`, `_EVALUATORS_NEEDING_CONTEXT`, `_EVALUATORS_NEEDING_TOOL_CALLS`, `_EVALUATORS_NEEDING_TOOL_DEFS_ONLY`, `_EVALUATORS_NEEDING_OUTPUT_ITEMS`. NLP evaluators with required init params use `_NLP_DEFAULT_INIT_PARAMS`.

### Environment Variables

Expand Down Expand Up @@ -208,6 +211,10 @@ When cloud evaluation is used, a `cloud_evaluation.json` is also produced contai
- Foundry backend helpers (`test_foundry_backend.py`)
- Subprocess backend (`test_subprocess_backend.py`)
- Initializer (`test_initializer.py`)
- CI/CD workflow generation (`test_cicd.py`)
- CLI command behavior (`test_cli_commands.py`)
- Eval comparison logic (`test_comparison.py`)
- OTLP telemetry instrumentation (`test_telemetry.py`)
- Integration test for:
- `agentops eval run` end-to-end using a fake subprocess backend (`test_eval_run_integration.py`)
- Tests must assert correct **exit codes**
Expand Down Expand Up @@ -248,7 +255,7 @@ When generating or modifying code:
- Azure SDK imports must be **lazy** (inside functions, not top-level)
- Never hardcode Azure API versions — let the SDK handle versioning
- Keep user-facing log output clean — no warning cascades or retry noise
- When adding evaluator support, update both cloud (`_cloud_evaluator_data_mapping` + `_cloud_evaluator_needs_model`) and local paths
- When adding evaluator support, add the builtin name to the correct frozenset in `foundry_backend.py` (`_EVALUATORS_NEEDING_GROUND_TRUTH`, `_EVALUATORS_NEEDING_CONTEXT`, `_EVALUATORS_NEEDING_TOOL_CALLS`, `_EVALUATORS_NEEDING_TOOL_DEFS_ONLY`, or `_EVALUATORS_NEEDING_OUTPUT_ITEMS`), update `_NLP_DEFAULT_INIT_PARAMS` if init params are required, and update both cloud (`_cloud_evaluator_data_mapping` + `_cloud_evaluator_needs_model`) and local paths
- All new logic must have corresponding unit tests in `tests/unit/`
- Always mock Azure SDK calls in tests — tests must run without credentials
- The `core/` package must remain free of Azure imports and I/O
Expand Down
27 changes: 20 additions & 7 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,17 +17,18 @@ Primary capabilities:

Public CLI contract:
- `agentops init`
- `agentops eval run --config <run.yaml> [--output <dir>]`
- `agentops eval compare --runs <baseline>,<current>`
- `agentops report --in <results.json> [--out <report.md>]`
- `agentops eval run --config <run.yaml> [--output <dir>] [--format md|html|all]`
- `agentops eval compare --runs <ID1>,<ID2>[,ID3,...] [--output <dir>]`
- `agentops report --in <results.json> [--out <report.md>] [--format md|html|all]`
- `agentops config cicd [--force] [--dir <path>]`

Planned CLI stubs (not implemented in this release):
- `agentops run list|show`
- `agentops run view <id> [--entry N]`
- `agentops report show|export`
- `agentops bundle list|show`
- `agentops dataset validate|describe|import`
- `agentops config validate|show|cicd`
- `agentops config validate|show`
- `agentops trace init`
- `agentops monitor setup|dashboard|alert`
- `agentops model list`
Expand Down Expand Up @@ -114,6 +115,8 @@ src/
│ ├── runner.py # Main evaluation orchestration
│ ├── initializer.py # `.agentops/` workspace scaffolding
│ ├── reporting.py # `results.json` -> `report.md`
│ ├── comparison.py # `agentops eval compare` logic
│ ├── cicd.py # CI/CD workflow generation
│ └── foundry_evals.py # Foundry evaluation publishing helpers
├── backends/
Expand All @@ -129,10 +132,13 @@ src/
└── templates/
├── config.yaml # Seed workspace config
├── run.yaml # Seed run config
├── run-agent.yaml # Seed agent run config
├── run-rag.yaml # Seed RAG run config
├── .gitignore # Seed `.agentops/.gitignore`
├── bundles/ # Starter bundle YAML files
├── datasets/ # Starter dataset YAML configs
└── data/ # Starter dataset JSONL rows
├── data/ # Starter dataset JSONL rows
└── workflows/ # CI/CD workflow templates
```

### Tests
Expand All @@ -149,7 +155,11 @@ tests/
├── test_reporter.py # Report generation and threshold output
├── test_foundry_backend.py # Foundry backend helpers
├── test_subprocess_backend.py # Subprocess backend behavior
└── test_initializer.py # `.agentops/` scaffold behavior
├── test_initializer.py # `.agentops/` scaffold behavior
├── test_cicd.py # CI/CD workflow generation
├── test_cli_commands.py # CLI command behavior
├── test_comparison.py # Eval comparison logic
└── test_telemetry.py # OTLP telemetry instrumentation
```

### Documentation
Expand Down Expand Up @@ -242,6 +252,7 @@ Key sections:
- `format.type`
- `format.input_field`
- `format.expected_field`
- `format.context_field`

Dataset rows live separately in `.agentops/data/*.jsonl`.

Expand Down Expand Up @@ -351,7 +362,9 @@ Common derived run metrics:
### Agent with Tools
- Target: Foundry agent
- Bundle: `agent_tools_baseline.yaml`
- Current status: placeholder baseline ready for expansion
- Evaluators: `TaskCompletionEvaluator`, `ToolCallAccuracyEvaluator`, `avg_latency_seconds`
- Typical row fields: `input`, `expected`, `tool_definitions`
- Primary evaluator pattern: task completion + tool accuracy + latency

---

Expand Down
14 changes: 14 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,20 @@ This format follows [Keep a Changelog](https://keepachangelog.com/) and adheres
## [Unreleased]

### Added
- Extend Foundry cloud evaluation to support 22 built-in evaluators (up from 8), covering quality, agent, safety, RAG, tool, and NLP evaluator categories. Verified end-to-end with live Foundry cloud evaluation.
- Quality: `CoherenceEvaluator`, `FluencyEvaluator`, `RelevanceEvaluator`
- Agent: `IntentResolutionEvaluator`, `TaskCompletionEvaluator`, `TaskAdherenceEvaluator`
- Similarity: `ResponseCompletenessEvaluator`
- RAG: `GroundednessProEvaluator`, `RetrievalEvaluator`
- Safety: `ViolenceEvaluator`, `SexualEvaluator`, `SelfHarmEvaluator`, `HateUnfairnessEvaluator`
- Tool: `ToolSelectionEvaluator`, `ToolInputAccuracyEvaluator`, `ToolOutputUtilizationEvaluator`, `ToolCallSuccessEvaluator`
- Add dynamic `item_schema` building — automatically includes `tool_definitions` and `context` fields when the enabled evaluators require them.
- Add CI/CD integration models documentation: PR quality gate, scheduled regression, post-deployment validation, multi-environment promotion, Azure DevOps pipeline.
- Add gating best practices: threshold design, scenario-specific evaluator selection, comparison-based regression detection.
- Add supported evaluators reference table to CI/CD documentation.
- Improve error messages when evaluators return no score (e.g. safety evaluators in unsupported regions) — surface the service error and suggest `--verbose`.
- Fix NLP evaluator names in frozensets to match `_to_builtin_evaluator_name` conversion (`bleu_score`, `rouge_score`, `gleu_score`, `meteor_score` instead of `bleu`, `rouge`, `gleu`, `meteor`).
- Add default `initialization_parameters` for `RougeScoreEvaluator` (`rouge_type: rouge1`).
- Add optional OTLP tracing for evaluation runs — set `AGENTOPS_OTLP_ENDPOINT` to emit OpenTelemetry spans.
- Three-layer schema: CICD semconv (pipeline run/task), GenAI semconv (agent invocation), and `agentops.eval.*` (evaluator scores/thresholds).
- Per-row item spans with evaluator child spans showing score, threshold, and pass/fail.
Expand Down
15 changes: 8 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,7 @@ Starter bundles created by `agentops init`:
|---|---|---|
| `model_direct_baseline` (default) | `SimilarityEvaluator` + `avg_latency_seconds` | Model-direct QA checks |
| `rag_retrieval_baseline` | `GroundednessEvaluator` + `avg_latency_seconds` | RAG groundedness checks |
| `agent_tools_baseline` | `SimilarityEvaluator` + `avg_latency_seconds` | Agent-with-tools baseline (placeholder) |
| `agent_tools_baseline` | `TaskCompletionEvaluator` + `ToolCallAccuracyEvaluator` + `avg_latency_seconds` | Agent-with-tools baseline |

`datasets/` stores YAML dataset definitions.
`data/` stores JSONL rows referenced by dataset definitions.
Expand All @@ -168,7 +168,7 @@ Starter bundles created by `agentops init`:
| Command | Description | Status |
|---|---|---|
| `agentops --version` | Show installed version | ✅ |
| `agentops init [--path DIR]` | Scaffold project workspace and starter files | ✅ |
| `agentops init [--dir DIR]` | Scaffold project workspace and starter files | ✅ |
| `agentops eval run` | Evaluate a dataset against a bundle | ✅ |
| `agentops eval compare --runs ID1,ID2` | Compare two past runs | ✅ |
| `agentops run list\|show` | List or inspect past runs | 🚧 |
Expand All @@ -188,9 +188,10 @@ Implemented command usage:

```bash
agentops --version
agentops init [--path <dir>]
agentops eval run [--config <path>] [--output <dir>]
agentops report [--in <results.json>] [--out <report.md>]
agentops init [--dir <dir>]
agentops eval run [--config <path>] [--output <dir>] [--format md|html|all]
agentops eval compare --runs ID1,ID2 [--output <dir>] [--format md|html|all]
agentops report [--in <results.json>] [--out <report.md>] [--format md|html|all]
agentops config cicd [--force] [--dir <path>]
```

Expand Down Expand Up @@ -237,13 +238,13 @@ Skills are distributed from this GitHub repository. Install them in VS Code:
1. Open **VS Code** with **GitHub Copilot Chat** enabled.
2. Use the Copilot skill install command and point to this repository:
- Source: `Azure/agentops`
- Skills are located under `.github/plugins/agentops/skills/`
- Skills are located under `plugins/agentops/skills/`
3. Once installed, Copilot will automatically use the skills when you ask about AgentOps evaluation, regressions, or observability.

Alternatively, you can copy the skill files manually:
```bash
# Copy skills to your user-level skills directory
cp -r .github/plugins/agentops/skills/* ~/.agents/skills/
cp -r plugins/agentops/skills/* ~/.agents/skills/
```

### For Repository Contributors
Expand Down
Loading
Loading