Implement evaluation comparison features and update documentation by Dongbumlee · Pull Request #43 · Azure/agentops

Dongbumlee · 2026-03-24T17:48:51Z

This pull request introduces a new release management skill and improves the structure and discoverability of AgentOps workflow skills for Copilot, while also updating related documentation and dependency requirements. The most significant change is the addition of a comprehensive release management guide and skill, ensuring contributors follow consistent branching, versioning, and release processes. Additionally, an extension is added to automatically inject workflow skill context into Copilot sessions, and the skills documentation is refactored for clarity and maintainability.

AgentOps Workflow Skills & Extension

Added .github/extensions/agentops-skills/extension.mjs, an extension that detects relevant user prompts and injects context from AgentOps workflow skills (run-evals, investigate-regression, observability-triage) into Copilot sessions for improved guidance.
Refactored .github/skills/investigate-regression/SKILL.md and .github/skills/observability-triage/SKILL.md by removing detailed markdown files in favor of extension-driven context, consolidating operational guidance into the extension. [1] [2]

Release Management Skill

Added .github/skills/release-management/SKILL.md, providing a detailed guide for branching, versioning, changelog management, and PyPI publishing. This includes workflow diagrams, commit guidelines, and guardrails to prevent common mistakes.

Documentation Updates

Updated .github/copilot-instructions.md to clarify Azure SDK dependency versions and expand the scope of operational guidance to include release management workflows. Also, emphasized the CLI as the source of truth and reinforced workflow consistency. [1] [2]

These changes collectively improve contributor onboarding, operational clarity, and Copilot’s ability to provide actionable, context-aware workflow guidance.

…kills (#13) - Implement agentops eval compare --runs for baseline comparison - Pydantic models: ComparisonResult, MetricDelta, ThresholdDelta, ItemDelta - Comparison service with run discovery (timestamps, latest, paths) - Comparison markdown report generator - Exit codes: 0=no regressions, 2=regressions, 1=error - Metric polarity: lower-is-better metrics (<=) correctly show improved - Fix Foundry cloud evaluation to use Project Evals API - Use {project_endpoint}/openai/evals?api-version=2025-11-15-preview - Supports azure_ai_evaluator testing criteria (New Foundry Experience) - Replaces OpenAI SDK path that lacked azure_ai_evaluator support - Add distributable Copilot skills under .github/plugins/agentops/skills/ - agentops-run-evals, agentops-investigate-regression, agentops-observability-triage - GitHub-based distribution (Channel 1) matching azure-skills pattern - Remove .github/skills/ internal folder (superseded by plugins) - Align azure-ai-projects version to >=2.0.1 across all files - Update README, AGENTS.md, how-it-works.md, CHANGELOG - 87 unit tests passing

- docs/tutorial-baseline-comparison.md: step-by-step comparison workflow, CI patterns, regression investigation guide - docs/tutorial-copilot-skills.md: skill installation (GitHub, manual, project), usage examples, skill quality evaluation with AgentOps - Update README docs section with new tutorial links

- tutorial-baseline-comparison.md: add model-direct vs agent evaluation target section with when-to-use, pros/cons, expected score differences, cross-target comparison guidance, detailed regression investigation patterns, and baseline management strategies - tutorial-copilot-skills.md: add context on why skills matter, detailed usage examples showing before/after skill behavior, skill quality evaluation workflow using AgentOps itself

…ferences - tutorial-model-direct.md: add when/why to use model-direct, how scores differ from agent, dataset writing guidance, transitioning to agent eval - tutorial-basic-foundry-agent.md: add model-vs-agent decision guide, score expectations, named vs legacy agents, why both agent_id and model are needed, cross-scenario comparison guidance, evaluation scenarios table

…d updated Copilot skills - Unified ComparisonResult model supporting 2+ runs - HTML report format with modern light theme, visual indicators (dots, arrows, badges) - --format md|html|all flag on eval run, eval compare, and report commands - Comparison dimension detection (Model/Agent/Dataset Coverage/General) - Conditions section showing fixed vs varying parameters - Merged Evaluators table with dual evaluation (Met/Missed + direction) - Row Details with per-row evaluator scores and Met/Missed - Smart number formatting (integers without decimals) - Met/Missed threshold terminology - Status with pass rate (PASS 100% 5/5 / FAIL 80% 4/5) - Regression detection based on threshold flips only (not numeric noise) - Informational metrics (samples_evaluated) shown as plain values - Foundry backend command strings enriched with target + model - Updated all 3 Copilot skills with N-run workflows, HTML guide, model benchmarking

- Migrate versioning from static pyproject.toml to setuptools-scm (version derived automatically from git tags, no manual bumps) - Split release workflow into 3 files with reusable build: - _build.yml: reusable build workflow (test + package) - staging.yml: release/* branch -> TestPyPI + verify - release.yml: v* tag -> TestPyPI + verify -> PyPI (approval) -> GitHub Release - CLI smoke test: agentops --version, --help, init in temp directory - Fix secret reference PIPY_TOKEN -> PYPI_TOKEN, add TEST_PYPI_TOKEN - Two GitHub environments: staging (TestPyPI) and release (PyPI, approval gate) - Add consistent workflow index header across all CI/CD files Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Step-by-step guide covering staging (TestPyPI) and production (PyPI) release workflows, setuptools-scm versioning, environment setup, release checklist, and troubleshooting. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

feat: GitOps release pipeline with TestPyPI staging and setuptools-scm

fix: move mid-file import to top to resolve ruff E402 lint error

…arison feat: N-run comparison, HTML reports, smart comparison conditions, and updated Copilot skills

…arison ci: remove duplicate test runs on release branches, add pre-commit hooks

…arison ci: add cut-release workflow, remove duplicate CI, pre-commit hooks, docs updates

ci: auto-publish dev builds to TestPyPI on develop push

The default output path was hardcoded to report.md regardless of the --format flag. When -f html was passed, the file was still named report.md and the returned path did not reflect the actual format. - Default output filename now uses the correct suffix based on report_format (report.html when html, report.md otherwise) - Returned output_report_path now tracks which file was actually written

…arison fix: report command respects --format html parameter

ReportResult now carries an optional html_report_path so the CLI can display both output paths when report_format is 'all'.

…arison fix: -f all now generates both md and html reports

Relocate agentops skill plugin folder to the repo root for better discoverability. Also includes updates to CI docs, release process docs, reporter, and foundry backend tests.

…ill-baseline-comparison # Conflicts: # docs/release-process.md

…arison refactor: move agent skill plugins from .github/plugins to plugins/

- SKILL.md: Remove incorrect pyproject.toml version editing instruction (project uses setuptools-scm with dynamic versioning) - SKILL.md: Fix branch naming from release/x.y.z to release/vx.y.z to match cut-release.yml automation - SKILL.md: Fix release trigger description (tag triggers production release, not merge to main) - SKILL.md: Add Cut Release workflow as preferred one-click method - release-process.md: Fix section numbering (9.x -> 10.x under section 10) - _build.yml: Update header comment to include cut-release.yml (5th workflow)

docs: align release-management skill and docs with actual workflows

PRs to main only come from release/* branches, which are already fully validated by the staging pipeline. Running CI again on the same code wastes CI minutes (28 duplicate jobs per PR). Removes 'main' from the pull_request.branches list in ci.yml.

ci: remove main from PR triggers to prevent duplicate CI runs

Dongbumlee and others added 30 commits March 19, 2026 12:18

evaluation

60e4c1e

Merge branch 'develop' into feature/gitops-release-pipeline

93cb885

Merge pull request #30 from Azure/feature/gitops-release-pipeline

5164836

feat: GitOps release pipeline with TestPyPI staging and setuptools-scm

fix: move mid-file import to top to resolve ruff E402 lint error

3a41a42

chore: add pre-commit with ruff lint and format hooks

da92b4f

ci: remove macOS from test matrix to avoid queue delays

8c07ac7

Merge pull request #31 from Azure/feature/gitops-release-pipeline

1327dee

fix: move mid-file import to top to resolve ruff E402 lint error

merge: resolve conflicts with develop branch

677f770

Merge pull request #32 from Azure/feature/copilot-skill-baseline-comp…

9603575

…arison feat: N-run comparison, HTML reports, smart comparison conditions, and updated Copilot skills

ci: remove duplicate test runs on release branches

cf73554

Merge pull request #33 from Azure/feature/copilot-skill-baseline-comp…

b2df5ae

…arison ci: remove duplicate test runs on release branches, add pre-commit hooks

ci: add cut-release workflow and update release docs

4b59925

Merge pull request #34 from Azure/feature/copilot-skill-baseline-comp…

1e7584f

…arison ci: add cut-release workflow, remove duplicate CI, pre-commit hooks, docs updates

ci: auto-publish dev builds to TestPyPI on develop push

08df77b

Merge pull request #35 from Azure/feature/ci-dev-publish

d99aec2

ci: auto-publish dev builds to TestPyPI on develop push

Merge pull request #36 from Azure/feature/copilot-skill-baseline-comp…

8e2f428

…arison fix: report command respects --format html parameter

fix: -f all now generates both md and html reports

4c62cd3

ReportResult now carries an optional html_report_path so the CLI can display both output paths when report_format is 'all'.

Merge pull request #37 from Azure/feature/copilot-skill-baseline-comp…

fea1e64

…arison fix: -f all now generates both md and html reports

refactor: move agent skill plugins from .github/plugins to plugins/

6849ea7

Relocate agentops skill plugin folder to the repo root for better discoverability. Also includes updates to CI docs, release process docs, reporter, and foundry backend tests.

Merge remote-tracking branch 'origin/develop' into feature/copilot-sk…

ed578d2

…ill-baseline-comparison # Conflicts: # docs/release-process.md

Merge pull request #40 from Azure/feature/copilot-skill-baseline-comp…

14836be

…arison refactor: move agent skill plugins from .github/plugins to plugins/

docs: update RELEASE.md to match actual workflow conventions

0bbb623

Dongbumlee and others added 5 commits March 24, 2026 08:57

Merge pull request #42 from Azure/fix/release-docs-alignment

5c26d66

docs: align release-management skill and docs with actual workflows

removed RELEASE.md

f04eae1

removed RELEASE.md

f61bb2e

Dongbumlee mentioned this pull request Mar 24, 2026

ci: remove main from PR triggers to prevent duplicate CI runs #44

Merged

Merge pull request #44 from Azure/fix/ci-duplicate-runs

afcaf7a

ci: remove main from PR triggers to prevent duplicate CI runs

Dongbumlee temporarily deployed to staging March 24, 2026 18:11 — with GitHub Actions Inactive

placerda merged commit d8e2ec0 into main Mar 24, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement evaluation comparison features and update documentation#43

Implement evaluation comparison features and update documentation#43
placerda merged 36 commits into
mainfrom
develop

Dongbumlee commented Mar 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Dongbumlee commented Mar 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants