feat: N-run comparison, HTML reports, smart comparison conditions, and updated Copilot skills by Dongbumlee · Pull Request #32 · Azure/agentops

Dongbumlee · 2026-03-23T21:42:28Z

Summary

Adds N-run baseline comparison capabilities, enhanced reporting, and distributable Copilot skills for AgentOps evaluation workflows.

Changes

N-run comparison: Implement agentops eval compare for comparing multiple evaluation runs
HTML reports: Enhanced report generation with HTML output support
Smart comparison conditions: Configurable comparison logic for evaluation metrics
Updated Copilot skills: Distributable skills for run-evals, investigate-regression, and observability-triage
Tutorial enhancements: Enriched model-direct and agent tutorials with deeper guidance and cross-references
Cloud eval fixes: Bug fixes for cloud evaluation flow

Commits

d563243 feat: implement eval compare, cloud eval fix, distributable Copilot skills
0f66f64 docs: add baseline comparison and Copilot skills tutorials
5bb5079 docs: enrich tutorials with model-vs-agent guidance and practical depth
457f43d docs: enrich model-direct and agent tutorials with depth and cross-references
5db66f7 feat: N-run comparison, HTML reports, smart comparison conditions, and updated Copilot skills

…kills (#13) - Implement agentops eval compare --runs for baseline comparison - Pydantic models: ComparisonResult, MetricDelta, ThresholdDelta, ItemDelta - Comparison service with run discovery (timestamps, latest, paths) - Comparison markdown report generator - Exit codes: 0=no regressions, 2=regressions, 1=error - Metric polarity: lower-is-better metrics (<=) correctly show improved - Fix Foundry cloud evaluation to use Project Evals API - Use {project_endpoint}/openai/evals?api-version=2025-11-15-preview - Supports azure_ai_evaluator testing criteria (New Foundry Experience) - Replaces OpenAI SDK path that lacked azure_ai_evaluator support - Add distributable Copilot skills under .github/plugins/agentops/skills/ - agentops-run-evals, agentops-investigate-regression, agentops-observability-triage - GitHub-based distribution (Channel 1) matching azure-skills pattern - Remove .github/skills/ internal folder (superseded by plugins) - Align azure-ai-projects version to >=2.0.1 across all files - Update README, AGENTS.md, how-it-works.md, CHANGELOG - 87 unit tests passing

- docs/tutorial-baseline-comparison.md: step-by-step comparison workflow, CI patterns, regression investigation guide - docs/tutorial-copilot-skills.md: skill installation (GitHub, manual, project), usage examples, skill quality evaluation with AgentOps - Update README docs section with new tutorial links

- tutorial-baseline-comparison.md: add model-direct vs agent evaluation target section with when-to-use, pros/cons, expected score differences, cross-target comparison guidance, detailed regression investigation patterns, and baseline management strategies - tutorial-copilot-skills.md: add context on why skills matter, detailed usage examples showing before/after skill behavior, skill quality evaluation workflow using AgentOps itself

…ferences - tutorial-model-direct.md: add when/why to use model-direct, how scores differ from agent, dataset writing guidance, transitioning to agent eval - tutorial-basic-foundry-agent.md: add model-vs-agent decision guide, score expectations, named vs legacy agents, why both agent_id and model are needed, cross-scenario comparison guidance, evaluation scenarios table

…d updated Copilot skills - Unified ComparisonResult model supporting 2+ runs - HTML report format with modern light theme, visual indicators (dots, arrows, badges) - --format md|html|all flag on eval run, eval compare, and report commands - Comparison dimension detection (Model/Agent/Dataset Coverage/General) - Conditions section showing fixed vs varying parameters - Merged Evaluators table with dual evaluation (Met/Missed + direction) - Row Details with per-row evaluator scores and Met/Missed - Smart number formatting (integers without decimals) - Met/Missed threshold terminology - Status with pass rate (PASS 100% 5/5 / FAIL 80% 4/5) - Regression detection based on threshold flips only (not numeric noise) - Informational metrics (samples_evaluated) shown as plain values - Foundry backend command strings enriched with target + model - Updated all 3 Copilot skills with N-run workflows, HTML guide, model benchmarking

Dongbumlee added 6 commits March 19, 2026 12:18

merge: resolve conflicts with develop branch

677f770

Dongbumlee merged commit 9603575 into develop Mar 23, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: N-run comparison, HTML reports, smart comparison conditions, and updated Copilot skills#32

feat: N-run comparison, HTML reports, smart comparison conditions, and updated Copilot skills#32
Dongbumlee merged 6 commits into
developfrom
feature/copilot-skill-baseline-comparison

Dongbumlee commented Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Dongbumlee commented Mar 23, 2026

Summary

Changes

Commits

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant