Skip to content

feat: N-run comparison, HTML reports, smart comparison conditions, and updated Copilot skills#32

Merged
Dongbumlee merged 6 commits into
developfrom
feature/copilot-skill-baseline-comparison
Mar 23, 2026
Merged

feat: N-run comparison, HTML reports, smart comparison conditions, and updated Copilot skills#32
Dongbumlee merged 6 commits into
developfrom
feature/copilot-skill-baseline-comparison

Conversation

@Dongbumlee
Copy link
Copy Markdown
Collaborator

Summary

Adds N-run baseline comparison capabilities, enhanced reporting, and distributable Copilot skills for AgentOps evaluation workflows.

Changes

  • N-run comparison: Implement agentops eval compare for comparing multiple evaluation runs
  • HTML reports: Enhanced report generation with HTML output support
  • Smart comparison conditions: Configurable comparison logic for evaluation metrics
  • Updated Copilot skills: Distributable skills for run-evals, investigate-regression, and observability-triage
  • Tutorial enhancements: Enriched model-direct and agent tutorials with deeper guidance and cross-references
  • Cloud eval fixes: Bug fixes for cloud evaluation flow

Commits

  • d563243 feat: implement eval compare, cloud eval fix, distributable Copilot skills
  • 0f66f64 docs: add baseline comparison and Copilot skills tutorials
  • 5bb5079 docs: enrich tutorials with model-vs-agent guidance and practical depth
  • 457f43d docs: enrich model-direct and agent tutorials with depth and cross-references
  • 5db66f7 feat: N-run comparison, HTML reports, smart comparison conditions, and updated Copilot skills

…kills (#13)

- Implement agentops eval compare --runs for baseline comparison
  - Pydantic models: ComparisonResult, MetricDelta, ThresholdDelta, ItemDelta
  - Comparison service with run discovery (timestamps, latest, paths)
  - Comparison markdown report generator
  - Exit codes: 0=no regressions, 2=regressions, 1=error
  - Metric polarity: lower-is-better metrics (<=) correctly show improved

- Fix Foundry cloud evaluation to use Project Evals API
  - Use {project_endpoint}/openai/evals?api-version=2025-11-15-preview
  - Supports azure_ai_evaluator testing criteria (New Foundry Experience)
  - Replaces OpenAI SDK path that lacked azure_ai_evaluator support

- Add distributable Copilot skills under .github/plugins/agentops/skills/
  - agentops-run-evals, agentops-investigate-regression, agentops-observability-triage
  - GitHub-based distribution (Channel 1) matching azure-skills pattern
  - Remove .github/skills/ internal folder (superseded by plugins)

- Align azure-ai-projects version to >=2.0.1 across all files
- Update README, AGENTS.md, how-it-works.md, CHANGELOG
- 87 unit tests passing
- docs/tutorial-baseline-comparison.md: step-by-step comparison workflow,
  CI patterns, regression investigation guide
- docs/tutorial-copilot-skills.md: skill installation (GitHub, manual, project),
  usage examples, skill quality evaluation with AgentOps
- Update README docs section with new tutorial links
- tutorial-baseline-comparison.md: add model-direct vs agent evaluation
  target section with when-to-use, pros/cons, expected score differences,
  cross-target comparison guidance, detailed regression investigation
  patterns, and baseline management strategies
- tutorial-copilot-skills.md: add context on why skills matter, detailed
  usage examples showing before/after skill behavior, skill quality
  evaluation workflow using AgentOps itself
…ferences

- tutorial-model-direct.md: add when/why to use model-direct, how scores
  differ from agent, dataset writing guidance, transitioning to agent eval
- tutorial-basic-foundry-agent.md: add model-vs-agent decision guide,
  score expectations, named vs legacy agents, why both agent_id and model
  are needed, cross-scenario comparison guidance, evaluation scenarios table
…d updated Copilot skills

- Unified ComparisonResult model supporting 2+ runs
- HTML report format with modern light theme, visual indicators (dots, arrows, badges)
- --format md|html|all flag on eval run, eval compare, and report commands
- Comparison dimension detection (Model/Agent/Dataset Coverage/General)
- Conditions section showing fixed vs varying parameters
- Merged Evaluators table with dual evaluation (Met/Missed + direction)
- Row Details with per-row evaluator scores and Met/Missed
- Smart number formatting (integers without decimals)
- Met/Missed threshold terminology
- Status with pass rate (PASS 100% 5/5 / FAIL 80% 4/5)
- Regression detection based on threshold flips only (not numeric noise)
- Informational metrics (samples_evaluated) shown as plain values
- Foundry backend command strings enriched with target + model
- Updated all 3 Copilot skills with N-run workflows, HTML guide, model benchmarking
@Dongbumlee Dongbumlee merged commit 9603575 into develop Mar 23, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant