feat: extend Foundry cloud evaluator coverage to 22 built-in evaluators (#51) by Dongbumlee · Pull Request #57 · Azure/agentops

Dongbumlee · 2026-04-07T17:04:23Z

Summary

Extends AgentOps Foundry cloud evaluation to support 22 built-in evaluators (up from 8), covering all evaluator categories: quality, agent, safety, RAG, tool, and NLP. Adds CI/CD integration documentation with integration models, gating best practices, and evaluator reference.

Closes #51

Changes

Code (`foundry_backend.py`, `runner.py`)

Expanded evaluator frozensets: response_completeness, groundedness_pro, retrieval, tool_selection added to existing sets
New frozensets: _EVALUATORS_NEEDING_TOOL_DEFS_ONLY (tool_input_accuracy, tool_output_utilization, tool_call_success), _EVALUATORS_NEEDING_OUTPUT_ITEMS (task_adherence — uses {{sample.output_items}} instead of {{sample.output_text}})
Fixed NLP evaluator names: bleu_score, rouge_score, gleu_score, meteor_score to match _to_builtin_evaluator_name conversion
Added default init params: RougeScoreEvaluator requires rouge_type — defaults to rouge1
Dynamic item_schema: Automatically includes tool_definitions and context fields when evaluators require them
Refactored _default_foundry_input_mapping to frozenset-based routing (covers all evaluators, not just 4 hardcoded names)
Improved error handling: Logs evaluator errors when score: null (e.g., safety evaluators in unsupported regions), improved runner error message with --verbose hint

Documentation (`ci-github-actions.md`)

CI/CD Integration Models: PR quality gate, scheduled regression, post-deployment validation, multi-environment promotion, Azure DevOps pipeline
Gating Best Practices: Threshold design, scenario-specific evaluator selection, comparison-based regression detection
Supported Evaluators Reference: Complete table of 22 evaluators by category with inputs and requirements
Troubleshooting: Safety evaluator region requirements, missing scores diagnosis

Tests

~20 new unit tests for all evaluator data_mapping patterns
All 96 tests pass

Live Verification

All 22 evaluators verified end-to-end against live Foundry cloud evaluation (East US 2):

Category	Evaluators	Result
Quality	Coherence, Fluency, Relevance	✅
Agent	IntentResolution, TaskCompletion, TaskAdherence	✅
Similarity	Similarity, ResponseCompleteness	✅
RAG	Groundedness	✅
Safety	Violence, Sexual, SelfHarm, HateUnfairness	✅
Tool	ToolCallAccuracy, ToolSelection, ToolInputAccuracy, ToolOutputUtilization	✅
NLP	F1Score, BleuScore, GleuScore, RougeScore, MeteorScore	✅

Note

docs/analysis-issue-51-*.md are internal research/analysis documents created during the issue investigation. They should be removed before release — they are included in this PR for team review only.

…rs (#51) - Expand evaluator frozensets: add response_completeness, groundedness_pro, retrieval, tool_selection to existing sets - Add new frozensets: _EVALUATORS_NEEDING_TOOL_DEFS_ONLY (tool_input_accuracy, tool_output_utilization, tool_call_success), _EVALUATORS_NEEDING_OUTPUT_ITEMS (task_adherence) - Fix NLP evaluator names (bleu_score, rouge_score, etc.) to match _to_builtin_evaluator_name conversion - Add default initialization_parameters for RougeScoreEvaluator (rouge_type) - Build item_schema dynamically: include tool_definitions and context_field when evaluators need them - Refactor _default_foundry_input_mapping to frozenset-based routing - Improve error handling: log evaluator errors when score is null, improve runner error message with --verbose hint - Add CI/CD integration models documentation: PR gate, scheduled, post-deploy, multi-env promotion, Azure DevOps pipeline - Add gating best practices: threshold design, evaluator selection by scenario - Add supported evaluators reference table (22 evaluators by category) - Add ~20 unit tests for all new evaluator data_mapping patterns - All 22 evaluators verified end-to-end with live Foundry cloud evaluation Closes #51

- Fix skill paths: plugins/agentops/skills/ (not .github/plugins/) across README, tutorial-copilot-skills (6 instances) - Fix CLI contract: add eval compare and config cicd as implemented commands in AGENTS.md, copilot-instructions.md, how-it-works.md - Fix source tree listings: add cicd.py, comparison.py, telemetry.py, workflows/ across AGENTS.md, how-it-works.md - Fix test listings: add test_cicd, test_cli_commands, test_comparison, test_telemetry across AGENTS.md, copilot-instructions.md, how-it-works.md - Fix agent_tools_baseline: TaskCompletionEvaluator + ToolCallAccuracyEvaluator (not SimilarityEvaluator placeholder) in README, AGENTS.md, how-it-works.md - Fix JSONL path: data/<name>.jsonl (not datasets/) in ci-github-actions.md - Fix init flag: --dir (not --path) in README - Fix evaluator guidance: add frozenset names and NLP_DEFAULT_INIT_PARAMS to copilot-instructions.md - Add context_field to dataset format docs in AGENTS.md - Add rouge_type default note to evaluator reference doc - Update planned command message to list all 5 available commands - Add --format flag to CLI usage examples

Dongbumlee added 2 commits April 7, 2026 10:03

merge: resolve CHANGELOG conflict with develop (OTLP tracing)

500966d

Dongbumlee mentioned this pull request Apr 7, 2026

Improve OTLP telemetry span names with dataset context #58

Closed

Dongbumlee requested a review from placerda April 7, 2026 18:50

placerda merged commit ce9b628 into develop Apr 13, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: extend Foundry cloud evaluator coverage to 22 built-in evaluators (#51)#57

feat: extend Foundry cloud evaluator coverage to 22 built-in evaluators (#51)#57
placerda merged 3 commits into
developfrom
feature/issue-51-extend-evaluators

Dongbumlee commented Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Dongbumlee commented Apr 7, 2026

Summary

Changes

Code (foundry_backend.py, runner.py)

Documentation (ci-github-actions.md)

Tests

Live Verification

Note

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Code (`foundry_backend.py`, `runner.py`)

Documentation (`ci-github-actions.md`)