Skip to content

feat: extend Foundry cloud evaluator coverage to 22 built-in evaluators (#51)#57

Merged
placerda merged 3 commits into
developfrom
feature/issue-51-extend-evaluators
Apr 13, 2026
Merged

feat: extend Foundry cloud evaluator coverage to 22 built-in evaluators (#51)#57
placerda merged 3 commits into
developfrom
feature/issue-51-extend-evaluators

Conversation

@Dongbumlee
Copy link
Copy Markdown
Collaborator

Summary

Extends AgentOps Foundry cloud evaluation to support 22 built-in evaluators (up from 8), covering all evaluator categories: quality, agent, safety, RAG, tool, and NLP. Adds CI/CD integration documentation with integration models, gating best practices, and evaluator reference.

Closes #51

Changes

Code (foundry_backend.py, runner.py)

  • Expanded evaluator frozensets: response_completeness, groundedness_pro, retrieval, tool_selection added to existing sets
  • New frozensets: _EVALUATORS_NEEDING_TOOL_DEFS_ONLY (tool_input_accuracy, tool_output_utilization, tool_call_success), _EVALUATORS_NEEDING_OUTPUT_ITEMS (task_adherence — uses {{sample.output_items}} instead of {{sample.output_text}})
  • Fixed NLP evaluator names: bleu_score, rouge_score, gleu_score, meteor_score to match _to_builtin_evaluator_name conversion
  • Added default init params: RougeScoreEvaluator requires rouge_type — defaults to rouge1
  • Dynamic item_schema: Automatically includes tool_definitions and context fields when evaluators require them
  • Refactored _default_foundry_input_mapping to frozenset-based routing (covers all evaluators, not just 4 hardcoded names)
  • Improved error handling: Logs evaluator errors when score: null (e.g., safety evaluators in unsupported regions), improved runner error message with --verbose hint

Documentation (ci-github-actions.md)

  • CI/CD Integration Models: PR quality gate, scheduled regression, post-deployment validation, multi-environment promotion, Azure DevOps pipeline
  • Gating Best Practices: Threshold design, scenario-specific evaluator selection, comparison-based regression detection
  • Supported Evaluators Reference: Complete table of 22 evaluators by category with inputs and requirements
  • Troubleshooting: Safety evaluator region requirements, missing scores diagnosis

Tests

  • ~20 new unit tests for all evaluator data_mapping patterns
  • All 96 tests pass

Live Verification

All 22 evaluators verified end-to-end against live Foundry cloud evaluation (East US 2):

Category Evaluators Result
Quality Coherence, Fluency, Relevance
Agent IntentResolution, TaskCompletion, TaskAdherence
Similarity Similarity, ResponseCompleteness
RAG Groundedness
Safety Violence, Sexual, SelfHarm, HateUnfairness
Tool ToolCallAccuracy, ToolSelection, ToolInputAccuracy, ToolOutputUtilization
NLP F1Score, BleuScore, GleuScore, RougeScore, MeteorScore

Note

docs/analysis-issue-51-*.md are internal research/analysis documents created during the issue investigation. They should be removed before release — they are included in this PR for team review only.

…rs (#51)

- Expand evaluator frozensets: add response_completeness, groundedness_pro,
  retrieval, tool_selection to existing sets
- Add new frozensets: _EVALUATORS_NEEDING_TOOL_DEFS_ONLY (tool_input_accuracy,
  tool_output_utilization, tool_call_success), _EVALUATORS_NEEDING_OUTPUT_ITEMS
  (task_adherence)
- Fix NLP evaluator names (bleu_score, rouge_score, etc.) to match
  _to_builtin_evaluator_name conversion
- Add default initialization_parameters for RougeScoreEvaluator (rouge_type)
- Build item_schema dynamically: include tool_definitions and context_field when
  evaluators need them
- Refactor _default_foundry_input_mapping to frozenset-based routing
- Improve error handling: log evaluator errors when score is null, improve
  runner error message with --verbose hint
- Add CI/CD integration models documentation: PR gate, scheduled, post-deploy,
  multi-env promotion, Azure DevOps pipeline
- Add gating best practices: threshold design, evaluator selection by scenario
- Add supported evaluators reference table (22 evaluators by category)
- Add ~20 unit tests for all new evaluator data_mapping patterns
- All 22 evaluators verified end-to-end with live Foundry cloud evaluation

Closes #51
- Fix skill paths: plugins/agentops/skills/ (not .github/plugins/)
  across README, tutorial-copilot-skills (6 instances)
- Fix CLI contract: add eval compare and config cicd as implemented
  commands in AGENTS.md, copilot-instructions.md, how-it-works.md
- Fix source tree listings: add cicd.py, comparison.py, telemetry.py,
  workflows/ across AGENTS.md, how-it-works.md
- Fix test listings: add test_cicd, test_cli_commands, test_comparison,
  test_telemetry across AGENTS.md, copilot-instructions.md, how-it-works.md
- Fix agent_tools_baseline: TaskCompletionEvaluator + ToolCallAccuracyEvaluator
  (not SimilarityEvaluator placeholder) in README, AGENTS.md, how-it-works.md
- Fix JSONL path: data/<name>.jsonl (not datasets/) in ci-github-actions.md
- Fix init flag: --dir (not --path) in README
- Fix evaluator guidance: add frozenset names and NLP_DEFAULT_INIT_PARAMS
  to copilot-instructions.md
- Add context_field to dataset format docs in AGENTS.md
- Add rouge_type default note to evaluator reference doc
- Update planned command message to list all 5 available commands
- Add --format flag to CLI usage examples
@Dongbumlee Dongbumlee requested a review from placerda April 7, 2026 18:50
@placerda placerda merged commit ce9b628 into develop Apr 13, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants