feat: extend Foundry cloud evaluator coverage to 22 built-in evaluators (#51)#57
Merged
Merged
Conversation
…rs (#51) - Expand evaluator frozensets: add response_completeness, groundedness_pro, retrieval, tool_selection to existing sets - Add new frozensets: _EVALUATORS_NEEDING_TOOL_DEFS_ONLY (tool_input_accuracy, tool_output_utilization, tool_call_success), _EVALUATORS_NEEDING_OUTPUT_ITEMS (task_adherence) - Fix NLP evaluator names (bleu_score, rouge_score, etc.) to match _to_builtin_evaluator_name conversion - Add default initialization_parameters for RougeScoreEvaluator (rouge_type) - Build item_schema dynamically: include tool_definitions and context_field when evaluators need them - Refactor _default_foundry_input_mapping to frozenset-based routing - Improve error handling: log evaluator errors when score is null, improve runner error message with --verbose hint - Add CI/CD integration models documentation: PR gate, scheduled, post-deploy, multi-env promotion, Azure DevOps pipeline - Add gating best practices: threshold design, evaluator selection by scenario - Add supported evaluators reference table (22 evaluators by category) - Add ~20 unit tests for all new evaluator data_mapping patterns - All 22 evaluators verified end-to-end with live Foundry cloud evaluation Closes #51
- Fix skill paths: plugins/agentops/skills/ (not .github/plugins/) across README, tutorial-copilot-skills (6 instances) - Fix CLI contract: add eval compare and config cicd as implemented commands in AGENTS.md, copilot-instructions.md, how-it-works.md - Fix source tree listings: add cicd.py, comparison.py, telemetry.py, workflows/ across AGENTS.md, how-it-works.md - Fix test listings: add test_cicd, test_cli_commands, test_comparison, test_telemetry across AGENTS.md, copilot-instructions.md, how-it-works.md - Fix agent_tools_baseline: TaskCompletionEvaluator + ToolCallAccuracyEvaluator (not SimilarityEvaluator placeholder) in README, AGENTS.md, how-it-works.md - Fix JSONL path: data/<name>.jsonl (not datasets/) in ci-github-actions.md - Fix init flag: --dir (not --path) in README - Fix evaluator guidance: add frozenset names and NLP_DEFAULT_INIT_PARAMS to copilot-instructions.md - Add context_field to dataset format docs in AGENTS.md - Add rouge_type default note to evaluator reference doc - Update planned command message to list all 5 available commands - Add --format flag to CLI usage examples
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Extends AgentOps Foundry cloud evaluation to support 22 built-in evaluators (up from 8), covering all evaluator categories: quality, agent, safety, RAG, tool, and NLP. Adds CI/CD integration documentation with integration models, gating best practices, and evaluator reference.
Closes #51
Changes
Code (
foundry_backend.py,runner.py)response_completeness,groundedness_pro,retrieval,tool_selectionadded to existing sets_EVALUATORS_NEEDING_TOOL_DEFS_ONLY(tool_input_accuracy, tool_output_utilization, tool_call_success),_EVALUATORS_NEEDING_OUTPUT_ITEMS(task_adherence — uses{{sample.output_items}}instead of{{sample.output_text}})bleu_score,rouge_score,gleu_score,meteor_scoreto match_to_builtin_evaluator_nameconversionRougeScoreEvaluatorrequiresrouge_type— defaults torouge1tool_definitionsandcontextfields when evaluators require them_default_foundry_input_mappingto frozenset-based routing (covers all evaluators, not just 4 hardcoded names)score: null(e.g., safety evaluators in unsupported regions), improved runner error message with--verbosehintDocumentation (
ci-github-actions.md)Tests
Live Verification
All 22 evaluators verified end-to-end against live Foundry cloud evaluation (East US 2):
Note