Normalize parameter types in TaskNavigationEfficiency comparison#46227
Merged
Conversation
Port fix from Azure/azureml-assets#4901. Adds _normalize_param_value static method for consistent string comparison of parameter values (int, float, bool, dict, list) between agent and ground truth. Updates _extract_tool_names_and_params_from_response to preserve original value types instead of premature str() conversion. Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/4888d4d1-bd21-46b6-a733-231b3ffefddd Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
Copilot created this pull request from a session on behalf of
m7md7sien
April 9, 2026 17:21
View session
…ine-length (#46232) Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/1c88d810-2e80-47a9-ad09-40adb6529219 Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
m7md7sien
approved these changes
Apr 9, 2026
m7md7sien
approved these changes
May 13, 2026
YoYoJa
approved these changes
May 15, 2026
Contributor
There was a problem hiding this comment.
Pull request overview
This PR ports a fix to reduce false negatives when comparing tool-call parameters against ground truth in TaskNavigationEfficiencyEvaluator by normalizing parameter values before comparison.
Changes:
- Added
_normalize_param_valueto stringify/JSON-serialize parameter values for consistent comparisons. - Updated
_prepare_steps_for_comparisonto normalize both agent and ground truth parameters before matching. - Updated
_extract_tool_names_and_params_from_responseto preserve raw argument values (avoid prematurestr()conversion) and added unit tests for type normalization scenarios.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_task_navigation_efficiency/_task_navigation_efficiency.py | Adds parameter normalization and applies it during step preparation for matching. |
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_base_eval.py | Preserves raw tool argument values so normalization can happen at comparison time. |
| sdk/evaluation/azure-ai-evaluation/tests/unittests/test_task_navigation_efficiency_evaluators.py | Adds unit tests to validate parameter type normalization behavior. |
Comment on lines
+163
to
+166
| return value | ||
| if isinstance(value, (dict, list)): | ||
| try: | ||
| return json.dumps(value, sort_keys=True) |
| ), | ||
| ) | ||
| assert result["task_navigation_efficiency_result"] == "pass" | ||
|
|
ninghu
pushed a commit
to ninghu/azure-sdk-for-python
that referenced
this pull request
May 22, 2026
…re#46227) * Normalize parameter types in TaskNavigationEfficiency comparison Port fix from Azure/azureml-assets#4901. Adds _normalize_param_value static method for consistent string comparison of parameter values (int, float, bool, dict, list) between agent and ground truth. Updates _extract_tool_names_and_params_from_response to preserve original value types instead of premature str() conversion. Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/4888d4d1-bd21-46b6-a733-231b3ffefddd Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * Fix black formatting: collapse expressions that fit within 120 char line-length (Azure#46232) Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/1c88d810-2e80-47a9-ad09-40adb6529219 Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * Fix black issue --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> Co-authored-by: mohessie <mohessie@microsoft.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Port of fix from Azure/azureml-assets#4901 to this repository.
Problem
When comparing agent tool call parameters against ground truth in
TaskNavigationEfficiencyEvaluator, parameter type mismatches caused false comparison failures. For example, an agent returning{"count": 1}(int) would not match a ground truth of{"count": "1"}(str), even though they represent the same value. Similarly, dict/list values were converted to non-canonical string representations that didn't match JSON-formatted strings.Fix
_normalize_param_valuestatic method to_TaskNavigationEfficiencyEvaluator— normalizes parameter values to strings usingjson.dumps(withsort_keys=True) for dicts/lists, andstr()for other types (int, float, bool)._prepare_steps_for_comparison— applies normalization on both agent and ground truth parameter values before comparison._extract_tool_names_and_params_from_responsein the base evaluator — removed prematurestr(v)conversion so raw values are preserved and normalization happens at comparison time instead.Files Changed
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_task_navigation_efficiency/_task_navigation_efficiency.pysdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_base_eval.pysdk/evaluation/azure-ai-evaluation/tests/unittests/test_task_navigation_efficiency_evaluators.py(13 new test cases)Testing