Skip to content

Normalize parameter types in TaskNavigationEfficiency comparison#46227

Merged
m7md7sien merged 10 commits into
mainfrom
copilot/duplicate-fix-from-azureml-assets
May 15, 2026
Merged

Normalize parameter types in TaskNavigationEfficiency comparison#46227
m7md7sien merged 10 commits into
mainfrom
copilot/duplicate-fix-from-azureml-assets

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 9, 2026

Description

Port of fix from Azure/azureml-assets#4901 to this repository.

Problem

When comparing agent tool call parameters against ground truth in TaskNavigationEfficiencyEvaluator, parameter type mismatches caused false comparison failures. For example, an agent returning {"count": 1} (int) would not match a ground truth of {"count": "1"} (str), even though they represent the same value. Similarly, dict/list values were converted to non-canonical string representations that didn't match JSON-formatted strings.

Fix

  1. Added _normalize_param_value static method to _TaskNavigationEfficiencyEvaluator — normalizes parameter values to strings using json.dumps (with sort_keys=True) for dicts/lists, and str() for other types (int, float, bool).
  2. Updated _prepare_steps_for_comparison — applies normalization on both agent and ground truth parameter values before comparison.
  3. Updated _extract_tool_names_and_params_from_response in the base evaluator — removed premature str(v) conversion so raw values are preserved and normalization happens at comparison time instead.

Files Changed

  • sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_task_navigation_efficiency/_task_navigation_efficiency.py
  • sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_base_eval.py
  • sdk/evaluation/azure-ai-evaluation/tests/unittests/test_task_navigation_efficiency_evaluators.py (13 new test cases)

Testing

  • 13 new unit tests covering: int vs int, int vs str, str vs int, bool vs bool, bool vs str, dict vs dict, dict vs JSON string, JSON string vs dict, list vs list, list vs JSON string, stringified args vs dict, float vs float, float vs str
  • All normalization logic verified via standalone tests (full pytest suite requires network-dependent dependencies not available in sandbox)

Port fix from Azure/azureml-assets#4901. Adds _normalize_param_value
static method for consistent string comparison of parameter values
(int, float, bool, dict, list) between agent and ground truth.
Updates _extract_tool_names_and_params_from_response to preserve
original value types instead of premature str() conversion.

Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/4888d4d1-bd21-46b6-a733-231b3ffefddd

Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
Copilot AI and others added 2 commits April 9, 2026 22:40
…ine-length (#46232)

Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/1c88d810-2e80-47a9-ad09-40adb6529219

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
@m7md7sien m7md7sien marked this pull request as ready for review May 15, 2026 03:03
Copilot AI review requested due to automatic review settings May 15, 2026 03:03
@m7md7sien m7md7sien requested a review from a team as a code owner May 15, 2026 03:03
@m7md7sien m7md7sien merged commit f502345 into main May 15, 2026
22 checks passed
@m7md7sien m7md7sien deleted the copilot/duplicate-fix-from-azureml-assets branch May 15, 2026 03:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR ports a fix to reduce false negatives when comparing tool-call parameters against ground truth in TaskNavigationEfficiencyEvaluator by normalizing parameter values before comparison.

Changes:

  • Added _normalize_param_value to stringify/JSON-serialize parameter values for consistent comparisons.
  • Updated _prepare_steps_for_comparison to normalize both agent and ground truth parameters before matching.
  • Updated _extract_tool_names_and_params_from_response to preserve raw argument values (avoid premature str() conversion) and added unit tests for type normalization scenarios.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_task_navigation_efficiency/_task_navigation_efficiency.py Adds parameter normalization and applies it during step preparation for matching.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_base_eval.py Preserves raw tool argument values so normalization can happen at comparison time.
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_task_navigation_efficiency_evaluators.py Adds unit tests to validate parameter type normalization behavior.

Comment on lines +163 to +166
return value
if isinstance(value, (dict, list)):
try:
return json.dumps(value, sort_keys=True)
),
)
assert result["task_navigation_efficiency_result"] == "pass"

ninghu pushed a commit to ninghu/azure-sdk-for-python that referenced this pull request May 22, 2026
…re#46227)

* Normalize parameter types in TaskNavigationEfficiency comparison

Port fix from Azure/azureml-assets#4901. Adds _normalize_param_value
static method for consistent string comparison of parameter values
(int, float, bool, dict, list) between agent and ground truth.
Updates _extract_tool_names_and_params_from_response to preserve
original value types instead of premature str() conversion.

Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/4888d4d1-bd21-46b6-a733-231b3ffefddd

Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

* Fix black formatting: collapse expressions that fit within 120 char line-length (Azure#46232)

Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/1c88d810-2e80-47a9-ad09-40adb6529219

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

* Fix black issue

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
Co-authored-by: mohessie <mohessie@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Evaluation Issues related to the client library for Azure AI Evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants