Skip to content

Fix TaskNavigationEfficiencyEvaluator threshold defaulting to 3.0 for binary metric#46542

Merged
m7md7sien merged 4 commits intomainfrom
copilot/fix-task-navigation-efficiency-threshold
Apr 27, 2026
Merged

Fix TaskNavigationEfficiencyEvaluator threshold defaulting to 3.0 for binary metric#46542
m7md7sien merged 4 commits intomainfrom
copilot/fix-task-navigation-efficiency-threshold

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 26, 2026

TaskNavigationEfficiencyEvaluator returns a _score key, which triggers EvaluatorBase._real_call to auto-compute _result/_threshold. Because super().__init__() was called without a threshold, it defaulted to 3.0, causing a score of 1.0 to always be evaluated as fail.

Changes

  • __init__: Pass threshold=1.0 to super().__init__() so EvaluatorBase correctly classifies this binary (0.0/1.0) metric as passing when score >= 1.0
  • _do_eval return dict: Use float(self._threshold) instead of a hard-coded value so the returned task_navigation_efficiency_threshold always stays in sync with the base class threshold and cannot drift if the threshold changes
  • Tests: Added assertions in test_exact_match_scenario for task_navigation_efficiency_result == "pass" and task_navigation_efficiency_threshold == 1.0 to prevent regressions of the original issue
Original prompt

In sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_task_navigation_efficiency/_task_navigation_efficiency.py, the evaluator returns a task_navigation_efficiency_score key (ending with _score), which triggers EvaluatorBase._real_call's automatic threshold/result computation. Since the evaluator calls super().__init__() with the default threshold=3.0, task_navigation_efficiency_result/_threshold will be overwritten to fail/3.0 even when match_result is True (score=1.0).

Two changes are needed:

  1. Line 140: Change super().__init__() to super().__init__(threshold=1.0) so the base class uses the correct threshold for this binary (0.0/1.0) metric.

  2. Line 348: Change "task_navigation_efficiency_threshold": None to "task_navigation_efficiency_threshold": 1.0 so the return dict explicitly returns the correct threshold value.

No other changes are needed.

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: This completed PR: https://github.com/Azure/azure-sdk-for-python/pull/46474/changes#top
Introduce a bug:
Returning a new task_navigation_efficiency_score key (ending with _score) triggers EvaluatorBase._real_call's automatic threshold/result computation. Since this evaluator calls super().init() with the default threshold=3.0, task_navigation_efficiency_result/_threshold will be overwritten to fail/3.0 even when match_result is True (score=1.0). Fix by initializing the base class with an appropriate threshold for a binary metric (e.g., 1.0 or 0.5) and/or avoid duplicating _result/_threshold here and let the base compute them consistently.

Fix it by passing to init threshold 1
super().init(threshold=1.0)

Also update return to:
"task_navigation_efficiency_threshold": 1.0,

Assistant: Now let me get the current file to find the __init__ method and the _do_eval return statement:

Assistant: Now I need to fix two things: change super().__init__() to super().__init__(threshold=1.0) on line 140, and change "task_navigation_efficiency_threshold": None to "task_navigation_efficiency_threshold": 1.0 on line 348.

Assistant: The fix has been committed. Two changes were made:

  1. Line 140: super().__init__()super().__init__(threshold=1.0) — ensures EvaluatorBase uses threshold 1.0 instead of the default 3.0, so its automatic _result/_threshold computation is consistent with this binary (0.0/1.0) metric.

  2. Line 348: "task_navigation_efficiency_threshold": None"task_navigation_efficiency_threshold": 1.0 — explicitly returns the correct threshold value instead of None, keeping the return dict consistent with the base class.

User: create a PR in a new branch from main

Assistant: A pull request is being created in the Azure/azure-sdk-for-python repository. The coding agent is working on a new branch from main with the two fixes:

  1. Changing super().__init__() to super().__init__(threshold=1.0) on line 140
  2. Changing "task_navigation_efficiency_threshold": None to "task_navigation_efficiency_threshold": 1.0 on line 348

You'll receive an update once the PR is ready.

This pull request was created from Copilot chat.

Copilot AI changed the title [WIP] Fix threshold value for task navigation efficiency evaluator Fix TaskNavigationEfficiencyEvaluator threshold defaulting to 3.0 for binary metric Apr 26, 2026
Copilot AI requested a review from m7md7sien April 26, 2026 17:51
m7md7sien
m7md7sien previously approved these changes Apr 26, 2026
@m7md7sien m7md7sien marked this pull request as ready for review April 26, 2026 18:06
@m7md7sien m7md7sien requested a review from a team as a code owner April 26, 2026 18:06
Copilot AI review requested due to automatic review settings April 26, 2026 18:06
@m7md7sien m7md7sien enabled auto-merge (squash) April 26, 2026 18:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes incorrect default thresholding for the TaskNavigationEfficiencyEvaluator’s binary *_score output, ensuring EvaluatorBase._real_call computes consistent *_result/*_threshold values.

Changes:

  • Set EvaluatorBase threshold to 1.0 for task_navigation_efficiency_score by passing threshold=1.0 to super().__init__().
  • Make the evaluator’s explicit task_navigation_efficiency_threshold output consistent with the base-class threshold (1.0).

…tors/_task_navigation_efficiency/_task_navigation_efficiency.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

Hi @copilot. We've noticed that new changes have been pushed to this pull request. Because it is set to automatically merge, we've reset the approvals to allow the opportunity to review the updates.

@github-actions github-actions Bot dismissed m7md7sien’s stale review April 26, 2026 18:20

Hi @m7md7sien. We've noticed that new changes have been pushed to this pull request. Because it is set to automatically merge, we've reset the approvals to allow the opportunity to review the updates.

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 26, 2026

Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • pypi.org
    • Triggering command: /home/REDACTED/work/azure-sdk-for-python/azure-sdk-for-python/.venv/bin/python /home/REDACTED/work/azure-sdk-for-python/azure-sdk-for-python/.venv/bin/python /home/REDACTED/work/azure-sdk-for-python/azure-sdk-for-python/.venv/lib/python3.9/site-packages/pip/__pip-REDACTED__.py install --ignore-installed --no-user --prefix /tmp/pip-build-env-wxmzb075/overlay --no-warn-script-location --no-binary :none: --only-binary :none: -i REDACTED -- setuptools&gt;=40.8.0 (dns block)
    • Triggering command: /home/REDACTED/work/azure-sdk-for-python/azure-sdk-for-python/.venv/bin/pip pip install httpx -q (dns block)
  • scanning-api.github.com
    • Triggering command: /home/REDACTED/work/_temp/ghcca-node/node/bin/node /home/REDACTED/work/_temp/ghcca-node/node/bin/node --enable-source-maps /home/REDACTED/work/_temp/copilot-developer-action-main/dist/index.js (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Copilot AI requested a review from m7md7sien April 26, 2026 18:24
aprilk-ms
aprilk-ms previously approved these changes Apr 26, 2026
m7md7sien
m7md7sien previously approved these changes Apr 26, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Hi @copilot. We've noticed that new changes have been pushed to this pull request. Because it is set to automatically merge, we've reset the approvals to allow the opportunity to review the updates.

@github-actions github-actions Bot dismissed stale reviews from aprilk-ms and m7md7sien April 26, 2026 18:30

Hi @aprilk-ms. We've noticed that new changes have been pushed to this pull request. Because it is set to automatically merge, we've reset the approvals to allow the opportunity to review the updates.

@m7md7sien m7md7sien marked this pull request as draft April 26, 2026 21:08
auto-merge was automatically disabled April 26, 2026 21:08

Pull request was converted to draft

@m7md7sien m7md7sien marked this pull request as ready for review April 27, 2026 16:12
@m7md7sien m7md7sien enabled auto-merge (squash) April 27, 2026 16:12
@m7md7sien m7md7sien merged commit ffd4b61 into main Apr 27, 2026
21 checks passed
@m7md7sien m7md7sien deleted the copilot/fix-task-navigation-efficiency-threshold branch April 27, 2026 17:09
fafhrd91 pushed a commit to fafhrd91/azure-sdk-for-python that referenced this pull request Apr 28, 2026
… binary metric (Azure#46542)

* Initial plan

* Fix TaskNavigationEfficiencyEvaluator threshold: use 1.0 instead of default 3.0

Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/e376f26a-4cd6-44a9-b271-81eb2b6e86d9

Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

* Update sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_task_navigation_efficiency/_task_navigation_efficiency.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Use self._threshold in return dict and add result/threshold test assertions

Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/28756174-3e26-4ea2-849c-9d5c0a28d6c3

Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
Co-authored-by: Mohamed Hessien <mohessie@microsoft.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants