Fix TaskNavigationEfficiencyEvaluator threshold defaulting to 3.0 for binary metric by Copilot · Pull Request #46542 · Azure/azure-sdk-for-python

Copilot · 2026-04-26T17:49:44Z

TaskNavigationEfficiencyEvaluator returns a _score key, which triggers EvaluatorBase._real_call to auto-compute _result/_threshold. Because super().__init__() was called without a threshold, it defaulted to 3.0, causing a score of 1.0 to always be evaluated as fail.

Changes

__init__: Pass threshold=1.0 to super().__init__() so EvaluatorBase correctly classifies this binary (0.0/1.0) metric as passing when score >= 1.0
_do_eval return dict: Use float(self._threshold) instead of a hard-coded value so the returned task_navigation_efficiency_threshold always stays in sync with the base class threshold and cannot drift if the threshold changes
Tests: Added assertions in test_exact_match_scenario for task_navigation_efficiency_result == "pass" and task_navigation_efficiency_threshold == 1.0 to prevent regressions of the original issue

Original prompt

In sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_task_navigation_efficiency/_task_navigation_efficiency.py, the evaluator returns a task_navigation_efficiency_score key (ending with _score), which triggers EvaluatorBase._real_call's automatic threshold/result computation. Since the evaluator calls super().__init__() with the default threshold=3.0, task_navigation_efficiency_result/_threshold will be overwritten to fail/3.0 even when match_result is True (score=1.0).

Two changes are needed:

Line 140: Change super().__init__() to super().__init__(threshold=1.0) so the base class uses the correct threshold for this binary (0.0/1.0) metric.

Line 348: Change "task_navigation_efficiency_threshold": None to "task_navigation_efficiency_threshold": 1.0 so the return dict explicitly returns the correct threshold value.

No other changes are needed.

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: This completed PR: https://github.com/Azure/azure-sdk-for-python/pull/46474/changes#top
Introduce a bug:
Returning a new task_navigation_efficiency_score key (ending with _score) triggers EvaluatorBase._real_call's automatic threshold/result computation. Since this evaluator calls super().init() with the default threshold=3.0, task_navigation_efficiency_result/_threshold will be overwritten to fail/3.0 even when match_result is True (score=1.0). Fix by initializing the base class with an appropriate threshold for a binary metric (e.g., 1.0 or 0.5) and/or avoid duplicating _result/_threshold here and let the base compute them consistently.

Fix it by passing to init threshold 1
super().init(threshold=1.0)

Also update return to:
"task_navigation_efficiency_threshold": 1.0,

Assistant: Now let me get the current file to find the __init__ method and the _do_eval return statement:

Assistant: Now I need to fix two things: change super().__init__() to super().__init__(threshold=1.0) on line 140, and change "task_navigation_efficiency_threshold": None to "task_navigation_efficiency_threshold": 1.0 on line 348.

Assistant: The fix has been committed. Two changes were made:

Line 140: super().__init__() → super().__init__(threshold=1.0) — ensures EvaluatorBase uses threshold 1.0 instead of the default 3.0, so its automatic _result/_threshold computation is consistent with this binary (0.0/1.0) metric.

Line 348: "task_navigation_efficiency_threshold": None → "task_navigation_efficiency_threshold": 1.0 — explicitly returns the correct threshold value instead of None, keeping the return dict consistent with the base class.

User: create a PR in a new branch from main

Assistant: A pull request is being created in the Azure/azure-sdk-for-python repository. The coding agent is working on a new branch from main with the two fixes:

Changing super().__init__() to super().__init__(threshold=1.0) on line 140

Changing "task_navigation_efficiency_threshold": None to "task_navigation_efficiency_threshold": 1.0 on line 348

You'll receive an update once the PR is ready.

This pull request was created from Copilot chat.

…efault 3.0 Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/e376f26a-4cd6-44a9-b271-81eb2b6e86d9 Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

Copilot

Pull request overview

Fixes incorrect default thresholding for the TaskNavigationEfficiencyEvaluator’s binary *_score output, ensuring EvaluatorBase._real_call computes consistent *_result/*_threshold values.

Changes:

Set EvaluatorBase threshold to 1.0 for task_navigation_efficiency_score by passing threshold=1.0 to super().__init__().
Make the evaluator’s explicit task_navigation_efficiency_threshold output consistent with the base-class threshold (1.0).

…tors/_task_navigation_efficiency/_task_navigation_efficiency.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

github-actions · 2026-04-26T18:20:58Z

Hi @copilot. We've noticed that new changes have been pushed to this pull request. Because it is set to automatically merge, we've reset the approvals to allow the opportunity to review the updates.

@m7md7sien

Hi @m7md7sien. We've noticed that new changes have been pushed to this pull request. Because it is set to automatically merge, we've reset the approvals to allow the opportunity to review the updates.

…rtions Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/28756174-3e26-4ea2-849c-9d5c0a28d6c3 Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

Copilot · 2026-04-26T18:24:33Z

Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

pypi.org
- Triggering command: /home/REDACTED/work/azure-sdk-for-python/azure-sdk-for-python/.venv/bin/python /home/REDACTED/work/azure-sdk-for-python/azure-sdk-for-python/.venv/bin/python /home/REDACTED/work/azure-sdk-for-python/azure-sdk-for-python/.venv/lib/python3.9/site-packages/pip/__pip-REDACTED__.py install --ignore-installed --no-user --prefix /tmp/pip-build-env-wxmzb075/overlay --no-warn-script-location --no-binary :none: --only-binary :none: -i REDACTED -- setuptools>=40.8.0 (dns block)
- Triggering command: /home/REDACTED/work/azure-sdk-for-python/azure-sdk-for-python/.venv/bin/pip pip install httpx -q (dns block)
scanning-api.github.com
- Triggering command: /home/REDACTED/work/_temp/ghcca-node/node/bin/node /home/REDACTED/work/_temp/ghcca-node/node/bin/node --enable-source-maps /home/REDACTED/work/_temp/copilot-developer-action-main/dist/index.js (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

github-actions · 2026-04-26T18:30:08Z

Hi @copilot. We've noticed that new changes have been pushed to this pull request. Because it is set to automatically merge, we've reset the approvals to allow the opportunity to review the updates.

@aprilk-ms

Hi @aprilk-ms. We've noticed that new changes have been pushed to this pull request. Because it is set to automatically merge, we've reset the approvals to allow the opportunity to review the updates.

… binary metric (Azure#46542) * Initial plan * Fix TaskNavigationEfficiencyEvaluator threshold: use 1.0 instead of default 3.0 Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/e376f26a-4cd6-44a9-b271-81eb2b6e86d9 Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * Update sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_task_navigation_efficiency/_task_navigation_efficiency.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Use self._threshold in return dict and add result/threshold test assertions Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/28756174-3e26-4ea2-849c-9d5c0a28d6c3 Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> Co-authored-by: Mohamed Hessien <mohessie@microsoft.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Initial plan

02317b4

Copilot AI assigned Copilot and m7md7sien Apr 26, 2026

Copilot started work on behalf of m7md7sien April 26, 2026 17:49 View session

Fix TaskNavigationEfficiencyEvaluator threshold: use 1.0 instead of d…

173ddcd

…efault 3.0 Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/e376f26a-4cd6-44a9-b271-81eb2b6e86d9 Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Fix threshold value for task navigation efficiency evaluator~~ Fix TaskNavigationEfficiencyEvaluator threshold defaulting to 3.0 for binary metric Apr 26, 2026

Copilot AI requested a review from m7md7sien April 26, 2026 17:51

Copilot finished work on behalf of m7md7sien April 26, 2026 17:51

m7md7sien previously approved these changes Apr 26, 2026

View reviewed changes

m7md7sien marked this pull request as ready for review April 26, 2026 18:06

m7md7sien requested a review from a team as a code owner April 26, 2026 18:06

Copilot AI review requested due to automatic review settings April 26, 2026 18:06

Copilot started reviewing on behalf of m7md7sien April 26, 2026 18:07 View session

m7md7sien enabled auto-merge (squash) April 26, 2026 18:08

Copilot AI reviewed Apr 26, 2026

View reviewed changes

Copilot started work on behalf of m7md7sien April 26, 2026 18:20 View session

Update sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evalua…

9d9465c

…tors/_task_navigation_efficiency/_task_navigation_efficiency.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Use self._threshold in return dict and add result/threshold test asse…

fc419ab

…rtions Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/28756174-3e26-4ea2-849c-9d5c0a28d6c3 Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

Copilot finished work on behalf of m7md7sien April 26, 2026 18:24

Copilot AI requested a review from m7md7sien April 26, 2026 18:24

aprilk-ms previously approved these changes Apr 26, 2026

View reviewed changes

m7md7sien previously approved these changes Apr 26, 2026

View reviewed changes

m7md7sien approved these changes Apr 26, 2026

View reviewed changes

m7md7sien marked this pull request as draft April 26, 2026 21:08

auto-merge was automatically disabled April 26, 2026 21:08
Pull request was converted to draft

m7md7sien marked this pull request as ready for review April 27, 2026 16:12

m7md7sien enabled auto-merge (squash) April 27, 2026 16:12

aprilk-ms approved these changes Apr 27, 2026

View reviewed changes

m7md7sien merged commit ffd4b61 into main Apr 27, 2026
21 checks passed

m7md7sien deleted the copilot/fix-task-navigation-efficiency-threshold branch April 27, 2026 17:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix TaskNavigationEfficiencyEvaluator threshold defaulting to 3.0 for binary metric#46542

Fix TaskNavigationEfficiencyEvaluator threshold defaulting to 3.0 for binary metric#46542
m7md7sien merged 4 commits intomainfrom
copilot/fix-task-navigation-efficiency-threshold

Copilot AI commented Apr 26, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 26, 2026

Uh oh!

Copilot AI commented Apr 26, 2026

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

github-actions Bot commented Apr 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Copilot AI commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 26, 2026

Uh oh!

Copilot AI commented Apr 26, 2026

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

github-actions Bot commented Apr 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Copilot AI commented Apr 26, 2026 •

edited

Loading