-
Notifications
You must be signed in to change notification settings - Fork 3k
Tool Call Accuracy V2 #41740
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Tool Call Accuracy V2 #41740
Conversation
Thank you for your contribution @salma-elshafey! We will review the pull request and get back to you soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR updates the Tool Call Accuracy Evaluator to use a scoring rubric ranging from 1 to 5 instead of a binary score and evaluates all tool calls in a single turn collectively. Key changes include:
- Transition from a binary scoring system (0/1) to a detailed 1–5 rubric.
- Consolidation of tool call evaluations per turn with enhanced output details.
- Updates to test cases, sample notebooks, and documentation to align with the new evaluation logic.
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_tool_call_accuracy_evaluator.py | Updated unit tests to verify new scoring and output details. |
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_agent_evaluators.py | Modified tests for missing input cases and tool definition validations. |
sdk/evaluation/azure-ai-evaluation/samples/agent_evaluators/tool_call_accuracy.ipynb | Revised sample to demonstrate updated evaluator usage and scoring. |
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py | Core evaluator logic modified to support the new scoring rubric and input handling. |
sdk/evaluation/azure-ai-evaluation/CHANGELOG.md | Changelog updated to reflect improvements to the evaluator. |
Comments suppressed due to low confidence (1)
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py:150
- The current logic overrides a provided 'tool_calls' parameter with those parsed from 'response' when present, which may not align with the documented behavior; consider preserving the explicitly provided 'tool_calls' when both are supplied.
tool_calls = parsed_tool_calls
The evaluator uses a scoring rubric of 1 to 5: | ||
- Score 1: The tool calls are irrelevant | ||
- Score 2: The tool calls are partially relevant, but not enough tools were called or the parameters were not correctly passed | ||
- Score 3: The tool calls are relevant, but there were unncessary, excessive tool calls made |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a spelling error in the description for score 3; 'unncessary' should be corrected to 'unnecessary'.
- Score 3: The tool calls are relevant, but there were unncessary, excessive tool calls made | |
- Score 3: The tool calls are relevant, but there were unnecessary, excessive tool calls made |
Copilot uses AI. Check for mistakes.
@microsoft-github-policy-service agree [company="Microsoft"] |
@microsoft-github-policy-service agree company="Microsoft" |
...ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/tool_call_accuracy.prompty
Show resolved
Hide resolved
raise EvaluationException( | ||
message="Tool call accuracy evaluator: Invalid score returned from LLM.", | ||
if isinstance(llm_output, dict): | ||
score = llm_output.get("tool_calls_success_level", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a constant for it.
...ure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py
Show resolved
Hide resolved
f"{self._result_key}_result": score_result, | ||
f"{self._result_key}_threshold": self.threshold, | ||
f"{self._result_key}_reason": reason, | ||
'applicable': True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this field signify ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether we ran evaluation on this turn or not. We don't run evaluations in the cases of:
- No tool calls happened in the turn.
- No tool definitions were provided in the turn.
- All/Some of the tool calls were of built-in tools.
However, this can be deduced from the score field, whether it's an int value or "not applicable". Do you suggest we remove this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can deduce this from other fields please remove it.
...ure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py
Outdated
Show resolved
Hide resolved
...ure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py
Show resolved
Hide resolved
self._EXCESS_TOOL_CALLS_KEY: llm_output.get(self._EXCESS_TOOL_CALLS_KEY, {}), | ||
self._MISSING_TOOL_CALLS_KEY: llm_output.get(self._MISSING_TOOL_CALLS_KEY, {}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any spec which defines what fields to be added ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but these fields have been approved by the PM
f"{self._result_key}_result": 'pass', | ||
f"{self._result_key}_threshold": self.threshold, | ||
f"{self._result_key}_reason": error_message, | ||
"applicable": False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this field
tool_results_map = {} | ||
if isinstance(response, list): | ||
for message in response: | ||
print(message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the print commands. If you would like to log, please use logger with correct log level
Your output should consist only of a JSON object, as provided in the examples, that has the following keys: | ||
- chain_of_thought: a string that explains your thought process to decide on the tool call accuracy level. Start this string with 'Let's think step by step:', and think deeply and precisely about which level should be chosen based on the agent's tool calls and how they were able to address the user's query. | ||
- tool_calls_success_level: a integer value between 1 and 5 that represents the level of tool call success, based on the level definitions mentioned before. You need to be very precise when deciding on this level. Ensure you are correctly following the rating system based on the description of each level. | ||
- tool_calls_sucess_result: 'pass' or 'fail' based on the evaluation level of the tool call accuracy. Levels 1 and 2 are a 'fail', levels 3, 4 and 5 are a 'pass'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be based on threshold customer passes ? Or spec defines this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the evaluator code, we parse the score/level and generate the 'pass' and 'fail' based on the threshold defined by the user, whose default value is 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is tool_calls_success_result used for ? How do we expose it to customers ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used it previously in the quality analysis but it's now unused in the SDK code, removed.
- excess_tool_calls: a dictionary with the following keys: | ||
- total: total number of excess, unnecessary tool calls made by the agent | ||
- details: a list of dictionaries, each containing: | ||
- tool_name: name of the tool | ||
- excess_count: number of excess calls made for this query | ||
- missing_tool_calls: a dictionary with the following keys: | ||
- total: total number of missing tool calls that should have been made by the agent to be able to answer the query | ||
- details: a list of dictionaries, each containing: | ||
- tool_name: name of the tool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we provide any instructions on how to come up with excess_tool_calls or missing_tool_calls ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but I believe the definitions in their subfields explain them.
"""Return a result indicating that the tool call is not applicable for evaluation. | ||
|
||
pr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a typo ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, removed.
assert result[key] == "not applicable" | ||
assert result[f"{key}_result"] == "not applicable" | ||
assert result[key] == ToolCallAccuracyEvaluator._NOT_APPLICABLE_RESULT | ||
assert result[f"{key}_result"] == "pass" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have result as pass if it was not applicable ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was agreed on with the PM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@changliu2
Please confirm your approval on the above behavior.
- Score 1: The tool call is relevant with properly extracted parameters from the conversation | ||
The evaluator uses a scoring rubric of 1 to 5: | ||
- Score 1: The tool calls are irrelevant | ||
- Score 2: The tool calls are partially relevant, but not enough tools were called or the parameters were not correctly passed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the description to include JTBD that were discuss in the tool accuracy doc
Description
This PR introduces a new version of the Tool Call Accuracy Evaluator with lower intra- and inter-model variance compared to V1.
It introduces:
With V2, we achieved an improvement on human-alignment scores of 11% compared to V1, as shown in the table below:

All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines