# Tool Input Accuracy Evaluator

## Objective
This sample demonstrates how to use tool input accuracy evaluator on agent data. The supported input formats include:
- simple data such as strings and `dict` describing tool calls;
- user-agent conversations in the form of list of agent messages. 

## Time

You should expect to spend about 20 minutes running this notebook. 

## Before you begin
For quality evaluation, you need to deploy a `gpt` model supporting JSON mode. We recommend using `gpt-4o`.    

### Prerequisite
```bash
pip install azure-ai-projects azure-identity openai
```
Set these environment variables with your own values:
1) **AZURE_AI_PROJECT_ENDPOINT** - Your Azure AI project endpoint in format: `https://<account_name>.services.ai.azure.com/api/projects/<project_name>`
2) **AZURE_AI_MODEL_DEPLOYMENT_NAME** - The deployment name of the model for this AI-assisted evaluator (e.g., gpt-4o-mini)

The Tool Input Accuracy evaluator performs a strict binary evaluation (PASS/FAIL) of parameters passed to tool calls. It ensures that ALL parameters meet ALL criteria:

- Parameter grounding: All parameters must be derived from conversation history/query
- Type compliance: All parameters must match exact types specified in tool definitions
- Format compliance: All parameters must follow exact format and structure requirements
- Completeness: All required parameters must be provided
- No unexpected parameters: Only defined parameters are allowed

The evaluator uses strict binary evaluation:

    - 1: Only when ALL criteria are satisfied perfectly for ALL parameters
    - 0: When ANY criterion fails for ANY parameter

This evaluation focuses on ensuring tool call parameters are completely correct without any tolerance for partial correctness.

Tool Input Accuracy requires following input:
- Query - This can be a single query or a list of messages(conversation history with agent). The original task request from the user.
- Response - Response from Agent (or any GenAI App). This can be a single text response or a list of messages generated as part of Agent Response. The evaluator extracts tool calls from the response.
- Tool Definitions - Tool(s) definition used by Agent to answer the query. Required to validate parameter types and structures.


### Initialize Tool Input Accuracy Evaluator


In [None]:
import os
from openai.types.evals.create_eval_jsonl_run_data_source_param import SourceFileContentContent
from pprint import pprint
from agent_utils import run_evaluator

# Get environment variables
deployment_name = os.environ["AZURE_AI_MODEL_DEPLOYMENT_NAME"]

# Data source configuration (defines the schema for evaluation inputs)
data_source_config = {
    "type": "custom",
    "item_schema": {
        "type": "object",
        "properties": {
            "query": {
                "anyOf": [
                    {"type": "string"},
                    {
                        "type": "array",
                        "items": {
                            "type": "object"
                        }
                    }
                ]
            },
            "response": {
                "anyOf": [
                    {"type": "string"},
                    {
                        "type": "array",
                        "items": {
                            "type": "object"
                        }
                    }
                ]
            },
            "tool_definitions": {
                "anyOf": [
                    {"type": "object"},
                    {
                        "type": "array",
                        "items": {
                            "type": "object"
                        }
                    }
                ]
            }
        },
        "required": ["query", "response", "tool_definitions"]
    },
    "include_sample_schema": True
}

# Data mapping (maps evaluation inputs to evaluator parameters)
data_mapping = {
    "query": "{{item.query}}",
    "response": "{{item.response}}",
    "tool_definitions": "{{item.tool_definitions}}"
}

# Initialization parameters for the evaluator
initialization_parameters = {
    "deployment_name": deployment_name
}

# Initialize the evaluation_contents list - we'll append all test cases here
evaluation_contents = []

### Samples

#### Evaluating Correct Tool Input Parameters

In [None]:
# Test Case 1: Correct tool input parameters
query = "How is the weather in Seattle?"
response = [
    {
        "role": "assistant",
        "content": [
            {
                "type": "tool_call",
                "tool_call_id": "call_weather_123",
                "name": "fetch_weather",
                "arguments": {"location": "Seattle"},
            }
        ],
    }
]

tool_definitions = [
    {
        "name": "fetch_weather",
        "description": "Fetches the weather information for the specified location.",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string", "description": "The location to fetch weather for."}},
            "required": ["location"]
        },
    }
]

# Append to evaluation_contents
evaluation_contents.append(
    SourceFileContentContent(
        item={
            "query": query,
            "response": response,
            "tool_definitions": tool_definitions
        }
    )
)

#### Response as String (str)

In [None]:
# Test Case 2: Response as string
query_str = "Check the weather in Miami"

# Response as a simple string containing tool call information
response_str = "I'll check the weather for you. Calling fetch_weather with location=\"Miami\"."

tool_definition = [{
    "name": "fetch_weather",
    "description": "Fetches the weather information for the specified location.",
    "parameters": {
        "type": "object",
        "properties": {"location": {"type": "string", "description": "The location to fetch weather for."}},
        "required": ["location"]
    },
}]

# Append to evaluation_contents
evaluation_contents.append(
    SourceFileContentContent(
        item={
            "query": query_str,
            "response": response_str,
            "tool_definitions": tool_definition
        }
    )
)

#### Response as a List and Tool Definition as Single Dict

In [None]:
# Test Case 3: Tool definition as single dict
query_dict = "What's the temperature in Boston?"
response_dict = [
    {
        "role": "assistant",
        "content": [
            {
                "type": "tool_call",
                "tool_call_id": "call_boston_weather",
                "name": "fetch_weather",
                "arguments": {"location": "Boston"},
            }
        ],
    }
]

tool_definition_dict = [{
    "name": "fetch_weather",
    "description": "Fetches the weather information for the specified location.",
    "parameters": {
        "type": "object",
        "properties": {"location": {"type": "string", "description": "The location to fetch weather for."}},
        "required": ["location"]
    },
}]

# Append to evaluation_contents
evaluation_contents.append(
    SourceFileContentContent(
        item={
            "query": query_dict,
            "response": response_dict,
            "tool_definitions": tool_definition_dict
        }
    )
)

#### Complex Tool Parameters with Multiple Fields

In [None]:
# Test Case 4: Complex tool parameters with multiple fields
query_complex = "Can you send an email to john@example.com with the subject 'Weather Update' and tell him the weather in Seattle is 15°C and partly cloudy?"
response_complex = [
    {
        "role": "assistant",
        "content": [
            {
                "type": "tool_call",
                "tool_call_id": "call_email_456",
                "name": "send_email",
                "arguments": {
                    "recipient": "john@example.com",
                    "subject": "Weather Update",
                    "body": "The weather in Seattle is 15°C and partly cloudy."
                },
            }
        ],
    }
]

tool_definitions_complex = [
    {
        "name": "send_email",
        "description": "Sends an email with the specified subject and body to the recipient.",
        "parameters": {
            "type": "object",
            "properties": {
                "recipient": {"type": "string", "description": "Email address of the recipient."},
                "subject": {"type": "string", "description": "Subject of the email."},
                "body": {"type": "string", "description": "Body content of the email."},
            },
            "required": ["recipient", "subject", "body"]
        },
    }
]

# Append to evaluation_contents
evaluation_contents.append(
    SourceFileContentContent(
        item={
            "query": query_complex,
            "response": response_complex,
            "tool_definitions": tool_definitions_complex
        }
    )
)

#### Query as Conversation History (List of Messages)
The evaluator also supports query as a list of messages representing conversation history. This helps validate parameters are grounded in the conversation context.

In [None]:
# Test Case 5: Query as conversation history with multiple tools
query_conversation = [
    {
        "role": "system",
        "content": "You are a helpful assistant that can fetch weather information and send emails."
    },
    {
        "role": "user", 
        "content": "Hi, can you check the weather in Seattle for me?"
    },
    {
        "role": "user",
        "content": "Please send the results to admin@company.com with the subject 'Daily Weather Report'."
    }
]

response_conversation = [
    {
        "role": "assistant",
        "content": [
            {
                "type": "tool_call",
                "tool_call_id": "call_weather_123",
                "name": "fetch_weather", 
                "arguments": {"location": "Seattle"},
            }
        ],
    },
    {
        "role": "assistant",
        "content": [
            {
                "type": "tool_call",
                "tool_call_id": "call_email_456",
                "name": "send_email",
                "arguments": {
                    "recipient": "admin@company.com",
                    "subject": "Daily Weather Report",
                    "body": "Weather information for Seattle as requested."
                },
            }
        ],
    }
]

tool_definitions_conversation = [
    {
        "name": "fetch_weather",
        "description": "Fetches the weather information for the specified location.",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string", "description": "The location to fetch weather for."}},
            "required": ["location"]
        },
    },
    {
        "name": "send_email",
        "description": "Sends an email with the specified subject and body to the recipient.",
        "parameters": {
            "type": "object",
            "properties": {
                "recipient": {"type": "string", "description": "Email address of the recipient."},
                "subject": {"type": "string", "description": "Subject of the email."},
                "body": {"type": "string", "description": "Body content of the email."},
            },
            "required": ["recipient", "subject", "body"]
        },
    },
]

# Append to evaluation_contents
evaluation_contents.append(
    SourceFileContentContent(
        item={
            "query": query_conversation,
            "response": response_conversation,
            "tool_definitions": tool_definitions_conversation
        }
    )
)

#### Example of Incorrect Tool Parameters

In [None]:
# Test Case 6: Incorrect tool parameters (should score 0)
query_incorrect = "How is the weather in Seattle?"
# Missing required parameter or wrong type
response_incorrect = [
    {
        "role": "assistant",
        "content": [
            {
                "type": "tool_call",
                "tool_call_id": "call_weather_123",
                "name": "fetch_weather",
                "arguments": {"city": "Seattle"},  # Wrong parameter name (should be "location")
            }
        ],
    }
]

tool_definitions_incorrect = [
    {
        "name": "fetch_weather",
        "description": "Fetches the weather information for the specified location.",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string", "description": "The location to fetch weather for."}},
            "required": ["location"]
        },
    }
]

# Append to evaluation_contents - This should score 0 due to incorrect parameter name
evaluation_contents.append(
    SourceFileContentContent(
        item={
            "query": query_incorrect,
            "response": response_incorrect,
            "tool_definitions": tool_definitions_incorrect
        }
    )
)

### Execute Batch Evaluation
Run all test cases together using the batch evaluation approach.

In [None]:
results = run_evaluator(
    evaluator_name="tool_input_accuracy",
    evaluation_contents=evaluation_contents,
    data_source_config=data_source_config,
    initialization_parameters=initialization_parameters,
    data_mapping=data_mapping
)

### Results Display
Display the evaluation results for each test case.

In [None]:
pprint(results)