# Task Completion Evaluator

## Objective
This sample demonstrates how to use task completion evaluator on agent data. The supported input formats include:
- simple data such as strings and `dict` describing task responses;
- user-agent conversations in the form of list of agent messages. 

## Time

You should expect to spend about 20 minutes running this notebook. 

## Before you begin
For quality evaluation, you need to deploy a `gpt` model supporting JSON mode. We recommend using `gpt-4.1` or `gpt-4.1-mini`.    

### Prerequisite
```bash
pip install azure-ai-projects azure-identity openai
```
Set these environment variables with your own values:
1) **AZURE_AI_PROJECT_ENDPOINT** - Your Azure AI project endpoint in format: `https://<account_name>.services.ai.azure.com/api/projects/<project_name>`
2) **AZURE_AI_MODEL_DEPLOYMENT_NAME** - The deployment name of the model for this AI-assisted evaluator (e.g., gpt-4o-mini)

The Task Completion evaluator assesses whether an AI agent successfully completes the requested task by examining:
- Whether the task was fully completed
- Quality of task execution
- Appropriateness of the response to the original request

The evaluator uses a binary scoring system (0 or 1):

    - Score 0: The task was not completed or only partially completed
    - Score 1: The task was successfully and fully completed

This evaluation focuses on measuring whether the agent's response indicates successful completion of the user's request, regardless of the specific methods or tools used to achieve the task.

Task Completion requires following input:
- Query - This can be a single query or a list of messages(conversation history with agent). The original task request from the user.
- Response - Response from Agent (or any GenAI App). This can be a single text response or a list of messages generated as part of Agent Response.
- Tool Definitions - (Optional) Tool(s) definition used by Agent to answer the query. Providing tool definitions helps the evaluator better understand the context and capabilities available to the agent.


### Initialize Task Completion Evaluator


In [None]:
import os
from openai.types.evals.create_eval_jsonl_run_data_source_param import SourceFileContentContent
from pprint import pprint
from agent_utils import run_evaluator

# Get environment variables
deployment_name = os.environ["AZURE_AI_MODEL_DEPLOYMENT_NAME"]

# Data source configuration (defines the schema for evaluation inputs)
data_source_config = {
    "type": "custom",
    "item_schema": {
        "type": "object",
        "properties": {
            "query": {
                "anyOf": [
                    {"type": "string"},
                    {
                        "type": "array",
                        "items": {
                            "type": "object"
                        }
                    }
                ]
            },
            "response": {
                "anyOf": [
                    {"type": "string"},
                    {
                        "type": "array",
                        "items": {
                            "type": "object"
                        }
                    }
                ]
            },
            "tool_definitions": {
                "anyOf": [
                    {"type": "object"},
                    {
                        "type": "array",
                        "items": {
                            "type": "object"
                        }
                    }
                ]
            }
        },
        "required": ["query", "response"]
    },
    "include_sample_schema": True
}

# Data mapping (maps evaluation inputs to evaluator parameters)
data_mapping = {
    "query": "{{item.query}}",
    "response": "{{item.response}}",
    "tool_definitions": "{{item.tool_definitions}}"
}

# Initialization parameters for the evaluator
initialization_parameters = {
    "deployment_name": deployment_name
}

# Initialize the evaluation_contents list - we'll append all test cases here
evaluation_contents = []

### Samples

#### Evaluating Simple Task Completion

In [None]:
# Test Case 1: Simple Task Completion - Basic weather query
query_simple = "How is the weather in Seattle?"
response_simple = "The current weather in Seattle is partly cloudy with a temperature of 15°C (59°F). There's a light breeze from the northwest at 8 mph, and the humidity is at 68%. No precipitation is expected for the rest of the day."

# Append to evaluation_contents - Basic evaluation without tool definitions
evaluation_contents.append(
    SourceFileContentContent(
        item={
            "query": query_simple,
            "response": response_simple,
            "tool_definitions": None
        }
    )
)

#### Task Completion with Tool Context

In [None]:
# Test Case 2: Task Completion with Tool Context
query_tool_context = "How is the weather in Seattle?"
response_tool_context = "I've checked the weather for Seattle and found that it's currently partly cloudy with a temperature of 15°C. There's a light breeze and no rain expected today."

tool_definitions_context = [
    {
        "name": "fetch_weather",
        "description": "Fetches the weather information for the specified location.",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string", "description": "The location to fetch weather for."}},
        },
    }
]

# Append to evaluation_contents
evaluation_contents.append(
    SourceFileContentContent(
        item={
            "query": query_tool_context,
            "response": response_tool_context,
            "tool_definitions": tool_definitions_context
        }
    )
)

#### Task Completion with Tool Definition as Single Dict

In [None]:
# Test Case 3: Tool Definition as Single Dict
query_dict = "What's the current temperature in Boston?"
response_dict = "The current temperature in Boston is 22°C (72°F) with clear skies. It's a beautiful day with low humidity and a gentle breeze."

tool_definition_dict = [
        {
        "name": "fetch_weather",
        "description": "Fetches the weather information for the specified location.",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string", "description": "The location to fetch weather for."}},
        },
    }
]

# Append to evaluation_contents
evaluation_contents.append(
    SourceFileContentContent(
        item={
            "query": query_dict,
            "response": response_dict,
            "tool_definitions": tool_definition_dict
        }
    )
)

#### Complex Task with Multiple Steps

In [None]:
# Test Case 4: Complex Task with Multiple Steps
query_complex = "Can you send me an email with weather information for Seattle?"
response_complex = [
    {
        "createdAt": "2025-03-26T17:27:35Z",
        "run_id": "run_zblZyGCNyx6aOYTadmaqM4QN",
        "role": "assistant",
        "content": [
            {
                "type": "text",
                "text": "I'll get the current weather information for Seattle and then send you an email with the details.",
            }
        ],
    },
    {
        "createdAt": "2025-03-26T17:27:42Z",
        "run_id": "run_zblZyGCNyx6aOYTadmaqM4QN",
        "role": "assistant",
        "content": [
            {
                "type": "text",
                "text": "I have successfully sent you an email with the weather information for Seattle. The current weather is partly cloudy with a temperature of 15°C. You should receive the email shortly at your registered email address.",
            }
        ],
    },
]

tool_definitions_complex = [
    {
        "name": "fetch_weather",
        "description": "Fetches the weather information for the specified location.",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string", "description": "The location to fetch weather for."}},
        },
    },
    {
        "name": "send_email",
        "description": "Sends an email with the specified subject and body to the recipient.",
        "parameters": {
            "type": "object",
            "properties": {
                "recipient": {"type": "string", "description": "Email address of the recipient."},
                "subject": {"type": "string", "description": "Subject of the email."},
                "body": {"type": "string", "description": "Body content of the email."},
            },
        },
    },
]

# Append to evaluation_contents
evaluation_contents.append(
    SourceFileContentContent(
        item={
            "query": query_complex,
            "response": response_complex,
            "tool_definitions": tool_definitions_complex
        }
    )
)

#### Query as Conversation History (List of Messages)
The evaluator also supports query as a list of messages representing conversation history. This helps evaluate task completion in the context of a full conversation.

In [None]:
# Test Case 5: Query as Conversation History
query_conversation = [
    {
        "role": "system",
        "content": "You are a helpful assistant that can fetch weather information and send emails."
    },
    {
        "role": "user", 
        "content": "Hi, I need to plan my day. Can you check the weather in Seattle for me?"
    },
    {
        "role": "user",
        "content": "Also, please send me an email summary of the weather so I can reference it later."
    }
]

response_conversation = "I've checked the weather in Seattle for you. It's currently 15°C and partly cloudy with light winds. I've also sent you an email summary with these details so you can reference them throughout your day. The task has been completed successfully."

tool_definitions_conversation = [
    {
        "name": "fetch_weather",
        "description": "Fetches the weather information for the specified location.",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string", "description": "The location to fetch weather for."}},
        },
    },
    {
        "name": "send_email",
        "description": "Sends an email with the specified subject and body to the recipient.",
        "parameters": {
            "type": "object",
            "properties": {
                "recipient": {"type": "string", "description": "Email address of the recipient."},
                "subject": {"type": "string", "description": "Subject of the email."},
                "body": {"type": "string", "description": "Body content of the email."},
            },
        },
    },
]

# Append to evaluation_contents
evaluation_contents.append(
    SourceFileContentContent(
        item={
            "query": query_conversation,
            "response": response_conversation,
            "tool_definitions": tool_definitions_conversation
        }
    )
)

#### Example of Incomplete Task

In [None]:
# Test Case 6: Incomplete Task Example
query_incomplete = "Can you send me an email with weather information for Seattle?"
response_incomplete = "I can see that you want weather information for Seattle. The weather there is usually quite nice this time of year, with temperatures ranging from mild to warm depending on the season."

# Append to evaluation_contents - This response doesn't complete the email sending task
evaluation_contents.append(
    SourceFileContentContent(
        item={
            "query": query_incomplete,
            "response": response_incomplete,
            "tool_definitions": None,
        }
    )
)

### Execute Batch Evaluation
Run all test cases together using the batch evaluation approach.

In [None]:
results = run_evaluator(
    evaluator_name="task_completion",
    evaluation_contents=evaluation_contents,
    data_source_config=data_source_config,
    initialization_parameters=initialization_parameters,
    data_mapping=data_mapping
)

### Results Display
Display the evaluation results for each test case.

In [None]:
pprint(results)