# Task Adherence Evaluator

## Objective
This sample demonstrates to how to use task adherence evaluator on agent data. The supported input formats include:
- simple data such as strings;
- user-agent conversations in the form of list of agent messages. 

## Time

You should expect to spend about 10 minutes running this notebook. 

## Before you begin
For quality evaluation, you need to deploy a `gpt` model supporting JSON mode. We recommend using `gpt-4o` or `gpt-4.1`.  

### Prerequisite
```bash
pip install azure-ai-projects azure-identity openai
```
Set these environment variables with your own values:
1) **AZURE_AI_PROJECT_ENDPOINT** - Your Azure AI project endpoint in format: `https://<account_name>.services.ai.azure.com/api/projects/<project_name>`
2) **AZURE_AI_MODEL_DEPLOYMENT_NAME** - The deployment name of the model for this AI-assisted evaluator (e.g., gpt-4o-mini)

The Task Adherence evaluator measures how well the agent adheres to their assigned tasks or predefined goal.

The scoring is on a 1-5 integer scale and is as follows:

  - Score 1: Fully Inadherent
  - Score 2: Barely Adherent
  - Score 3: Moderately Adherent
  - Score 4: Mostly Adherent
  - Score 5: Fully Adherent

The evaluation requires the following inputs:

  - Query    : The user query. Either a string with a user request or a list of messages with previous requests from the user and responses from the assistant, potentially including a system message.
  - Response : The response to be evaluated. Either a string or a message with the response from the agent to the last user query.

There is a third optional parameter:
  - ToolDefinitions : The list of tool definitions the agent can call. This may be useful for the evaluator to better assess if the right tool was called to adhere to user intent.

### Initialize Task Adherence Evaluator


In [None]:
import os
from openai.types.evals.create_eval_jsonl_run_data_source_param import SourceFileContentContent
from pprint import pprint
from agent_utils import run_evaluator

# Get environment variables
deployment_name = os.environ["AZURE_AI_MODEL_DEPLOYMENT_NAME"]

# Data source configuration (defines the schema for evaluation inputs)
data_source_config = {
    "type": "custom",
    "item_schema": {
        "type": "object",
        "properties": {
            "query": {
                "anyOf": [
                    {"type": "string"},
                    {
                        "type": "array",
                        "items": {
                            "type": "object"
                        }
                    }
                ]
            },
            "response": {
                "anyOf": [
                    {"type": "string"},
                    {
                        "type": "array",
                        "items": {
                            "type": "object"
                        }
                    }
                ]
            },
            "tool_definitions": {
                "anyOf": [
                    {"type": "object"},
                    {
                        "type": "array",
                        "items": {
                            "type": "object"
                        }
                    }
                ]
            }
        },
        "required": ["query", "response"]
    },
    "include_sample_schema": True
}

# Data mapping (maps evaluation inputs to evaluator parameters)
data_mapping = {
    "query": "{{item.query}}",
    "response": "{{item.response}}",
    "tool_definitions": "{{item.tool_definitions}}"
}

# Initialization parameters for the evaluator
initialization_parameters = {
    "deployment_name": deployment_name
}

# Initialize the evaluation_contents list - we'll append all test cases here
evaluation_contents = []

### Samples

#### Evaluating query and response as string

In [None]:
# Test Case 1: Vague adherence example (should score low)
query1 = "What are the best practices for maintaining a healthy rose garden during the summer?"
response1 = "Make sure to water your roses regularly and trim them occasionally."

# Append to evaluation_contents
evaluation_contents.append(
    SourceFileContentContent(
        item={
            "query": query1,
            "response": response1,
            "tool_definitions": None,
        }
    )
)

In [None]:
# Test Case 2: Full adherence example (should score high)
query2 = "What are the best practices for maintaining a healthy rose garden during the summer?"
response2 = "For optimal summer care of your rose garden, start by watering deeply early in the morning to ensure the roots are well-hydrated without encouraging fungal growth. Apply a 2-3 inch layer of organic mulch around the base of the plants to conserve moisture and regulate soil temperature. Fertilize with a balanced rose fertilizer every 4 to 6 weeks to support healthy growth. Prune away any dead or diseased wood to promote good air circulation, and inspect regularly for pests such as aphids or spider mites, treating them promptly with an appropriate organic insecticidal soap. Finally, ensure that your roses receive at least 6 hours of direct sunlight daily for robust flowering."

# Append to evaluation_contents
evaluation_contents.append(
    SourceFileContentContent(
        item={
            "query": query2,
            "response": response2,
            "tool_definitions": None,
        }
    )
)

#### Query and Response as String with Tool Definition as Single Dict

In [None]:
# Test Case 3: Book recommendation with tool definition as single dict
query3 = "Can you recommend a science fiction book for me?"
response3 = "I found 'Dune' by Frank Herbert for you. It's a classic science fiction novel set in a distant future."

tool_definition_dict = [
    {
        "name": "get_book",
        "description": "Retrieve a book recommendation for a specified genre.",
        "parameters": {
            "type": "object",
            "properties": {
                "genre": {"type": "string", "description": "The genre for which a book recommendation is requested."}
            },
        },
    }
]

# Append to evaluation_contents
evaluation_contents.append(
    SourceFileContentContent(
        item={
            "query": query3,
            "response": response3,
            "tool_definitions": tool_definition_dict
        }
    )
)

#### Evaluating query and response as list of messages

In [None]:
# Test Case 4: Historical fiction recommendation with conversation messages and tool usage
query4 = [
    {"role": "system", "content": "You are an expert in literature and at provid can provide book recommendations."},
    {
        "createdAt": "2025-03-14T08:00:00Z",
        "role": "user",
        "content": [
            {"type": "text", "text": "I love historical fiction. Can you recommend a good book from that genre?"}
        ],
    },
]

response4 = [
    {
        "createdAt": "2025-03-14T08:00:05Z",
        "role": "assistant",
        "content": [{"type": "text", "text": "Let me fetch a recommendation for historical fiction."}],
    },
    {
        "createdAt": "2025-03-14T08:00:10Z",
        "role": "assistant",
        "content": [
            {
                "type": "tool_call",
                "tool_call_id": "tool_call_20250314_001",
                "name": "get_book",
                "arguments": {"genre": "historical fiction"},
            }
        ],
    },
    {
        "createdAt": "2025-03-14T08:00:15Z",
        "role": "tool",
        "tool_call_id": "tool_call_20250314_001",
        "content": [
            {
                "type": "tool_result",
                "tool_result": '{ "book": { "title": "The Pillars of the Earth", "author": "Ken Follett", "summary": "A captivating tale set in medieval England that weaves historical events with personal drama." } }',
            }
        ],
    },
    {
        "createdAt": "2025-03-14T08:00:20Z",
        "role": "assistant",
        "content": [
            {
                "type": "text",
                "text": "Based on our records, I recommend 'The Pillars of the Earth' by Ken Follett. This novel is an excellent example of historical fiction with a rich narrative and well-developed characters. Would you like more details or another suggestion?",
            }
        ],
    },
]

tool_definitions4 = [
    {
        "name": "get_book",
        "description": "Retrieve a book recommendation for a specified genre.",
        "parameters": {
            "type": "object",
            "properties": {
                "genre": {"type": "string", "description": "The genre for which a book recommendation is requested."}
            },
        },
    }
]

# Append to evaluation_contents
evaluation_contents.append(
    SourceFileContentContent(
        item={
            "query": query4,
            "response": response4,
            "tool_definitions": tool_definitions4
        }
    )
)

### Run Evaluation on All Test Cases

Now that we've defined all test cases, let's run the evaluation once on all of them.

In [None]:
results = run_evaluator(
    evaluator_name="task_adherence",
    evaluation_contents=evaluation_contents,
    data_source_config=data_source_config,
    initialization_parameters=initialization_parameters,
    data_mapping=data_mapping
)

### Display Results

View the evaluation results for each test case.

In [None]:
pprint(results)