# Task Adherence Evaluator

# Intent Resolution Evaluator

## Objective
This sample demonstrates to how to use task adherence evaluator on agent data. The supported input formats include:
- simple data such as strings;
- user-agent conversations in the form of list of agent messages. 

## Time

You should expect to spend about 10 minutes running this notebook. 

## Before you begin
For quality evaluation, you need to deploy a `gpt` model supporting JSON mode. We recommend a model `gpt-4o` or `gpt-4o-mini` for their strong reasoning capabilities.    

### Prerequisite
```bash
pip install azure-ai-projects azure-identity azure-ai-evaluation
```
Set these environment variables with your own values:
1) **PROJECT_CONNECTION_STRING** - The project connection string, as found in the overview page of your Azure AI Foundry project.
2) **MODEL_DEPLOYMENT_NAME** - The deployment name of the model for this AI-assisted evaluator, as found under the "Name" column in the "Models + endpoints" tab in your Azure AI Foundry project.
3) **AZURE_OPENAI_ENDPOINT** - Azure Open AI Endpoint to be used for evaluation.
4) **AZURE_OPENAI_API_KEY** - Azure Open AI Key to be used for evaluation.
5) **AZURE_OPENAI_API_VERSION** - Azure Open AI Api version to be used for evaluation.
6) **AZURE_SUBSCRIPTION_ID** - Azure Subscription Id of Azure AI Project
7) **PROJECT_NAME** - Azure AI Project Name
8) **RESOURCE_GROUP_NAME** - Azure AI Project Resource Group Name


### Getting Started

This sample demonstrates how to use Task Adherence Evaluator
Before running the sample:
```bash
pip install azure-ai-projects azure-identity azure-ai-evaluation
```
Set these environment variables with your own values:
1) **PROJECT_CONNECTION_STRING** - The project connection string, as found in the overview page of your Azure AI Foundry project.
2) **MODEL_DEPLOYMENT_NAME** - The deployment name of the AI model, as found under the "Name" column in the "Models + endpoints" tab in your Azure AI Foundry project.
3) **AZURE_OPENAI_ENDPOINT** - Azure Open AI Endpoint to be used for evaluation.
4) **AZURE_OPENAI_API_KEY** - Azure Open AI Key to be used for evaluation.
5) **AZURE_OPENAI_API_VERSION** - Azure Open AI Api version to be used for evaluation.
6) **AZURE_SUBSCRIPTION_ID** - Azure Subscription Id of Azure AI Project
7) **PROJECT_NAME** - Azure AI Project Name
8) **RESOURCE_GROUP_NAME** - Azure AI Project Resource Group Name

The Task Adherence evaluator measures how well the agent adheres to their assigned tasks or predefined goal.

The scoring is on a 1-5 integer scale and is as follows:

  - Score 1: Fully Inadherent
  - Score 2: Barely Adherent
  - Score 3: Moderately Adherent
  - Score 4: Mostly Adherent
  - Score 5: Fully Adherent

The evaluation requires the following inputs:

  - Query    : The user query. Either a string with a user request or a list of messages with previous requests from the user and responses from the assistant, potentially including a system message.
  - Response : The response to be evaluated. Either a string or a message with the response from the agent to the last user query.

There is a third optional parameter:
  - ToolDefinitions : The list of tool definitions the agent can call. This may be useful for the evaluator to better assess if the right tool was called to adhere to user intent.

In [1]:
%pip install azure-ai-projects azure-identity azure-ai-evaluation python-dotenv

Note: you may need to restart the kernel to use updated packages.


In [2]:
from dotenv import load_dotenv
load_dotenv()

True

### Initialize Task Adherence Evaluator


In [3]:
import os
from azure.ai.evaluation import TaskAdherenceEvaluator, AzureOpenAIModelConfiguration
from pprint import pprint

model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_key=os.environ["AZURE_OPENAI_API_KEY"],
    api_version=os.environ["AZURE_OPENAI_API_VERSION"],
    azure_deployment=os.environ["MODEL_DEPLOYMENT_NAME"],
)
task_adherence_evaluator = TaskAdherenceEvaluator(model_config)

Class TaskAdherenceEvaluator: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


In [4]:
print("AZURE_OPENAI_ENDPOINT=" + os.environ["AZURE_OPENAI_ENDPOINT"])
print("AZURE_OPENAI_API_KEY=" + os.environ["AZURE_OPENAI_API_KEY"])
print("AZURE_OPENAI_API_VERSION=" + os.environ["AZURE_OPENAI_API_VERSION"])
print("MODEL_DEPLOYMENT_NAME=" + os.environ["MODEL_DEPLOYMENT_NAME"])

AZURE_OPENAI_ENDPOINT=https://cvi-aie-wf-swedencentra-resource.openai.azure.com/
AZURE_OPENAI_API_KEY=6LLV1Vgx1HRa3368D7SmCflhIcNk5Z4fr8bm1cXcSK2g40f2ShNqJQQJ99BFACfhMk5XJ3w3AAAAACOGulr0
AZURE_OPENAI_API_VERSION=2024-12-01-preview
MODEL_DEPLOYMENT_NAME=gpt-4o


### Samples

#### Evaluating query and response as string

In [5]:
# Failure example, there's only a vague adherence to the task
result = task_adherence_evaluator(
    query="What are the best practices for maintaining a healthy rose garden during the summer?",
    response="Make sure to water your roses regularly and trim them occasionally.",
)
pprint(result)

{'task_adherence': 2.0,
 'task_adherence_reason': 'The response partially aligns with the query by '
                          'mentioning two relevant actions but fails to '
                          'provide sufficient detail or a comprehensive list '
                          'of best practices, making it barely adherent.',
 'task_adherence_result': 'fail',
 'task_adherence_threshold': 3}


In [6]:
# Success example, full adherence to the task
result = task_adherence_evaluator(
    query="What are the best practices for maintaining a healthy rose garden during the summer?",
    response="For optimal summer care of your rose garden, start by watering deeply early in the morning to ensure the roots are well-hydrated without encouraging fungal growth. Apply a 2-3 inch layer of organic mulch around the base of the plants to conserve moisture and regulate soil temperature. Fertilize with a balanced rose fertilizer every 4 to 6 weeks to support healthy growth. Prune away any dead or diseased wood to promote good air circulation, and inspect regularly for pests such as aphids or spider mites, treating them promptly with an appropriate organic insecticidal soap. Finally, ensure that your roses receive at least 6 hours of direct sunlight daily for robust flowering.",
)
pprint(result)

{'task_adherence': 4.0,
 'task_adherence_reason': 'The response is highly accurate, clear, and adheres '
                          'closely to the instructions, with only minor room '
                          'for improvement.',
 'task_adherence_result': 'pass',
 'task_adherence_threshold': 3}


#### Evaluating query and response as list of messages

In [7]:
query = [
    {"role": "system", "content": "You are an expert in literature and at provid can provide book recommendations."},
    {
        "createdAt": "2025-03-14T08:00:00Z",
        "role": "user",
        "content": [
            {"type": "text", "text": "I love historical fiction. Can you recommend a good book from that genre?"}
        ],
    },
]

response = [
    {
        "createdAt": "2025-03-14T08:00:05Z",
        "role": "assistant",
        "content": [{"type": "text", "text": "Let me fetch a recommendation for historical fiction."}],
    },
    {
        "createdAt": "2025-03-14T08:00:10Z",
        "role": "assistant",
        "content": [
            {
                "type": "tool_call",
                "tool_call_id": "tool_call_20250314_001",
                "name": "get_book",
                "arguments": {"genre": "historical fiction"},
            }
        ],
    },
    {
        "createdAt": "2025-03-14T08:00:15Z",
        "role": "tool",
        "tool_call_id": "tool_call_20250314_001",
        "content": [
            {
                "type": "tool_result",
                "tool_result": '{ "book": { "title": "The Pillars of the Earth", "author": "Ken Follett", "summary": "A captivating tale set in medieval England that weaves historical events with personal drama." } }',
            }
        ],
    },
    {
        "createdAt": "2025-03-14T08:00:20Z",
        "role": "assistant",
        "content": [
            {
                "type": "text",
                "text": "Based on our records, I recommend 'The Pillars of the Earth' by Ken Follett. This novel is an excellent example of historical fiction with a rich narrative and well-developed characters. Would you like more details or another suggestion?",
            }
        ],
    },
]

tool_definitions = [
    {
        "name": "get_book",
        "description": "Retrieve a book recommendation for a specified genre.",
        "parameters": {
            "type": "object",
            "properties": {
                "genre": {"type": "string", "description": "The genre for which a book recommendation is requested."}
            },
        },
    }
]

result = task_adherence_evaluator(
    query=query,
    response=response,
    tool_definitions=tool_definitions,
)
pprint(result)

{'task_adherence': 5.0,
 'task_adherence_reason': 'The response is clear, accurate, and fulfills the '
                          'query by providing a specific book recommendation '
                          'along with relevant details. It also offers to '
                          'provide further assistance, making it highly '
                          'adherent to the instructions.',
 'task_adherence_result': 'pass',
 'task_adherence_threshold': 3}


## Batch evaluate and visualize results on Azure AI Foundry
Batch evaluate to leverage asynchronous evaluation on a dataset. 

Optionally, you can go to AI Foundry URL for rich Azure AI Foundry data visualization. You can inspect the evaluation scores and reasoning to quickly identify bugs and issues of your agent to fix and improve. Make sure to authenticate to Azure using `az login` in your terminal before running this cell.

In [8]:
from azure.ai.evaluation import evaluate

# This sample files contains the evaluation data in JSONL format. Where each line is a run from agent.
# This was saved using agent thread and converter.
file_name = "evaluation_data.jsonl"

response = evaluate(
    data=file_name,
    evaluation_name="Task Adherence Evaluation",
    evaluators={
        "task_adherence": task_adherence_evaluator,
    },
    azure_ai_project={
        "subscription_id": os.environ["REPORT_AZURE_SUBSCRIPTION_ID"],
        "project_name": os.environ["REPORT_PROJECT_NAME"],
        "resource_group_name": os.environ["REPORT_RESOURCE_GROUP_NAME"],
    },
)
pprint(f'AI Foundary URL: {response.get("studio_url")}')

[2025-06-04 14:26:40 -0700][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_task_adherence_20250604_142640_832291, log path: /Users/cv/.promptflow/.runs/azure_ai_evaluation_evaluators_task_adherence_20250604_142640_832291/logs.txt


2025-06-04 14:26:40 -0700   81861 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2025-06-04 14:26:43 -0700   81861 execution.bulk     INFO     Finished 1 / 3 lines.
2025-06-04 14:26:43 -0700   81861 execution.bulk     INFO     Average execution time for completed lines: 2.21 seconds. Estimated time for incomplete lines: 4.42 seconds.
2025-06-04 14:26:43 -0700   81861 execution.bulk     INFO     Finished 2 / 3 lines.
2025-06-04 14:26:43 -0700   81861 execution.bulk     INFO     Average execution time for completed lines: 1.15 seconds. Estimated time for incomplete lines: 1.15 seconds.
2025-06-04 14:26:43 -0700   81861 execution.bulk     INFO     Finished 3 / 3 lines.
2025-06-04 14:26:43 -0700   81861 execution.bulk     INFO     Average execution time for completed lines: 0.84 seconds. Estimated time for incomplete lines: 0.0 seconds.

Run name: "azure_ai_evaluation_evaluators_task_adherence_20250604_142640_832291"
Run stat