# Evaluate Semantic Kernel AI (ChatCompletion) Agents in Azure AI Foundry

## Objective

This sample demonstrates how to evaluate Semantic Kernel AI ChatCompletionAgents in Azure AI Foundry. It provides a step-by-step guide to set up the environment, create an agent, and evaluate its performance.

## Time
You can expect to complete this sample in approximately 20 minutes.

## Prerequisites
### Packages
- `semantic-kernel` installed (`pip install semantic-kernel`)
- `azure-ai-evaluation` SDK installed *(pip install latest version to include the SK converter)*
- An Azure OpenAI resource with a deployment configured

### Environment Variables
- For AzureChatService: *(related to SK Agent generating conversation)*
  - `api_key` The API key to access your Azure OpenAI resource.
  - `chat_deployment_name` The name of the chat model deployment (e.g., gpt-4-32k) used by your agent.
  - `endpoint` The full endpoint URL of your Azure OpenAI resource (e.g., https://your-resource.openai.azure.com).
- For evaluating agents:
  - `AZURE_OPENAI_ENDPOINT` The base endpoint for Azure OpenAI (e.g., https://<resource>.openai.azure.com).
  - `AZURE_OPENAI_API_KEY` The API key for your Azure OpenAI resource.
  - `AZURE_OPENAI_API_VERSION` The API version to use (e.g., 2024-05-01-preview).
  - `MODEL_DEPLOYMENT_NAME` The model deployment name used by the agent or evaluation flow (e.g., gpt-4).
- For Azure AI Foundry (Bonus):
  - `AZURE_SUBSCRIPTION_ID` Your Azure subscription ID.
  - `PROJECT_NAME` The name of your Azure AI Foundry project.
  - `RESOURCE_GROUP_NAME` The name of the resource group containing your Azure AI project.

### Create a AzureChatCompletion service - [reference](https://learn.microsoft.com/en-us/semantic-kernel/concepts/ai-services/chat-completion/?tabs=csharp-AzureOpenAI%2Cpython-AzureOpenAI%2Cjava-AzureOpenAI&pivots=programming-language-python)

In [2]:
import dotenv
dotenv.load_dotenv()

True

In [3]:
import os

print(os.getenv("AZURE_OPENAI_API_VERSION"))

2024-12-01-preview


In [4]:
from semantic_kernel.connectors.ai.open_ai import AzureChatCompletion

# You can do the following if you have set the necessary environment variables or created a .env file
chat_completion_service = AzureChatCompletion(service_id="my-service-id")

### Create a ChatCompletionAgent - [reference](https://learn.microsoft.com/en-us/semantic-kernel/frameworks/agent/agent-types/chat-completion-agent?pivots=programming-language-python)

In [5]:
from sk_tools import AgentToolsPlugin

In [10]:
from semantic_kernel.agents import ChatCompletionAgent

# Create the agent by directly providing the chat completion service
agent = ChatCompletionAgent(
    service=chat_completion_service,
    name="Assistant",
    instructions="You are a helpful assistant",
    plugins=[AgentToolsPlugin()],
)

In [11]:
# Examples
bugbash_examples = {
    "fetch_current_datetime": [
        "What is the current date and time?",
        "Give me the current time in format '%A, %d %B %Y %I:%M %p'."
    ],
    "fetch_weather": [
        "What's the weather like in Tokyo?",
        "Can you provide the weather information for London?",
        "Check the weather for Cairo."
    ],
    "send_email": [
        "Send an email to jane.doe@example.com with the subject 'Lunch Plans' and body 'Shall we grab lunch at 12:30 today?'",
        "Email to manager@company.com with subject 'Weekly Update' and body 'The sprint is on track, and all tasks are green.'"
    ],
    "send_email_using_recipient_name": [
        "Send an email to Ahmed with subject 'Bugbash Follow-up' and body 'Thanks for your participation!'",
        "Email Sarah with subject 'Meeting Notes' and body 'I've attached the minutes of today's meeting.'"
    ],
    "calculate_sum": [
        "What's the sum of 103 and 47?",
        "Add 999 and 1."
    ],
    "convert_temperature": [
        "Convert 30 degrees Celsius to Fahrenheit.",
        "What's 0°C in Fahrenheit?"
    ],
    "toggle_flag": [
        "Toggle the flag True.",
        "If a value is False, what will it be if toggled?"
    ],
    "merge_dicts": [
        "Merge these two dictionaries: {'team': 'AI'} and {'members': 5}.",
        "Combine {'x': 10, 'y': 20} with {'y': 30, 'z': 40}."
    ],
    "get_user_info": [
        "Retrieve user information for user ID 2.",
        "What are the details for user ID 99?"
    ],
    "longest_word_in_sentences": [
        "Find the longest word in each of these sentences: ['The sky was clear and blue', 'Programming is fun'].",
        "Give me the longest word per sentence: ['Azure is scalable', 'Pythonic code is clean and readable']."
    ],
    "process_records": [
        "Process the following records: [{'math': 95, 'science': 85}, {'art': 70, 'music': 80}].",
        "Sum the values in each record: [{'a': 5, 'b': 10, 'c': 20}, {'x': 1, 'y': 2}]"
    ]
}

bugbash_topics = list(bugbash_examples.keys())

In [12]:
import random
topic = random.choice(bugbash_topics)
print(f"Selected topic: {topic}")
examples = bugbash_examples[topic]

Selected topic: fetch_weather


In [None]:
thread = None

user_inputs = ["Hello"] + examples + ["Thank you"]

for user_input in user_inputs:
    response = await agent.get_response(messages=user_input, thread=thread)
    print(f"## User: {user_input}")
    print(f"## {response.name}: {response}\n")
    thread = response.thread

## User: Hello
## Assistant: Hello! How can I assist you today?

## User: What's the weather like in Tokyo?
## Assistant: The weather in Tokyo is rainy, with a temperature of 22°C.

## User: Can you provide the weather information for London?
## Assistant: The weather in London is cloudy, with a temperature of 18°C.

## User: Check the weather for Cairo.
## Assistant: I'm sorry, but the weather data is not available for Cairo at the moment.

## User: Thank you
## Assistant: You're welcome! If you have any more questions or need further assistance, feel free to ask.



### Converter

In [14]:
from azure.ai.evaluation import SKAgentConverter

# Get the avaiable turn indices for the thread,
# useful for selecting a specific turn for evaluation
turn_indices = await SKAgentConverter._get_thread_turn_indices(thread=thread)
print(f"Available turn indices: {turn_indices}")

Available turn indices: [0, 1, 2, 3, 4]


In [15]:
converter = SKAgentConverter()

# Get a single agent run data
evaluation_data_single_run = await converter.convert(
    thread=thread,
    turn_index=1, # Specify the turn index you want to evaluate
    agent=agent # Pass it to include the instructions and plugins in the evaluation data
)

Class SKAgentConverter: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


In [17]:
import json

file_name = "evaluation_data.jsonl"
# Save the agent thread data to a JSONL file (all turns)
evaluation_data = await converter.prepare_evaluation_data(threads=[thread], filename=file_name, agent=agent)
# print(json.dumps(evaluation_data, indent=4))
len(evaluation_data) # number of turns in the thread

5

### Setting up evaluator

We will select the following evaluators to assess the different aspects relevant for agent quality: 

- [Intent resolution](https://aka.ms/intentresolution-sample): measures the extent of which an agent identifies the correct intent from a user query. Scale: integer 1-5. Higher is better.
- [Tool call accuracy](https://aka.ms/toolcallaccuracy-sample): evaluates the agent’s ability to select the appropriate tools, and process correct parameters from previous steps. Scale: float 0-1. Higher is better.
- [Task adherence](https://aka.ms/taskadherence-sample): measures the extent of which an agent’s final response adheres to the task based on its system message and a user query. Scale: integer 1-5. Higher is better.


In [18]:
import os
from pprint import pprint

from azure.ai.evaluation import (
    ToolCallAccuracyEvaluator,
    AzureOpenAIModelConfiguration,
    IntentResolutionEvaluator,
    TaskAdherenceEvaluator,
)

model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_key=os.environ["AZURE_OPENAI_API_KEY"],
    api_version=os.environ["AZURE_OPENAI_API_VERSION"],
    azure_deployment=os.environ["MODEL_DEPLOYMENT_NAME"],
)

intent_resolution = IntentResolutionEvaluator(model_config=model_config)

tool_call_accuracy = ToolCallAccuracyEvaluator(model_config=model_config)

task_adherence = TaskAdherenceEvaluator(model_config=model_config)

Class IntentResolutionEvaluator: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


Class ToolCallAccuracyEvaluator: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class TaskAdherenceEvaluator: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


In [19]:
# Test a single evaluation run
evaluator = ToolCallAccuracyEvaluator(model_config=model_config)

# evaluation_data_single_run.keys() # query, response, tool_definitions
res = evaluator(**evaluation_data_single_run)
print(json.dumps(res, indent=4))

{
    "tool_call_accuracy": 1.0,
    "tool_call_accuracy_result": "pass",
    "tool_call_accuracy_threshold": 0.8,
    "per_tool_call_details": [
        {
            "tool_call_accurate": true,
            "tool_call_accurate_reason": "The TOOL CALL is directly relevant to the user's inquiry about the weather in Tokyo, uses appropriate parameters that match the TOOL DEFINITION, and the parameter values are correctly inferred from the CONVERSATION. Thus, it is likely to be very useful in advancing the conversation.",
            "tool_call_id": "call_UNFfTMKX4PdWYzDDp0PLYhh5"
        }
    ]
}


#### Bonus - run on perviously saved file for all turns

In [None]:
from azure.ai.evaluation import evaluate

response = evaluate(
    data=file_name,
    evaluators={
        "tool_call_accuracy": tool_call_accuracy,
        "intent_resolution": intent_resolution,
        "task_adherence": task_adherence,
    },
    azure_ai_project={
        "subscription_id": os.environ["AZURE_SUBSCRIPTION_ID"],
        "project_name": os.environ["PROJECT_NAME"],
        "resource_group_name": os.environ["RESOURCE_GROUP_NAME"],
    },
)

pprint(f'AI Foundary URL: {response.get("studio_url")}')

## Inspect results on Azure AI Foundry

Go to AI Foundry URL for rich Azure AI Foundry data visualization to inspect the evaluation scores and reasoning to quickly identify bugs and issues of your agent to fix and improve.

In [None]:
# alternatively, you can use the following to get the evaluation results in memory

# average scores across all runs
pprint(response["metrics"])