# Evaluate Semantic Kernel AI (ChatCompletion) Agents in Azure AI Foundry

## Objective

This sample demonstrates how to evaluate Semantic Kernel AI ChatCompletionAgents in Azure AI Foundry. It provides a step-by-step guide to set up the environment, create an agent, and evaluate its performance.

## Time
You can expect to complete this sample in approximately 20 minutes.

## Prerequisites
### Packages
- `semantic-kernel` installed (`pip install semantic-kernel`)
- `azure-ai-evaluation` SDK installed
- An Azure OpenAI resource with a deployment configured

Before running the sample:
```bash
pip install semantic-kernel azure-ai-projects azure-identity azure-ai-evaluation
```

### Environment Variables
- For **AzureChatService** (Semantic Kernel Agent):
  - **`api_key`** – Azure OpenAI API key used by the agent.
  - **`chat_deployment_name`** – Name of the deployed chat model (e.g., `gpt-35-turbo`) used by the agent.
  - **`endpoint`** – Azure OpenAI endpoint URL (e.g., `https://<your-resource>.openai.azure.com/`).
- For **LLM Evaluation**:
  - **`AZURE_OPENAI_ENDPOINT`** – Azure OpenAI endpoint to be used by the evaluation LLM.
  - **`AZURE_OPENAI_API_KEY`** – Azure OpenAI API key for evaluation.
  - **`AZURE_OPENAI_API_VERSION`** – API version (e.g., `2024-05-01-preview`) for the evaluation LLM.
  - **`MODEL_DEPLOYMENT_NAME`** – Deployment name of the model used for evaluation*, as found under the "Name" column in the "Models + endpoints" tab in your Azure AI Foundry project*.
- For Azure AI Foundry (Bonus):
  - **`AZURE_SUBSCRIPTION_ID`** – Your Azure subscription ID where the AI Foundry project is hosted.
  - **`PROJECT_NAME`** – Name of the Azure AI Foundry project.
  - **`RESOURCE_GROUP_NAME`** – Resource group containing your AI Foundry project.

### Create a AzureChatCompletion service - [reference](https://learn.microsoft.com/en-us/semantic-kernel/concepts/ai-services/chat-completion/?tabs=csharp-AzureOpenAI%2Cpython-AzureOpenAI%2Cjava-AzureOpenAI&pivots=programming-language-python)

In [None]:
from semantic_kernel.connectors.ai.open_ai import AzureChatCompletion

# You can do the following if you have set the necessary environment variables or created a .env file
chat_completion_service = AzureChatCompletion(service_id="my-service-id")

### Create a ChatCompletionAgent - [reference](https://learn.microsoft.com/en-us/semantic-kernel/frameworks/agent/agent-types/chat-completion-agent?pivots=programming-language-python)

In [None]:
from semantic_kernel.functions import kernel_function
from typing import Annotated


# This is a sample plugin that provides tools
class MenuPlugin:
    """A sample Menu Plugin used for the concept sample."""

    @kernel_function(description="Provides a list of specials from the menu.")
    def get_specials(self) -> Annotated[str, "Returns the specials from the menu."]:
        return """
        Special Soup: Clam Chowder
        Special Salad: Cobb Salad
        Special Drink: Chai Tea
        """

    @kernel_function(description="Provides the price of the requested menu item.")
    def get_item_price(
        self, menu_item: Annotated[str, "The name of the menu item."]
    ) -> Annotated[str, "Returns the price of the menu item."]:
        _ = menu_item  # This is just to simulate a function that uses the input.
        return "$9.99"

In [None]:
from semantic_kernel.agents import ChatCompletionAgent

# Create the agent by directly providing the chat completion service
agent = ChatCompletionAgent(
    service=chat_completion_service,
    name="Chef",
    instructions="Answer questions about the menu.",
    plugins=[MenuPlugin()],
)

In [None]:
thread = None

user_inputs = [
    "Hello",
    "What is the special drink today?",
    "What does that cost?",
    "Thank you",
]

for user_input in user_inputs:
    response = await agent.get_response(messages=user_input, thread=thread)
    print(f"## User: {user_input}")
    print(f"## {response.name}: {response}\n")
    thread = response.thread

### Converter

In [None]:
from azure.ai.evaluation import SKAgentConverter

# Get the avaiable turn indices for the thread,
# useful for selecting a specific turn for evaluation
turn_indices = await SKAgentConverter._get_thread_turn_indices(thread=thread)
print(f"Available turn indices: {turn_indices}")

In [None]:
converter = SKAgentConverter()

# Get a single agent run data
evaluation_data_single_run = await converter.convert(
    thread=thread,
    turn_index=2,  # Specify the turn index you want to evaluate
    agent=agent,  # Pass it to include the instructions and plugins in the evaluation data
)

In [None]:
import json

file_name = "evaluation_data.jsonl"
# Save the agent thread data to a JSONL file (all turns)
evaluation_data = await converter.prepare_evaluation_data(threads=[thread], filename=file_name, agent=agent)
# print(json.dumps(evaluation_data, indent=4))
len(evaluation_data)  # number of turns in the thread

### Setting up evaluator

We will select the following evaluators to assess the different aspects relevant for agent quality: 

- [Intent resolution](https://aka.ms/intentresolution-sample): measures the extent of which an agent identifies the correct intent from a user query. Scale: integer 1-5. Higher is better.
- [Tool call accuracy](https://aka.ms/toolcallaccuracy-sample): evaluates the agent’s ability to select the appropriate tools, and process correct parameters from previous steps. Scale: float 0-1. Higher is better.
- [Task adherence](https://aka.ms/taskadherence-sample): measures the extent of which an agent’s final response adheres to the task based on its system message and a user query. Scale: integer 1-5. Higher is better.


In [None]:
import os
from pprint import pprint

from azure.ai.evaluation import (
    ToolCallAccuracyEvaluator,
    AzureOpenAIModelConfiguration,
    IntentResolutionEvaluator,
    TaskAdherenceEvaluator,
)

model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_key=os.environ["AZURE_OPENAI_API_KEY"],
    api_version=os.environ["AZURE_OPENAI_API_VERSION"],
    azure_deployment=os.environ["MODEL_DEPLOYMENT_NAME"],
)

intent_resolution = IntentResolutionEvaluator(model_config=model_config)

tool_call_accuracy = ToolCallAccuracyEvaluator(model_config=model_config)

task_adherence = TaskAdherenceEvaluator(model_config=model_config)

In [None]:
# Test a single evaluation run
evaluator = ToolCallAccuracyEvaluator(model_config=model_config)

# evaluation_data_single_run.keys() # query, response, tool_definitions
res = evaluator(**evaluation_data_single_run)
print(json.dumps(res, indent=4))

#### Bonus - run on perviously saved file for all turns

In [None]:
from azure.ai.evaluation import evaluate

response = evaluate(
    data=file_name,
    evaluators={
        "tool_call_accuracy": tool_call_accuracy,
        "intent_resolution": intent_resolution,
        "task_adherence": task_adherence,
    },
    azure_ai_project={
        "subscription_id": os.environ["AZURE_SUBSCRIPTION_ID"],
        "project_name": os.environ["PROJECT_NAME"],
        "resource_group_name": os.environ["RESOURCE_GROUP_NAME"],
    },
)

pprint(f'AI Foundary URL: {response.get("studio_url")}')

## Inspect results on Azure AI Foundry

Go to AI Foundry URL for rich Azure AI Foundry data visualization to inspect the evaluation scores and reasoning to quickly identify bugs and issues of your agent to fix and improve.

In [None]:
# alternatively, you can use the following to get the evaluation results in memory

# average scores across all runs
pprint(response["metrics"])