# Quality Evaluators with the Azure AI Evaluation SDK
The following sample shows the basic way to evaluate a Generative AI application in your development environment with the Azure AI evaluation SDK.

> ✨ ***Note*** <br>
> Please check the reference document before you get started - https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk

## 🔨 Current Support and Limitations (as of 2025-01-14) 
- Check the region support for the Azure AI Evaluation SDK. https://learn.microsoft.com/en-us/azure/ai-studio/concepts/evaluation-metrics-built-in?tabs=warning#region-support

### Region support for evaluations
| Region              | Hate and Unfairness, Sexual, Violent, Self-Harm, XPIA, ECI (Text) | Groundedness (Text) | Protected Material (Text) | Hate and Unfairness, Sexual, Violent, Self-Harm, Protected Material (Image) |
|---------------------|------------------------------------------------------------------|---------------------|----------------------------|----------------------------------------------------------------------------|
| North Central US    | no                                                               | no                  | no                         | yes                                                                        |
| East US 2           | yes                                                              | yes                 | yes                        | yes                                                                        |
| Sweden Central      | yes                                                              | yes                 | yes                        | yes                                                                        |
| US North Central    | yes                                                              | no                  | yes                        | yes                                                                        |
| France Central      | yes                                                              | yes                 | yes                        | yes                                                                        |
| Switzerland West    | yes                                                              | no                  | no                         | yes                                                                        |

### Region support for adversarial simulation
| Region            | Adversarial Simulation (Text) | Adversarial Simulation (Image) |
|-------------------|-------------------------------|---------------------------------|
| UK South          | yes                           | no                              |
| East US 2         | yes                           | yes                             |
| Sweden Central    | yes                           | yes                             |
| US North Central  | yes                           | yes                             |
| France Central    | yes                           | no                              |


## ✔️ Pricing and billing
- Effective 1/14/2025, Azure AI Safety Evaluations will no longer be free in public preview. It will be billed based on consumption as following:

| Service Name              | Safety Evaluations       | Price Per 1K Tokens (USD) |
|---------------------------|--------------------------|---------------------------|
| Azure Machine Learning    | Input pricing for 3P     | $0.02                     |
| Azure Machine Learning    | Output pricing for 3P    | $0.06                     |
| Azure Machine Learning    | Input pricing for 1P     | $0.012                    |
| Azure Machine Learning    | Output pricing for 1P    | $0.012                    |


In [3]:
import pandas as pd
import os
import json

from pprint import pprint
from azure.ai.evaluation import evaluate
from azure.ai.evaluation import RelevanceEvaluator
from azure.ai.evaluation import GroundednessEvaluator, GroundednessProEvaluator
from azure.identity import DefaultAzureCredential
from dotenv import load_dotenv
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import (
    Evaluation,
    Dataset,
    EvaluatorConfiguration,
    ConnectionType,
    EvaluationSchedule,
    RecurrenceTrigger,
    ApplicationInsightsConfiguration
)
import pathlib

from azure.ai.evaluation import evaluate
from azure.ai.evaluation import (
    ContentSafetyEvaluator,
    RelevanceEvaluator,
    CoherenceEvaluator,
    GroundednessEvaluator,
    FluencyEvaluator,
    SimilarityEvaluator,
    F1ScoreEvaluator,
    RetrievalEvaluator
)

from azure.ai.ml import MLClient



load_dotenv("../.env")

True

In [None]:
credential = DefaultAzureCredential()

azure_ai_project_client = AIProjectClient.from_connection_string(
    credential=DefaultAzureCredential(),
    #conn_str=os.environ.get("AZURE_AI_PROJECT_CONN_STR'),  # At the moment, it should be in the format '<Region>.api.azureml.ms;<AzureSubscriptionId>;<ResourceGroup>;<HubName>' Ex: eastus2.api.azureml.ms;xxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxx;rg-sample;sample-project-eastus2
    conn_str="swedencentral.api.azureml.ms;3d4d3dd0-79d4-40cf-a94e-b4154812c6ca;AOAI-group3;aoai-pjt1"
)



model_config = {
    "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
    "api_key": os.environ.get("AZURE_OPENAI_API_KEY"),
    "azure_deployment": os.environ.get("AZURE_OPENAI_CHAT_DEPLOYMENT_NAME"),
    "api_version": os.environ.get("AZURE_OPENAI_API_VERSION"),
    "type": "azure_openai",
}

## 🚀 Run Evaluators in Azure Cloud

### Generate synthetic dataset with Azure OpenAI
- Use your models to generate custom textual utterances for your purpose in your target language. These utterances serve as a seed for the evaluation creation. By adjusting your prompts, you can produce text tailored to your domain (such as call center Q&A for a tech brand).

In [5]:
from openai import AzureOpenAI


aoai_api_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
aoai_api_key = os.getenv("AZURE_OPENAI_API_KEY")
aoai_api_version = os.getenv("AZURE_OPENAI_API_VERSION")
aoai_deployment_name = os.getenv("AZURE_OPENAI_CHAT_DEPLOYMENT_NAME")

try:
    client = AzureOpenAI(
        azure_endpoint = aoai_api_endpoint,
        api_key        = aoai_api_key,
        api_version    = aoai_api_version,
    )

    print("=== Initialized AzuureOpenAI client ===")
    print(f"AZURE_OPENAI_ENDPOINT={aoai_api_endpoint}")
    print(f"AZURE_OPENAI_API_VERSION={aoai_api_version}")
    print(f"AZURE_OPENAI_DEPLOYMENT_NAME={aoai_deployment_name}")
        
except (ValueError, TypeError) as e:
    print(e)

=== Initialized AzuureOpenAI client ===
AZURE_OPENAI_ENDPOINT=https://aoai-services1.openai.azure.com/
AZURE_OPENAI_API_VERSION=2025-01-01-preview
AZURE_OPENAI_DEPLOYMENT_NAME=gpt-4o-mini


In [25]:
NUM_SAMPLES = 20

LANGUAGE = "English"

topic = f"""
Contoso Electronics call center QnA related expected spoken utterances for {LANGUAGE} languages. create {NUM_SAMPLES} lines of jsonl of the topic related text.
"""

example = """
Here is examples of the expected format:
{"query":"How do you create a run?","context":"AML API only","response":"To create a run using the Azure Machine Learning API, you first need to create an Experiment. Once you have an experiment, you can create a Run object that is associated with that experiment. Here is some Python code that demonstrates this process:\\n\\n```\\nfrom azureml.core import Experiment, Run\\nfrom azureml.core.workspace import Workspace\\n\\n# Define workspace and experiment\\nws = Workspace.from_config()\\nexp = Experiment(workspace=ws, name='my_experiment')\\n\\n# Create a new run\\nrun = exp.start_logging()\\n```\\n\\nIn this code, the `from_config()` method reads the configuration file that you created when you set up your Azure Machine Learning workspace. The `Experiment` constructor creates an Experiment object that is associated with your workspace, and the `start_logging()` method creates a new Run object that is associated with the Experiment. Now you can use the `run` object to log metrics, upload files, and track other information related to your machine learning experiment.","ground_truth":"Paris is the capital of France."}
{"query":"How do you log a model?","context":"Logging can be done using any OSS Sdk","response":"There are a few ways to log models in Azure Machine Learning. \\n\\nOne way is to use the `register_model()` method of the `Run` object. The `register_model()` method logs a model file in the Azure Machine Learning service workspace and makes it available for deployment. Here's an example:\\n\\n```python\\nfrom azureml.core import Model\\n\\nmodel_path = './outputs/my_model.pkl'\\nmodel = Model.register(workspace=ws, model_path=model_path, model_name='my_model')\\n```\\n\\nThis code registers the model file located at `model_path` to the Azure Machine Learning service workspace with the name `my_model`. \\n\\nAnother way to log a model is to save it as an output of a `Run`. If your model generation code is part of a script or Jupyter notebook that runs as an Azure Machine Learning experiment, you can save the model file as an output of the `Run` object. Here's an example:\\n\\n```python\\nfrom sklearn.linear_model import LogisticRegression\\nfrom azureml.core.run import Run\\n\\n# Initialize a run object\\nrun = Run.get_context()\\n\\n# Train your model\\nX_train, y_train = ...\\nlog_reg = LogisticRegression().fit(X_train, y_train)\\n\\n# Save the model to the Run object's outputs directory\\nmodel_path = 'outputs/model.pkl'\\njoblib.dump(value=log_reg, filename=model_path)\\n\\n# Log the model as a run artifact\\nrun.upload_file(name=model_path, path_or_stream=model_path)\\n```\\n\\nIn this code, `Run.get_context()` retrieves the current run context object, which you can use to track metadata and metrics for the run. After training your model, you can use `joblib.dump()` to save the model to a file, and then log the file as an artifact of the run using `run.upload_file()`.","ground_truth":"Paris is the capital of France."}
{"query":"What is the capital of France?","context":"France is in Europe","response":"Paris is the capital of France.","ground_truth":"Paris is the capital of France."}
"""

system_message = """
Generate plain text sentences of #topic# related text to improve the recognition of domain-specific words and phrases.
Domain-specific words can be uncommon or made-up words, but their pronunciation must be straightforward to be recognized. 
Use text data that's close to the expected spoken utterances. The nummber of utterances per line should be 1. 
jsonl format is required. use 'no' as number, 'query' as string, 'context' as string, 'response' as string, and 'ground_truth' as string.
only include the lines as the result. Do not include ```jsonl, ``` and blank line in the result. 

"""

user_message = f"""
#topic#: {topic}
Example: {example}
"""

# Simple API Call
response = client.chat.completions.create(
    model=aoai_deployment_name,
    messages=[
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_message},
    ],
    temperature=0.8,
    top_p=0.1
)

content = response.choices[0].message.content
print(content)
print("Usage Information:")
#print(f"Cached Tokens: {response.usage.prompt_tokens_details.cached_tokens}") #only o1 models support this
print(f"Completion Tokens: {response.usage.completion_tokens}")
print(f"Prompt Tokens: {response.usage.prompt_tokens}")
print(f"Total Tokens: {response.usage.total_tokens}")

{"no":"1","query":"What are the latest products from Contoso Electronics?","context":"Product catalog inquiry","response":"The latest products from Contoso Electronics include the Contoso Smart Speaker, Contoso Ultra HD TV, and the Contoso Fitness Tracker. Each product features cutting-edge technology and user-friendly interfaces.","ground_truth":"The latest products from Contoso Electronics include the Contoso Smart Speaker, Contoso Ultra HD TV, and the Contoso Fitness Tracker."}
{"no":"2","query":"How can I reset my Contoso device?","context":"Device troubleshooting","response":"To reset your Contoso device, locate the reset button on the back or bottom of the device. Press and hold the button for about ten seconds until the device powers off and restarts.","ground_truth":"To reset your Contoso device, locate the reset button on the back or bottom of the device."}
{"no":"3","query":"What warranty does Contoso offer?","context":"Warranty information","response":"Contoso offers a one-y

In [26]:
synthetic_text_file = "../data/sythetic_evaluation_data.jsonl"
with open(synthetic_text_file, 'w', encoding='utf-8') as f:
    for line in content.split('\n'):
        if line.strip():  # Check if the line is not empty
            f.write(line + '\n')

%store synthetic_text_file

Stored 'synthetic_text_file' (str)


In [27]:
# # Upload data for evaluation
data_id, _ = azure_ai_project_client.upload_file("../data/sythetic_evaluation_data.jsonl")
# data_id = "azureml://registries/<registry>/data/<dataset>/versions/<version>"
# To use an existing dataset, replace the above line with the following line
# data_id = "<dataset_id>"

Overriding of current TracerProvider is not allowed
Overriding of current MeterProvider is not allowed
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented


[32mUploading sythetic_evaluation_data.jsonl[32m (< 1 MB): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.17k/7.17k [00:00<00:00,

### Configure Evaluators to Run
- The code below demonstrates how to configure the evaluators you want to run. In this example, we use the F1ScoreEvaluator, RelevanceEvaluator and the ViolenceEvaluator, but all evaluators supported by Azure AI Evaluation are supported by cloud evaluation and can be configured here. You can either import the classes from the SDK and reference them with the .id property, or you can find the fully formed id of the evaluator in the AI Studio registry of evaluators, and use it here. 

In [28]:
# id for each evaluator can be found in your AI Studio registry - please see documentation for more information
# init_params is the configuration for the model to use to perform the evaluation
# data_mapping is used to map the output columns of your query to the names required by the evaluator
# Evaluator parameter format - https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk#evaluator-parameter-format
evaluators_cloud = {
    "f1_score": EvaluatorConfiguration(
        id=F1ScoreEvaluator.id,
    ),
    "relevance": EvaluatorConfiguration(
        id=RelevanceEvaluator.id,
        init_params={"model_config": model_config},
        data_mapping={"query": "${data.query}", "context": "${data.context}", "response": "${data.response}"},
    ),
    "groundedness": EvaluatorConfiguration(
        id=GroundednessEvaluator.id,
        init_params={"model_config": model_config},
        data_mapping={"query": "${data.query}", "context": "${data.context}", "response": "${data.response}"},
    ),
    # "retrieval": EvaluatorConfiguration(
    #     #from azure.ai.evaluation._evaluators._common.math import list_mean_nan_safe\nModuleNotFoundError: No module named 'azure.ai.evaluation._evaluators._common.math'
    #     #id=RetrievalEvaluator.id,
    #     id="azureml://registries/azureml/models/Retrieval-Evaluator/versions/2",
    #     init_params={"model_config": model_config},
    #     data_mapping={"query": "${data.query}", "context": "${data.context}", "response": "${data.response}"},
    # ),
    "coherence": EvaluatorConfiguration(
        id=CoherenceEvaluator.id,
        init_params={"model_config": model_config},
        data_mapping={"query": "${data.query}", "response": "${data.response}"},
    ),
    "fluency": EvaluatorConfiguration(
        id=FluencyEvaluator.id,
        init_params={"model_config": model_config},
        data_mapping={"query": "${data.query}", "context": "${data.context}", "response": "${data.response}"},
    ),
     "similarity": EvaluatorConfiguration(
        # currently bug in the SDK, please use the id below
        #id=SimilarityEvaluator.id,
        id="azureml://registries/azureml/models/Similarity-Evaluator/versions/3",
        init_params={"model_config": model_config},
        data_mapping={"query": "${data.query}", "response": "${data.response}"},
    ),

}


In [29]:
evaluation = Evaluation(
    display_name="Cloud Evaluation",
    description="Cloud Evaluation of dataset",
    data=Dataset(id=data_id),
    evaluators=evaluators_cloud,
)

# Create evaluation
evaluation_response = azure_ai_project_client.evaluations.create(
    evaluation=evaluation,
)

In [30]:
from tqdm import tqdm
import time

# Monitor the status of the run_result
def monitor_status(project_client:AIProjectClient, evaluation_response_id:str):
    with tqdm(total=3, desc="Running Status", unit="step") as pbar:
        status = project_client.evaluations.get(evaluation_response_id).status
        if status == "Queued":
            pbar.update(1)
        while status != "Completed" and status != "Failed":
            if status == "Running" and pbar.n < 2:
                pbar.update(1)
            print(f"Current Status: {status}")
            time.sleep(10)
            status = project_client.evaluations.get(evaluation_response_id).status
        while(pbar.n < 3):
            pbar.update(1)
        print("Operation Completed")

In [None]:
monitor_status(azure_ai_project_client, evaluation_response.id)

Running Status:   0%|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | 0/3 [00:00<?, ?step/s]

Current Status: Starting
Current Status: Queued


### Check the evaluation result in Azure AI Foundry 
- After running the evaluation, you can check the evaluation results in Azure AI Foundry. You can find the evaluation results in the Evaluation tab of your project.

In [None]:
# Get evaluation
get_evaluation_response = azure_ai_project_client.evaluations.get(evaluation_response.id)

print("----------------------------------------------------------------")
print("Created evaluation, evaluation ID: ", get_evaluation_response.id)
print("Evaluation status: ", get_evaluation_response.status)
print("AI Foundry Portal URI: ", get_evaluation_response.properties["AiStudioEvaluationUri"])
print("----------------------------------------------------------------")

----------------------------------------------------------------
Created evaluation, evaluation ID:  04ee9cf1-317a-4e8c-b9c2-2d61dcc4755e
Evaluation status:  Completed
AI Foundry Portal URI:  https://ai.azure.com/build/evaluation/04ee9cf1-317a-4e8c-b9c2-2d61dcc4755e?wsid=/subscriptions/3d4d3dd0-79d4-40cf-a94e-b4154812c6ca/resourceGroups/AOAI-group3/providers/Microsoft.MachineLearningServices/workspaces/aoai-pjt1
----------------------------------------------------------------


![Cloud Evaluation Result](../images/cloud_evaluation_result.png)