# Evaluate model endpoints using Prompt Flow Eval APIs

## Objective

This tutorial provides a step-by-step guide on how to evaluate prompts against variety of model endpoints deployed on Azure AI Platform or non Azure AI platforms. 

This guide uses Python Class as an application target which is passed to Evaluate API provided by PromptFlow SDK to evaluate results generated by LLM models against provided prompts. 

This tutorial uses the following Azure AI services:

- [azure-ai-evaluation](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk)

## Time

You should expect to spend 30 minutes running this sample. 

## About this example

This example demonstrates evaluating model endpoints responses against provided prompts using azure-ai-evaluation

## Before you begin

### Installation

Install the following packages required to execute this notebook. 

In [1]:
%pip install azure-ai-evaluation
%pip install promptflow-azure

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: C:\Users\sydneylister\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: C:\Users\sydneylister\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


### Parameters and imports

In [2]:
from pprint import pprint

import pandas as pd
import random
from openai import AzureOpenAI

## Target Application

We will use Evaluate API provided by Prompt Flow SDK. It requires a target Application or python Function, which handles a call to LLMs and retrieve responses. 

In the notebook, we will use an Application Target `ModelEndpoints` to get answers from multiple model endpoints against provided question aka prompts. 

This application target requires list of model endpoints and their authentication keys. For simplicity, we have provided them in the `env_var` variable which is passed into init() function of `ModelEndpoints`.


Please provide Azure AI Project details so that traces and eval results are pushing in the project in Azure AI Studio.

In [6]:
azure_ai_project = {
    "subscription_id": "<your-subscription-id>",
    "resource_group_name": "<your-resource-group-name>",
    "project_name": "<your-project-name>",
}

In [7]:
import os

# Use the following code to set the environment variables if not already set. If set, you can skip this step.

os.environ["AZURE_OPENAI_API_KEY"] = "<your-api-key>"
os.environ["AZURE_OPENAI_API_VERSION"] = "<api version>"
os.environ["AZURE_OPENAI_DEPLOYMENT"] = "<your-deployment>"
os.environ["AZURE_OPENAI_ENDPOINT"] = "<your-endpoint>"

## Data

Following code reads Json file "data.jsonl" which contains inputs to the Application Target function. It provides question, context and ground truth on each line. 

In [9]:
df = pd.read_json("data.jsonl", lines=True)
print(df.head())

                                           query  \
0                 What is the capital of France?   
1             Which tent is the most waterproof?   
2           Which camping table is the lightest?   
3  How much does TrailWalker Hiking Shoes cost?    

                                             context  \
0                   France is the country in Europe.   
1  #TrailMaster X4 Tent, price $250,## BrandOutdo...   
2  #BaseCamp Folding Table, price $60,## BrandCam...   
3  #TrailWalker Hiking Shoes, price $110## BrandT...   

                                        ground_truth  
0                                              Paris  
1  The TrailMaster X4 tent has a rainfly waterpro...  
2  The BaseCamp Folding Table has a weight of 15 lbs  
3    The TrailWalker Hiking Shoes are priced at $110  


## Configuration
To use Relevance and Cohenrence Evaluator, we will Azure Open AI model details as a Judge that can be passed as model config.

In [10]:
model_config = {
    "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
    "api_key": os.environ.get("AZURE_OPENAI_KEY"),
    "azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
}

## Run the evaluation

The Following code runs Evaluate API and uses Content Safety, Relevance and Coherence Evaluator to evaluate results from different models.

The following are the few parameters required by Evaluate API. 

+   Data file (Prompts): It represents data file 'data.jsonl' in JSON format. Each line contains question, context and ground truth for evaluators.     

+   Application Target: It is name of python class which can route the calls to specific model endpoints using model name in conditional logic.  

+   Model Name: It is an identifier of model so that custom code in the App Target class can identify the model type and call respective LLM model using endpoint URL and auth key.  

+   Evaluators: List of evaluators is provided, to evaluate given prompts (questions) as input and output (answers) from LLM models. 

In [16]:
from endpoint_target import ModelEndpoint
import pathlib

from azure.ai.evaluation import evaluate
from azure.ai.evaluation import (
    ContentSafetyEvaluator,
    RelevanceEvaluator,
    CoherenceEvaluator,
    GroundednessEvaluator,
    FluencyEvaluator,
    SimilarityEvaluator,
)


content_safety_evaluator = ContentSafetyEvaluator(azure_ai_project)
relevance_evaluator = RelevanceEvaluator(model_config)
coherence_evaluator = CoherenceEvaluator(model_config)
groundedness_evaluator = GroundednessEvaluator(model_config)
fluency_evaluator = FluencyEvaluator(model_config)
similarity_evaluator = SimilarityEvaluator(model_config)

path = str(pathlib.Path(pathlib.Path.cwd())) + "/data.jsonl"

results = evaluate(
    evaluation_name="Eval-Run-" + "-" + model_config["azure_deployment"].title(),
    data=path,
    target=ModelEndpoint(model_config),
    evaluators={
        "content_safety": content_safety_evaluator,
        "coherence": coherence_evaluator,
        "relevance": relevance_evaluator,
        "groundedness": groundedness_evaluator,
        "fluency": fluency_evaluator,
        "similarity": similarity_evaluator,
    },
    evaluator_config={
        "content_safety": {"query": "${data.query}", "response": "${target.response}"},
        "coherence": {"response": "${target.response}", "query": "${data.query}"},
        "relevance": {"response": "${target.response}", "context": "${data.context}", "query": "${data.query}"},
        "groundedness": {
            "response": "${target.response}",
            "context": "${data.context}",
            "query": "${data.query}",
        },
        "fluency": {"response": "${target.response}", "context": "${data.context}", "query": "${data.query}"},
        "similarity": {"response": "${target.response}", "context": "${data.context}", "query": "${data.query}"},
    },
)

Prompt flow service has started...
You can view the traces in local from http://127.0.0.1:23333/v1.0/ui/traces/?#run=evaluate_model_endpoints_20241003_135400_792011


[2024-10-03 13:54:09 -0700][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run evaluate_model_endpoints_20241003_135400_792011, log path: C:\Users\sydneylister\.promptflow\.runs\evaluate_model_endpoints_20241003_135400_792011\logs.txt


2024-10-03 13:54:09 -0700   32640 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2024-10-03 13:54:09 -0700   32640 execution.bulk     INFO     Current system's available memory is 12436.453125MB, memory consumption of current process is 351.0859375MB, estimated available worker count is 12436.453125/351.0859375 = 35
2024-10-03 13:54:09 -0700   32640 execution.bulk     INFO     Set process count to 4 by taking the minimum value among the factors of {'default_worker_count': 4, 'row_count': 4, 'estimated_worker_count_based_on_memory_usage': 35}.
2024-10-03 13:54:15 -0700   32640 execution.bulk     INFO     Process name(SpawnProcess-4)-Process id(36544)-Line number(0) start execution.
2024-10-03 13:54:15 -0700   32640 execution.bulk     INFO     Process name(SpawnProcess-6)-Process id(13464)-Line number(1) start execution.
2024-10-03 13:54:15 -0700   32640 execution.bulk     INFO     Process name(SpawnProcess-5)-Process id(31



Prompt flow service has started...
You can view the traces in local from http://127.0.0.1:23333/v1.0/ui/traces/?#run=azure_ai_evaluation_evaluators_content_safety_content_safety_contentsafetyevaluator_7oxgfzyb_20241003_135430_353548
Prompt flow service has started...
You can view the traces in local from http://127.0.0.1:23333/v1.0/ui/traces/?#run=azure_ai_evaluation_evaluators_coherence_coherence_asynccoherenceevaluator_ah3k8481_20241003_135430_366469
Prompt flow service has started...
Prompt flow service has started...
You can view the traces in local from http://127.0.0.1:23333/v1.0/ui/traces/?#run=azure_ai_evaluation_evaluators_similarity_similarity_asyncsimilarityevaluator_nm65mz8b_20241003_135430_371035
You can view the traces in local from http://127.0.0.1:23333/v1.0/ui/traces/?#run=azure_ai_evaluation_evaluators_groundedness_groundedness_asyncgroundednessevaluator_wlti5wr_20241003_135430_366469
Prompt flow service has started...
You can view the traces in local from http://127.

[2024-10-03 13:54:30 -0700][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_content_safety_content_safety_contentsafetyevaluator_7oxgfzyb_20241003_135430_353548, log path: C:\Users\sydneylister\.promptflow\.runs\azure_ai_evaluation_evaluators_content_safety_content_safety_contentsafetyevaluator_7oxgfzyb_20241003_135430_353548\logs.txt
[2024-10-03 13:54:30 -0700][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_similarity_similarity_asyncsimilarityevaluator_nm65mz8b_20241003_135430_371035, log path: C:\Users\sydneylister\.promptflow\.runs\azure_ai_evaluation_evaluators_similarity_similarity_asyncsimilarityevaluator_nm65mz8b_20241003_135430_371035\logs.txt
[2024-10-03 13:54:31 -0700][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_groundedness_groundedness_asyncgroundednessevaluator_wlti5wr_20241003_135430_366469, log path: C:\Users\sydne

2024-10-03 13:54:31 -0700   32640 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2024-10-03 13:55:29 -0700   32640 execution.bulk     INFO     Finished 3 / 4 lines.
2024-10-03 13:55:29 -0700   32640 execution.bulk     INFO     Average execution time for completed lines: 19.33 seconds. Estimated time for incomplete lines: 19.33 seconds.
2024-10-03 13:55:30 -0700   32640 execution.bulk     INFO     Finished 4 / 4 lines.
2024-10-03 13:55:30 -0700   32640 execution.bulk     INFO     Average execution time for completed lines: 14.81 seconds. Estimated time for incomplete lines: 0.0 seconds.
2024-10-03 13:55:30 -0700   32640 execution          ERROR    3/4 flow run failed, indexes: [2,3,1], exception of index 2: OpenAI API hits exception: CredentialUnavailableError: Failed to invoke the Azure CLI

Run name: "azure_ai_evaluation_evaluators_groundedness_groundedness_asyncgroundednessevaluator_wlti5wr_20241003_135430_366469"
Run s



2024-10-03 13:55:33 -0700   32640 execution.bulk     INFO     Finished 4 / 4 lines.
2024-10-03 13:55:33 -0700   32640 execution.bulk     INFO     Average execution time for completed lines: 15.51 seconds. Estimated time for incomplete lines: 0.0 seconds.
2024-10-03 13:55:33 -0700   32640 execution          ERROR    3/4 flow run failed, indexes: [3,1,2], exception of index 3: OpenAI API hits exception: CredentialUnavailableError: Failed to invoke the Azure CLI


 Please check out C:/Users/sydneylister/.promptflow/.runs/azure_ai_evaluation_evaluators_similarity_similarity_asyncsimilarityevaluator_nm65mz8b_20241003_135430_371035 for more details.
[2024-10-03 13:55:33 -0700][promptflow.core._prompty_utils][ERROR] - Exception occurs: CredentialUnavailableError: Failed to invoke the Azure CLI


2024-10-03 13:54:31 -0700   32640 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2024-10-03 13:55:32 -0700   32640 execution.bulk     INFO     Finished 3 / 4 lines.
2024-10-03 13:55:32 -0700   32640 execution.bulk     INFO     Average execution time for completed lines: 20.36 seconds. Estimated time for incomplete lines: 20.36 seconds.
2024-10-03 13:55:33 -0700   32640 execution.bulk     INFO     Finished 4 / 4 lines.
2024-10-03 13:55:33 -0700   32640 execution.bulk     INFO     Average execution time for completed lines: 15.51 seconds. Estimated time for incomplete lines: 0.0 seconds.
2024-10-03 13:55:33 -0700   32640 execution          ERROR    3/4 flow run failed, indexes: [3,1,2], exception of index 3: OpenAI API hits exception: CredentialUnavailableError: Failed to invoke the Azure CLI

Run name: "azure_ai_evaluation_evaluators_similarity_similarity_asyncsimilarityevaluator_nm65mz8b_20241003_135430_371035"
Run status

[2024-10-03 13:55:34 -0700][promptflow.core._prompty_utils][ERROR] - Exception occurs: CredentialUnavailableError: Failed to invoke the Azure CLI


2024-10-03 13:55:34 -0700   32640 execution.bulk     INFO     Finished 4 / 4 lines.
2024-10-03 13:55:34 -0700   32640 execution.bulk     INFO     Average execution time for completed lines: 15.88 seconds. Estimated time for incomplete lines: 0.0 seconds.
2024-10-03 13:55:34 -0700   32640 execution          ERROR    3/4 flow run failed, indexes: [1,2,3], exception of index 1: OpenAI API hits exception: CredentialUnavailableError: Failed to invoke the Azure CLI
2024-10-03 13:55:35 -0700   32640 execution.bulk     INFO     Finished 3 / 4 lines.
2024-10-03 13:55:35 -0700   32640 execution.bulk     INFO     Average execution time for completed lines: 21.23 seconds. Estimated time for incomplete lines: 21.23 seconds.


 Please check out C:/Users/sydneylister/.promptflow/.runs/azure_ai_evaluation_evaluators_relevance_relevance_asyncrelevanceevaluator_mrcw1my2_20241003_135430_371035 for more details.
[2024-10-03 13:55:35 -0700][promptflow.core._prompty_utils][ERROR] - Exception occurs: CredentialUnavailableError: Failed to invoke the Azure CLI


2024-10-03 13:54:31 -0700   32640 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2024-10-03 13:55:33 -0700   32640 execution.bulk     INFO     Finished 3 / 4 lines.
2024-10-03 13:55:33 -0700   32640 execution.bulk     INFO     Average execution time for completed lines: 20.84 seconds. Estimated time for incomplete lines: 20.84 seconds.
2024-10-03 13:55:34 -0700   32640 execution.bulk     INFO     Finished 4 / 4 lines.
2024-10-03 13:55:34 -0700   32640 execution.bulk     INFO     Average execution time for completed lines: 15.88 seconds. Estimated time for incomplete lines: 0.0 seconds.
2024-10-03 13:55:34 -0700   32640 execution          ERROR    3/4 flow run failed, indexes: [1,2,3], exception of index 1: OpenAI API hits exception: CredentialUnavailableError: Failed to invoke the Azure CLI

Run name: "azure_ai_evaluation_evaluators_relevance_relevance_asyncrelevanceevaluator_mrcw1my2_20241003_135430_371035"
Run status: "



2024-10-03 13:55:36 -0700   32640 execution.bulk     INFO     Finished 4 / 4 lines.
2024-10-03 13:55:36 -0700   32640 execution.bulk     INFO     Average execution time for completed lines: 16.18 seconds. Estimated time for incomplete lines: 0.0 seconds.
2024-10-03 13:55:36 -0700   32640 execution          ERROR    3/4 flow run failed, indexes: [2,1,3], exception of index 2: OpenAI API hits exception: CredentialUnavailableError: Failed to invoke the Azure CLI


 Please check out C:/Users/sydneylister/.promptflow/.runs/azure_ai_evaluation_evaluators_coherence_coherence_asynccoherenceevaluator_ah3k8481_20241003_135430_366469 for more details.


2024-10-03 13:54:31 -0700   32640 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2024-10-03 13:55:35 -0700   32640 execution.bulk     INFO     Finished 3 / 4 lines.
2024-10-03 13:55:35 -0700   32640 execution.bulk     INFO     Average execution time for completed lines: 21.23 seconds. Estimated time for incomplete lines: 21.23 seconds.
2024-10-03 13:55:36 -0700   32640 execution.bulk     INFO     Finished 4 / 4 lines.
2024-10-03 13:55:36 -0700   32640 execution.bulk     INFO     Average execution time for completed lines: 16.18 seconds. Estimated time for incomplete lines: 0.0 seconds.
2024-10-03 13:55:36 -0700   32640 execution          ERROR    3/4 flow run failed, indexes: [2,1,3], exception of index 2: OpenAI API hits exception: CredentialUnavailableError: Failed to invoke the Azure CLI

Run name: "azure_ai_evaluation_evaluators_coherence_coherence_asynccoherenceevaluator_ah3k8481_20241003_135430_366469"
Run status: "



2024-10-03 13:55:36 -0700   32640 execution.bulk     INFO     Finished 4 / 4 lines.
2024-10-03 13:55:36 -0700   32640 execution.bulk     INFO     Average execution time for completed lines: 16.29 seconds. Estimated time for incomplete lines: 0.0 seconds.
2024-10-03 13:55:36 -0700   32640 execution          ERROR    3/4 flow run failed, indexes: [1,3,2], exception of index 1: OpenAI API hits exception: CredentialUnavailableError: Failed to invoke the Azure CLI


 Please check out C:/Users/sydneylister/.promptflow/.runs/azure_ai_evaluation_evaluators_fluency_fluency_asyncfluencyevaluator_1tk9glvt_20241003_135430_385178 for more details.


2024-10-03 13:54:31 -0700   32640 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2024-10-03 13:55:35 -0700   32640 execution.bulk     INFO     Finished 3 / 4 lines.
2024-10-03 13:55:35 -0700   32640 execution.bulk     INFO     Average execution time for completed lines: 21.39 seconds. Estimated time for incomplete lines: 21.39 seconds.
2024-10-03 13:55:36 -0700   32640 execution.bulk     INFO     Finished 4 / 4 lines.
2024-10-03 13:55:36 -0700   32640 execution.bulk     INFO     Average execution time for completed lines: 16.29 seconds. Estimated time for incomplete lines: 0.0 seconds.
2024-10-03 13:55:36 -0700   32640 execution          ERROR    3/4 flow run failed, indexes: [1,3,2], exception of index 1: OpenAI API hits exception: CredentialUnavailableError: Failed to invoke the Azure CLI

Run name: "azure_ai_evaluation_evaluators_fluency_fluency_asyncfluencyevaluator_1tk9glvt_20241003_135430_385178"
Run status: "Comple

 Please check out C:/Users/sydneylister/.promptflow/.runs/azure_ai_evaluation_evaluators_content_safety_content_safety_contentsafetyevaluator_7oxgfzyb_20241003_135430_353548 for more details.


2024-10-03 13:54:31 -0700   32640 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2024-10-03 13:54:31 -0700   32640 execution.bulk     INFO     The timeout for the batch run is 3600 seconds.
2024-10-03 13:54:31 -0700   32640 execution.bulk     INFO     Current system's available memory is 12363.328125MB, memory consumption of current process is 354.421875MB, estimated available worker count is 12363.328125/354.421875 = 34
2024-10-03 13:54:31 -0700   32640 execution.bulk     INFO     Set process count to 4 by taking the minimum value among the factors of {'default_worker_count': 4, 'row_count': 4, 'estimated_worker_count_based_on_memory_usage': 34}.
2024-10-03 13:54:43 -0700   32640 execution.bulk     INFO     Process name(SpawnProcess-9)-Process id(33796)-Line number(0) start execution.
2024-10-03 13:54:43 -0700   32640 execution.bulk     INFO     Process name(SpawnProcess-10)-Process id(14820)-Line number(1) start executi

  outputs.fillna(value="(Failed)", inplace=True)  # replace nan with explicit prompt
  result_df.replace("(Failed)", np.nan, inplace=True)
  outputs.fillna(value="(Failed)", inplace=True)  # replace nan with explicit prompt
  result_df.replace("(Failed)", np.nan, inplace=True)
  outputs.fillna(value="(Failed)", inplace=True)  # replace nan with explicit prompt
  result_df.replace("(Failed)", np.nan, inplace=True)
  outputs.fillna(value="(Failed)", inplace=True)  # replace nan with explicit prompt
  result_df.replace("(Failed)", np.nan, inplace=True)
  outputs.fillna(value="(Failed)", inplace=True)  # replace nan with explicit prompt
  result_df.replace("(Failed)", np.nan, inplace=True)
ERROR:azure.ai.evaluation._evaluate._utils:Unable to log traces as trace destination was not defined.


View the results

In [17]:
pprint(results)

{'metrics': {'coherence.gpt_coherence': 5.0,
             'fluency.gpt_fluency': 5.0,
             'groundedness.gpt_groundedness': 1.0,
             'relevance.gpt_relevance': 5.0,
             'similarity.gpt_similarity': 5.0},
 'rows': [{'inputs.context': 'France is the country in Europe.',
           'inputs.ground_truth': 'Paris',
           'inputs.query': 'What is the capital of France?',
           'outputs.coherence.gpt_coherence': 5.0,
           'outputs.fluency.gpt_fluency': 5.0,
           'outputs.groundedness.gpt_groundedness': 1.0,
           'outputs.query': 'What is the capital of France?',
           'outputs.relevance.gpt_relevance': 5.0,
           'outputs.response': 'The capital of France is Paris.',
           'outputs.similarity.gpt_similarity': 5.0},
          {'inputs.context': '#TrailMaster X4 Tent, price $250,## '
                             'BrandOutdoorLiving## CategoryTents## Features- '
                             'Polyester material for durability- S

In [18]:
pd.DataFrame(results["rows"])

Unnamed: 0,outputs.query,outputs.response,inputs.query,inputs.context,inputs.ground_truth,outputs.coherence.gpt_coherence,outputs.relevance.gpt_relevance,outputs.groundedness.gpt_groundedness,outputs.fluency.gpt_fluency,outputs.similarity.gpt_similarity
0,What is the capital of France?,The capital of France is Paris.,What is the capital of France?,France is the country in Europe.,Paris,5.0,5.0,1.0,5.0,5.0
1,Which tent is the most waterproof?,"When looking for the most waterproof tent, con...",Which tent is the most waterproof?,"#TrailMaster X4 Tent, price $250,## BrandOutdo...",The TrailMaster X4 tent has a rainfly waterpro...,,,,,
2,Which camping table is the lightest?,"When looking for the lightest camping table, m...",Which camping table is the lightest?,"#BaseCamp Folding Table, price $60,## BrandCam...",The BaseCamp Folding Table has a weight of 15 lbs,,,,,
3,How much does TrailWalker Hiking Shoes cost?,The cost of TrailWalker hiking shoes can vary ...,How much does TrailWalker Hiking Shoes cost?,"#TrailWalker Hiking Shoes, price $110## BrandT...",The TrailWalker Hiking Shoes are priced at $110,,,,,
