# Evaluate mutliple models in quantitative NLP evaluators

## Objective
This notebook demonstrates how to use NLP-based evaluators to assess the quality of generated text by comparing it to reference text. By the end of this tutorial, you'll be able to:
 - Understand different NLP evaluators such as `BleuScoreEvaluator`, `GleuScoreEvaluator`, `MeteorScoreEvaluator`, and `RougeScoreEvaluator`.
 - Evaluate dataset using these evaluators.

## Time
You should expect to spend about 10 minutes running this notebook.

## Before you begin

### Installation
Install the following packages required to execute this notebook.

In [None]:
# Install the packages
%pip install azure-ai-evaluation

In [1]:
import os
from pprint import pprint
from dotenv import load_dotenv
load_dotenv("../.credentials.env")

True

## NLP Evaluators

In [2]:
# Initialize Azure AI project and Azure OpenAI conncetion with your environment variables
azure_ai_project = {
    "subscription_id": os.environ.get("AZURE_SUBSCRIPTION_ID"),
    "resource_group_name": os.environ.get("AZURE_RESOURCE_GROUP"),
    "project_name": os.environ.get("AZURE_PROJECT_NAME"),
}

## Set up env vars for model endpoints and keys

In [4]:
env_var = { 
    "gpt-35-turbo": {
        "endpoint": os.environ.get("AZURE_OPENAI_GPT35_ENDPOINT"),
        "key": os.environ.get("AZURE_OPENAI_GPT35_API_KEY"),
    },
    "gpt-4": {
        "endpoint": os.environ.get("AZURE_OPENAI_GPT4_ENDPOINT"),
        "key": os.environ.get("AZURE_OPENAI_GPT4_API_KEY"),
    },
    "gpt-4o": {
        "endpoint": os.environ.get("AZURE_OPENAI_GPT4o_ENDPOINT"),
        "key": os.environ.get("AZURE_OPENAI_GPT4o_API_KEY"),
    },
   "gpt-4o-mini" : { 
        "endpoint" : os.environ.get("AZURE_OPENAI_GPT4o-mini_ENDPOINT"), 
        "key" : os.environ.get("AZURE_OPENAI_GPT4o-mini_API_KEY"), 
    },    
}

In [5]:
with open("target_nlp_api/target_nlp_api.py") as fin:
    print(fin.read())

import requests
from typing_extensions import Self
from typing import TypedDict
from promptflow.tracing import trace


class ModelEndpoints:
    def __init__(self: Self, env: dict, model_type: str) -> str:
        self.env = env
        self.model_type = model_type

    class Response(TypedDict):
        query: str
        response: str

    @trace
    def __call__(self: Self, query: str) -> Response:
        if self.model_type == "gpt-4":
            output = self.call_gpt4_endpoint(query)
        elif self.model_type == "gpt-35-turbo":
            output = self.call_gpt35_turbo_endpoint(query)
        elif self.model_type == "gpt-4o":
            output = self.call_gpt4o_endpoint(query)
        elif self.model_type == "gpt-4o-mini":
            output = self.call_gpt4o_mini_endpoint(query)
        else:
            output = self.call_default_endpoint(query)

        return output

    def query(self: Self, endpoint: str, headers: str, payload: str) -> str:
        response = requests

In [6]:
from target_nlp_api.target_nlp_api import ModelEndpoints

In [8]:
from azure.ai.evaluation import BleuScoreEvaluator
from azure.ai.evaluation import GleuScoreEvaluator
from azure.ai.evaluation import MeteorScoreEvaluator
from azure.ai.evaluation import RougeScoreEvaluator, RougeType

bleu = BleuScoreEvaluator()
gleu = GleuScoreEvaluator()
meteor = MeteorScoreEvaluator(alpha=0.9, beta=3.0, gamma=0.5)
rouge = RougeScoreEvaluator(rouge_type=RougeType.ROUGE_1)

In [9]:
from azure.ai.evaluation import evaluate
import random
import pathlib
import sys

from target_nlp_api.target_nlp_api import ModelEndpoints

models = ["gpt-35-turbo","gpt-4","gpt-4o","gpt-4o-mini"]

for model in models:
    print(" Evaluating NLP metrics - ", model)
    print("-----------------------------------")
    randomNum = random.randint(1111, 9999)
    result = evaluate(
        azure_ai_project=azure_ai_project, 
        data="ai_data.jsonl",
        evaluation_name = "NLP-" + model.title() + "_Run-" + str(randomNum),
        target = ModelEndpoints(env_var, model),

        evaluators={
            "bleu": bleu,
            "gleu": gleu,
            "meteor": meteor,
            "rouge": rouge,
        },
        evaluator_config={
        "bleu": {
            "column_mapping": {
                "ground_truth": "${data.ground_truth}",
                "response": "${target.response}"}
            },
        }
    )

 Evaluating NLP metrics -  gpt-35-turbo
-----------------------------------


[2025-04-21 18:10:20 +0100][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_None_20250421_181016_091185, log path: C:\Users\sumohammed\.promptflow\.runs\azure_ai_evaluation_evaluators_None_20250421_181016_091185\logs.txt
[2025-04-21 18:10:31 +0100][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_gleu_20250421_181031_697032, log path: C:\Users\sumohammed\.promptflow\.runs\azure_ai_evaluation_evaluators_gleu_20250421_181031_697032\logs.txt
[2025-04-21 18:10:31 +0100][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_meteor_20250421_181031_697032, log path: C:\Users\sumohammed\.promptflow\.runs\azure_ai_evaluation_evaluators_meteor_20250421_181031_697032\logs.txt
[2025-04-21 18:10:31 +0100][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_bleu_20250421_181031_697032, log path: C:\Users\sumoh

2025-04-21 18:10:20 +0100   43612 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2025-04-21 18:10:20 +0100   43612 execution.bulk     INFO     Current system's available memory is 5601.21484375MB, memory consumption of current process is 211.26953125MB, estimated available worker count is 5601.21484375/211.26953125 = 26
2025-04-21 18:10:20 +0100   43612 execution.bulk     INFO     Set process count to 4 by taking the minimum value among the factors of {'default_worker_count': 4, 'row_count': 4, 'estimated_worker_count_based_on_memory_usage': 26}.
2025-04-21 18:10:24 +0100   43612 execution.bulk     INFO     Process name(SpawnProcess-5)-Process id(28904)-Line number(0) start execution.
2025-04-21 18:10:24 +0100   43612 execution.bulk     INFO     Process name(SpawnProcess-6)-Process id(35680)-Line number(1) start execution.
2025-04-21 18:10:24 +0100   43612 execution.bulk     INFO     Process name(SpawnProcess-7)-Process i

[2025-04-21 18:10:31 +0100][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_rouge_20250421_181031_698021, log path: C:\Users\sumohammed\.promptflow\.runs\azure_ai_evaluation_evaluators_rouge_20250421_181031_698021\logs.txt


2025-04-21 18:10:33 +0100   43612 execution.bulk     INFO     Finished 4 / 4 lines.
2025-04-21 18:10:33 +0100   43612 execution.bulk     INFO     Average execution time for completed lines: 0.5 seconds. Estimated time for incomplete lines: 0.0 seconds.
2025-04-21 18:10:32 +0100   43612 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2025-04-21 18:10:32 +0100   43612 execution.bulk     INFO     Finished 4 / 4 lines.
2025-04-21 18:10:32 +0100   43612 execution.bulk     INFO     Average execution time for completed lines: 0.02 seconds. Estimated time for incomplete lines: 0.0 seconds.

Run name: "azure_ai_evaluation_evaluators_rouge_20250421_181031_698021"
Run status: "Completed"
Start time: "2025-04-21 18:10:31.713531+01:00"
Duration: "0:00:01.714263"
Output path: "C:\Users\sumohammed\.promptflow\.runs\azure_ai_evaluation_evaluators_rouge_20250421_181031_698021"

2025-04-21 18:10:31 +0100   43612 execution.bulk     INFO     

[2025-04-21 18:10:50 +0100][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_None_20250421_181047_864708, log path: C:\Users\sumohammed\.promptflow\.runs\azure_ai_evaluation_evaluators_None_20250421_181047_864708\logs.txt


2025-04-21 18:10:50 +0100   43612 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2025-04-21 18:10:50 +0100   43612 execution.bulk     INFO     Current system's available memory is 5419.6484375MB, memory consumption of current process is 380.0625MB, estimated available worker count is 5419.6484375/380.0625 = 14
2025-04-21 18:10:50 +0100   43612 execution.bulk     INFO     Set process count to 4 by taking the minimum value among the factors of {'default_worker_count': 4, 'row_count': 4, 'estimated_worker_count_based_on_memory_usage': 14}.
2025-04-21 18:10:54 +0100   43612 execution.bulk     INFO     Process name(SpawnProcess-11)-Process id(27664)-Line number(0) start execution.
2025-04-21 18:10:54 +0100   43612 execution.bulk     INFO     Process name(SpawnProcess-12)-Process id(12592)-Line number(1) start execution.
2025-04-21 18:10:54 +0100   43612 execution.bulk     INFO     Process name(SpawnProcess-14)-Process id(4728)

[2025-04-21 18:11:06 +0100][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_gleu_20250421_181106_764370, log path: C:\Users\sumohammed\.promptflow\.runs\azure_ai_evaluation_evaluators_gleu_20250421_181106_764370\logs.txt
[2025-04-21 18:11:07 +0100][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_meteor_20250421_181106_764750, log path: C:\Users\sumohammed\.promptflow\.runs\azure_ai_evaluation_evaluators_meteor_20250421_181106_764750\logs.txt
[2025-04-21 18:11:07 +0100][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_rouge_20250421_181106_765916, log path: C:\Users\sumohammed\.promptflow\.runs\azure_ai_evaluation_evaluators_rouge_20250421_181106_765916\logs.txt
[2025-04-21 18:11:07 +0100][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_bleu_20250421_181106_764370, log path: C:\Users\sum

2025-04-21 18:11:07 +0100   43612 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2025-04-21 18:11:08 +0100   43612 execution.bulk     INFO     Finished 4 / 4 lines.
2025-04-21 18:11:08 +0100   43612 execution.bulk     INFO     Average execution time for completed lines: 0.25 seconds. Estimated time for incomplete lines: 0.0 seconds.

Run name: "azure_ai_evaluation_evaluators_gleu_20250421_181106_764370"
Run status: "Completed"
Start time: "2025-04-21 18:11:06.784841+01:00"
Duration: "0:00:01.435969"
Output path: "C:\Users\sumohammed\.promptflow\.runs\azure_ai_evaluation_evaluators_gleu_20250421_181106_764370"

2025-04-21 18:11:07 +0100   43612 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2025-04-21 18:11:08 +0100   43612 execution.bulk     INFO     Finished 4 / 4 lines.
2025-04-21 18:11:08 +0100   43612 execution.bulk     INFO     Average execution time fo

[2025-04-21 18:11:24 +0100][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_None_20250421_181120_288463, log path: C:\Users\sumohammed\.promptflow\.runs\azure_ai_evaluation_evaluators_None_20250421_181120_288463\logs.txt


2025-04-21 18:11:24 +0100   43612 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2025-04-21 18:11:24 +0100   43612 execution.bulk     INFO     Current system's available memory is 4454.9453125MB, memory consumption of current process is 380.28515625MB, estimated available worker count is 4454.9453125/380.28515625 = 11
2025-04-21 18:11:24 +0100   43612 execution.bulk     INFO     Set process count to 4 by taking the minimum value among the factors of {'default_worker_count': 4, 'row_count': 4, 'estimated_worker_count_based_on_memory_usage': 11}.
2025-04-21 18:11:28 +0100   43612 execution.bulk     INFO     Process name(SpawnProcess-18)-Process id(7672)-Line number(0) start execution.
2025-04-21 18:11:28 +0100   43612 execution.bulk     INFO     Process name(SpawnProcess-19)-Process id(19072)-Line number(1) start execution.
2025-04-21 18:11:28 +0100   43612 execution.bulk     INFO     Process name(SpawnProcess-20)-Process i

[2025-04-21 18:11:40 +0100][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_meteor_20250421_181140_286160, log path: C:\Users\sumohammed\.promptflow\.runs\azure_ai_evaluation_evaluators_meteor_20250421_181140_286160\logs.txt
[2025-04-21 18:11:40 +0100][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_rouge_20250421_181140_286160, log path: C:\Users\sumohammed\.promptflow\.runs\azure_ai_evaluation_evaluators_rouge_20250421_181140_286160\logs.txt
[2025-04-21 18:11:40 +0100][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_bleu_20250421_181140_286160, log path: C:\Users\sumohammed\.promptflow\.runs\azure_ai_evaluation_evaluators_bleu_20250421_181140_286160\logs.txt
[2025-04-21 18:11:40 +0100][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_gleu_20250421_181140_286160, log path: C:\Users\sum

2025-04-21 18:11:40 +0100   43612 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2025-04-21 18:11:41 +0100   43612 execution.bulk     INFO     Finished 4 / 4 lines.
2025-04-21 18:11:41 +0100   43612 execution.bulk     INFO     Average execution time for completed lines: 0.18 seconds. Estimated time for incomplete lines: 0.0 seconds.

Run name: "azure_ai_evaluation_evaluators_meteor_20250421_181140_286160"
Run status: "Completed"
Start time: "2025-04-21 18:11:40.333264+01:00"
Duration: "0:00:01.656069"
Output path: "C:\Users\sumohammed\.promptflow\.runs\azure_ai_evaluation_evaluators_meteor_20250421_181140_286160"

2025-04-21 18:11:40 +0100   43612 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2025-04-21 18:11:41 +0100   43612 execution.bulk     INFO     Finished 4 / 4 lines.
2025-04-21 18:11:41 +0100   43612 execution.bulk     INFO     Average execution tim

[2025-04-21 18:12:00 +0100][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_None_20250421_181155_669346, log path: C:\Users\sumohammed\.promptflow\.runs\azure_ai_evaluation_evaluators_None_20250421_181155_669346\logs.txt


2025-04-21 18:12:00 +0100   43612 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2025-04-21 18:12:00 +0100   43612 execution.bulk     INFO     Current system's available memory is 4231.23828125MB, memory consumption of current process is 380.81640625MB, estimated available worker count is 4231.23828125/380.81640625 = 11
2025-04-21 18:12:00 +0100   43612 execution.bulk     INFO     Set process count to 4 by taking the minimum value among the factors of {'default_worker_count': 4, 'row_count': 4, 'estimated_worker_count_based_on_memory_usage': 11}.
2025-04-21 18:12:04 +0100   43612 execution.bulk     INFO     Process name(SpawnProcess-25)-Process id(7224)-Line number(0) start execution.
2025-04-21 18:12:04 +0100   43612 execution.bulk     INFO     Process name(SpawnProcess-26)-Process id(35408)-Line number(1) start execution.
2025-04-21 18:12:04 +0100   43612 execution.bulk     INFO     Process name(SpawnProcess-27)-Process

[2025-04-21 18:12:16 +0100][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_gleu_20250421_181216_089870, log path: C:\Users\sumohammed\.promptflow\.runs\azure_ai_evaluation_evaluators_gleu_20250421_181216_089870\logs.txt
[2025-04-21 18:12:16 +0100][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_bleu_20250421_181216_089347, log path: C:\Users\sumohammed\.promptflow\.runs\azure_ai_evaluation_evaluators_bleu_20250421_181216_089347\logs.txt
[2025-04-21 18:12:16 +0100][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_rouge_20250421_181216_091460, log path: C:\Users\sumohammed\.promptflow\.runs\azure_ai_evaluation_evaluators_rouge_20250421_181216_091460\logs.txt
[2025-04-21 18:12:16 +0100][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_meteor_20250421_181216_090403, log path: C:\Users\sumoh

2025-04-21 18:12:16 +0100   43612 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2025-04-21 18:12:17 +0100   43612 execution.bulk     INFO     Finished 4 / 4 lines.
2025-04-21 18:12:17 +0100   43612 execution.bulk     INFO     Average execution time for completed lines: 0.15 seconds. Estimated time for incomplete lines: 0.0 seconds.

Run name: "azure_ai_evaluation_evaluators_gleu_20250421_181216_089870"
Run status: "Completed"
Start time: "2025-04-21 18:12:16.112036+01:00"
Duration: "0:00:01.492040"
Output path: "C:\Users\sumohammed\.promptflow\.runs\azure_ai_evaluation_evaluators_gleu_20250421_181216_089870"

2025-04-21 18:12:16 +0100   43612 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2025-04-21 18:12:16 +0100   43612 execution.bulk     INFO     Finished 4 / 4 lines.
2025-04-21 18:12:16 +0100   43612 execution.bulk     INFO     Average execution time fo

{'metrics': {'bleu.bleu_score': 0.010399929600000002,
             'bleu.bleu_threshold': 0.5,
             'gleu.gleu_score': 0.015433388000000003,
             'gleu.gleu_threshold': 0.5,
             'meteor.meteor_score': 0.220047109375,
             'meteor.meteor_threshold': 0.5,
             'rouge.rouge_f1_score': 0.12047072645000001,
             'rouge.rouge_f1_score_threshold': 0.5,
             'rouge.rouge_precision': 0.0671799517,
             'rouge.rouge_precision_threshold': 0.5,
             'rouge.rouge_recall': 0.7045454545500001,
             'rouge.rouge_recall_threshold': 0.5},
 'rows': [{'inputs.context': 'United Kingdom is a country in Europe.',
           'inputs.ground_truth': 'London',
           'inputs.query': 'What is the capital of United Kingdom?',
           'line_number': 0,
           'outputs.bleu.bleu_result': 'fail',
           'outputs.bleu.bleu_score': 0.025732850300000002,
           'outputs.bleu.bleu_threshold': 0.5,
           'outputs.gleu.

View the results, Alternatively you can view the results in AI Foundry

In [None]:
import pandas as pd

pd.DataFrame(result["rows"])

Unnamed: 0,outputs.query,outputs.response,inputs.query,inputs.context,inputs.ground_truth,outputs.bleu.bleu_score,outputs.bleu.bleu_result,outputs.bleu.bleu_threshold,outputs.gleu.gleu_score,outputs.gleu.gleu_result,...,outputs.rouge.rouge_precision,outputs.rouge.rouge_recall,outputs.rouge.rouge_f1_score,outputs.rouge.rouge_precision_result,outputs.rouge.rouge_recall_result,outputs.rouge.rouge_f1_score_result,outputs.rouge.rouge_precision_threshold,outputs.rouge.rouge_recall_threshold,outputs.rouge.rouge_f1_score_threshold,line_number
0,What is the capital of United Kingdom?,The capital of the United Kingdom is London.,What is the capital of United Kingdom?,United Kingdom is a country in Europe.,London,0.025733,fail,0.5,0.033333,fail,...,0.125,1.0,0.222222,fail,pass,fail,0.5,0.5,0.5,0
1,Which tent is the most waterproof?,"When considering the most waterproof tents, se...",Which tent is the most waterproof?,"#TrailMaster X4 Tent, price $250,## BrandOutdo...",The TrailMaster X4 tent has a rainfly waterpro...,0.006217,fail,0.5,0.01016,fail,...,0.032609,0.818182,0.062718,fail,pass,fail,0.5,0.5,0.5,1
2,Which camping table is the lightest?,As of my last knowledge update in October 2023...,Which camping table is the lightest?,"#BaseCamp Folding Table, price $60,## BrandCam...",The BaseCamp Folding Table has a weight of 15 lbs,0.002801,fail,0.5,0.00542,fail,...,0.03268,0.5,0.06135,fail,pass,fail,0.5,0.5,0.5,2
3,How much does TrailWalker Hiking Shoes cost?,The price of TrailWalker hiking shoes can vary...,How much does TrailWalker Hiking Shoes cost?,"#TrailWalker Hiking Shoes, price $110## BrandT...",The TrailWalker Hiking Shoes are priced at $110,0.006848,fail,0.5,0.012821,fail,...,0.078431,0.5,0.135593,fail,pass,fail,0.5,0.5,0.5,3
