# Evaluate with math evaluators

## Objective
This notebook demonstrates how to use math-based evaluators to assess the quality of generated text by comparing it to reference text. By the end of this tutorial, you'll be able to:
 - Understand different math evaluators such as `BleuScoreEvaluator`, `GleuScoreEvaluator`, `MeteorScoreEvaluator`, and `RougeScoreEvaluator`.
 - Evaluate dataset using these evaluators.

## Time
You should expect to spend about 10 minutes running this notebook.

## Before you begin

### Installation
Install the following packages required to execute this notebook.

In [None]:
# Install the packages
%pip install azure-ai-evaluation

Set the following environment variables for use in this notebook:

In [1]:
import os

os.environ["AZURE_SUBSCRIPTION_ID"] = ""
os.environ["AZURE_RESOURCE_GROUP"] = ""
os.environ["AZURE_PROJECT_NAME"] = ""

## Math Evaluators

### BleuScoreEvaluator

BLEU (Bilingual Evaluation Understudy) score is commonly used in natural language processing (NLP) and machine
translation. It is widely used in text summarization and text generation use cases. It evaluates how closely the
generated text matches the reference text. The BLEU score ranges from 0 to 1, with higher scores indicating
better quality.

In [1]:
from azure.ai.evaluation import BleuScoreEvaluator

bleu = BleuScoreEvaluator()

[2024-09-23 12:35:33 -0700][promptflow][DEBUG] - preparing home directory with default value.


In [None]:
result = bleu(
    response="Tokyo is the capital of Japan.",
    ground_truth="The capital of Japan is Tokyo.")

print(result)

### GleuScoreEvaluator

The GLEU (Google-BLEU) score evaluator measures the similarity between generated and reference texts by
evaluating n-gram overlap, considering both precision and recall. This balanced evaluation, designed for
sentence-level assessment, makes it ideal for detailed analysis of translation quality. GLEU is well-suited for
use cases such as machine translation, text summarization, and text generation.

In [2]:
from azure.ai.evaluation import GleuScoreEvaluator

gleu = GleuScoreEvaluator()

In [None]:
result = gleu(
    response="Tokyo is the capital of Japan.",
    ground_truth="The capital of Japan is Tokyo.")

print(result)

### MeteorScoreEvaluator

The METEOR (Metric for Evaluation of Translation with Explicit Ordering) score grader evaluates generated text by
comparing it to reference texts, focusing on precision, recall, and content alignment. It addresses limitations of
other metrics like BLEU by considering synonyms, stemming, and paraphrasing. METEOR score considers synonyms and
word stems to more accurately capture meaning and language variations. In addition to machine translation and
text summarization, paraphrase detection is an optimal use case for the METEOR score.

In [3]:
from azure.ai.evaluation import MeteorScoreEvaluator

meteor = MeteorScoreEvaluator(
    alpha=0.9,
    beta=3.0,
    gamma=0.5
)

In [None]:
result = meteor(
    response="Tokyo is the capital of Japan.",
    ground_truth="The capital of Japan is Tokyo.")

print(result)

### RougeScoreEvaluator

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate automatic
summarization and machine translation. It measures the overlap between generated text and reference summaries.
ROUGE focuses on recall-oriented measures to assess how well the generated text covers the reference text. Text
summarization and document comparison are among optimal use cases for ROUGE, particularly in scenarios where text
coherence and relevance are critical.


In [4]:
from azure.ai.evaluation import RougeScoreEvaluator, RougeType

rouge = RougeScoreEvaluator(rouge_type=RougeType.ROUGE_1)

In [None]:
result = rouge(
    response="Tokyo is the capital of Japan.",
    ground_truth="The capital of Japan is Tokyo.")

print(result)

## Evaluate using math evaluators on a dataset

The following code runs Evaluate API and uses BLEU, GLEU, METEOR and ROUGE evaluators to evaluate results on a dataset.

In [6]:
import os
from azure.ai.evaluation import evaluate

azure_ai_project = {
    "subscription_id": os.environ.get("AZURE_SUBSCRIPTION_ID"),
    "resource_group_name": os.environ.get("AZURE_RESOURCE_GROUP"),
    "project_name": os.environ.get("AZURE_PROJECT_NAME"),
}

result = evaluate(
    data="data.jsonl",
    evaluators={
        "bleu": bleu,
        "gleu": gleu,
        "meteor": meteor,
        "rouge": rouge,
    },
    # Optionally provide your AI Studio project information to track your evaluation results in your Azure AI Studio project
    azure_ai_project=azure_ai_project
)

[2024-09-23 12:36:02 -0700][promptflow][DEBUG] - PFClient init with kwargs: {'config': None, 'user_agent': 'azure-ai-evaluation/1.0.0b2'}
[2024-09-23 12:36:02 -0700][promptflow][INFO] - pf.config.trace.destination: None
[2024-09-23 12:36:02 -0700][promptflow][DEBUG] - trace destination does not need to be resolved, directly return...
[2024-09-23 12:36:02 -0700][promptflow][DEBUG] - flow entry <azure.ai.evaluation._evaluators._bleu._bleu._AsyncBleuScoreEvaluator object at 0x00000260A4BF8CD0> is a callable.
[2024-09-23 12:36:02 -0700][promptflow][DEBUG] - flow entry <azure.ai.evaluation._evaluators._gleu._gleu._AsyncGleuScoreEvaluator object at 0x00000260E2141270> is a callable.
[2024-09-23 12:36:02 -0700][promptflow][DEBUG] - flow entry <azure.ai.evaluation._evaluators._meteor._meteor._AsyncMeteorScoreEvaluator object at 0x00000260E2141750> is a callable.
[2024-09-23 12:36:02 -0700][promptflow][INFO] - Create temporary entry for flex flow.
[2024-09-23 12:36:02 -0700][promptflow][DEBUG] 

Prompt flow service has started...
Prompt flow service has started...
Prompt flow service has started...
Prompt flow service has started...
You can view the traces in local from http://127.0.0.1:23334/v1.0/ui/traces/?#run=azure_ai_evaluation_evaluators_rouge_rouge_asyncrougescoreevaluator_93wi4ccd_20240923_123602_609257
You can view the traces in local from http://127.0.0.1:23334/v1.0/ui/traces/?#run=azure_ai_evaluation_evaluators_bleu_bleu_asyncbleuscoreevaluator_s3as2b6_20240923_123602_609257
You can view the traces in local from http://127.0.0.1:23334/v1.0/ui/traces/?#run=azure_ai_evaluation_evaluators_gleu_gleu_asyncgleuscoreevaluator_vmu8ya57_20240923_123602_609257
You can view the traces in local from http://127.0.0.1:23334/v1.0/ui/traces/?#run=azure_ai_evaluation_evaluators_meteor_meteor_asyncmeteorscoreevaluator_ddb39bi4_20240923_123602_609257


[2024-09-23 12:36:03 -0700][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_meteor_meteor_asyncmeteorscoreevaluator_ddb39bi4_20240923_123602_609257, log path: C:\Users\ninhu\.promptflow\.runs\azure_ai_evaluation_evaluators_meteor_meteor_asyncmeteorscoreevaluator_ddb39bi4_20240923_123602_609257\logs.txt
[2024-09-23 12:36:03 -0700][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_gleu_gleu_asyncgleuscoreevaluator_vmu8ya57_20240923_123602_609257, log path: C:\Users\ninhu\.promptflow\.runs\azure_ai_evaluation_evaluators_gleu_gleu_asyncgleuscoreevaluator_vmu8ya57_20240923_123602_609257\logs.txt
[2024-09-23 12:36:03 -0700][promptflow._sdk._orchestrator.run_submitter][DEBUG] - Resolving connections for flow C:\Users\ninhu\AppData\Local\Temp\tmpiwykshpu\flow.flex.yaml with environment variables {}.
[2024-09-23 12:36:03 -0700][promptflow][DEBUG] - PFClient init with kwargs: {}
[2024-09-23 12:36:

2024-09-23 12:36:03 -0700   35848 execution          DEBUG    Start initializing the executor with <azure.ai.evaluation._evaluators._gleu._gleu._AsyncGleuScoreEvaluator object at 0x00000260E2141270>.
2024-09-23 12:36:03 -0700   35848 execution          DEBUG    Init params for script executor: None
2024-09-23 12:36:03 -0700   35848 execution          DEBUG    Failed to load flow from file <azure.ai.evaluation._evaluators._gleu._gleu._AsyncGleuScoreEvaluator object at 0x00000260E2141270> with error: join() argument must be str, bytes, or os.PathLike object, not '_AsyncGleuScoreEvaluator'
2024-09-23 12:36:03 -0700   35848 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2024-09-23 12:36:03 -0700   35848 execution.bulk     INFO     Finished 3 / 3 lines.
2024-09-23 12:36:03 -0700   35848 execution.bulk     INFO     Average execution time for completed lines: 0.17 seconds. Estimated time for incomplete lines: 0.0 seconds.

Run n

[2024-09-23 12:36:06 -0700][promptflow._utils.flow_utils][DEBUG] - Got lineage id 215893810571086/C:/Users/ninhu/AppData/Local/Temp/azure_ai_evaluation_evaluators_bleu_bleu_asyncbleuscoreevaluator__s3as2b6 from local since failed to get git info.


2024-09-23 12:36:03 -0700   35848 execution          DEBUG    Start initializing the executor with <azure.ai.evaluation._evaluators._rouge._rouge._AsyncRougeScoreEvaluator object at 0x00000260E2141DB0>.
2024-09-23 12:36:03 -0700   35848 execution          DEBUG    Init params for script executor: None
2024-09-23 12:36:03 -0700   35848 execution          DEBUG    Failed to load flow from file <azure.ai.evaluation._evaluators._rouge._rouge._AsyncRougeScoreEvaluator object at 0x00000260E2141DB0> with error: join() argument must be str, bytes, or os.PathLike object, not '_AsyncRougeScoreEvaluator'
2024-09-23 12:36:03 -0700   35848 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2024-09-23 12:36:04 -0700   35848 execution.bulk     INFO     Finished 3 / 3 lines.
2024-09-23 12:36:04 -0700   35848 execution.bulk     INFO     Average execution time for completed lines: 0.22 seconds. Estimated time for incomplete lines: 0.0 seconds.

[2024-09-23 12:36:08 -0700][promptflow._utils.flow_utils][DEBUG] - Got lineage id 215893810571086/C:/Users/ninhu/AppData/Local/Temp/azure_ai_evaluation_evaluators_bleu_bleu_asyncbleuscoreevaluator__s3as2b6 from local since failed to get git info.
[2024-09-23 12:36:08 -0700][promptflow][INFO] - Line run is not terminated, skip persisting line run record.
[2024-09-23 12:36:08 -0700][promptflow._utils.flow_utils][DEBUG] - Got lineage id 215893810571086/C:/Users/ninhu/AppData/Local/Temp/azure_ai_evaluation_evaluators_bleu_bleu_asyncbleuscoreevaluator__s3as2b6 from local since failed to get git info.
[2024-09-23 12:36:08 -0700][promptflow][INFO] - Line run is not terminated, skip persisting line run record.
[2024-09-23 12:36:08 -0700][promptflow._utils.flow_utils][DEBUG] - Got lineage id 215893810571086/C:/Users/ninhu/AppData/Local/Temp/azure_ai_evaluation_evaluators_bleu_bleu_asyncbleuscoreevaluator__s3as2b6 from local since failed to get git info.


2024-09-23 12:36:03 -0700   35848 execution          DEBUG    Start initializing the executor with <azure.ai.evaluation._evaluators._bleu._bleu._AsyncBleuScoreEvaluator object at 0x00000260A4BF8CD0>.
2024-09-23 12:36:03 -0700   35848 execution          DEBUG    Init params for script executor: None
2024-09-23 12:36:03 -0700   35848 execution          DEBUG    Failed to load flow from file <azure.ai.evaluation._evaluators._bleu._bleu._AsyncBleuScoreEvaluator object at 0x00000260A4BF8CD0> with error: join() argument must be str, bytes, or os.PathLike object, not '_AsyncBleuScoreEvaluator'
2024-09-23 12:36:03 -0700   35848 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2024-09-23 12:36:03 -0700   35848 execution.bulk     INFO     Finished 3 / 3 lines.
2024-09-23 12:36:03 -0700   35848 execution.bulk     INFO     Average execution time for completed lines: 0.17 seconds. Estimated time for incomplete lines: 0.0 seconds.

Run n

[2024-09-23 12:36:08 -0700][promptflow._utils.flow_utils][DEBUG] - Got lineage id 215893810571086/C:/Users/ninhu/AppData/Local/Temp/azure_ai_evaluation_evaluators_bleu_bleu_asyncbleuscoreevaluator__s3as2b6 from local since failed to get git info.
[2024-09-23 12:36:08 -0700][promptflow._utils.flow_utils][DEBUG] - Got lineage id 215893810571086/C:/Users/ninhu/AppData/Local/Temp/azure_ai_evaluation_evaluators_gleu_gleu_asyncgleuscoreevaluator_vmu8ya57 from local since failed to get git info.
[2024-09-23 12:36:08 -0700][promptflow._utils.flow_utils][DEBUG] - Got lineage id 215893810571086/C:/Users/ninhu/AppData/Local/Temp/azure_ai_evaluation_evaluators_gleu_gleu_asyncgleuscoreevaluator_vmu8ya57 from local since failed to get git info.
[2024-09-23 12:36:08 -0700][promptflow._utils.flow_utils][DEBUG] - Got lineage id 215893810571086/C:/Users/ninhu/AppData/Local/Temp/azure_ai_evaluation_evaluators_meteor_meteor_asyncmeteorscoreevaluator_ddb39bi4 from local since failed to get git info.
[2024-

2024-09-23 12:36:03 -0700   35848 execution          DEBUG    Start initializing the executor with <azure.ai.evaluation._evaluators._meteor._meteor._AsyncMeteorScoreEvaluator object at 0x00000260E2141750>.
2024-09-23 12:36:03 -0700   35848 execution          DEBUG    Init params for script executor: None
2024-09-23 12:36:03 -0700   35848 execution          DEBUG    Failed to load flow from file <azure.ai.evaluation._evaluators._meteor._meteor._AsyncMeteorScoreEvaluator object at 0x00000260E2141750> with error: join() argument must be str, bytes, or os.PathLike object, not '_AsyncMeteorScoreEvaluator'
2024-09-23 12:36:03 -0700   35848 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2024-09-23 12:36:08 -0700   35848 execution.bulk     INFO     Finished 3 / 3 lines.
2024-09-23 12:36:08 -0700   35848 execution.bulk     INFO     Average execution time for completed lines: 1.74 seconds. Estimated time for incomplete lines: 0.0 s

[2024-09-23 12:36:08 -0700][promptflow._utils.flow_utils][DEBUG] - Got lineage id 215893810571086/C:/Users/ninhu/AppData/Local/Temp/azure_ai_evaluation_evaluators_meteor_meteor_asyncmeteorscoreevaluator_ddb39bi4 from local since failed to get git info.
[2024-09-23 12:36:08 -0700][promptflow._utils.flow_utils][DEBUG] - Got lineage id 215893810571086/C:/Users/ninhu/AppData/Local/Temp/azure_ai_evaluation_evaluators_rouge_rouge_asyncrougescoreevaluator_93wi4ccd from local since failed to get git info.
[2024-09-23 12:36:08 -0700][promptflow._utils.flow_utils][DEBUG] - Got lineage id 215893810571086/C:/Users/ninhu/AppData/Local/Temp/azure_ai_evaluation_evaluators_rouge_rouge_asyncrougescoreevaluator_93wi4ccd from local since failed to get git info.
ERROR:azure.ai.evaluation._evaluate._utils:Unable to log traces as trace destination was not defined.


View the results

In [7]:
from pprint import pprint
pprint(result)

{'metrics': {'bleu.bleu_score': 0.27619794053333335,
             'gleu.gleu_score': 0.34843304843333334,
             'meteor.meteor_score': 0.7349908339666668,
             'rouge.rouge_f1_score': 0.5913715913666667,
             'rouge.rouge_precision': 0.6666666666666666,
             'rouge.rouge_recall': 0.5321428571333334},
 'rows': [{'inputs.ground_truth': 'A cat is sitting on the mat.',
           'inputs.response': 'The cat sits on the mat.',
           'outputs.bleu.bleu_score': 0.37684991640000004,
           'outputs.gleu.gleu_score': 0.4230769231,
           'outputs.meteor.meteor_score': 0.7454289733,
           'outputs.rouge.rouge_f1_score': 0.6153846154,
           'outputs.rouge.rouge_precision': 0.6666666667000001,
           'outputs.rouge.rouge_recall': 0.5714285714},
          {'inputs.ground_truth': 'She loves to read books.',
           'inputs.response': 'She enjoys reading books.',
           'outputs.bleu.bleu_score': 0.1098261402,
           'outputs.gleu.g