# Evaluate Using Risk & Safety Metrics

Contoso Home Furnishings is developing an app that generates product descriptions for their selection of furniture. The app aims to generates engaging product descriptions based on the manufacturer's specification of the furniture.

In this exercise, you will evaluate the model output for the generated product description using performance and quality metrics. Provided below is an example of a row of data provided for the description generated for the Contoso Home Furnishings Dining Chair:

`context`

Dining chair. Wooden seat. Four legs. Backrest. Brown. 18" wide, 20" deep, 35" tall. Holds 250 lbs.

`query`

Given the product specfication for the Contoso Home Furnishings Dining Chair, provide a product description.

`ground_truth`

The dining chair is brown and wooden with four legs and a backrest. The dimensinos are 18" wide, 20" deep, 35" tall. The dining chair has a weight capacity of 250 lbs.

`response`

Introducing our timeless wooden dining chair, designed for both comfort and durability. Crafted with a solid wood seat and sturdy four-legged base, this chair offers reliable support for up to 250 lbs. The smooth brown finish adds a touch of rustic elegance, while the ergonomically shaped backrest ensures a comfortable dining experience. Measuring 18" wide, 20" deep, and 35" tall, it's the perfect blend of form and function, making it a versatile addition to any dining space. Elevate your home with this beautifully simple yet sophisticated seating option.


## Install the package

The evaluator classes for assessing performance and quality are in the Azure AI Evaluation SDK. We'll begin by installing the package.

In [None]:
%pip install azure-ai-evaluation

## Import packages

We'll import `os` so that you can access the environment variables that you'll set in the next step.

In [None]:
import os

## Set environment variables to create an instance of the evaluators

We'll now set the environment variables that'll be required to create an instance of the evaluators. You'll need the following:

- Azure OpenAI endpoint
- Azure OpenAI API Key

You can locate your **Azure OpenAI endpoint** and **Azure OpenAI API Key** by navigating to **Connections** and copying the respective credentials for your Azure OpenAI resource.

In [None]:
os.environ['AZURE_OPENAI_ENDPOINT'] = 'Your Azure OpenAI endpoint'
os.environ['AZURE_OPENAI_API_KEY'] = 'Your Azure OpenAI API key'

## Configure the model_config

The `model_config` is necessary as it's a required parameter when creating an instance of the evaluator class. Let's configure the `model_config` with the following:

- Azure OpenAI endpoint
- Azure OpenAI API key

In [None]:
model_config = {
    "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
    "api_key": os.environ.get("AZURE_OPENAI_API_KEY")
}

## Create variables for the evaluation data

Since we'll be using the same context, query, response, and ground truth for the exercises, we'll create a variable to store each string and pass the variables into our evaluations.

In [None]:
context = "Dining chair. Wooden seat. Four legs. Backrest. Brown. 18\" wide, 20\" deep, 35\" tall. Holds 250 lbs."
query = "Given the product specfication for the Contoso Home Furnishings Dining Chair, provide an engaging marketing product description."
ground_truth = "The dining chair is brown and wooden with four legs and a backrest. The dimensinos are 18\" wide, 20\" deep, 35\" tall. The dining chair has a weight capacity of 250 lbs."
response = "Introducing our timeless wooden dining chair, designed for both comfort and durability. Crafted with a solid wood seat and sturdy four-legged base, this chair offers reliable support for up to 250 lbs. The smooth brown finish adds a touch of rustic elegance, while the ergonomically shaped backrest ensures a comfortable dining experience. Measuring 18\" wide, 20\" deep, and 35\" tall, it's the perfect blend of form and function, making it a versatile addition to any dining space. Elevate your home with this beautifully simple yet sophisticated seating option."

## Evaluate for Groundedness

Create an instance of the `GroundednessEvaluator` and run the evaluation.



In [None]:
from azure.ai.evaluation import GroundednessEvaluator

groundedness_eval = GroundednessEvaluator(model_config)

groundedness_score = groundedness_eval(
    response=response,
    context=context,
)

print(groundedness_score)

## Evaluate for Relevance

Create an instance of the `RelevanceEvaluator` and run the evaluation.

In [None]:
from azure.ai.evaluation import RelevanceEvaluator

relevance_eval = RelevanceEvaluator(model_config)

relevance_score = relevance_eval(
    response=response,
    context=context,
    query=query
)

print(relevance_score)

## Evaluate for Coherence

Create an instance of the `CoherenceEvaluator` and run the evaluation.

In [None]:
from azure.ai.evaluation import CoherenceEvaluator

coherence_eval = CoherenceEvaluator(model_config)

coherence_score = coherence_eval(
    response=response,
    query=query
)

print(coherence_score)

## Evaluate for Fluency

Create an instance of the `FluencyEvaluator` and run the evaluation.

In [None]:
from azure.ai.evaluation import FluencyEvaluator

fluency_eval = FluencyEvaluator(model_config)

fluency_score = fluency_eval(
    response=response,
    query=query
)

print(fluency_score)

## Evaluate for Similarity

Create an instance of the `SimiliartyEvaluator` and run the evaluation.

In [None]:
from azure.ai.evaluation import SimilarityEvaluator

similarity_eval = SimilarityEvaluator(model_config)

similarity_score = similarity_eval(
    response=response,
    query=query,
    ground_truth=ground_truth
)

print(similarity_score)

## Evaluate for F1 Score

Create an instance of the `F1ScoreEvaluator` and run the evaluation.

In [None]:
from azure.ai.evaluation import F1ScoreEvaluator

f1_eval = F1ScoreEvaluator()

f1_score = f1_eval(
    response=response,
    ground_truth=ground_truth
)

print(f1_score)

## Evaluate with the QA Evaluator

The `QAEvaluator` is a composite evaluator which combines all the quality evaluators for a single output of combined metrics for query and response pairs. This composite evaluator contains the following evaluators: `GroundednessEvaluator`, `RelevanceEvaluator`, `CoherenceEvaluator`, `FluencyEvaluator`, `SimilarityEvaluator`, and the `F1ScoreEvaluator`.

Create an instance of the `QAEvaluator` and run the evaluation.

In [None]:
from azure.ai.evaluation import QAEvaluator

qa_eval = QAEvaluator(model_config)
   
qa_score = qa_eval(
    query=query,
    response=response,
    context=context,
    ground_truth=ground_truth
    )

print(qa_score)

## Evaluate for ROUGE
There are several types of ROUGE metrics: `ROUGE_1`, `ROUGE_2`, `ROUGE_3`, `ROUGE_4`, `ROUGE_5`, and `ROUGE_L`.

The initial 5 types are considered **ROUGE-N** which measures the overlap of n-grams (contiguous sequences of 'n' words) between the generated summary and reference summary. For example, `ROUGE_1` measures of the overalp of unigrams (single words), and `ROUGE_2` measures the overlap of bigrams (two-word sequences). We provide up to 5-grams.

`ROUGE_L` measures the longest common subsequence (LCS) between the generated and reference summaries. LCS takes into account sequence similarity whle maintaining word order, which makes `ROUGE_L` effective in capturing sentence-level structure.

Create an instance of the `RougeScoreEvaluator` and run the evaluation.

In [None]:
from azure.ai.evaluation import RougeScoreEvaluator, RougeType

rouge_eval = RougeScoreEvaluator(rouge_type=RougeType.ROUGE_1)

rouge_score = rouge_eval(
    response=response,
    ground_truth=ground_truth,
)

print(rouge_score)

## Evaluate for BLEU

Create an instance of the `BleuScoreEvaluator` and run the evaluation.

In [None]:
from azure.ai.evaluation import BleuScoreEvaluator

bleu_eval = BleuScoreEvaluator()

bleu_score = bleu_eval(
    response=response,
    ground_truth=ground_truth
)

print(bleu_score)

## Evaluate for METEOR

The METEOR metric takes an `alpha`, `beta`, and `gamma` parameter which control the balance between precision, recall, and the penalty for incorrect word order (fragmentation penalty). These parameters influence how the final METEOR score is calculated, helping fine-tune it's sensitivity to different aspects of the translation or summary quality.

Create an instance of the `MeteorScoreEvaluator` and run the evaluation.

In [None]:
from azure.ai.evaluation import MeteorScoreEvaluator

meteor_eval = MeteorScoreEvaluator(
    alpha=0.9,
    beta=3.0,
    gamma=0.5
)

meteor_score = meteor_eval(
    response=response,
    ground_truth=ground_truth,
)

print(meteor_score)

## Evaluate for GLEU

Create an instance of the `GleuScoreEvaluator` and run the evaluation.

In [None]:
from azure.ai.evaluation import GleuScoreEvaluator

gleu_eval = GleuScoreEvaluator()

gleu_score = gleu_eval(
    response=response,
    ground_truth=ground_truth,
)

print(gleu_score)

## Evaluate on a test dataset

We can run an evaluation for a dataset with the `evaluate` function. In addition, we can run the evaluation using multiple evaluators. In our case, we're going to run an evaluation using a few evaluators on the product description dataset within the `product-descriptions.jsonl` file.

Let's run an evalation using the `Relevance`, `Groundedness`, and `Fluency` evaluators.

In [None]:
from azure.ai.evaluation import evaluate
import json

path = "performance-quality-data.jsonl"

result = evaluate(
    data=path, # provide your data here
    evaluators={
        "relevance": relevance_eval,
        "groundedness": groundedness_eval,
        "fluency": fluency_eval
    },
    # column mapping
    evaluator_config={
        "default": {
            "query": "${data.query}",
            "response": "${data.response}",
            "context": "${data.context}",
            "ground_truth": "${data.ground_truth}"
        }
    },
)

print(result)

## Delete resources

If you've finished exploring Azure AI Services, delete the Azure resource that you created during the workshop.

**Note**: You may be prompted to delete your deployed model(s) before deleting the resource group.