# Evaluate Using Risk & Safety Metrics

Contoso Home Furnishings is developing an app that generates product descriptions for their selection of furniture. The app aims to generates engaging product descriptions based on the manufacturer's specification of the furniture.

In this exercise, you will evaluate the model output for the generated product description using performance and quality metrics. Provided below is an example of a row of data provided for the description generated for the Contoso Home Furnishings Dining Chair:

`context`

Dining chair. Wooden seat. Four legs. Backrest. Brown. 18" wide, 20" deep, 35" tall. Holds 250 lbs.

`query`

Given the product specfication for the Contoso Home Furnishings Dining Chair, provide a product description.

`ground_truth`

The dining chair is brown and wooden with four legs and a backrest. The dimensions are 18" wide, 20" deep, 35" tall. The dining chair has a weight capacity of 250 lbs.

`response`

Introducing our timeless wooden dining chair, designed for both comfort and durability. Crafted with a solid wood seat and sturdy four-legged base, this chair offers reliable support for up to 250 lbs. The smooth brown finish adds a touch of rustic elegance, while the ergonomically shaped backrest ensures a comfortable dining experience. Measuring 18" wide, 20" deep, and 35" tall, it's the perfect blend of form and function, making it a versatile addition to any dining space. Elevate your home with this beautifully simple yet sophisticated seating option.


## Add environment variables to the .env file

In the root of the **Evaluation and Data Generation Workshop** folder is an `.env` file. Within the `.env` file, fill in the values for the environment variables. You can locate the values for each environment variable in the following locations of the [Azure AI Foundry](https://ai.azure.com) portal:

- `AZURE_SUBSCRIPTION_ID` - On the **Overview** page of your project within **Project details**.
- `AZURE_AI_PROJECT_NAME` - At the top of the **Overview** page for your project.
- `AZURE_OPENAI_RESOURCE_GROUP` - On the **Overview** page of the **Management Center** within **Project properties**.
- `AZURE_OPENAI_SERVICE` - On the **Overview** page of your project in the **Included capabilities** tab for **Azure OpenAI Service**.
- `AZURE_OPENAI_API_VERSION` - On the [API version lifecycle](https://learn.microsoft.com/azure/ai-services/openai/api-version-deprecation#latest-ga-api-release) webpage within the **Latest GA API release** section.
- `AZURE_OPENAI_ENDPOINT` - On the **Details** tab of your model deployment within **Endpoint** (i.e. **Target URI**)
- `AZURE_OPENAI_DEPLOYMENT_NAME` -  On the **Details** tab of your model deployment within **Deployment info**.

# Sign in to Azure

As a security best practice, we'll use [keyless authentication](https://learn.microsoft.com/azure/developer/ai/keyless-connections?tabs=csharp%2Cazure-cli) to authenticate to Azure OpenAI with Microsoft Entra ID. Before you can do so, you'll first need to install the **Azure CLI** per the [installation instructions](https://learn.microsoft.com/cli/azure/install-azure-cli) for your operating system.

Next, open a terminal and run `az login` to sign in to your Azure account.

## Install the package

The evaluator classes for assessing performance and quality are in the Azure AI Evaluation SDK. We'll begin by installing the package.

In [None]:
%pip install azure-ai-evaluation

## Access the environment variables.

We'll import `os` and `load_dotenv` so that you can access the environment variables.

In [None]:
import os
from dotenv import load_dotenv

load_dotenv()

## Setup keyless authentication

Rather than hardcode your **key**, we'll use a keyless connection with Azure OpenAI.

In [None]:
import azure.identity

credential = azure.identity.DefaultAzureCredential()
token_provider = azure.identity.get_bearer_token_provider(credential, "https://cognitiveservices.azure.com/.default")

token = token_provider()

## Configure the model_config

The `model_config` is necessary as it's a required parameter when creating an instance of the evaluator class. Let's configure the `model_config` with the following:

- Azure OpenAI endpoint
- Azure OpenAI API key
- Azure deployment

In [None]:
model_config = {
    "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
    "api_key": token,
    "azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT_NAME")
}

## Create variables for the evaluation data

Since we'll be using the same context, query, response, and ground truth for the exercises, we'll create a variable to store each string and pass the variables into our evaluations.

In [None]:
context = "Dining chair. Wooden seat. Four legs. Backrest. Brown. 18\" wide, 20\" deep, 35\" tall. Holds 250 lbs."
query = "Given the product specification for the Contoso Home Furnishings Dining Chair, provide an engaging marketing product description."
ground_truth = "The dining chair is brown and wooden with four legs and a backrest. The dimensions are 18\" wide, 20\" deep, 35\" tall. The dining chair has a weight capacity of 250 lbs."
response = "Introducing our timeless wooden dining chair, designed for both comfort and durability. Crafted with a solid wood seat and sturdy four-legged base, this chair offers reliable support for up to 250 lbs. The smooth brown finish adds a touch of rustic elegance, while the ergonomically shaped backrest ensures a comfortable dining experience. Measuring 18\" wide, 20\" deep, and 35\" tall, it's the perfect blend of form and function, making it a versatile addition to any dining space. Elevate your home with this beautifully simple yet sophisticated seating option."

## Evaluate for Groundedness

Create an instance of the `GroundednessEvaluator` and run the evaluation.



In [None]:
from azure.ai.evaluation import GroundednessEvaluator

groundedness_eval = GroundednessEvaluator(model_config)

groundedness_score = groundedness_eval(
    response=response,
    context=context,
)

print(groundedness_score)

## Evaluate for Relevance

Create an instance of the `RelevanceEvaluator` and run the evaluation.

In [None]:
from azure.ai.evaluation import RelevanceEvaluator

relevance_eval = RelevanceEvaluator(model_config)

relevance_score = relevance_eval(
    response=response,
    context=context,
    query=query
)

print(relevance_score)

## Evaluate for Coherence

Create an instance of the `CoherenceEvaluator` and run the evaluation.

In [None]:
from azure.ai.evaluation import CoherenceEvaluator

coherence_eval = CoherenceEvaluator(model_config)

coherence_score = coherence_eval(
    response=response,
    query=query
)

print(coherence_score)

## Evaluate for Fluency

Create an instance of the `FluencyEvaluator` and run the evaluation.

In [None]:
from azure.ai.evaluation import FluencyEvaluator

fluency_eval = FluencyEvaluator(model_config)

fluency_score = fluency_eval(
    response=response,
    query=query
)

print(fluency_score)

## Evaluate for Similarity

Create an instance of the `SimiliartyEvaluator` and run the evaluation.

In [None]:
from azure.ai.evaluation import SimilarityEvaluator

similarity_eval = SimilarityEvaluator(model_config)

similarity_score = similarity_eval(
    response=response,
    query=query,
    ground_truth=ground_truth
)

print(similarity_score)

## Evaluate for F1 Score

Create an instance of the `F1ScoreEvaluator` and run the evaluation.

In [None]:
from azure.ai.evaluation import F1ScoreEvaluator

f1_eval = F1ScoreEvaluator()

f1_score = f1_eval(
    response=response,
    ground_truth=ground_truth
)

print(f1_score)

## Evaluate for ROUGE
There are several types of ROUGE metrics: `ROUGE_1`, `ROUGE_2`, `ROUGE_3`, `ROUGE_4`, `ROUGE_5`, and `ROUGE_L`.

The initial 5 types are considered **ROUGE-N** which measures the overlap of n-grams (contiguous sequences of 'n' words) between the generated summary and reference summary. For example, `ROUGE_1` measures of the overalp of unigrams (single words), and `ROUGE_2` measures the overlap of bigrams (two-word sequences). We provide up to 5-grams.

`ROUGE_L` measures the longest common subsequence (LCS) between the generated and reference summaries. LCS takes into account sequence similarity whle maintaining word order, which makes `ROUGE_L` effective in capturing sentence-level structure.

Create an instance of the `RougeScoreEvaluator` and run the evaluation.

In [None]:
from azure.ai.evaluation import RougeScoreEvaluator, RougeType

rouge_eval = RougeScoreEvaluator(rouge_type=RougeType.ROUGE_1)

rouge_score = rouge_eval(
    response=response,
    ground_truth=ground_truth,
)

print(rouge_score)

## Evaluate for BLEU

Create an instance of the `BleuScoreEvaluator` and run the evaluation.

**Note**: The initial run may install a package. If this occurs, run the cell once more to receive the BLEU score.

In [None]:
from azure.ai.evaluation import BleuScoreEvaluator

bleu_eval = BleuScoreEvaluator()

bleu_score = bleu_eval(
    response=response,
    ground_truth=ground_truth
)

print(bleu_score)

## Evaluate for METEOR

The METEOR metric takes an `alpha`, `beta`, and `gamma` parameter which control the balance between precision, recall, and the penalty for incorrect word order (fragmentation penalty). These parameters influence how the final METEOR score is calculated, helping fine-tune it's sensitivity to different aspects of the translation or summary quality.

Create an instance of the `MeteorScoreEvaluator` and run the evaluation.

In [None]:
from azure.ai.evaluation import MeteorScoreEvaluator

meteor_eval = MeteorScoreEvaluator(
    alpha=0.9,
    beta=3.0,
    gamma=0.5
)

meteor_score = meteor_eval(
    response=response,
    ground_truth=ground_truth,
)

print(meteor_score)

## Evaluate for GLEU

Create an instance of the `GleuScoreEvaluator` and run the evaluation.

In [None]:
from azure.ai.evaluation import GleuScoreEvaluator

gleu_eval = GleuScoreEvaluator()

gleu_score = gleu_eval(
    response=response,
    ground_truth=ground_truth,
)

print(gleu_score)

## Evaluate on a test dataset

We can run an evaluation for a dataset with the `evaluate` function. In addition, we can run the evaluation using multiple evaluators. In our case, we're going to run an evaluation using a few evaluators on the product description dataset within the `product-descriptions.jsonl` file. We'll also output the results to a new `evaluation_results.json` file.

Let's run an evalation using the `Relevance`, `Groundedness`, and `Fluency` evaluators.

In [None]:
from azure.ai.evaluation import evaluate
import json

path = "performance-quality-data.jsonl"

result = evaluate(
    data=path, # provide your data here
    evaluators={
        "relevance": relevance_eval,
        "groundedness": groundedness_eval,
        "fluency": fluency_eval
    },
    # column mapping
    evaluator_config={
        "default": {
            "query": "${data.query}",
            "response": "${data.response}",
            "context": "${data.context}",
            "ground_truth": "${data.ground_truth}"
        }
    }
)

## Print the results with Pretty Print

Now that we've run the evaluation, let's print the results using Pretty Print, which displays data in a structured and visually appealing way, making it easier to read and understand.

In [None]:
from pprint import pprint
pprint(result)

## Print the results as table

We can also print the results as a table using `Pandas`.

In [None]:
import pandas as pd
pd.DataFrame(result["rows"])

## Delete resources

If you've finished exploring Azure AI Services, delete the Azure resource that you created during the workshop.

**Note**: You may be prompted to delete your deployed model(s) before deleting the resource group.