# Evaluation with Data
In this notebook, we introduce built-in evaluators and guide you through creating your own custom evaluators. We'll cover both code-based and prompt-based custom evaluators. Finally, we'll demonstrate how to use the `evaluate` API to assess data using these evaluators.


In [None]:
# Clearing any old installation
# This is important since older version of promptflow has one package.
# Now it is split into number of them.
! pip uninstall -y promptflow promptflow-cli promptflow-azure promptflow-core promptflow-devkit promptflow-tools promptflow-evals

# Install packages in this order
! pip install promptflow-evals

## Evaluate the eval dataset using the fine tuned model

In [None]:
import os

experiment_name="vampire-bats"
experiment_dir=f"dataset/{experiment_name}-files"

dataset_path_hf_eval = f"{experiment_dir}/{experiment_name}-hf.eval.jsonl"
dataset_path_hf_eval_answer = f"{experiment_dir}/{experiment_name}-hf.eval.answer.jsonl"
dataset_path_hf_eval_answer_baseline = f"{experiment_dir}/{experiment_name}-hf.eval.answer.baseline.jsonl"

dataset_path_ft_eval = f"{experiment_dir}/{experiment_name}-ft.eval.jsonl"
dataset_path_ft_eval_baseline = f"{experiment_dir}/{experiment_name}-ft.eval.baseline.jsonl"
dataset_path_ft_eval_score = f"{experiment_dir}/{experiment_name}-ft.eval.score.jsonl"

EVAL_OPENAI_BASE_URL_BASE = os.getenv('EVAL_OPENAI_BASE_URL_BASE')
EVAL_OPENAI_API_KEY_BASE = os.getenv('EVAL_OPENAI_API_KEY_BASE')
EVAL_OPENAI_DEPLOYMENT_BASE = os.getenv('EVAL_OPENAI_DEPLOYMENT_BASE')

EVAL_OPENAI_BASE_URL_FT = os.getenv('EVAL_OPENAI_BASE_URL_FT')
EVAL_OPENAI_API_KEY_FT = os.getenv('EVAL_OPENAI_API_KEY_FT')
EVAL_OPENAI_DEPLOYMENT_FT = os.getenv('EVAL_OPENAI_DEPLOYMENT_FT')

def obfuscate(secret):
    l = len(secret)
    return '.' * (l - 4) + secret[-4:]

print(f"EVAL_OPENAI_BASE_URL_BASE={EVAL_OPENAI_BASE_URL_BASE}")
print(f"EVAL_OPENAI_API_KEY_BASE={obfuscate(EVAL_OPENAI_API_KEY_BASE)}")
print(f"EVAL_OPENAI_DEPLOYMENT_BASE={EVAL_OPENAI_DEPLOYMENT_BASE}")

print(f"EVAL_OPENAI_BASE_URL_FT={EVAL_OPENAI_BASE_URL_FT}")
print(f"EVAL_OPENAI_API_KEY_FT={obfuscate(EVAL_OPENAI_API_KEY_FT)}")
print(f"EVAL_OPENAI_DEPLOYMENT_FT={EVAL_OPENAI_DEPLOYMENT_FT}")


In [None]:

a = input()
print(a)

### Baseline

In [None]:
!unset AZURE_OPENAI_ENDPOINT && \
unset AZURE_OPENAI_API_KEY && \
unset OPENAI_API_VERSION && \
OPENAI_BASE_URL=$EVAL_OPENAI_BASE_URL_FT \
OPENAI_API_KEY=$EVAL_OPENAI_API_KEY_BASE \
python ../eval.py \
    --question-file $dataset_path_hf_eval \
    --answer-file $dataset_path_hf_eval_answer_baseline \
    --model $EVAL_OPENAI_DEPLOYMENT_BASE

### Fine tuned model

In [None]:
!unset AZURE_OPENAI_ENDPOINT && \
unset AZURE_OPENAI_API_KEY && \
unset OPENAI_API_VERSION && \
OPENAI_BASE_URL=$EVAL_OPENAI_BASE_URL_FT \
OPENAI_API_KEY=$EVAL_OPENAI_API_KEY_FT \
python ../eval.py \
    --question-file $dataset_path_hf_eval \
    --answer-file $dataset_path_hf_eval_answer \
    --model $EVAL_OPENAI_DEPLOYMENT_FT

## 0. Prepare eval dataset

In [None]:
! python ../format.py \
    --input $dataset_path_hf_eval_answer \
    --input-type jsonl \
    --output $dataset_path_ft_eval \
    --output-format eval

In [None]:
! python ../format.py \
    --input $dataset_path_hf_eval_answer_baseline \
    --input-type jsonl \
    --output $dataset_path_ft_eval_baseline \
    --output-format eval

In [None]:
import pandas as pd

In [None]:
df = pd.read_json(dataset_path_ft_eval, lines=True)
df.head()

In [None]:
pd.read_json(dataset_path_ft_eval_baseline, lines=True).head()

## 1. Built-in Evaluators

The table below lists all the built-in evaluators we support. In the following sections, we will select a few of these evaluators to demonstrate how to use them.

| Category       | Namespace                                        | Evaluator Class           | Notes                                             |
|----------------|--------------------------------------------------|---------------------------|---------------------------------------------------|
| Quality        | promptflow.evals.evaluators                      | GroundednessEvaluator     | Measures how well the answer is entailed by the context and is not hallucinated |
|                |                                                  | RelevanceEvaluator        | How well the answer addresses the main aspects of the question, based on the context. Consider whether all and only the important aspects are contained in the answer when evaluating relevance. |
|                |                                                  | CoherenceEvaluator        | How well all the sentences fit together and sound naturally as a whole. |
|                |                                                  | FluencyEvaluator          | Quality of individual sentences in the answer, and whether they are well-written and grammatically correct. |
|                |                                                  | SimilarityEvaluator       | Measures the similarity between the predicted answer and the correct answer |
|                |                                                  | F1ScoreEvaluator          | F1 score |
| Content Safety | promptflow.evals.evaluators.content_safety       | ViolenceEvaluator         |                                                   |
|                |                                                  | SexualEvaluator           |                                                   |
|                |                                                  | SelfHarmEvaluator         |                                                   |
|                |                                                  | HateUnfairnessEvaluator   |                                                   |
| Composite      | promptflow.evals.evaluators                      | QAEvaluator               | Built on top of individual quality evaluators.    |
|                |                                                  | ChatEvaluator             | Similar to QAEvaluator but designed for evaluating chat messages. |
|                |                                                  | ContentSafetyEvaluator    | Built on top of individual content safety evaluators. |



### 1.1 Quality Evaluator

In [None]:
import os
from promptflow.core import AzureOpenAIModelConfiguration

azure_endpoint=os.environ.get("EVAL_AZURE_OPENAI_ENDPOINT_EVALUATORS")
api_key=os.environ.get("EVAL_AZURE_OPENAI_API_KEY_EVALUATORS")
azure_deployment=os.environ.get("EVAL_AZURE_OPENAI_DEPLOYMENT_EVALUATORS")
api_version=os.environ.get("EVAL_OPENAI_API_VERSION_EVALUATORS")

print("azure_endpoint=" + azure_endpoint)
print("azure_deployment=" + azure_deployment)
print("api_version=" + api_version)

# Initialize Azure OpenAI Connection
model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=azure_endpoint,
    api_key=api_key,
    azure_deployment=azure_deployment,
    api_version=api_version,
)

In [None]:
from promptflow.evals.evaluators import RelevanceEvaluator

# Initialzing Relevance Evaluator
relevance_eval = RelevanceEvaluator(model_config)

In [None]:
sample=df.iloc[1]
sample

In [None]:
# Running Relevance Evaluator on single input row
relevance_score = relevance_eval(
    question=sample['question'],
    answer=sample['final_answer'],
    context=sample['context'],
    ground_truth=sample['gold_answer'],
)
print(relevance_score)

## 3. Batch evaluate

In [None]:
df = pd.read_json(dataset_path_ft_eval, lines=True)
df.head()

In [None]:
!AZURE_OPENAI_ENDPOINT=$azure_endpoint \
    AZURE_OPENAI_API_KEY=$api_key \
    AZURE_OPENAI_DEPLOYMENT=$azure_deployment \
    OPENAI_API_VERSION=$api_version \
    python ../pfeval.py \
    --input $dataset_path_ft_eval \
    --output $dataset_path_ft_eval_score

In [None]:
df = pd.read_json(dataset_path_ft_eval_score, lines=True)
df.head()

In [None]:
df.describe()

## 3. Using Evaluate API to evaluate with data

In previous sections, we walked you through how to use built-in evaluators to evaluate a single row and how to define your own custom evaluators. Now, we will show you how to use these evaluators with the powerful `evaluate` API to assess an entire dataset.

First, let's take a peek at what the data looks like.

In [None]:
df.head()

Now, we will invoke the `evaluate` API using a few evaluators that we already initialized

Additionally, we have a column mapping to map the `truth` column from the dataset to `ground_truth`, which is accepted by the evaluator.

In [None]:
from promptflow.evals.evaluate import evaluate

result = evaluate(
    data=dataset_path_ft_eval,
    evaluators={
        "relevance": relevance_eval
    },
    # column mapping
    evaluator_config={
        "default": {
            "answer": "${data.gold_answer}"
        }
    }
)

from IPython.display import display, JSON
display(JSON(result))


Finally, let's check the results produced by the evaluate API.

In [None]:
# Check the results using Azure AI Studio UI
print(result["studio_url"])