# AOAI Evaluators POC
The goal of this proof of concept is to help us understand how to integrate graders into the evaluation SDK and identify potential gaps.


## Grader Architecture
The diagram is based on the learning from code [here](https://msdata.visualstudio.com/Vienna/_git/azureml-asset?path=/assets/training/evaluation/src/graders/graders.py&version=GBmain).

![Grader Arch Image](grader_arch.png)


Three major concepts:
- *Model Sampler*: take the evaluation input prompt and generate the completion from the users requested deployed model. 
- *Graders*: compare and score the evaluation completion against the provided evaluation expected outcome. 
- *Aggregators* – take the results of the grading step and aggregate them over the evaluation job.



## Grader List

| Category               | Name                             | Model Based | Priority | Notes                                                                                                             |
|------------------------|----------------------------------|-------------|----------|-------------------------------------------------------------------------------------------------------------------|
| **Text Matching**      | StringCheckGrader                | No          |          | Need to check PF team whether they can support formatted string<br> ```lhs```: {`"type"`: `"tstr"`, `"value"`: `"My name is {item.col2}"`}<br> ```rhs```: {`"type"`: `"tstr"`, `"value"`: `"My name is {sample.col2}"`} |
|                        | SetMembershipGrader              | No          |          |                                                                                                                   |
|                        | SetComparisionGrader             | No          |          |                                                                                                                   |
|                        | StringCountGrader                | No          |          |                                                                                                                   |
| **Classification Quality** | ModelGrader                      | Yes         |          | It's built on top of the model sampler. Need to support formatted chat messages. Like our composite evaluator. Similar to what evaluate API does |
|                        | MulticlassClassificationGrader   | No          |          |                                                                                                                   |
|                        | DiscreteClassificationModelGrader| Yes         |          |                                                                                                                   |
| **Text Similarity**    | RougeScoreGrader                 | No          |          |                                                                                                                   |
|                        | BleuScoreGrader                  | No          |          |                                                                                                                   |
|                        | MeteorScoreGrader                | No          |          |                                                                                                                   |
|                        | GleuScoreGrader                  | No          |          |                                                                                                                   |
| **Conversational**     | ClosedQAModelGrader              | Yes         |          | Prompt based. Can be implemented using Prompty.                                                                   |
|                        | ChatCriteriaModelGrader          | Yes         |          | Prompt based. Can be implemented using Prompty.                                                                   |
| **Criteria Based**     | FactualityModelGrader            | Yes         |          | Prompt based. Can be implemented using Prompty.                                                                   |


## Key Differences Between AOAI Graders and pf-evals Evaluators

|                      | **AOAI Graders**                                   | **pf-evals SDK**                                                 |
|----------------------|----------------------------------------------------|------------------------------------------------------------------|
| **Input/Output**     | Multi-rows                                         | Single row <br> Multi-rows through evaluate API                  |
| **Data Access**      | Through Expression Config. Supports three types: <br> 1. Constant Value <br> 2. Column Reference <br> 3. Template String | Similar to column mapping in the evaluation SDK. <br> Most scenarios can be supported through column mapping, but we have a few limitations requiring support from the PF side: <br> 1. Template String <br> ```Question: ${data.question} A: ${data.option_a} B: ${data.option_b} C: ${data.option_c} D: ${data.option_d}``` <br> 2. Column Reference in an array. For example: <br> ```"element": "${data.ground_truth}", "set": ["A", "B", "C", "D"]``` |
| **Aggregation**      | Per Grader Aggregation <br> Multiple Aggregation Types Supported | Single Aggregation applied to all evaluators <br> Currently only supports mean |
| **Model Sampler**    | Generates samples based on formatted chat messages | This concept is not present in pf-evals today, but it functions like a generic target function and can be implemented through prompty. |


## Graders in Evaluation SDK Demo
We have done the following in this POC:

1. Migrated three graders (Bleu, String Count and Set Membership) to the evaluation SDK to showcase the primary scenario.
2. Implemented ModelSampler based on prompty, with support for formatted chat messages.
3. Added support for evaluate API to work with graders.
4. Support per evaluator aggregation.

In [1]:
import os
from pprint import pprint as print

from promptflow.evals.evaluators import BleuScoreEvaluator, SetMembershipEvaluator, StringCountEvaluator
from promptflow.evals.evaluators import BleuScoreConfig, SetMembershipConfig, StringCountConfig
from promptflow.core import AzureOpenAIModelConfiguration
from promptflow.evals.evaluate import evaluate, ModelSampler

[2024-08-07 12:50:07 -0700][promptflow][DEBUG] - preparing home directory with default value.


### 1. Single line test

In [2]:
# BleuScoreEvaluator
bleu = BleuScoreEvaluator()
score = bleu(
    reference="this is a book", hypothesis="this is not a book"
)
print(score)

{'bleu_score': 0.17141814854755813}


In [3]:
# StringCountGrader
string_count = StringCountEvaluator(
    StringCountConfig(case_sensitive=False)
)
score = string_count(
    reference="A A A", hypothesis="A"
)
print(score)

{'string_count': 3}


In [4]:
# SetMembershipEvaluator
membership = SetMembershipEvaluator(
    SetMembershipConfig(present_grade=1, absent_grade=0, aggregation_type="sum")
)
score = membership(
    element="A", set=["A", "B"]
)
print(score)

{'set_membership': 1}


### Try out ModelSampler

In [5]:
# Model Sampler
model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT"),
    api_key=os.environ.get("AZURE_OPENAI_KEY"),
    azure_deployment="gpt-35-turbo",
)

sampling_params = {
    "max_tokens": 1,
}

# use formatted string in chat messages
trajectory = [
    {"role": "system", "content": "You are a useful bot. Given a question and some possible answers, reply with the letter of the correct answer."},
    {"role": "user", "content": "Question: ${data.question} A: ${data.option_a} B: ${data.option_b} C: ${data.option_c} D: ${data.option_d}"},
]
sampler = ModelSampler(model_config=model_config, trajectory=trajectory, sampling_params=sampling_params)

inputs = {
    "question": "What is the capital of France?",
    "option_a": "Paris",
    "option_b": "Berlin",
    "option_c": "London", \
    "option_d": "Madrid"
}

print(sampler(line_data=inputs))

{'sample': 'A'}


### 3. Batch Run with Evaluate API

In [None]:
# Batch Run
path = "aoai_evaluator_data.jsonl"
result = evaluate(
    data=path,
    target=sampler,
    evaluators={
        "bleu": bleu,
        "membership": membership,
        "string_count": string_count
    },
    evaluator_config={
        "bleu": {"reference": "${data.ground_truth}", "hypothesis": "${target.sample}"},
        "string_count": {"reference": "${data.ground_truth}", "hypothesis": "A"},
        "membership": {"element": "${data.ground_truth}", "set": ["A", "B", "C", "D"]},
    },
    azure_ai_project = {
        "subscription_id": "2d385bf4-0756-4a76-aa95-28bf9ed3b625",
        "resource_group_name": "rg-ninhuai",
        "project_name": "ninhu-3593",
    }
)

In [8]:
print(result['metrics'])
print(result['studio_url'])

{'bleu.bleu_score_mean': 0.6666666666666666,
 'membership.set_membership_sum': 3,
 'string_count.string_count_mean': 0.6666666666666666}
'https://ai.azure.com/build/evaluation/promptflow_evals_evaluate_model_sampler_model_sampler_modelsampler_9z7resds_20240807_125153_249924?wsid=/subscriptions/2d385bf4-0756-4a76-aa95-28bf9ed3b625/resourceGroups/rg-ninhuai/providers/Microsoft.MachineLearningServices/workspaces/ninhu-3593'
