#### Generative AI Metrics

1) GPT Star Metrics - Question Answering
2) RAG Evaluation Metrics - Chat Completion

#### Appendix:

3) Computation of other Question Answering Metrics
4) Computation of other Chat Completion Metrics

For more details please refer to the quick start guide here: https://aka.ms/azureml-metrics-quick-start

#### To run this notebook

* Creating a conda environment

```$ conda create --name <env_name> python=3.8```

* Deleting a conda environment

```$ conda env remove -n <env_name>```

* Activating the environment

```$ conda activate <env_name> ```

* Please install azureml-metrics package

```$ pip install azureml-metrics```

- The above command install numpy, pandas, psutil dependencies.

- For computing gpt-star, rag based metrics please run:

```$ pip install azureml-metrics[generative-ai]```

- The above command installs : ('openai', 'tenacity', 'evaluate', "rtoml", "azure-keyvault", "azure-identity", "requests", "aiohttp") dependencies

##### 1) Computing GPT-Star Metrics

For computing gpt-star metrics we need openai, tenacity libraries. To install the required dependencies we can run:

```$ pip install azureml-metrics[generative-ai]```

In [1]:
from azureml.metrics import list_metrics, constants

qa_metrics = list_metrics(task_type=constants.Tasks.QUESTION_ANSWERING)
print("Question Answering metrics:", qa_metrics)

Question Answering metrics: {'llm_fluency', 'llm_relevance', 'gpt_fluency', 'gpt_groundedness', 'f1_score', 'bertscore', 'llm_similarity', 'gpt_coherence', 'llm_coherence', 'llm_groundedness', 'gpt_relevance', 'gpt_similarity', 'exact_match', 'ada_similarity'}


In [4]:
from azureml.metrics import compute_metrics, constants
from pprint import pprint
import os

context = "In 2018, a group of scientists discovered a new type of deep-sea fish that has a transparent head. The fish, named Barreleye, has tubular eyes that can rotate to look either upward or forward, allowing it to see potential prey and predators in the dark depths of the ocean."
question = "What is the name of the deep-sea fish discovered by scientists in 2018, and what is unique about its head?"
coherent_answer = "The deep-sea fish discovered by scientists in 2018 is called Barreleye, and it has a transparent head. The fish has tubular eyes that can rotate to look either upward or forward, allowing it to see potential prey and predators in the dark depths of the ocean."
incoherent_answer = "The scientists who made the discovery in 2018 were actually studying coral reefs, not deep-sea fish. However, they did come across an unusual creature that they couldn't identify. It turned out to be a type of sea cucumber that has a strange, tube-like shape."

# this dictionary is propogated to openai completion or chat completion API.
# please add the keys directly accepted by openai API.
openai_params = {
    "api_version": os.environ["OPENAI_API_VERSION"],
    "api_base": os.envrion["OPENAI_API_BASE"],
    "api_type": os.envrion["OPENAI_API_TYPE"],
    "api_key" : os.envrion["OPENAI_API_KEY"],
    "deployment_id": "<deployment_id>"
}

metrics_config = {
     "questions" : [question, question],
     "contexts" : [context, context],
     "openai_params" : openai_params,
     # To compute gpt-star metrics
     "metrics" : ["gpt_coherence", "gpt_fluency", "gpt_groundedness", "gpt_relevance", "gpt_similarity"]
}

# Note : length of lists of y_test, y_pred, questions, contexts should be equal
result = compute_metrics(task_type=constants.Tasks.QUESTION_ANSWERING, 
                         y_test=[coherent_answer, coherent_answer],
                         y_pred=[coherent_answer, incoherent_answer],
                         **metrics_config)
pprint(result)

LLM related metrics need llm_params to be computed. Computing metrics for ['gpt_fluency', 'gpt_groundedness', 'gpt_coherence', 'gpt_relevance', 'gpt_similarity']
100%|██████████| 2/2 [00:02<00:00,  1.08s/it]
100%|██████████| 2/2 [00:01<00:00,  1.64it/s]
100%|██████████| 2/2 [00:01<00:00,  1.79it/s]
100%|██████████| 2/2 [00:01<00:00,  1.56it/s]
100%|██████████| 2/2 [00:01<00:00,  1.59it/s]

{'artifacts': {'gpt_coherence': [5, 3],
               'gpt_fluency': [5, 3],
               'gpt_groundedness': [5, 3],
               'gpt_relevance': [5, 1],
               'gpt_similarity': [5, 1]},
 'metrics': {'mean_gpt_coherence': 4.0,
             'mean_gpt_fluency': 4.0,
             'mean_gpt_groundedness': 4.0,
             'mean_gpt_relevance': 3.0,
             'mean_gpt_similarity': 3.0,
             'median_gpt_coherence': 4.0,
             'median_gpt_fluency': 4.0,
             'median_gpt_groundedness': 4.0,
             'median_gpt_relevance': 3.0,
             'median_gpt_similarity': 3.0}}





##### 2) Computing RAG based metrics

For computing rag based metrics we need openai, tenacity, requests, aiohttp, rtoml, azure-identity, azure-keyvalut libraries. To install the required dependencies we can run:

```$ pip install azureml-metrics[generative-ai]```

In [5]:
from azureml.metrics import list_metrics, constants

rag_metrics = list_metrics(task_type=constants.Tasks.RAG_EVALUATION)
print("RAG Evaluation based metrics:", rag_metrics)

RAG Evaluation based metrics: {'gpt_groundedness', 'gpt_relevance', 'gpt_retrieval_score'}


In [6]:
from azureml.metrics import list_metrics, constants

chat_metrics = list_metrics(task_type=constants.Tasks.CHAT_COMPLETION)
print("Chat Completion metrics:", chat_metrics)

Chat Completion metrics: {'bleu_4', 'rouge1', 'bleu_2', 'gpt_groundedness', 'rougeL', 'rougeLsum', 'conversation_groundedness_score', 'bleu_1', 'rouge2', 'bleu_3', 'gpt_retrieval_score', 'gpt_relevance', 'perplexity'}


In [7]:
%%time
# gpt4-32k model
from azureml.metrics import compute_metrics, constants
from pprint import pprint
import os

y_test = [["4", "2 + 2 = 4"], ["Agra", "Agra, India"]]

y_pred = [
    [
        {"role": "user", "content": "What is the value of 2 + 2?"},
        {"role": "assistant", "content": "2 + 2 = 4",
         "context": {
             "citations": [{'id': 'math_document1.md',
                            'content': 'Information about additions: ' \
                            '1 + 2 = 3, 2 + 2 = 4'}]
            }
        }
    ],
    [
        {"role": "user", "content": "Where is Taj Mahal located?"},
        {"role": "assistant", "content": "Taj Mahal is located in Agra, India",
         "context": {
             "citations": [{'id': 'taj_mahal_document1.md',
                            'content': 'Taj Mahal is located in Agra, India ' \
                                        'and is one of the seven wonders of the world.'}]
            }
        }
    ]
]

openai_params = {
    "api_version": os.environ["OPENAI_API_VERSION"],
    "api_base": os.envrion["OPENAI_API_BASE"],
    "api_type": os.envrion["OPENAI_API_TYPE"],
    "api_key" : os.envrion["OPENAI_API_KEY"],
    "deployment_id": "<deployment_id>"
}

metrics_config = {
    "openai_params": openai_params,
    "score_version": "v1",
    "use_chat_completion_api": True,
    # To compute RAG based metrics
    "metrics": ["gpt_relevance", "gpt_groundedness", "gpt_retrieval_score"]
}

# The above metrics can even be computed by setting the task_type to RAG_EVALUATION
result = compute_metrics(task_type=constants.Tasks.CHAT_COMPLETION,
                         y_test=y_test, 
                         y_pred=y_pred,
                         **metrics_config)
pprint(result)

Computing gpt relevance score: 100%|██████████| 2/2 [00:02<00:00,  1.23s/it]
Computing gpt groundedness score: 100%|██████████| 2/2 [00:03<00:00,  1.57s/it]
Computing gpt retrieval score: 100%|██████████| 2/2 [00:03<00:00,  1.95s/it]

{'artifacts': {'gpt_groundedness': {'reason': [['<Quality reasoning:> The '
                                                'chatbot\'s response "What is '
                                                'the value of 2 + 2?" is a '
                                                'direct paraphrase of the '
                                                'question "2 + 2 = 4". The '
                                                'factual information in the '
                                                "chatbot's response, which is "
                                                'the equation "2 + 2 = 4", is '
                                                'directly taken from the '
                                                'retrieved document '
                                                '"math_document1.md" where it '
                                                'states "2 + 2 = 4". '
                                                "Therefore, the chatbot's "
    




##### Appendix:

3) Computing all Question Answering metrics

| Metrics | Extra Dependencies Needed | Can be installed with | 
| :--: | -- | -- |
| ExactMatch, F1 Score | No other dependencies are needed | -- |
| bertscore | bert_score, evaluate | pip install azureml-metrics[bert-score] |

To install all question-answering dependencies at once please run:

```$ pip install azureml-metrics[qa]```

In [8]:
# computing exact_match, f1 score for question answering task
from azureml.metrics import compute_metrics, constants
from pprint import pprint

y_pred = ["hello there general kenobi 123","foo bar foobar", "ram 234", "sid"]
y_test = ["hello there general kenobi san", "foo bar foobar", "ram 23", "sid$"]

metrics_config = {
    "metrics": ["exact_match", "f1_score"],
}

result = compute_metrics(task_type=constants.Tasks.QUESTION_ANSWERING, y_test=y_test, y_pred=y_pred,
                         **metrics_config)
pprint(result)


GPT related metrics need openai_params to be computed. Computing metrics for ['f1_score', 'exact_match']
LLM related metrics need llm_params to be computed. Computing metrics for ['f1_score', 'exact_match']
GPT related metrics need openai_params in a dictionary.
GPT related metrics need openai_params in a dictionary.


{'artifacts': {'exact_match': [False, True, False, False],
               'f1_score': [0.8000000000000002, 1.0, 0.5, 1.0]},
 'metrics': {'mean_exact_match': 0.25,
             'mean_f1_score': 0.8250000000000001,
             'median_exact_match': 0.0,
             'median_f1_score': 0.9000000000000001}}


For computing bert-score please run:

```$ pip install azureml-metrics[bert-score]```

In [2]:
# computing exact_match, f1 score for question answering task
from azureml.metrics import compute_metrics, constants
from pprint import pprint

y_pred = ["hello there general kenobi 123","foo bar foobar", "ram 234", "sid"]
y_test = ["hello there general kenobi san", "foo bar foobar", "ram 23", "sid$"]

metrics_config = {
    "metrics": ["bertscore"],
}

result = compute_metrics(task_type=constants.Tasks.QUESTION_ANSWERING, y_test=y_test, y_pred=y_pred,
                         **metrics_config)
pprint(result)

GPT related metrics need openai_params to be computed. Computing metrics for ['bertscore']
LLM related metrics need llm_params to be computed. Computing metrics for ['bertscore']
GPT related metrics need openai_params in a dictionary.


{'artifacts': {'bertscore': {'f1': [0.693062424659729,
                                    1.0,
                                    0.7792337536811829,
                                    0.435146301984787],
                             'hashcode': 'microsoft/deberta-large_L16_no-idf_version=0.3.12(hug_trans=4.31.0)-rescaled',
                             'precision': [0.6916525363922119,
                                           1.0,
                                           0.7782196998596191,
                                           0.5183156728744507],
                             'recall': [0.6915604472160339,
                                        1.0,
                                        0.7781534790992737,
                                        0.3549974262714386]}},
 'metrics': {'mean_bertscore_f1': 0.7268606200814247,
             'mean_bertscore_precision': 0.7470469772815704,
             'mean_bertscore_recall': 0.7061778381466866}}


For computing ada-cosine similarity, we need to have 'openai', 'plotly', 'scipy' dependencies installed. So, we can run:

```$ pip install azureml-metrics[ada-cosine-similarity]```

**Note: To compute ada-cosine-similarity we need to send openai credentials which has "text-embedding-ada-002" model.**

In [10]:
from azureml.metrics import compute_metrics, constants
from pprint import pprint
import os

y_pred = ["hello there general kenobi 123","foo bar foobar", "ram 234", "sid"]
y_test = ["hello there general kenobi san", "foo bar foobar", "ram 23", "sid$"]

# this dictionary is propogated to openai completion or chat completion API.
# please add the keys directly accepted by openai API.
openai_params = {
    "api_version": os.environ["OPENAI_API_VERSION"],
    "api_base": os.envrion["OPENAI_API_BASE"],
    "api_type": os.envrion["OPENAI_API_TYPE"],
    "api_key" : os.envrion["OPENAI_API_KEY"],
    "deployment_id": "<deployment_id>"
}

metrics_config = {
     "openai_params" : openai_params,
     # To compute gpt-star metrics
     "metrics" : ["ada_cosine_similarity"]
}

# Note : length of lists of y_test, y_pred, questions, contexts should be equal
result = compute_metrics(task_type=constants.Tasks.QUESTION_ANSWERING, 
                         y_test=y_test,
                         y_pred=y_pred,
                         **metrics_config)
pprint(result)

{'artifacts': {'ada_cosine_similarity': [0.9526062492711195,
                                         1.0000000000000002,
                                         0.9100055776481804,
                                         0.8961406378903477]},
 'metrics': {'mean_ada_cosine_similarity': 0.9396881162024119}}


##### Compute all supported QA metrics:

Now, as all required QA dependencies are installed: we can compute all required qa metrics

In [4]:
from azureml.metrics import compute_metrics, constants
from pprint import pprint
import os

context = "In 2018, a group of scientists discovered a new type of deep-sea fish that has a transparent head. The fish, named Barreleye, has tubular eyes that can rotate to look either upward or forward, allowing it to see potential prey and predators in the dark depths of the ocean."
question = "What is the name of the deep-sea fish discovered by scientists in 2018, and what is unique about its head?"
coherent_answer = "The deep-sea fish discovered by scientists in 2018 is called Barreleye, and it has a transparent head. The fish has tubular eyes that can rotate to look either upward or forward, allowing it to see potential prey and predators in the dark depths of the ocean."
incoherent_answer = "The scientists who made the discovery in 2018 were actually studying coral reefs, not deep-sea fish. However, they did come across an unusual creature that they couldn't identify. It turned out to be a type of sea cucumber that has a strange, tube-like shape."

# this dictionary is propogated to openai completion or chat completion API.
# please add the keys directly accepted by openai API.
openai_params = {
    "api_version": os.environ["OPENAI_API_VERSION"],
    "api_base": os.envrion["OPENAI_API_BASE"],
    "api_type": os.envrion["OPENAI_API_TYPE"],
    "api_key" : os.envrion["OPENAI_API_KEY"],
    "deployment_id": "<deployment_id>"
}


metrics_config = {
     "questions" : [question, question],
     "contexts" : [context, context],
     "openai_params" : openai_params,
}

# Note : length of lists of y_test, y_pred, questions, contexts should be equal
result = compute_metrics(task_type=constants.Tasks.QUESTION_ANSWERING, 
                         y_test=[coherent_answer, coherent_answer],
                         y_pred=[coherent_answer, incoherent_answer],
                         **metrics_config)
pprint(result)

LLM related metrics need llm_params to be computed. Computing metrics for ['ada_similarity', 'bertscore', 'gpt_groundedness', 'gpt_relevance', 'f1_score', 'gpt_similarity', 'gpt_fluency', 'gpt_coherence', 'exact_match']
Using the engine text-embedding-ada-002 for computing ada similarity. Please ensure to have valid deployment for text-embedding-ada-002 model
Could not compute metric because of the following exception : Error code: 404 - {'error': {'code': 'DeploymentNotFound', 'message': 'The API deployment for this resource does not exist. If you created the deployment within the last 5 minutes, please wait a moment and try again.'}}
100%|██████████| 2/2 [00:01<00:00,  1.19it/s]
100%|██████████| 2/2 [00:01<00:00,  1.13it/s]
100%|██████████| 2/2 [00:01<00:00,  1.12it/s]
100%|██████████| 2/2 [00:01<00:00,  1.35it/s]
100%|██████████| 2/2 [00:01<00:00,  1.18it/s]


{'artifacts': {'ada_similarity': ['notfounderror', 'notfounderror'],
               'bertscore': {'f1': [1.0, 0.29999807476997375],
                             'hashcode': 'microsoft/deberta-large_L16_no-idf_version=0.3.12(hug_trans=4.31.0)-rescaled',
                             'precision': [1.0, 0.31754738092422485],
                             'recall': [1.0, 0.27638280391693115]},
               'exact_match': [True, False],
               'f1_score': [1.0, 0.25],
               'gpt_coherence': [5, 3],
               'gpt_fluency': [5, 3],
               'gpt_groundedness': [5, 3],
               'gpt_relevance': [5, 1],
               'gpt_similarity': [5, 1]},
 'metrics': {'mean_ada_similarity': nan,
             'mean_bertscore_f1': 0.6499990373849869,
             'mean_bertscore_precision': 0.6587736904621124,
             'mean_bertscore_recall': 0.6381914019584656,
             'mean_exact_match': 0.5,
             'mean_f1_score': 0.625,
             'mean_gpt_coherence

##### For computing all of chat completion metrics, please run:

```$ pip install azureml-metrics[chat-completion]```

- The above command installs the following dependencies: ("evaluate", "rouge-score", "torch", "transformers", "nltk", "rtoml", "azure-keyvault", "azure-identity", "requests", "aiohttp", "openai", "tenacity")

In [6]:
%%time
# gpt4-32k model
from azureml.metrics import compute_metrics, constants
from pprint import pprint

y_test = [["4", "2 + 2 = 4"], ["Agra", "Agra, India"]]

y_pred = [
    [
        {"role": "user", "content": "What is the value of 2 + 2?"},
        {"role": "assistant", "content": "2 + 2 = 4",
         "context": {
             "citations": [{'id': 'math_document1.md',
                            'content': 'Information about additions: ' \
                            '1 + 2 = 3, 2 + 2 = 4'}]
            }
        }
    ],
    [
        {"role": "user", "content": "Where is Taj Mahal located?"},
        {"role": "assistant", "content": "Taj Mahal is located in Agra, India",
         "context": {
             "citations": [{'id': 'taj_mahal_document1.md',
                            'content': 'Taj Mahal is located in Agra, India ' \
                                        'and is one of the seven wonders of the world.'}]
            }
        }
    ]
]

openai_params = {
    "api_version": os.environ["OPENAI_API_VERSION"],
    "api_base": os.envrion["OPENAI_API_BASE"],
    "api_type": os.envrion["OPENAI_API_TYPE"],
    "api_key" : os.envrion["OPENAI_API_KEY"],
    "deployment_id": "<deployment_id>"
}


metrics_config = {
    "openai_params": openai_params,
    "score_version": "v1",
    "use_chat_completion_api": True,
    # To compute GPT based RAG metrics
    # "metrics": ["generation_score", "grounding_score", "retrieval_score"]
}

# The above metrics can even be computed by setting the task_type to CHAT_COMPLETION
result = compute_metrics(task_type=constants.Tasks.CHAT_COMPLETION,
                         y_test=y_test, 
                         y_pred=y_pred,
                         **metrics_config)
pprint(result)

Using pad_token, but it is not set yet.
100%|██████████| 1/1 [00:00<00:00,  7.71it/s]
Computing gpt groundedness score: 100%|██████████| 2/2 [00:03<00:00,  1.82s/it]
Computing gpt relevance score: 100%|██████████| 2/2 [00:02<00:00,  1.34s/it]
Computing gpt retrieval score: 100%|██████████| 2/2 [00:04<00:00,  2.33s/it]
0it [00:00, ?it/s]

{'artifacts': {'conversation_groundedness_score': [],
               'gpt_groundedness': {'reason': [['<Quality reasoning:> The '
                                                'chatbot\'s response "What is '
                                                'the value of 2 + 2?" is a '
                                                'direct paraphrase of the '
                                                'question "2 + 2 = 4". The '
                                                'factual information in the '
                                                "chatbot's response, which is "
                                                'the equation "2 + 2 = 4", is '
                                                'directly taken from the '
                                                'retrieved document '
                                                '"math_document1.md" where it '
                                                'states "2 + 2 = 4". '
                          


