# Demo on Azureml-Metrics Package
This notebook provides a comprehensive demonstration of utilizing the azureml-metrics package. The demonstration covers the following key aspects:

1. **List Task** : Provides an overview of all tasks supported by the azureml-metrics package.

2. **List Metrics** :

    a. Lists all metrics for all available tasks.

    b. Lists metrics specific to a given task.

3. **List Prompt** : Displays the prompt associated with a metric supported by a given task.

4. **Score API** : Illustrates how to compute metrics directly using the model and test data.

4. **Compute Metrics API** : 

    a. Demonstrates computing task-specific metrics.
    
    b. Shows how to compute metrics without specifying the task type.

5. **Computing Custom Prompt Metrics**:

    a. Guides you through computing custom prompt metrics alongside task-supported metrics.

    b. Explains how to compute custom prompt metrics without specifying the task type.


This demonstration will help you effectively leverage the capabilities of the azureml-metrics package for a variety of tasks related to metric computation and management.

#### Prerequisites
 
Please install the latest version of azureml-metrics package (text based requirements) using the following command:
 
``` $ pip install --upgrade azureml-metrics[all] ```
 
For more details on azureml-metrics package, please refer to the following link: https://aka.ms/azureml-metrics-quick-start

## List Tasks

In [2]:
from azureml.metrics import list_tasks
supported_tasks = list_tasks()
supported_tasks

{'chat-completion',
 'classification',
 'code-generation',
 'custom-prompt-metric',
 'fill-mask',
 'forecasting',
 'image-classification',
 'image-classification-multilabel',
 'image-instance-segmentation',
 'image-multi-labeling',
 'image-object-detection',
 'qa',
 'qa_multiple_ground_truth',
 'rag-evaluation',
 'regression',
 'summarization',
 'text-classification',
 'text-classification-multilabel',
 'text-generation',
 'text-ner',
 'translation'}

## List Metrics

### a) List metrics for all the tasks

In [3]:
from azureml.metrics import list_metrics
metrics = list_metrics()

# to display the metrics as a dataframe
import pandas as pd
pd.set_option('display.max_colwidth', None)
flattened_data = [(key, values) for item in metrics for key, values in item.items()]
metrics_df = pd.DataFrame(flattened_data, columns=['TASKS','SUPPORTED METRICS']).set_index(pd.RangeIndex(start=1, stop=len(flattened_data)+1))

# styling the dataframe
metrics_df =  metrics_df.style.set_table_styles([
                {'selector': 'td', 'props': [('text-align', 'justify')]},
                {'selector': 'th', 'props': [('text-align', 'center')]}
            ])

metrics_df

Unnamed: 0,TASKS,SUPPORTED METRICS
1,classification,"{'f1_score_weighted', 'iou_macro', 'precision_score_binary', 'f1_score_macro', 'average_precision_score_weighted', 'iou', 'accuracy_table', 'iou_micro', 'average_precision_score_binary', 'norm_macro_recall', 'f1_score_binary', 'recall_score_macro', 'precision_score_weighted', 'precision_score_macro', 'recall_score_micro', 'confusion_matrix', 'AUC_macro', 'matthews_correlation', 'iou_weighted', 'AUC_classwise', 'f1_score_micro', 'balanced_accuracy', 'average_precision_score_micro', 'weighted_accuracy', 'accuracy', 'AUC_micro', 'average_precision_score_macro', 'recall_score_weighted', 'classification_report', 'precision_score_micro', 'recall_score_classwise', 'recall_score_binary', 'average_precision_score_classwise', 'AUC_weighted', 'precision_score_classwise', 'AUC_binary', 'log_loss', 'iou_classwise', 'f1_score_classwise'}"
2,regression,"{'spearman_correlation', 'predicted_true', 'r2_score', 'mean_absolute_percentage_error', 'explained_variance', 'median_absolute_error', 'root_mean_squared_log_error', 'normalized_root_mean_squared_error', 'mean_absolute_error', 'root_mean_squared_error', 'normalized_median_absolute_error', 'normalized_root_mean_squared_log_error', 'residuals', 'normalized_mean_absolute_error'}"
3,text-classification,"{'f1_score_weighted', 'iou_macro', 'precision_score_binary', 'f1_score_macro', 'average_precision_score_weighted', 'iou', 'accuracy_table', 'iou_micro', 'average_precision_score_binary', 'norm_macro_recall', 'f1_score_binary', 'recall_score_macro', 'precision_score_weighted', 'precision_score_macro', 'recall_score_micro', 'confusion_matrix', 'AUC_macro', 'matthews_correlation', 'iou_weighted', 'AUC_classwise', 'f1_score_micro', 'balanced_accuracy', 'average_precision_score_micro', 'weighted_accuracy', 'accuracy', 'AUC_micro', 'average_precision_score_macro', 'recall_score_weighted', 'classification_report', 'precision_score_micro', 'recall_score_classwise', 'recall_score_binary', 'average_precision_score_classwise', 'AUC_weighted', 'precision_score_classwise', 'AUC_binary', 'log_loss', 'iou_classwise', 'f1_score_classwise'}"
4,text-classification-multilabel,"{'f1_score_weighted', 'iou_macro', 'precision_score_binary', 'f1_score_macro', 'average_precision_score_weighted', 'iou', 'accuracy_table', 'iou_micro', 'average_precision_score_binary', 'norm_macro_recall', 'f1_score_binary', 'recall_score_macro', 'precision_score_weighted', 'precision_score_macro', 'recall_score_micro', 'confusion_matrix', 'AUC_macro', 'matthews_correlation', 'iou_weighted', 'AUC_classwise', 'f1_score_micro', 'balanced_accuracy', 'average_precision_score_micro', 'weighted_accuracy', 'accuracy', 'AUC_micro', 'average_precision_score_macro', 'recall_score_weighted', 'classification_report', 'precision_score_micro', 'recall_score_classwise', 'recall_score_binary', 'average_precision_score_classwise', 'AUC_weighted', 'precision_score_classwise', 'AUC_binary', 'log_loss', 'iou_classwise', 'f1_score_classwise'}"
5,text-ner,"{'f1_score_weighted', 'recall_score_weighted', 'f1_score_macro', 'precision_score_weighted', 'precision_score_micro', 'accuracy', 'recall_score_macro', 'precision_score_macro', 'recall_score_micro', 'f1_score_micro'}"
6,translation,"{'bleu_2', 'bleu_3', 'bleu_4', 'bleu_1'}"
7,summarization,"{'rouge1', 'rouge2', 'rougeL', 'rougeLsum'}"
8,qa,"{'llm_coherence', 'bertscore', 'llm_groundedness', 'gpt_groundedness', 'gpt_coherence', 'ada_similarity', 'gpt_similarity', 'gpt_fluency', 'llm_similarity', 'llm_relevance', 'exact_match', 'gpt_relevance', 'llm_fluency', 'f1_score'}"
9,qa_multiple_ground_truth,"{'llm_coherence', 'bertscore', 'llm_groundedness', 'gpt_groundedness', 'gpt_coherence', 'ada_similarity', 'gpt_similarity', 'gpt_fluency', 'llm_similarity', 'llm_relevance', 'exact_match', 'gpt_relevance', 'llm_fluency', 'f1_score'}"
10,fill-mask,{'perplexity'}


### b) List metrics for a specific Task

In [4]:
from azureml.metrics import list_metrics
metrics = list_metrics('qa')
metrics

{'ada_similarity',
 'bertscore',
 'exact_match',
 'f1_score',
 'gpt_coherence',
 'gpt_fluency',
 'gpt_groundedness',
 'gpt_relevance',
 'gpt_similarity',
 'llm_coherence',
 'llm_fluency',
 'llm_groundedness',
 'llm_relevance',
 'llm_similarity'}

## List Prompt
Lists the prompt of a metric, supported by the given task

In [5]:
from azureml.metrics import list_prompts

coherence_prompt = list_prompts(task_type='qa', metric='gpt_coherence')
print(coherence_prompt)

Coherence of an answer is measured by how well all the sentences fit together and sound naturally as a whole. Consider the overall quality of the answer when evaluating coherence. Given the question and answer, score the coherence of answer between one to five stars using the following rating scale:
One star: the answer completely lacks coherence
Two stars: the answer mostly lacks coherence
Three stars: the answer is partially coherent
Four stars: the answer is mostly coherent
Five stars: the answer has perfect coherency

This rating value should always be an integer between 1 and 5. So the rating produced should be 1 or 2 or 3 or 4 or 5.

question: What is your favorite indoor activity and why do you enjoy it?
answer: I like pizza. The sun is shining.
stars: 1

question: Can you describe your favorite movie without giving away any spoilers?
answer: It is a science fiction movie. There are dinosaurs. The actors eat cake. People must stop the villain.
stars: 2

question: What are some b

## Score API
Computing metrics directly with a model and test data

In [6]:
from azureml.metrics import score
import numpy as np
from pprint import pprint
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
    
X, y = load_iris(return_X_y=True)
clf = LogisticRegression(random_state=0).fit(X, y)

metrics = score(task_type='classification',
                model=clf,
                X_test=X,
                y_test=y)

pprint(metrics)

Computing classification metrics: 100%|██████████████████████████████████████████████| 28/28 [00:00<00:00, 1374.65it/s]
Metrics skipped due to missing y_pred_proba:
 ['average_precision_score_micro', 'average_precision_score_weighted', 'accuracy_table', 'average_precision_score_binary', 'AUC_micro', 'norm_macro_recall', 'average_precision_score_macro', 'AUC_macro', 'AUC_weighted', 'AUC_binary', 'log_loss']


{'artifacts': {'confusion_matrix': {'data': {'class_labels': ['0', '1', '2'],
                                             'matrix': [[50, 0, 0],
                                                        [0, 47, 3],
                                                        [0, 1, 49]]},
                                    'schema_type': 'confusion_matrix',
                                    'schema_version': '1.0.0'}},
 'metrics': {'accuracy': 0.9733333333333334,
             'balanced_accuracy': 0.9733333333333333,
             'f1_score_binary': nan,
             'f1_score_macro': 0.9733226623982927,
             'f1_score_micro': 0.9733333333333334,
             'f1_score_weighted': 0.9733226623982927,
             'matthews_correlation': 0.9602561024455323,
             'precision_score_binary': nan,
             'precision_score_macro': 0.9738247863247862,
             'precision_score_micro': 0.9733333333333334,
             'precision_score_weighted': 0.9738247863247864,
          

## Compute Metrics API

In [7]:
openai_params = {
    "api_version": '<placeholder>',
    "api_base": '<placeholder>',
    "api_type": '<placeholder>',
    "api_key": '<placeholder>',
    "deployment_id": '<placeholder>'
}

In [8]:
# preparing QA dataset
coherent_answer = "The deep-sea fish discovered by scientists in 2018 is called Barreleye, and it has a transparent head. The fish has tubular eyes that can rotate to look either upward or forward, allowing it to see potential prey and predators in the dark depths of the ocean."
incoherent_answer = "The scientists who made the discovery in 2018 were actually studying coral reefs, not deep-sea fish. However, they did come across an unusual creature that they couldn't identify. It turned out to be a type of sea cucumber that has a strange, tube-like shape."
context = "In 2018, a group of scientists discovered a new type of deep-sea fish that has a transparent head. The fish, named Barreleye, has tubular eyes that can rotate to look either upward or forward, allowing it to see potential prey and predators in the dark depths of the ocean."
question = "What is the name of the deep-sea fish discovered by scientists in 2018, and what is unique about its head?"

y_test = [coherent_answer, coherent_answer]
y_pred = [coherent_answer, incoherent_answer]
contexts = [context, context]
questions = [question, question]

### a) Computing Task supported Metrics

In [9]:
from azureml.metrics import compute_metrics

metrics = compute_metrics(task_type='qa', y_test=y_test, y_pred=y_pred, questions=questions, contexts=contexts, openai_params=openai_params)
metrics

LLM related metrics need llm_params to be computed. Computing metrics for ['bertscore', 'gpt_groundedness', 'gpt_coherence', 'ada_similarity', 'gpt_similarity', 'gpt_fluency', 'exact_match', 'gpt_relevance', 'f1_score']
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.14s/it]
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.36it/s]
Using the engine text-embedding-ada-002 for computing ada similarity. Please ensure to have valid deployment for text-embedding-ada-002 model
Could not compute metric because of the following exception : RetryError[<Future at 0x2404c1fa7a0 state=finished raised InvalidRequestError>]
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.18it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.25it/s]
100%|███████████

{'metrics': {'mean_bertscore_precision': 0.6587736904621124,
  'mean_bertscore_recall': 0.6381914019584656,
  'mean_bertscore_f1': 0.6499990373849869,
  'mean_gpt_groundedness': 4.0,
  'median_gpt_groundedness': 4.0,
  'mean_gpt_coherence': 4.0,
  'median_gpt_coherence': 4.0,
  'mean_ada_similarity': nan,
  'median_ada_similarity': nan,
  'mean_gpt_similarity': 3.0,
  'median_gpt_similarity': 3.0,
  'mean_gpt_fluency': 4.0,
  'median_gpt_fluency': 4.0,
  'mean_exact_match': 0.5,
  'median_exact_match': 0.5,
  'mean_gpt_relevance': 3.0,
  'median_gpt_relevance': 3.0,
  'mean_f1_score': 0.625,
  'median_f1_score': 0.625},
 'artifacts': {'bertscore': {'precision': [1.0, 0.31754738092422485],
   'recall': [1.0, 0.27638280391693115],
   'f1': [1.0, 0.29999807476997375],
   'hashcode': 'microsoft/deberta-large_L16_no-idf_version=0.3.12(hug_trans=4.28.1)-rescaled'},
  'gpt_groundedness': [5, 3],
  'gpt_coherence': [5, 3],
  'ada_similarity': ['retryerror', 'retryerror'],
  'gpt_similarity': [

### b)Computing Metrics without the Task Type
Here we compute a subset of metrics without passing the task_type. In this code accuracy for example is not a supported metric for QA task, but we can still pass and calculate it.

In [10]:
from azureml.metrics import compute_metrics

metrics_config = {
    "metrics": ['accuracy','f1_score', 'gpt_relevance']
}

metrics = compute_metrics(y_test=y_test, y_pred=y_pred, questions=questions, contexts=contexts, openai_params=openai_params, **metrics_config)
metrics

We have unused keyword arguments : ['questions', 'contexts', 'openai_params']
Applicable keyword arguments for text-ner are ['metrics'].
Assertion Failed. Invalid Operation. Target: y_test_value. Reference Code: validate_ner. Details: y_test_value must be a list
Assertion Failed. Invalid Operation. Target: y_test_value. Reference Code: validate_ner. Details: y_test_value must be a list
Skipping the computation of ['accuracy'] for text-ner task due to the following exception : Assertion Failed. Invalid Operation. Target: y_test_value. Reference Code: validate_ner. Details: y_test_value must be a list
LLM related metrics need llm_params to be computed. Computing metrics for ['f1_score', 'gpt_relevance']
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.43it/s]
Assertion Failed. Invalid Operation. Target: y_test_value. Reference Code: validate_chat_completion. Details: y_test_value must be a list
Assertion Failed. Invalid Opera

{'metrics': {'mean_f1_score': 0.625,
  'median_f1_score': 0.625,
  'mean_gpt_relevance': 3.0,
  'median_gpt_relevance': 3.0,
  'accuracy': 0.5},
 'artifacts': {'f1_score': [1.0, 0.25], 'gpt_relevance': [5, 1]}}

## Computing Custom Prompt Metrics

Custom prompt templates for the metrics to be computed

In [11]:
custom_coherence_prompt = 'Coherence of an answer is measured by how well all the sentences fit together and sound naturally as a whole. Consider the overall quality of the answer when evaluating coherence. Given the question and answer, score the coherence of answer between one to five stars using the following rating scale:\nOne star: the answer completely lacks coherence\nTwo stars: the answer mostly lacks coherence\nThree stars: the answer is partially coherent\nFour stars: the answer is mostly coherent\nFive stars: the answer has perfect coherency\n\nThis rating value should always be an integer between 1 and 5. So the rating produced should be 1 or 2 or 3 or 4 or 5.\n\nquestion: What is your favorite indoor activity and why do you enjoy it?\nanswer: I like pizza. The sun is shining.\nstars: 1\n\nquestion: Can you describe your favorite movie without giving away any spoilers?\nanswer: It is a science fiction movie. There are dinosaurs. The actors eat cake. People must stop the villain.\nstars: 2\n\nquestion: What are some benefits of regular exercise?\nanswer: Regular exercise improves your mood. A good workout also helps you sleep better. Trees are green.\nstars: 3\n\nquestion: How do you cope with stress in your daily life?\nanswer: I usually go for a walk to clear my head. Listening to music helps me relax as well. Stress is a part of life, but we can manage it through some activities.\nstars: 4\n\nquestion: What can you tell me about climate change and its effects on the environment?\nanswer: Climate change has far-reaching effects on the environment. Rising temperatures result in the melting of polar ice caps, contributing to sea-level rise. Additionally, more frequent and severe weather events, such as hurricanes and heatwaves, can cause disruption to ecosystems and human societies alike.\nstars: 5\n\nquestion: {{questions}}\nanswer: {{predictions}}\nstars:'
print("Custom coherence prompt : \n", custom_coherence_prompt)

Custom coherence prompt : 
 Coherence of an answer is measured by how well all the sentences fit together and sound naturally as a whole. Consider the overall quality of the answer when evaluating coherence. Given the question and answer, score the coherence of answer between one to five stars using the following rating scale:
One star: the answer completely lacks coherence
Two stars: the answer mostly lacks coherence
Three stars: the answer is partially coherent
Four stars: the answer is mostly coherent
Five stars: the answer has perfect coherency

This rating value should always be an integer between 1 and 5. So the rating produced should be 1 or 2 or 3 or 4 or 5.

question: What is your favorite indoor activity and why do you enjoy it?
answer: I like pizza. The sun is shining.
stars: 1

question: Can you describe your favorite movie without giving away any spoilers?
answer: It is a science fiction movie. There are dinosaurs. The actors eat cake. People must stop the villain.
stars: 

In [12]:
custom_equivalence_metric = "Equivalence, as a metric, measures the similarity between the predicted answer and the correct answer. If the information and content in the predicted answer is similar or equivalent to the correct answer, then the value of the Equivalence metric should be high, else it should be low. Given the question, correct answer, and predicted answer, determine the value of Equivalence metric using the following rating scale: One star: the predicted answer is not at all similar to the correct answer Two stars: the predicted answer is mostly not similar to the correct answer Three stars: the predicted answer is somewhat similar to the correct answer Four stars: the predicted answer is mostly similar to the correct answer Five stars: the predicted answer is completely similar to the correct answer This rating value should always be an integer between 1 and 5. So the rating produced should be 1 or 2 or 3 or 4 or 5. The examples below show the Equivalence score for a question, a correct answer, and a predicted answer. question: What is the role of ribosomes? correct answer: Ribosomes are cellular structures responsible for protein synthesis. They interpret the genetic information carried by messenger RNA (mRNA) and use it to assemble amino acids into proteins. predicted answer: Ribosomes participate in carbohydrate breakdown by removing nutrients from complex sugar molecules. stars: 1 question: Why did the Titanic sink? correct answer: The Titanic sank after it struck an iceberg during its maiden voyage in 1912. The impact caused the ship's hull to breach, allowing water to flood into the vessel. The ship's design, lifeboat shortage, and lack of timely rescue efforts contributed to the tragic loss of life. predicted answer: The sinking of the Titanic was a result of a large iceberg collision. This caused the ship to take on water and eventually sink, leading to the death of many passengers due to a shortage of lifeboats and insufficient rescue attempts. stars: 2 question: What causes seasons on Earth? correct answer: Seasons on Earth are caused by the tilt of the Earth's axis and its revolution around the Sun. As the Earth orbits the Sun, the tilt causes different parts of the planet to receive varying amounts of sunlight, resulting in changes in temperature and weather patterns. predicted answer: Seasons occur because of the Earth's rotation and its elliptical orbit around the Sun. The tilt of the Earth's axis causes regions to be subjected to different sunlight intensities, which leads to temperature fluctuations and alternating weather conditions. stars: 3 question: How does photosynthesis work? correct answer: Photosynthesis is a process by which green plants and some other organisms convert light energy into chemical energy. This occurs as light is absorbed by chlorophyll molecules, and then carbon dioxide and water are converted into glucose and oxygen through a series of reactions. predicted answer: In photosynthesis, sunlight is transformed into nutrients by plants and certain microorganisms. Light is captured by chlorophyll molecules, followed by the conversion of carbon dioxide and water into sugar and oxygen through multiple reactions. stars: 4 question: What are the health benefits of regular exercise? correct answer: Regular exercise can help maintain a healthy weight, increase muscle and bone strength, and reduce the risk of chronic diseases. It also promotes mental well-being by reducing stress and improving overall mood. predicted answer: Routine physical activity can contribute to  maintaining ideal body weight, enhancing muscle and bone strength, and preventing chronic illnesses. In addition, it supports mental health by alleviating stress and augmenting general mood. \nstars: 5\n\n question: {{questions}}\n correct answer: {{ground_truths}}\n predicted answer: {{predictions}}\n stars:"""

print("Custom equivalence metric3 : \n", custom_equivalence_metric)

Custom equivalence metric3 : 
 Equivalence, as a metric, measures the similarity between the predicted answer and the correct answer. If the information and content in the predicted answer is similar or equivalent to the correct answer, then the value of the Equivalence metric should be high, else it should be low. Given the question, correct answer, and predicted answer, determine the value of Equivalence metric using the following rating scale: One star: the predicted answer is not at all similar to the correct answer Two stars: the predicted answer is mostly not similar to the correct answer Three stars: the predicted answer is somewhat similar to the correct answer Four stars: the predicted answer is mostly similar to the correct answer Five stars: the predicted answer is completely similar to the correct answer This rating value should always be an integer between 1 and 5. So the rating produced should be 1 or 2 or 3 or 4 or 5. The examples below show the Equivalence score for a q

Here we are preparing the prompts using `AzuremlCustomPromptMetric` API, and creating the extra metrics list

In [13]:
from azureml.metrics import AzureMLCustomPromptMetric

custom_metric_names = ["custom_coherence_class_api", "custom_equivalence_class_api"]
custom_metric_descriptions = ["Custom coherence metric with class based implementation",
                              "Custom equivalence metric with class based implementation"]

user_prompt_templates = [custom_coherence_prompt, custom_equivalence_metric]
input_vars = [["questions", "predictions"], ["questions", "ground_truths", "predictions"]]

custom_coherence_prompt_config = {
    "input_vars" : input_vars[0],
    "openai_params" : openai_params,
    "metric_name" : custom_metric_names[0],
    "metric_description" : custom_metric_descriptions[0],
    "user_prompt_template" : user_prompt_templates[0],
}

custom_equivalence_prompt_config = {
    "input_vars" : input_vars[1],
    "openai_params" : openai_params,
    "metric_name" : custom_metric_names[1],
    "metric_description" : custom_metric_descriptions[1],
    "user_prompt_template" : user_prompt_templates[1],
}

custom_coherence_class_based = AzureMLCustomPromptMetric(**custom_coherence_prompt_config)
custom_equivalence_class_based = AzureMLCustomPromptMetric(**custom_equivalence_prompt_config)

extra_metrics = [custom_coherence_class_based, custom_equivalence_class_based]

### a) Computing custom prompt metrics along with task supported metrics

In [14]:
from azureml.metrics import compute_metrics

data_input = {
    # additional data required for question answering metrics
    "contexts": contexts,
    "y_pred": y_pred,
    "y_test": y_test,

    # data required for coherence, equivalence prompt template
    "questions": questions,
    "predictions": y_pred,
    "ground_truths": y_test,
}

result = compute_metrics(task_type='qa',
                        metrics=extra_metrics,
                        openai_params=openai_params,
                        **data_input)
result

100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.31it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.28it/s]
We have unused keyword arguments : ['predictions', 'ground_truths']
Applicable keyword arguments for qa are ['metrics', 'tokenizer', 'regexes_to_ignore', 'ignore_case', 'ignore_punctuation', 'ignore_numbers', 'lang', 'model_type', 'questions', 'openai_params', 'idf', 'rescale_with_baseline', 'contexts', 'openai_api_batch_size', 'use_chat_completion_api', 'openai_embedding_engine', 'llm_params', 'llm_api_batch_size'].
LLM related metrics need llm_params to be computed. Computing metrics for ['bertscore', 'gpt_groundedness', 'gpt_coherence', 'ada_similarity', 'gpt_similarity', 'gpt_fluency', 'exact_match', 'gpt_relevance', 'f1_score']
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.33it/s]

{'metrics': {'mean_custom_coherence_class_api': 4.0,
  'median_custom_coherence_class_api': 4.0,
  'mean_custom_equivalence_class_api': 3.0,
  'median_custom_equivalence_class_api': 3.0,
  'mean_bertscore_precision': 0.6587736904621124,
  'mean_bertscore_recall': 0.6381914019584656,
  'mean_bertscore_f1': 0.6499990373849869,
  'mean_gpt_groundedness': 4.0,
  'median_gpt_groundedness': 4.0,
  'mean_gpt_coherence': 4.0,
  'median_gpt_coherence': 4.0,
  'mean_ada_similarity': nan,
  'median_ada_similarity': nan,
  'mean_gpt_similarity': 3.0,
  'median_gpt_similarity': 3.0,
  'mean_gpt_fluency': 4.0,
  'median_gpt_fluency': 4.0,
  'mean_exact_match': 0.5,
  'median_exact_match': 0.5,
  'mean_gpt_relevance': 3.0,
  'median_gpt_relevance': 3.0,
  'mean_f1_score': 0.625,
  'median_f1_score': 0.625},
 'artifacts': {'custom_coherence_class_api': ['5', '3'],
  'custom_equivalence_class_api': ['5', '1'],
  'bertscore': {'precision': [1.0, 0.31754738092422485],
   'recall': [1.0, 0.276382803916931

### b) Computing custom prompt metrics without a task type

In [15]:
from azureml.metrics import compute_metrics

custom_prompt_data_input = {
    # data required for coherence prompt template
    "questions" : questions,
    "predictions": y_pred,
    # additional data required for equivalence prompt template
    "ground_truths": y_test,
}

result = compute_metrics(metrics=extra_metrics,
                        **custom_prompt_data_input)
result

100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.24it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.28it/s]


{'metrics': {'mean_custom_coherence_class_api': 4.0,
  'median_custom_coherence_class_api': 4.0,
  'mean_custom_equivalence_class_api': 3.0,
  'median_custom_equivalence_class_api': 3.0},
 'artifacts': {'custom_coherence_class_api': ['5', '3'],
  'custom_equivalence_class_api': ['5', '1']}}