#### Computing Prompt based (LLM as a Judge/LLM Star) metrics in AzureML SDK

This sample notebook demonstrates how to compute LLM as a Judge based metrics in AzureML SDK. The notebook demonstrates the following steps:

<pre>
1) Calculate any prompt based metrics using openAI GPT models
2) Calculate any prompt based metrics using any deployed LLM endpoint from AzureML Model Catalog
</pre>

#### Prerequisites

1) Please install the latest version of azureml-metrics package (text based requirements) using the following command:

``` $ pip install --upgrade azureml-metrics[text] ```

For more details on azureml-metrics package, please refer to the following link: https://aka.ms/azureml-metrics-quick-start

#### 1) Calculate any prompt based metrics using openAI GPT models

In [2]:
import os
from azureml.metrics import compute_metrics, constants

context = "In 2018, a group of scientists discovered a new type of deep-sea fish that has a transparent head. The fish, named Barreleye, has tubular eyes that can rotate to look either upward or forward, allowing it to see potential prey and predators in the dark depths of the ocean."
question = "What is the name of the deep-sea fish discovered by scientists in 2018, and what is unique about its head?"
coherent_answer = "The deep-sea fish discovered by scientists in 2018 is called Barreleye, and it has a transparent head. The fish has tubular eyes that can rotate to look either upward or forward, allowing it to see potential prey and predators in the dark depths of the ocean."
incoherent_answer = "The scientists who made the discovery in 2018 were actually studying coral reefs, not deep-sea fish. However, they did come across an unusual creature that they couldn't identify. It turned out to be a type of sea cucumber that has a strange, tube-like shape."

# Note: Please replace the values for the following variables with your own values.
openai_params = {
    "api_version" : os.environ["OPENAI_API_VERSION"],
    "api_base" : os.environ["OPENAI_API_BASE"],
    "api_type" : os.environ["OPENAI_API_TYPE"],
    "api_key" : os.environ["OPENAI_API_KEY"],
    "deployment_id" : "<deployment_name>"
}

metrics_config = {
    "questions" : [question, question],
    "contexts" : [context, context],
    "openai_params" : openai_params,
    "metrics" : ["gpt_coherence", "gpt_fluency", "gpt_relevance", "gpt_groundedness", "gpt_similarity"]
}

result = compute_metrics(task_type=constants.Tasks.QUESTION_ANSWERING,
                         y_test=[coherent_answer, coherent_answer],
                         y_pred=[coherent_answer, incoherent_answer],
                         **metrics_config,)
result

LLM related metrics need llm_params to be computed. Computing metrics for ['gpt_relevance', 'gpt_groundedness', 'gpt_coherence', 'gpt_similarity', 'gpt_fluency']
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.65it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.72it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.72it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.67it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.21it/s]


{'metrics': {'mean_gpt_relevance': 3.0,
  'median_gpt_relevance': 3.0,
  'mean_gpt_groundedness': 4.0,
  'median_gpt_groundedness': 4.0,
  'mean_gpt_coherence': 4.0,
  'median_gpt_coherence': 4.0,
  'mean_gpt_similarity': 3.0,
  'median_gpt_similarity': 3.0,
  'mean_gpt_fluency': 4.0,
  'median_gpt_fluency': 4.0},
 'artifacts': {'gpt_relevance': [5, 1],
  'gpt_groundedness': [5, 3],
  'gpt_coherence': [5, 3],
  'gpt_similarity': [5, 1],
  'gpt_fluency': [5, 3]}}

In [3]:
# for looking at the prompt template for coherence
from azureml.metrics import list_prompts, constants

coherence_prompt = list_prompts(task_type=constants.Tasks.QUESTION_ANSWERING,
                               metric="gpt_coherence")
print(coherence_prompt)

Coherence of an answer is measured by how well all the sentences fit together and sound naturally as a whole. Consider the overall quality of the answer when evaluating coherence. Given the question and answer, score the coherence of answer between one to five stars using the following rating scale:
One star: the answer completely lacks coherence
Two stars: the answer mostly lacks coherence
Three stars: the answer is partially coherent
Four stars: the answer is mostly coherent
Five stars: the answer has perfect coherency

This rating value should always be an integer between 1 and 5. So the rating produced should be 1 or 2 or 3 or 4 or 5.

question: What is your favorite indoor activity and why do you enjoy it?
answer: I like pizza. The sun is shining.
stars: 1

question: Can you describe your favorite movie without giving away any spoilers?
answer: It is a science fiction movie. There are dinosaurs. The actors eat cake. People must stop the villain.
stars: 2

question: What are some b

#### 2) Calculate any prompt based metrics using any deployed LLM endpoint from AzureML Model Catalog

In [4]:
from azureml.metrics import compute_metrics, constants

context = "In 2018, a group of scientists discovered a new type of deep-sea fish that has a transparent head. The fish, named Barreleye, has tubular eyes that can rotate to look either upward or forward, allowing it to see potential prey and predators in the dark depths of the ocean."
question = "What is the name of the deep-sea fish discovered by scientists in 2018, and what is unique about its head?"
coherent_answer = "The deep-sea fish discovered by scientists in 2018 is called Barreleye, and it has a transparent head. The fish has tubular eyes that can rotate to look either upward or forward, allowing it to see potential prey and predators in the dark depths of the ocean."
incoherent_answer = "The scientists who made the discovery in 2018 were actually studying coral reefs, not deep-sea fish. However, they did come across an unusual creature that they couldn't identify. It turned out to be a type of sea cucumber that has a strange, tube-like shape."

# Note: Please replace the values for the following variables with your own values.
llm_params = {
    "llm_url": "<rest_endpoint_after_deployment_from_azureml_model_catalog>",
    "llm_api_key": "<api_key_for_endpoint>",
    "azureml_model_deployment": "<deployment_name>",
}

metrics_config = {
    "questions" : [question, question],
    "contexts" : [context, context],
    "llm_params" : llm_params,
    "metrics": ["llm_coherence", "llm_fluency", "llm_relevance", "llm_groundedness", "llm_similarity"]
}

result = compute_metrics(task_type=constants.Tasks.QUESTION_ANSWERING,
                         y_test=[coherent_answer, coherent_answer],
                         y_pred=[coherent_answer, incoherent_answer],
                         **metrics_config,)
result

GPT related metrics need openai_params to be computed. Computing metrics for ['llm_relevance', 'llm_similarity', 'llm_coherence', 'llm_fluency', 'llm_groundedness']
GPT related metrics need openai_params in a dictionary.
GPT related metrics need openai_params in a dictionary.
GPT related metrics need openai_params in a dictionary.
GPT related metrics need openai_params in a dictionary.
GPT related metrics need openai_params in a dictionary.


{'metrics': {'mean_llm_relevance': 3.0,
  'median_llm_relevance': 3.0,
  'mean_llm_similarity': 3.0,
  'median_llm_similarity': 3.0,
  'mean_llm_coherence': 3.0,
  'median_llm_coherence': 3.0,
  'mean_llm_fluency': 1.0,
  'median_llm_fluency': 1.0,
  'mean_llm_groundedness': 1.0,
  'median_llm_groundedness': 1.0},
 'artifacts': {'llm_relevance': [5, 1],
  'llm_similarity': [5, 1],
  'llm_coherence': [5, 1],
  'llm_fluency': [1, 1],
  'llm_groundedness': [1, 1]}}

#### How LLM based metrics can be utilized?

<pre>
1) Use any GPT based models by sending Azure openAI credentials to AzureML Model Evaluation.
2) Use any other LLMs in AzureML Model Catalog by deploying them to a real time endpoint.
3) Pick any base model: 
    - perform supervised finetuning with instructions based on computing relevant prompt based metrics.
    - onboard into AzureML platform using Import framework.
    - deploy the finetuned model to real time endpoint.
    - compute prompt based (LLM as a judge/LLM star) metrics using AzureML Model Evaluation.
</pre>