# Metrics computation using IBM watsonx.governance for RAG task

This notebook is used to compute reference free Answer Quality metrics such as **Faithfulness**, **Answer relevance** and **Unsuccessful requests** for RAG task type.

**Faithfulness** measures how faithful the model output or generated text is to the context sent to the LLM input. The faithfulness score is a value between 0 and 1. A value closer to 1 indicates that the output is more faithful or grounded and less hallucinated. A value closer to 0 indicates that the output is less faithful or grounded and more hallucinated.

**Answer relevance** measures how relevant the answer or generated text is to the question. This is one of the ways to determine the quality of your model. The answer relevance score is a value between 0 and 1. A value closer to 1 indicates that the answer is more relevant to the given question. A value closer to 0 indicates that the answer is less relevant to the question.

**Unsuccessful requests** measures the ratio of questions answered unsuccessfully out of to the total number of questions. The unsuccessful requests score is a value between 0 and 1. A value closer to 0 indicates that the model is successfully answering the questions. A value closer to 1 indicates the model is not able to answer the questions.

**Prerequisites**: The user needs to provide a data frame which contains question, context and the model generated response for each question to compute reference free metrics and include reference answer to compute the reference based metrics.

**Please run the notebook in an environment with memory greater than 6GB**


## Install required libraries

In [None]:
!pip install -U "ibm-metrics-plugin>5.0.0.0" | tail -n 1
!pip install -U ibm-watson-openscale>=3.0.38 | tail -n 1
!pip install nest_asyncio unitxt torch==2.1.0 | tail -n 1

import warnings
warnings.filterwarnings("ignore")

#### Restart kernel

## Initialize watsonx.governance client

In [None]:
from ibm_cloud_sdk_core.authenticators import CloudPakForDataAuthenticator

from ibm_watson_openscale import *
from ibm_watson_openscale.supporting_classes.enums import *
from ibm_watson_openscale.supporting_classes import *

#For authentication against CPD cluster with version 5.0.0 or higher
WOS_CREDENTIALS = {
    "url": "<server_url>",
    "username": "<username>",
    "password": "<password>"
}
authenticator = CloudPakForDataAuthenticator(
        url=WOS_CREDENTIALS["url"],
        username=WOS_CREDENTIALS["username"],
        password=WOS_CREDENTIALS["password"],
        disable_ssl_verification=True, 
    )

client = APIClient(service_url=WOS_CREDENTIALS["url"],authenticator=authenticator)
client.version

## Read the data

Create a dataframe containing the context, question and answer from the LLM model. Please note that if the context contains multiple chunks, each chunk can be sent in a separate column.

In [None]:
# Download rag data
!rm rag_state_union.csv
!wget https://raw.githubusercontent.com/IBM/watson-openscale-samples/main/IBM%20Cloud/WML/assets/data/watsonx/rag_state_union.csv

In [2]:
import pandas as pd

test_data_file = "rag_state_union.csv"
data = pd.read_csv(test_data_file)
data.head()

# Run the below code snippet only if you are running the notebook via watson studio
# from ibm_watson_studio_lib import access_project_or_space
# wslib = access_project_or_space()
# wslib.download_file(test_data_file)

Unnamed: 0,context1,context2,context3,context4,question,answer
0,"Last month, I announced our plan to supercharg...",For that purpose we’ve mobilized American grou...,"If you travel 20 miles east of Columbus, Ohio,...",But cancer from prolonged exposure to burn pit...,What is ARPA-H?,ARPA-H is the Advanced Research Projects Agenc...
1,So let’s not wait any longer. Send it to my de...,"If you travel 20 miles east of Columbus, Ohio,...",When we use taxpayer dollars to rebuild Americ...,It is going to transform America and put us on...,What is the investment of Ford and GM to build...,Ford is investing $11 billion to build electri...
2,My plan will cut the cost in half for most fam...,We got more than 130 countries to agree on a g...,And unlike the $2 Trillion tax cut passed in t...,We’re going after the criminals who stole bill...,What is the proposed tax rate for corporations?,The proposed tax rate for corporations is a 15...
3,"If you travel 20 miles east of Columbus, Ohio,...",So let’s not wait any longer. Send it to my de...,When we use taxpayer dollars to rebuild Americ...,It is going to transform America and put us on...,What is Intel going to build?,Intel is going to build a $20 billion semicond...
4,So let’s not wait any longer. Send it to my de...,"If you travel 20 miles east of Columbus, Ohio,...",It is going to transform America and put us on...,Vice President Harris and I ran for office wit...,How many electric vehicle charging stations ar...,The document does not provide information on t...


### Configuration for RAG

### Common parameters

| Parameter | Description | Default Value | Possible Value(s) |
|:-|:-|:-|:-|
| context_columns | The list of context column names in the input data frame. |  |  |
| question_column | the name of the question column in the input data frame. |  |  |
| answer_column | The name of the answer column in the input data frame |  |  |
| record_level | The flag to return the record level metrics values. Set the flag under configuration to generate record level metrics for all the metrics. Set the flag under specific metric to generate record level metrics for that metric alone. | False | True, False |


### Faithfulness parameters
| Parameter | Description | Default Value | Possible Value(s) |
|:-|:-|:-|:-|
| attributions_count [Optional]| Source attributions are computed for each sentence in the generated answer. Source attribution for a sentence is the set of sentences in the context which contributed to the LLM generating that sentence in the answer.  The attributions_count parameter specifies the number of sentences in the context which need to be identified for attributions.  E.g., if the value is set to 2, then we will find the top 2 sentences from the context as source attributions. | 3 |  |
| ngrams [Optional]| The number of sentences to be grouped from the context when computing faithfulness score. These grouped sentences will be shown in the attributions. Having a very high value of ngrams might lead to having lower faithfulness scores due to dispersion of data and inclusion of unrelated sentences in the attributions. Having a very low value might lead to increase in metric computation time and attributions not capturing the all the aspects of the answer. | 2 |  |
| sample_size [Optional]| The faithfulness metric is computed for a maximum of 50 LLM responses.  If you wish to compute it for a smaller number of responses, set the sample_size value to a lower number. If the number of records in the input data frame are more than the sample size, a uniform random sample will taken for computation. | 50 | Integer between 0 to 50. Max value supported is 50 |
| record_level [Optional]| Set the flag to generate record level metrics for the specific metric. | False | True, False |

### Unsuccessful requests parameters
| Parameter | Description | Default Value |
|:-|:-|:-|
| unsuccessful_phrases [Optional]| The list of phrases to be used for comparing the model output to determine whether the request is unsuccessful or not. | ["i don't know", "i do not know", "i'm not sure", "i am not sure", "i'm unsure", "i am unsure", "i'm uncertain", "i am uncertain", "i'm not certain", "i am not certain", "i can't fulfill", "i cannot fulfill"] |

In [None]:
from ibm_metrics_plugin.metrics.llm.utils.constants import LLMTextMetricGroup, LLMCommonMetrics, LLMQAMetrics

# Edit below values based on the input data
context_columns = ["context1", "context2", "context3", "context4"]
question_column = "question"
answer_column = "answer"

config_json = {
            "configuration": {
                #"record_level": True
                "context_columns": context_columns,
                "question_column": question_column,
                 LLMTextMetricGroup.RAG.value: {
                        LLMCommonMetrics.FAITHFULNESS.value: {
                            "record_level": True
                            #"attributions_count": 3,
                            #"ngrams": 2,
                            #"sample_size": 10,
                        },
                        LLMCommonMetrics.ANSWER_RELEVANCE.value: {
                            "record_level": True
                        },
                        LLMQAMetrics.UNSUCCESSFUL_REQUESTS.value: {
                            #"unsuccessful_phrases": []
                        }
                }
            }
        }

### Create the input, output and reference data frames and send them as input to compute metrics

In [4]:
from copy import deepcopy

input_columns = deepcopy(context_columns)
input_columns.append(question_column)
df_input = data[input_columns].copy()
df_output = data[[answer_column]].copy()

### Compute the metrics.

The faithfulness metric will be computed on the server (Cloud or CPD watsonx.governance instance) in asynchronous manner by default. The details of the computation tasks submitted and the response of them are returned in the faithfulness metric response. The `get_metrics_result` method shown in the next cells can be used to get the response from the server.

To execute the faithfulness metric in synchronous manner send the parameter `background_mode=False` to the compute_metrics method

**Note:** Each computation of faithfulness metric will consume 1 Resource Unit when the user cloud watsonx.governance instance is used.

In [5]:
import json
metrics_result = client.llm_metrics.compute_metrics(config_json, 
                                                    sources=df_input, 
                                                    predictions=df_output,
                                                    references=None
                                                    #background_mode=False
                                                   )

print(json.dumps(metrics_result, indent=2))

Artifact metrics.reward.deberta_v3_large_v2 is fetched from LocalCatalog(type='local_catalog', __description__=None, __tags__={}, __id__=None, is_local=True, name='local', location='/root/prv/github/metric-plugins/venv310_nb/lib/python3.10/site-packages/unitxt/catalog')
Artifact metrics.rag.answer_reward is fetched from LocalCatalog(type='local_catalog', __description__=None, __tags__={}, __id__=None, is_local=True, name='local', location='/root/prv/github/metric-plugins/venv310_nb/lib/python3.10/site-packages/unitxt/catalog')
{
  "answer_relevance": {
    "metric_value": 0.6308,
    "mean": 0.6308,
    "min": 0.2322,
    "max": 0.9672,
    "total_records": 5,
    "record_level_metrics": [
      {
        "answer_relevance": 0.8572,
        "record_id": "99884fd3-1c3b-432d-aa7e-acda18401930"
      },
      {
        "answer_relevance": 0.9672,
        "record_id": "c9f6fd6c-bbb7-4b3a-9466-17737abe08c1"
      },
      {
        "answer_relevance": 0.5256,
        "record_id": "10b90644-

#### Rerun the below cell until all the computation tasks are finished and results returned.

In [7]:
final_results = client.llm_metrics.get_metrics_result(configuration=config_json, metrics_result=metrics_result)
print(json.dumps(final_results, indent=2))

{
  "answer_relevance": {
    "metric_value": 0.6308,
    "mean": 0.6308,
    "min": 0.2322,
    "max": 0.9672,
    "total_records": 5,
    "record_level_metrics": [
      {
        "answer_relevance": 0.8572,
        "record_id": "99884fd3-1c3b-432d-aa7e-acda18401930"
      },
      {
        "answer_relevance": 0.9672,
        "record_id": "c9f6fd6c-bbb7-4b3a-9466-17737abe08c1"
      },
      {
        "answer_relevance": 0.5256,
        "record_id": "10b90644-a33a-4bb6-b50a-4a130f006553"
      },
      {
        "answer_relevance": 0.5719,
        "record_id": "d3dc6c74-4937-4e20-98f4-68c70774184a"
      },
      {
        "answer_relevance": 0.2322,
        "record_id": "0718468f-468a-4cf8-8a29-7fd3af193826"
      }
    ]
  },
  "faithfulness": {
    "max": 0.9902,
    "mean": 0.6894399999999999,
    "metric_value": 0.6894399999999999,
    "min": 0.0023,
    "total_records": 5,
    "record_level_metrics": [
      {
        "faithfulness": 0.8524,
        "faithfulness_attributions"

### Metric results for all the records

In [8]:
# Execute this cell only after the above faithulness metric computation tasks are complete
results_df = data.copy()
for k, v in final_results.items():
    if v.get("record_level_metrics"):
        results_df[k] = [r.get(k) for r in v.get("record_level_metrics")]
results_df

Unnamed: 0,context1,context2,context3,context4,question,answer,answer_relevance,faithfulness
0,"Last month, I announced our plan to supercharg...",For that purpose we’ve mobilized American grou...,"If you travel 20 miles east of Columbus, Ohio,...",But cancer from prolonged exposure to burn pit...,What is ARPA-H?,ARPA-H is the Advanced Research Projects Agenc...,0.8572,0.8524
1,So let’s not wait any longer. Send it to my de...,"If you travel 20 miles east of Columbus, Ohio,...",When we use taxpayer dollars to rebuild Americ...,It is going to transform America and put us on...,What is the investment of Ford and GM to build...,Ford is investing $11 billion to build electri...,0.9672,0.6144
2,My plan will cut the cost in half for most fam...,We got more than 130 countries to agree on a g...,And unlike the $2 Trillion tax cut passed in t...,We’re going after the criminals who stole bill...,What is the proposed tax rate for corporations?,The proposed tax rate for corporations is a 15...,0.5256,0.9902
3,"If you travel 20 miles east of Columbus, Ohio,...",So let’s not wait any longer. Send it to my de...,When we use taxpayer dollars to rebuild Americ...,It is going to transform America and put us on...,What is Intel going to build?,Intel is going to build a $20 billion semicond...,0.5719,0.9879
4,So let’s not wait any longer. Send it to my de...,"If you travel 20 miles east of Columbus, Ohio,...",It is going to transform America and put us on...,Vice President Harris and I ran for office wit...,How many electric vehicle charging stations ar...,The document does not provide information on t...,0.2322,0.0023


### Show the attributions for the first record

Attributions are the most important sentences in the context which contributed to the faithfulness score.

In [9]:
record_idx = 0
attributions, attribution_scores = [], []
for i in final_results.get("faithfulness")["record_level_metrics"][record_idx]["faithfulness_attributions"]:
    for attr in i["attributions"]:
        attributions.extend(attr.get("feature_values"))
        attribution_scores.extend(attr.get("faithfulness_scores"))

attributions_df = pd.DataFrame({"faithfulness attribution": attributions, "attribution score": attribution_scores})
pd.set_option('display.max_colwidth', 0)
attributions_df.sort_values(by=["attribution score"], inplace=True, ascending=False)
print("Question: {}".format(results_df[question_column][0]))
print("Answer: {}".format(results_df[answer_column][0]))
print("Attributions: ")
attributions_df

Question: What is ARPA-H?
Answer: ARPA-H is the Advanced Research Projects Agency for Health, which is an agency that aims to drive breakthroughs in cancer, Alzheimer's, diabetes, and more. It was proposed by the U.S. President to supercharge the Cancer Moonshot and cut the cancer death rate by at least 50% over the next 25 years.
Attributions: 


Unnamed: 0,faithfulness attribution,attribution score
3,"Last month, I announced our plan to supercharge the Cancer Moonshot that President Obama asked me to lead six years ago. Our goal is to cut the cancer death rate by at least 50% over the next 25 years, turn more cancers from death sentences into treatable diseases.",0.9633
0,"It’s based on DARPA—the Defense Department project that led to the Internet, GPS, and so much more. ARPA-H will have a singular purpose—to drive breakthroughs in cancer, Alzheimer’s, diabetes, and more.",0.7415
1,"ARPA-H will have a singular purpose—to drive breakthroughs in cancer, Alzheimer’s, diabetes, and more. A unity agenda for the nation.",0.4274
2,"Tonight, Danielle—we are. The VA is pioneering new ways of linking toxic exposures to diseases, already helping more veterans get benefits.",0.0023
4,The Internet. Technology we have yet to invent.,0.002
5,"Tonight, Danielle—we are. The VA is pioneering new ways of linking toxic exposures to diseases, already helping more veterans get benefits.",0.0017


Author: <a href="mailto:pvemulam@in.ibm.com">Pratap Kishore Varma V</a>