# Retrieval and Answer Quality Metrics computation using fine tuned models in IBM watsonx.governance for RAG task

This notebook demonstrates the computation of reference-free Answer Quality metrics, such as **Faithfulness**, **Answer relevance** and Retrieval Quality Metrics such as **Context Relevance**, **Retrieval Precision**, **Average Precision**, **Reciprocal Rank**, **Hit Rate** and **Normalized Discounted Cumulative Gain** using fine tuned models.

To compute the above metrics using the IBM fine tuned models, you must create a custom evaluator specific for each metric and provide it in the monitor configuration. A custom evaluator is a python function created using runtime-24.1 base and deployed in a space which runs the models and compute the metrics.

## Learning goals

- Create custom evaluators to run the fine tuned models and compute Retrieval and Answer Quality metrics

## Prerequisites

- IBM Cloud credentials
- A `.csv` file containing test data to be evaluated
- Space ID in which you want to create python function deployments

## Contents

- [Step 1 - Setup](#setup)
- [Step 2 - Read data](#data)
- [Step 3 - Custom evaluators creation](#evaluator)
- [Step 4 - Configure retrieval and answer quality metrics](#config)
- [Step 5 - Compute the retrieval and answer quality metrics](#compute)
- [Step 6 - Display the results](#results)

## Step 1 - Setup <a id="setup"></a>

### Install necessary libraries

In [None]:
!pip install ibm-watson-openscale>=3.0.43
!pip install ibm-metrics-plugin~=3.0.11
!pip install ibm-watsonx-ai==1.1.11

Looking in indexes: https://test.pypi.org/simple/


**Note**: you may need to restart the kernel to use updated libraries.

### Configure your credentials

In [2]:
# Cloud credentials
IAM_URL = "https://iam.cloud.ibm.com"
CLOUD_API_KEY = "<EDIT THIS>"  # YOUR_CLOUD_API_KEY
SERVICE_URL = "https://aiopenscale.cloud.ibm.com"

WX_AI_CREDENTIALS = {
    "url": "https://us-south.ml.cloud.ibm.com",
    "apikey": CLOUD_API_KEY,
}

### Verify client version

In [3]:
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator

from ibm_watson_openscale import *
from ibm_watson_openscale.supporting_classes.enums import *
from ibm_watson_openscale.supporting_classes import *

service_instance_id = None  # Update this to refer to a particular service instance
authenticator = IAMAuthenticator(apikey=CLOUD_API_KEY, url=IAM_URL)
wos_client = APIClient(
    authenticator=authenticator,
    service_url=SERVICE_URL,
    service_instance_id=service_instance_id,
)
print(wos_client.version)

3.0.41.12


In [4]:
from ibm_watsonx_ai import APIClient

watsonx_ai_client = APIClient(WX_AI_CREDENTIALS)
print(watsonx_ai_client.version)

1.1.11


## Step 2 - Read data<a id="data"></a>

In [5]:
# Download rag data
!rm rag_state_union.csv
!wget https://raw.githubusercontent.com/IBM/watson-openscale-samples/main/IBM%20Cloud/WML/assets/data/watsonx/rag_state_union.csv

--2024-12-17 11:06:34--  https://raw.githubusercontent.com/IBM/watson-openscale-samples/main/IBM%20Cloud/WML/assets/data/watsonx/rag_state_union.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20259 (20K) [text/plain]
Saving to: ‘rag_state_union.csv’


2024-12-17 11:06:34 (19.4 MB/s) - ‘rag_state_union.csv’ saved [20259/20259]



In [6]:
import pandas as pd
import json

data = pd.read_csv("rag_state_union.csv")

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   context1  5 non-null      object
 1   context2  5 non-null      object
 2   context3  5 non-null      object
 3   context4  5 non-null      object
 4   question  5 non-null      object
 5   answer    5 non-null      object
dtypes: object(6)
memory usage: 372.0+ bytes


In [8]:
# Edit below values based on the input data
context_columns = ["context1", "context2", "context3", "context4"]
question_column = "question"
answer_column = "answer"

## Step 3 - Custom evaluators creation <a id="evaluator"></a>

The custom evaluators are created by as a python function which is deployed in a custom runtime environment built on `RT24.1`.

#### Evaluator parameters
| Parameter | Description | Default Value | Possible Value(s) |
|:-|:-|:-|:-|
| `metric_type` | The name of the metric for which evaluator is created. |  | `retrieval_quality`, `faithfulness`, `answer_relevance` |
| `wx_ai_credentials` | Watsonx AI credentials. |  |  |
| `space_id` | ID of the space in which you want to create python function deployment. |  |  |
| `context_columns` [Optional]| The list of context column names in the input data frame. |  |  |
| `question_column` | The name of the question column in the input data frame. |  |  |
| `hardware_spec` [Optional]| Hardware specifications for deploying the python function. The metrics computation performance can be improved by providing larger hardware specification. | `M` | `M`, `L`, `XL` |
| `metric_parameters` [Optional]| Additional parameter specific to each metric.  |  |  |
| `func_name` [Optional]| The name of python function. | `<metric_type>`_with_nlp |  |
| `create_integrated_system` [Optional]| Flag for restricting creation of associated integrated system. | `True` | `True` or `False` |

##### Faithfulness parameters
| Parameter | Description | Default Value | Possible Value(s) |
|:-|:-|:-|:-|
| attributions_count [Optional]| Source attributions are computed for each sentence in the generated answer. Source attribution for a sentence is the set of sentences in the context which contributed to the LLM generating that sentence in the answer.  The attributions_count parameter specifies the number of sentences in the context which need to be identified for attributions.  E.g., if the value is set to 2, then we will find the top 2 sentences from the context as source attributions. | `3` |  |
| ngrams [Optional]| The number of sentences to be grouped from the context when computing faithfulness score. These grouped sentences will be shown in the attributions. Having a very high value of ngrams might lead to having lower faithfulness scores due to dispersion of data and inclusion of unrelated sentences in the attributions. Having a very low value might lead to increase in metric computation time and attributions not capturing the all the aspects of the answer. | `2` |  |

#### Context relevance parameters
| Parameter | Description | Default Value | Possible Value(s) |
|:-|:-|:-|:-|
| ngrams [Optional]| The number of sentences to be grouped from the context when computing context relevance score. Having a very high value of ngrams might lead to having lower context relevance scores due to dispersion of data and inclusion of unrelated sentences. | `5` |  |

In [9]:
space_id = "<EDIT_THIS>"  # space in which python function should be deployed

### Evaluator creation for retrieval quality

In [None]:
rq_evaluator_dtls = wos_client.llm_metrics.evaluators.add(
    metric_type="retrieval_quality",
    wx_ai_credentials=WX_AI_CREDENTIALS,
    space_id=space_id,
    question_column=question_column,
    context_columns=context_columns,
    hardware_spec="M",
    metric_parameters={
        # The metrics computed for retrieval quality are context_relevance, retrieval_precision, average_precision, reciprocal_rank, hit_rate, normalized_discounted_cumulative_gain
        # "context_relevance": {
        #     "ngrams": 5
        # },
    },
    func_name="retrieval_quality_with_nlp",
    create_integrated_system=False,
)
print(json.dumps(rq_evaluator_dtls, indent=2))
rq_func_deployment_uid = rq_evaluator_dtls["deployment_uid"]



######################################################################################

Synchronous deployment creation for id: '4fa7fac2-6cf3-4058-be02-26f91d6bb21c' started

######################################################################################


initializing
Note: online_url and serving_urls are deprecated and will be removed in a future release. Use inference instead.
............
ready


-----------------------------------------------------------------------------------------------
Successfully finished deployment creation, deployment_id='7710830f-0c7d-4cba-ab87-cd273d066ec2'
-----------------------------------------------------------------------------------------------


{
  "deployment_uid": "7710830f-0c7d-4cba-ab87-cd273d066ec2",
  "scoring_url": "https://us-south.ml.cloud.ibm.com/ml/v4/deployments/7710830f-0c7d-4cba-ab87-cd273d066ec2/predictions?version=2021-05-01"
}


### Evaluator creation for faithfulness

In [None]:
faith_evaluator_dtls = wos_client.llm_metrics.evaluators.add(
    metric_type="faithfulness",
    wx_ai_credentials=WX_AI_CREDENTIALS,
    space_id=space_id,
    question_column=question_column,
    context_columns=context_columns,
    hardware_spec="XL",
    metric_parameters={
        # "attributions_count": 3,
        # "ngrams": 2
    },
    func_name="faithfulness_with_nlp",
    create_integrated_system=False,
)
print(json.dumps(faith_evaluator_dtls, indent=2))
faith_func_deployment_uid = faith_evaluator_dtls["deployment_uid"]



######################################################################################

Synchronous deployment creation for id: 'ebbf53bf-4d2f-4f4a-82e3-708d249d2019' started

######################################################################################


initializing
Note: online_url and serving_urls are deprecated and will be removed in a future release. Use inference instead.
............
ready


-----------------------------------------------------------------------------------------------
Successfully finished deployment creation, deployment_id='0702d244-7964-4eb1-829d-88142a0b25ec'
-----------------------------------------------------------------------------------------------


{
  "deployment_uid": "0702d244-7964-4eb1-829d-88142a0b25ec",
  "scoring_url": "https://us-south.ml.cloud.ibm.com/ml/v4/deployments/0702d244-7964-4eb1-829d-88142a0b25ec/predictions?version=2021-05-01"
}


### Evaluator creation for answer relevance

In [None]:
ar_evaluator_dtls = wos_client.llm_metrics.evaluators.add(
    metric_type="answer_relevance",
    wx_ai_credentials=WX_AI_CREDENTIALS,
    space_id=space_id,
    question_column=question_column,
    hardware_spec="M",
    metric_parameters={
    },
    func_name="answer_relevance_with_nlp",
    create_integrated_system=False,
)
print(json.dumps(ar_evaluator_dtls, indent=2))
ar_func_deployment_uid = ar_evaluator_dtls["deployment_uid"]



######################################################################################

Synchronous deployment creation for id: '34302bd4-f278-43c4-bc98-0bb631060121' started

######################################################################################


initializing
Note: online_url and serving_urls are deprecated and will be removed in a future release. Use inference instead.
...........
ready


-----------------------------------------------------------------------------------------------
Successfully finished deployment creation, deployment_id='c467860b-340e-443a-8941-4930dd6350fd'
-----------------------------------------------------------------------------------------------


{
  "deployment_uid": "c467860b-340e-443a-8941-4930dd6350fd",
  "scoring_url": "https://us-south.ml.cloud.ibm.com/ml/v4/deployments/c467860b-340e-443a-8941-4930dd6350fd/predictions?version=2021-05-01"
}


## Step 4 - Configure retrieval and answer quality metrics
<a id="config"></a>

In [13]:
from ibm_metrics_plugin.metrics.llm.config.entities import (
    LLMTaskType,
    LLMMetricType,
)

### Scoring function for retrieval quality 

In [14]:
def rq_scoring_fn(input_data):
    results = []

    data = input_data.values.tolist()
    scoring_payload = {
        "input_data": [{"fields": input_data.columns.tolist(), "values": data}]
    }
    watsonx_ai_client.set.default_space(space_id)
    deployment_response = watsonx_ai_client.deployments.score(
        rq_func_deployment_uid, scoring_payload
    )
    results = deployment_response["predictions"][0]["values"]
    return pd.DataFrame({"record_level_metrics": results})

### Scoring function for faithfulness 

In [15]:
def faith_scoring_fn(input_data):
    results = []

    data = input_data.values.tolist()
    scoring_payload = {
        "input_data": [{"fields": input_data.columns.tolist(), "values": data}]
    }
    watsonx_ai_client.set.default_space(space_id)
    deployment_response = watsonx_ai_client.deployments.score(
        faith_func_deployment_uid, scoring_payload
    )
    results = deployment_response["predictions"][0]["values"]
    return pd.DataFrame({"record_level_metrics": results})

### Scoring function for answer relevance

In [16]:
def ar_scoring_fn(input_data):
    results = []

    data = input_data.values.tolist()
    scoring_payload = {
        "input_data": [{"fields": input_data.columns.tolist(), "values": data}]
    }
    watsonx_ai_client.set.default_space(space_id)
    deployment_response = watsonx_ai_client.deployments.score(
        ar_func_deployment_uid, scoring_payload
    )
    results = deployment_response["predictions"][0]["values"]
    return pd.DataFrame({"record_level_metrics": results})

### Configure metrics parameters

#### Common parameters

| Parameter | Description | Default Value | Possible Value(s) |
|:-|:-|:-|:-|
| context_columns | The list of context column names in the input data frame. |  |  |
| question_column | the name of the question column in the input data frame. |  |  |
| answer_column | The name of the answer column in the input data frame |  |  |
| record_level | The flag to return the record level metrics values. Set the flag under configuration to generate record level metrics for all the metrics. Set the flag under specific metric to generate record level metrics for that metric alone. | `False` | `True`, `False` |

##### Faithfulness, Retrieval quality and Answer relevance parameters

| Parameter | Description | Default Value | Possible Value(s) |
|:-|:-|:-|:-|
| sample_size [Optional]| The metrics are computed for a maximum of 50 LLM responses.  If you wish to compute it for a smaller number of responses, set the sample_size value to a lower number. If the number of records in the input data frame are more than the sample size, a uniform random sample will taken for computation. | `50` | Integer between 0 to 50. Max value supported is 50 |

In [None]:
metric_config = {
    "configuration": {
        "record_level": True,
        "context_columns": context_columns,
        "question_column": question_column,
        LLMTaskType.RAG.value: {
            LLMMetricType.RETRIEVAL_QUALITY.value: {
                "scoring_fn": rq_scoring_fn,
                "is_watson_nlp_fn": True,
                # "sample_size": 2,
            },
            LLMMetricType.FAITHFULNESS.value: {
                "scoring_fn": faith_scoring_fn,
                "is_watson_nlp_fn": True,
                # "sample_size": 2,
            },
            LLMMetricType.ANSWER_RELEVANCE.value: {
                "scoring_fn": ar_scoring_fn,
                "is_watson_nlp_fn": True,
                # "sample_size": 2,
            },
        },
    },
    # Uncomment this to activate sampling logic 
    # "total_records": len(data),
}

### Create the input and output data frames and send them as input to compute metrics

In [22]:
from copy import deepcopy

input_columns = deepcopy(context_columns)
input_columns.append(question_column)
df_input = data[input_columns].copy()
df_output = data[[answer_column]].copy()


## Step 5 - Compute the retrieval and answer quality metrics <a id="compute"></a>

In [23]:
metrics_result = wos_client.llm_metrics.compute_metrics(
    metric_config, sources=df_input, predictions=df_output
)

print(json.dumps(metrics_result, indent=2))

{
  "faithfulness": {
    "metric_value": 0.7072,
    "mean": 0.7072,
    "min": 0.0138,
    "max": 0.9951,
    "total_records": 5,
    "record_level_metrics": [
      {
        "record_id": "fb035ee8-6ebf-4d3f-b252-eca2ff63fbc4",
        "faithfulness": 0.5906,
        "faithfulness_attributions": [
          {
            "output_text": "ARPA-H is the Advanced Research Projects Agency for Health, which is an agency that aims to drive breakthroughs in cancer, Alzheimer's, diabetes, and more.",
            "faithfulness_score": 0.2397,
            "attributions": [
              {
                "faithfulness_scores": [
                  0.2397,
                  0.2165
                ],
                "feature_name": "context1",
                "feature_values": [
                  "It\u2019s based on DARPA\u2014the Defense Department project that led to the Internet, GPS, and so much more.  ARPA-H will have a singular purpose\u2014to drive breakthroughs in cancer, Alzheimer\u2019s

## Step 6 - Display the results <a id="results"></a>

In [24]:
results_df = data.copy()

for metric_name, metric_data in metrics_result.items():
    vals = {}
    if "record_level_metrics" in metric_data:
        for rm in metric_data["record_level_metrics"]:
            for m, mv in rm.items():
                if m != "record_id" and m!= "faithfulness_attributions" and m!= "context_relevances":  # Excluding columns
                    if m in vals:
                        vals[m].append(mv)
                    else:
                        vals[m] = [mv]

    if vals:
        for k, v in vals.items():
            results_df[k] = v

results_df

Unnamed: 0,context1,context2,context3,context4,question,answer,faithfulness,answer_relevance,context_relevance,retrieval_precision,average_precision,reciprocal_rank,hit_rate,ndcg
0,"Last month, I announced our plan to supercharg...",For that purpose we’ve mobilized American grou...,"If you travel 20 miles east of Columbus, Ohio,...",But cancer from prolonged exposure to burn pit...,What is ARPA-H?,ARPA-H is the Advanced Research Projects Agenc...,0.5906,0.964,0.8103,0.5,0.8,1.0,1,0.9828
1,So let’s not wait any longer. Send it to my de...,"If you travel 20 miles east of Columbus, Ohio,...",When we use taxpayer dollars to rebuild Americ...,It is going to transform America and put us on...,What is the investment of Ford and GM to build...,Ford is investing $11 billion to build electri...,0.9452,0.9877,0.2005,0.0,0.0,0.0,0,1.0
2,My plan will cut the cost in half for most fam...,We got more than 130 countries to agree on a g...,And unlike the $2 Trillion tax cut passed in t...,We’re going after the criminals who stole bill...,What is the proposed tax rate for corporations?,The proposed tax rate for corporations is a 15...,0.9951,0.9953,0.1992,0.0,0.0,0.0,0,0.9997
3,"If you travel 20 miles east of Columbus, Ohio,...",So let’s not wait any longer. Send it to my de...,When we use taxpayer dollars to rebuild Americ...,It is going to transform America and put us on...,What is Intel going to build?,Intel is going to build a $20 billion semicond...,0.9911,0.9935,0.8737,0.25,1.0,1.0,1,1.0
4,So let’s not wait any longer. Send it to my de...,"If you travel 20 miles east of Columbus, Ohio,...",It is going to transform America and put us on...,Vice President Harris and I ran for office wit...,How many electric vehicle charging stations ar...,The document does not provide information on t...,0.0138,0.9823,0.5984,0.0,0.0,0.0,0,0.6561
