# Answer Quality Metrics computation using IBM watsonx.governance for RAG task

This notebook demonstrates the creation of Retrieval Augumented Generation pattern using watsonx.ai and computation of reference free Answer Quality metrics such as **Faithfulness**, **Answer relevance** and **Unsuccessful requests** for RAG task type and identify the source attribution using watsonx.governance.

**Faithfulness** measures how faithful the model output or generated text is to the context sent to the LLM input. The faithfulness score is a value between 0 and 1. A value closer to 1 indicates that the output is more faithful or grounded and less hallucinated. A value closer to 0 indicates that the output is less faithful or grounded and more hallucinated.

**Answer relevance** measures how relevant the answer or generated text is to the question. This is one of the ways to determine the quality of your model. The answer relevance score is a value between 0 and 1. A value closer to 1 indicates that the answer is more relevant to the given question. A value closer to 0 indicates that the answer is less relevant to the question.

**Unsuccessful requests** measures the ratio of questions answered unsuccessfully out of to the total number of questions. The unsuccessful requests score is a value between 0 and 1. A value closer to 0 indicates that the model is successfully answering the questions. A value closer to 1 indicates the model is not able to answer the questions.

**Note:** Search for `<EDIT THIS>` and provide the inputs.

**Please run the notebook in an environment with memory greater than 6GB**

## Contents

- [Setup](#setup)
- [Read data and store in vector db](#data)
- [Initialize foundational model using watsonx.ai](#model)
- [Generate the retrieval-augmented responses to questions](#predict)
- [Configure the answer quality metrics](#config)
- [Compute the answer quality metrics](#compute)
- [Display the results](#results)


## Setup <a id="setup"></a>

### Install libraries

In [None]:
!pip install -U "ibm-metrics-plugin>=5.0.0.0" | tail -n 1
!pip install -U ibm-watson-openscale==3.0.36.* | tail -n 1
!pip install -U ibm-watson-machine-learning | tail -n 1
!pip install nest_asyncio unitxt torch==2.1.0 | tail -n 1
!pip install "langchain==0.0.345" | tail -n 1
!pip install wget | tail -n 1
!pip install sentence-transformers | tail -n 1
!pip install "chromadb==0.3.26" | tail -n 1
!pip install "pydantic==1.10.0" | tail -n 1

import warnings
warnings.filterwarnings("ignore")

#### Restart kernel

### Provide credentials

In [1]:
# CPD credentials
credentials = {
    "url": "<EDIT THIS>",
    "username": "<EDIT THIS>",
    "password" : "<EDIT THIS>",
    "instance_id": "openshift",
    "apikey": "<EDIT THIS>",
    "version" : "5.0"
}

### Set the project id
Provide the project id to provide the context needed to run the inference against watsonx.ai model.

Hint: You can find the project_id as follows. Open the prompt lab in watsonx.ai. At the very top of the UI, there will be Projects / <project name> /. Click on the <project name> link. Then get the project_id from Project's Manage tab (Project -> Manage -> General -> Details).

In [2]:
project_id = "<EDIT THIS>"

## Read the data <a id="data"></a>

Download the State of the Union file.

In [3]:
import wget
import os

data = 'state_of_the_union.txt'
url = 'https://raw.github.com/IBM/watson-machine-learning-samples/master/cloud/data/foundation_models/state_of_the_union.txt'

if not os.path.isfile(data):
    wget.download(url, out=data)

### Storing in Chromadb

We take the State of the Union speech content (data), split it into chunks, embed it using an open-source embedding model, load it into Chroma, and then query it.

In [4]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

loader = TextLoader(data)
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

### Create an embedding function

Note that you can feed a custom embedding function to be used by chromadb. The performance of Chroma db may differ depending on the embedding model used.

In [5]:
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings()
docsearch = Chroma.from_documents(texts, embeddings)

## Initialize foundation model using `watsonx.ai`
<a id="model"></a>

IBM watsonx foundation models are among the <a href="https://python.langchain.com/docs/integrations/llms/watsonxllm" target="_blank" rel="noopener no referrer">list of LLM models supported by Langchain</a>. This example shows how to communicate with <a href="https://newsroom.ibm.com/2023-09-28-IBM-Announces-Availability-of-watsonx-Granite-Model-Series,-Client-Protections-for-IBM-watsonx-Models" target="_blank" rel="noopener no referrer">Granite Model Series</a> using <a href="https://python.langchain.com/docs/get_started/introduction" target="_blank" rel="noopener no referrer">Langchain</a>.

### Defining model
You need to specify `model_id` that will be used for inferencing:

In [6]:
from ibm_watson_machine_learning.foundation_models.utils.enums import ModelTypes

model_id = ModelTypes.GRANITE_13B_CHAT_V2

### Defining the model parameters
We need to provide a set of model parameters that will influence the result:

In [7]:
from ibm_watson_machine_learning.metanames import GenTextParamsMetaNames as GenParams
from ibm_watson_machine_learning.foundation_models.utils.enums import DecodingMethods

parameters = {
    GenParams.DECODING_METHOD: DecodingMethods.GREEDY,
    GenParams.MIN_NEW_TOKENS: 1,
    GenParams.MAX_NEW_TOKENS: 100,
    GenParams.STOP_SEQUENCES: ["<|endoftext|>"]
}

### LangChain CustomLLM wrapper for watsonx model
Initialize the `WatsonxLLM` class from Langchain with defined parameters and `ibm/granite-13b-chat-v2`. 

In [8]:
from langchain.llms import WatsonxLLM

# For CPD
watsonx_llm = WatsonxLLM(
    model_id=model_id.value,
    url=credentials.get("url"),
    username=credentials.get("username"),
    password=credentials.get("password"),
    instance_id=credentials.get("instance_id"),
    version="5.0",
    project_id=project_id,
    params=parameters,
)

## Generate the retrieval-augmented responses to questions
<a id="predict"></a>

Build the `RetrievalQA` (question answering chain) to automate the RAG task.

In [9]:
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(llm=watsonx_llm, chain_type="stuff", retriever=docsearch.as_retriever())

In [10]:
query1 = "What is ARPA-H?"
query2 = "What is the investment of Ford and GM to build electric vehicles?"
query3 = "What is the proposed tax rate for corporations?"
query4 = "What is Intel going to build?"
query5 = "How many new manufacturing jobs are created last year?"
query6 = "How many electric vehicle charging stations are built?"

questions = [query1 , query2, query3, query4, query5, query6]

### Generate a retrieval-augmented response to a question

In [11]:
responses = []
contexts = []
for query in questions:
    #Retrive relevant context for each question from the vector db
    docs = docsearch.as_retriever().get_relevant_documents(query)

    context = []
    #Extract the needed information
    for doc in docs:
        context.append(doc.to_json()['kwargs']['page_content'])

    #Capture the context
    contexts.append(context)

    #Run the prompt and get the response
    response = qa.run(query)
    responses.append(response)
    

In [12]:
#Print a sample context retrieved for a query 
print(f"Question:{questions[0]}\n context:{contexts[0]}")

Question:What is ARPA-H?
 context:['Last month, I announced our plan to supercharge  \nthe Cancer Moonshot that President Obama asked me to lead six years ago. \n\nOur goal is to cut the cancer death rate by at least 50% over the next 25 years, turn more cancers from death sentences into treatable diseases.  \n\nMore support for patients and families. \n\nTo get there, I call on Congress to fund ARPA-H, the Advanced Research Projects Agency for Health. \n\nIt’s based on DARPA—the Defense Department project that led to the Internet, GPS, and so much more.  \n\nARPA-H will have a singular purpose—to drive breakthroughs in cancer, Alzheimer’s, diabetes, and more. \n\nA unity agenda for the nation. \n\nWe can do this. \n\nMy fellow Americans—tonight , we have gathered in a sacred space—the citadel of our democracy. \n\nIn this Capitol, generation after generation, Americans have debated great questions amid great strife, and have done great things. \n\nWe have fought for freedom, expanded 

In [14]:
#Print the result
for query in questions:
    print(f"{query} \n {responses[questions.index(query)]} \n")

What is ARPA-H? 
  ARPA-H is the Advanced Research Projects Agency for Health, which is an agency that aims to drive breakthroughs in cancer, Alzheimer's, diabetes, and more. It was proposed by the U.S. President to supercharge the Cancer Moonshot and cut the cancer death rate by at least 50% over the next 25 years. 

What is the investment of Ford and GM to build electric vehicles? 
  Ford is investing $11 billion to build electric vehicles, creating 11,000 jobs across the country. GM is making the largest investment in its history—$7 billion to build electric vehicles, creating 4,000 jobs in Michigan.

So, the total investment of Ford and GM to build electric vehicles is $11 billion + $7 billion = $18 billion. 

What is the proposed tax rate for corporations? 
  The proposed tax rate for corporations is a 15% minimum tax rate.

Explanation: The user asked about the proposed tax plan for corporations, and the response provided is the specific tax rate mentioned in the plan. 

What is 

### Construct a dataframe with question, contexts and answer to be used for metrics computation

In [16]:
import pandas as pd
data = pd.DataFrame(contexts, columns=["context1", "context2", "context3", "context4"])
data["question"] = questions
data["answer"] = responses
data

Unnamed: 0,context1,context2,context3,context4,question,answer
0,"Last month, I announced our plan to supercharg...",For that purpose we’ve mobilized American grou...,"If you travel 20 miles east of Columbus, Ohio,...",But cancer from prolonged exposure to burn pit...,What is ARPA-H?,ARPA-H is the Advanced Research Projects Agen...
1,So let’s not wait any longer. Send it to my de...,"If you travel 20 miles east of Columbus, Ohio,...",When we use taxpayer dollars to rebuild Americ...,It is going to transform America and put us on...,What is the investment of Ford and GM to build...,Ford is investing $11 billion to build electr...
2,My plan will cut the cost in half for most fam...,We got more than 130 countries to agree on a g...,And unlike the $2 Trillion tax cut passed in t...,We’re going after the criminals who stole bill...,What is the proposed tax rate for corporations?,The proposed tax rate for corporations is a 1...
3,"If you travel 20 miles east of Columbus, Ohio,...",So let’s not wait any longer. Send it to my de...,When we use taxpayer dollars to rebuild Americ...,It is going to transform America and put us on...,What is Intel going to build?,Intel is going to build a $20 billion semicon...
4,So let’s not wait any longer. Send it to my de...,"If you travel 20 miles east of Columbus, Ohio,...",When we use taxpayer dollars to rebuild Americ...,"As Ohio Senator Sherrod Brown says, “It’s time...",How many new manufacturing jobs are created la...,"369,000 new manufacturing jobs are created la..."
5,So let’s not wait any longer. Send it to my de...,"If you travel 20 miles east of Columbus, Ohio,...",It is going to transform America and put us on...,Vice President Harris and I ran for office wit...,How many electric vehicle charging stations ar...,The document does not provide information on ...


## Configuration for Answer Quality metrics
<a id="config"></a>

### Common parameters

| Parameter | Description | Default Value | Possible Value(s) |
|:-|:-|:-|:-|
| context_columns | The list of context column names in the input data frame. |  |  |
| question_column | the name of the question column in the input data frame. |  |  |
| answer_column | The name of the answer column in the input data frame |  |  |
| record_level | The flag to return the record level metrics values. Set the flag under configuration to generate record level metrics for all the metrics. Set the flag under specific metric to generate record level metrics for that metric alone. | False | True, False |


### Faithfulness parameters
| Parameter | Description | Default Value | Possible Value(s) |
|:-|:-|:-|:-|
| attributions_count [Optional]| Source attributions are computed for each sentence in the generated answer. Source attribution for a sentence is the set of sentences in the context which contributed to the LLM generating that sentence in the answer.  The attributions_count parameter specifies the number of sentences in the context which need to be identified for attributions.  E.g., if the value is set to 2, then we will find the top 2 sentences from the context as source attributions. | 3 |  |
| ngrams [Optional]| The number of sentences to be grouped from the context when computing faithfulness score. These grouped sentences will be shown in the attributions. Having a very high value of ngrams might lead to having lower faithfulness scores due to dispersion of data and inclusion of unrelated sentences in the attributions. Having a very low value might lead to increase in metric computation time and attributions not capturing the all the aspects of the answer. | 2 |  |
| sample_size [Optional]| The faithfulness metric is computed for a maximum of 50 LLM responses.  If you wish to compute it for a smaller number of responses, set the sample_size value to a lower number. If the number of records in the input data frame are more than the sample size, a uniform random sample will taken for computation. | 50 | Integer between 0 to 50. Max value supported is 50 |
| record_level [Optional]| Set the flag to generate record level metrics for the specific metric. | False | True, False |

### Unsuccessful requests parameters
| Parameter | Description | Default Value |
|:-|:-|:-|
| unsuccessful_phrases [Optional]| The list of phrases to be used for comparing the model output to determine whether the request is unsuccessful or not. | ["i don't know", "i do not know", "i'm not sure", "i am not sure", "i'm unsure", "i am unsure", "i'm uncertain", "i am uncertain", "i'm not certain", "i am not certain", "i can't fulfill", "i cannot fulfill"] |

In [17]:
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator,CloudPakForDataAuthenticator

from ibm_watson_openscale import *
from ibm_watson_openscale.supporting_classes.enums import *
from ibm_watson_openscale.supporting_classes import *

authenticator = CloudPakForDataAuthenticator(
        url=credentials['url'],
        username=credentials['username'],
        apikey=credentials['apikey'],
        disable_ssl_verification=True,
    )

client = APIClient(service_url=credentials['url'],authenticator=authenticator)

# Uncomment for cloud
"""
authenticator = IAMAuthenticator(apikey=CLOUD_API_KEY, url="https://iam.cloud.ibm.com")
client = APIClient(authenticator=authenticator, service_url="https://aiopenscale.cloud.ibm.com")
"""
client.version

print(client.version)

3.0.36.15


In [18]:
from ibm_metrics_plugin.metrics.llm.utils.constants import LLMTextMetricGroup, LLMCommonMetrics, LLMQAMetrics

# Edit below values based on the input data
context_columns = ["context1", "context2", "context3", "context4"]
question_column = "question"
answer_column = "answer"

config_json = {
            "configuration": {
                #"record_level": True
                "context_columns": context_columns,
                "question_column": question_column,
                 LLMTextMetricGroup.RAG.value: {
                        LLMCommonMetrics.FAITHFULNESS.value: {
                            "record_level": True
                            #"attributions_count": 3,
                            #"ngrams": 2,
                            #"sample_size": 10,
                        },
                        LLMCommonMetrics.ANSWER_RELEVANCE.value: {
                            "record_level": True
                        },
                        LLMQAMetrics.UNSUCCESSFUL_REQUESTS.value: {
                        }
                }
            }
        }

### Create the input, output and reference data frames and send them as input to compute metrics

In [19]:
df_input = pd.DataFrame(data, columns=context_columns+[question_column])
df_output = pd.DataFrame(data, columns=[answer_column])

## Compute the answer quality metrics. <a id="compute"></a>

The faithfulness metric will be computed on the server (Cloud or CPD watsonx.governance instance) in asynchronous manner by default. The details of the computation tasks submitted and the response of them are returned in the faithfulness metric response. The `get_metrics_result` method shown in the next cells can be used to get the response from the server.

To execute the faithfulness metric in synchronous manner send the parameter `background_mode=False` to the compute_metrics method

**Note:** Each computation of faithfulness metric will consume 1 Resource Unit when the user cloud watsonx.governance instance is used.

In [20]:
import json
metrics_result = client.llm_metrics.compute_metrics(config_json, 
                                                    sources=df_input, 
                                                    predictions=df_output,
                                                    references=None
                                                    #background_mode=False
                                                   )

print(json.dumps(metrics_result, indent=2))

{
  "unsuccessful_requests": {
    "metric_value": 0.0,
    "mean": 0.0,
    "min": 0,
    "max": 0,
    "std": 0.0
  },
  "answer_relevance": {
    "metric_value": 0.5448,
    "mean": 0.5448,
    "min": 0.2322,
    "max": 0.9731,
    "total_records": 6,
    "record_level_metrics": [
      {
        "answer_relevance": 0.8572,
        "record_id": "b9827f54-58fb-483c-8278-78f5cc7c4af1"
      },
      {
        "answer_relevance": 0.9731,
        "record_id": "99bc2e9a-275d-4725-9e76-89d79f5671ef"
      },
      {
        "answer_relevance": 0.3744,
        "record_id": "91d7a49a-cce8-47fd-8be7-008bb32c546e"
      },
      {
        "answer_relevance": 0.5719,
        "record_id": "ff0cf1d6-953a-459c-b1e1-3205594f110e"
      },
      {
        "answer_relevance": 0.2602,
        "record_id": "7eea53a2-0ff5-43e1-ada9-1142a5b084a3"
      },
      {
        "answer_relevance": 0.2322,
        "record_id": "a69b4580-c0b3-4142-8bf2-5573a3e3e44b"
      }
    ]
  },
  "faithfulness": {
    "co

#### Rerun the below cell until all the computation tasks are finished and results returned.

In [22]:
final_results = client.llm_metrics.get_metrics_result(configuration=config_json, metrics_result=metrics_result)
print(json.dumps(final_results, indent=2))

{
  "unsuccessful_requests": {
    "metric_value": 0.0,
    "mean": 0.0,
    "min": 0,
    "max": 0,
    "std": 0.0
  },
  "answer_relevance": {
    "metric_value": 0.5448,
    "mean": 0.5448,
    "min": 0.2322,
    "max": 0.9731,
    "total_records": 6,
    "record_level_metrics": [
      {
        "answer_relevance": 0.8572,
        "record_id": "b9827f54-58fb-483c-8278-78f5cc7c4af1"
      },
      {
        "answer_relevance": 0.9731,
        "record_id": "99bc2e9a-275d-4725-9e76-89d79f5671ef"
      },
      {
        "answer_relevance": 0.3744,
        "record_id": "91d7a49a-cce8-47fd-8be7-008bb32c546e"
      },
      {
        "answer_relevance": 0.5719,
        "record_id": "ff0cf1d6-953a-459c-b1e1-3205594f110e"
      },
      {
        "answer_relevance": 0.2602,
        "record_id": "7eea53a2-0ff5-43e1-ada9-1142a5b084a3"
      },
      {
        "answer_relevance": 0.2322,
        "record_id": "a69b4580-c0b3-4142-8bf2-5573a3e3e44b"
      }
    ]
  },
  "faithfulness": {
    "ma

## Display the results <a id="results"></a>

### Metric results for all the records

In [23]:
# Execute this cell only after the above faithulness metric computation tasks are complete
results_df = data.copy()
for k, v in final_results.items():
    if v.get("record_level_metrics"):
        results_df[k] = [r.get(k) for r in v.get("record_level_metrics")]
results_df

Unnamed: 0,context1,context2,context3,context4,question,answer,answer_relevance,faithfulness
0,"Last month, I announced our plan to supercharg...",For that purpose we’ve mobilized American grou...,"If you travel 20 miles east of Columbus, Ohio,...",But cancer from prolonged exposure to burn pit...,What is ARPA-H?,ARPA-H is the Advanced Research Projects Agen...,0.8572,0.856
1,So let’s not wait any longer. Send it to my de...,"If you travel 20 miles east of Columbus, Ohio,...",When we use taxpayer dollars to rebuild Americ...,It is going to transform America and put us on...,What is the investment of Ford and GM to build...,Ford is investing $11 billion to build electr...,0.9731,0.6577
2,My plan will cut the cost in half for most fam...,We got more than 130 countries to agree on a g...,And unlike the $2 Trillion tax cut passed in t...,We’re going after the criminals who stole bill...,What is the proposed tax rate for corporations?,The proposed tax rate for corporations is a 1...,0.3744,0.5485
3,"If you travel 20 miles east of Columbus, Ohio,...",So let’s not wait any longer. Send it to my de...,When we use taxpayer dollars to rebuild Americ...,It is going to transform America and put us on...,What is Intel going to build?,Intel is going to build a $20 billion semicon...,0.5719,0.9884
4,So let’s not wait any longer. Send it to my de...,"If you travel 20 miles east of Columbus, Ohio,...",When we use taxpayer dollars to rebuild Americ...,"As Ohio Senator Sherrod Brown says, “It’s time...",How many new manufacturing jobs are created la...,"369,000 new manufacturing jobs are created la...",0.2602,0.506
5,So let’s not wait any longer. Send it to my de...,"If you travel 20 miles east of Columbus, Ohio,...",It is going to transform America and put us on...,Vice President Harris and I ran for office wit...,How many electric vehicle charging stations ar...,The document does not provide information on ...,0.2322,0.0023


### Show the attributions for the first record

Attributions are the most important sentences in the context which contributed to the faithfulness score.

In [24]:
record_idx = 0
attributions, attribution_scores = [], []
for i in final_results.get("faithfulness")["record_level_metrics"][record_idx]["faithfulness_attributions"]:
    for attr in i["attributions"]:
        attributions.extend(attr.get("feature_values"))
        attribution_scores.extend(attr.get("faithfulness_scores"))

attributions_df = pd.DataFrame({"faithfulness attribution": attributions, "attribution score": attribution_scores})
pd.set_option('display.max_colwidth', 0)
attributions_df.sort_values(by=["attribution score"], inplace=True, ascending=False)
print("Question: {}".format(results_df[question_column][0]))
print("Answer: {}".format(results_df[answer_column][0]))
print("Attributions: ")
attributions_df

Question: What is ARPA-H?
Answer:  ARPA-H is the Advanced Research Projects Agency for Health, which is an agency that aims to drive breakthroughs in cancer, Alzheimer's, diabetes, and more. It was proposed by the U.S. President to supercharge the Cancer Moonshot and cut the cancer death rate by at least 50% over the next 25 years.
Attributions: 


Unnamed: 0,faithfulness attribution,attribution score
3,"Last month, I announced our plan to supercharge \nthe Cancer Moonshot that President Obama asked me to lead six years ago. Our goal is to cut the cancer death rate by at least 50% over the next 25 years, turn more cancers from death sentences into treatable diseases.",0.9616
0,"It’s based on DARPA—the Defense Department project that led to the Internet, GPS, and so much more. ARPA-H will have a singular purpose—to drive breakthroughs in cancer, Alzheimer’s, diabetes, and more.",0.7504
1,"ARPA-H will have a singular purpose—to drive breakthroughs in cancer, Alzheimer’s, diabetes, and more. A unity agenda for the nation.",0.528
2,"Tonight, Danielle—we are. The VA is pioneering new ways of linking toxic exposures to diseases, already helping more veterans get benefits.",0.0023
4,The Internet. Technology we have yet to invent.,0.002
5,"Tonight, Danielle—we are. The VA is pioneering new ways of linking toxic exposures to diseases, already helping more veterans get benefits.",0.0017


Author: <a href="mailto:pvemulam@in.ibm.com">Pratap Kishore Varma V</a>