# Retrieval and Answer Quality Metrics computation using LLM as Judge in IBM watsonx.governance for RAG task

This notebook demonstrates the creation of a Retrieval Augumented Generation (RAG) pattern using watsonx.ai and computations of reference-free Answer Quality metrics, such as **Faithfulness**, **Answer relevance**, reference based **Answer similarity** metric and Retrieval Quality Metrics such as **Context Relevance**, **Retrieval Precision**, **Average Precision**, **Reciprocal Rank**, **Hit Rate** and **Normalized Discounted Cumulative Gain** for the RAG task type. It also identifies the source attribution using watsonx.governance.

- **Faithfulness** measures how faithful the model output or generated text is to the context sent to the LLM input. The faithfulness score is a value between 0 and 1. A value closer to 1 indicates that the output is more faithful - or grounded - and less hallucinated. A value closer to 0 indicates that the output is less faithful and more hallucinated.

- **Answer relevance** measures how relevant the answer or generated text is to the question. This is one of the ways to determine the quality of your model. The answer relevance score is a value between 0 and 1. A value closer to 1 indicates that the answer is more relevant to the given question. A value closer to 0 indicates that the answer is less relevant to the question.

- **Answer similarity** measures how similar the answer or generated text is to the ground truth or reference answer. This is one of the ways to determine the quality of your model. The answer similarity score is a value between 0 and 1. A value closer to 1 indicates that the answer is more similar to the reference value. A value closer to 0 indicates that the answer is less similar to the reference value.

- **Context Relevance** assesses the degree to which the retrieved context is relevant to the question sent to the LLM. This is one of the ways to determine the quality of your retrieval system. The context relevance score is a value between 0 and 1. A value closer to 1 indicates that the context is more relevant to your question in the prompt. A value closer to 0 indicates that the context is less relevant to your question in the prompt.

- **Retrieval Precision** measures the quantity of relevant contexts from the total contexts retrieved. The retrieval precision is a value between 0 and 1. A value of 1 indicates that all the retrieved contexts are relevant. A value of 0 indicates that none of the retrieved contexts are relevant.

- **Average Precision** evaluates whether all the relevant contexts are ranked higher or not. It is the mean of the precision scores of relevant contexts. The average precision is a value between 0 and 1. A value of 1 indicates that all the relevant contexts are ranked higher. A value of 0 indicates that none of the retrieved contexts are relevant.

- **Reciprocal Rank** is the reciprocal of the rank of the first relevant context. The retrieval reciprocal rank is a value between 0 and 1. A value of 1 indicates that the first relevant context is at first position. A value of 0 indicates that none of the relevant contexts are retrieved.

- **Hit Rate** Hit Rate measures whether there is atleast one relevant context among the retrieved contexts. The hit rate value is either 0 or 1. A value of 1 indicates that there is at least one relevant context. A value of 0 indicates that there is no relevant context in the retrieved contexts.

- **Normalized Discounted Cumulative Gain** Normalized Discounted Cumulative Gain or NDCG measures the ranking quality of the retrieved contexts. The ndcg is a value between 0 and 1. A value of 1 indicates that the retrieved contexts are ranked in the correct order.

## Learning goals

- Read data into a vector database
- Initialize foundation model
- Generate RAG responses
- Configure and compute Retrieval and Answer Quality metrics

## Contents

- [Step 1 - Setup](#setup)
- [Step 2 - Read data and store in vector db](#data)
- [Step 3 - Initialize foundational model using watsonx.ai](#model)
- [Step 4 - Generate the retrieval-augmented responses to questions](#predict)
- [Step 5 - Configure the retrieval and answer quality metrics](#config)
- [Step 6 - Compute the retrieval and answer quality metrics](#compute)
- [Step 7 - Display the results](#results)

## Step 1 - Setup <a id="setup"></a>

### Install necessary libraries

In [None]:
!pip install -U "ibm-metrics-plugin~=3.0.9" | tail -n 1
!pip install -U ibm-watson-openscale | tail -n 1
!pip install -U ibm-watson-machine-learning | tail -n 1
!pip install "langchain==0.0.345" | tail -n 1
!pip install wget | tail -n 1
!pip install sentence-transformers | tail -n 1
!pip install "chromadb==0.3.26" | tail -n 1
!pip install "pydantic==1.10.0" | tail -n 1

import warnings
warnings.filterwarnings("ignore")

**Note**: you may need to restart the kernel to use updated libraries.

### Configure your credentials

In [9]:
# Cloud credentials
IAM_URL="https://iam.cloud.ibm.com"
DATAPLATFORM_URL = "https://api.dataplatform.cloud.ibm.com"
SERVICE_URL = "https://aiopenscale.cloud.ibm.com"
CLOUD_API_KEY = "<EDIT THIS>" # YOUR_CLOUD_API_KEY

credentials = {
    "url": "https://us-south.ml.cloud.ibm.com",
    "apikey": CLOUD_API_KEY
}

### Configure your project id
Provide the project id to provide the context needed to run the inference against the watsonx.ai model.

***Hint***: You can find the `project_id` as follows. Open the prompt lab in watsonx.ai. At the very top of the UI, there will be "Projects / *project name* /". Click on the "*project name*" link, then get the `project_id` from the project's "Manage" tab ("Project -> Manage -> General -> Details").

In [2]:
project_id = "<EDIT THIS>"

## Step 2 - Read and store data in a vector database <a id="data"></a>

### Read the data

Download the sample "State of the Union" file.

In [3]:
import wget
import os

data = 'state_of_the_union.txt'
url = 'https://raw.github.com/IBM/watson-machine-learning-samples/master/cloud/data/foundation_models/state_of_the_union.txt'

if not os.path.isfile(data):
    wget.download(url, out=data)

### Prepare the data for the vector database

Take the `state_of_the_union.txt` speech content data and split it into chunks. 

In [4]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

loader = TextLoader(data)
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

### Create an embedding function to store the data in a vector database

Embed the chunked data using an open-source embedding model and load it into Chromadb, a vector database.

**Note**: You can also provide a custom embedding function to be used by Chromadb; the performance of Chromadb may differ depending on the embedding model used.

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings()
docsearch = Chroma.from_documents(texts, embeddings)

## Step 3 - Initialize a foundation model using `watsonx.ai`
<a id="model"></a>

IBM watsonx foundation models are among the <a href="https://python.langchain.com/docs/integrations/llms/watsonxllm" target="_blank" rel="noopener no referrer">list of LLM models supported by Langchain</a>. This example shows how to communicate with <a href="https://newsroom.ibm.com/2023-09-28-IBM-Announces-Availability-of-watsonx-Granite-Model-Series,-Client-Protections-for-IBM-watsonx-Models" target="_blank" rel="noopener no referrer">Granite Model Series</a> using <a href="https://python.langchain.com/docs/get_started/introduction" target="_blank" rel="noopener no referrer">Langchain</a>.

### Define a model
Specify a `model_id` that will be used for inferencing:

In [6]:
from ibm_watson_machine_learning.foundation_models.utils.enums import ModelTypes

model_id = ModelTypes.GRANITE_13B_CHAT_V2

### Define the model parameters
Provide a set of model parameters that will influence the result:

In [7]:
from ibm_watson_machine_learning.metanames import GenTextParamsMetaNames as GenParams
from ibm_watson_machine_learning.foundation_models.utils.enums import DecodingMethods

parameters = {
    GenParams.DECODING_METHOD: DecodingMethods.GREEDY,
    GenParams.MIN_NEW_TOKENS: 1,
    GenParams.MAX_NEW_TOKENS: 100,
    GenParams.STOP_SEQUENCES: ["<|endoftext|>"]
}

### Set LangChain custom LLM wrapper for watsonx model
Initialize the `WatsonxLLM` class from LangChain with defined parameters, and using `ibm/granite-13b-chat-v2`.

In [10]:
from langchain.llms import WatsonxLLM

watsonx_llm = WatsonxLLM(
    model_id=model_id.value,
    url=credentials.get("url"),
    apikey=credentials.get("apikey"),
    project_id=project_id,
    params=parameters
)

## Step 4 - Generate retrieval-augmented responses to questions
<a id="predict"></a>

### Build a `RetrievalQA` (question answering chain) to automate the RAG task.

In [11]:
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(llm=watsonx_llm, chain_type="stuff", retriever=docsearch.as_retriever())

In [12]:
query1 = "What is ARPA-H?"
query2 = "What is the investment of Ford and GM to build electric vehicles?"
query3 = "What is the proposed tax rate for corporations?"
query4 = "What is Intel going to build?"
query5 = "How many new manufacturing jobs are created last year?"
query6 = "How many electric vehicle charging stations are built?"

questions = [query1 , query2, query3, query4, query5, query6]

### Generate retrieval-augmented responses to the questions

In [13]:
responses = []
contexts = []
for query in questions:
    #Retrive relevant context for each question from the vector db
    docs = docsearch.as_retriever().get_relevant_documents(query)

    context = []
    #Extract the needed information
    for doc in docs:
        context.append(doc.to_json()['kwargs']['page_content'])

    #Capture the context
    contexts.append(context)

    #Run the prompt and get the response
    response = qa.run(query)
    responses.append(response)
    

In [14]:
#Print a sample context retrieved for a query 
print(f"Question:{questions[0]}\n context:{contexts[0]}")

Question:What is ARPA-H?
 context:['Last month, I announced our plan to supercharge  \nthe Cancer Moonshot that President Obama asked me to lead six years ago. \n\nOur goal is to cut the cancer death rate by at least 50% over the next 25 years, turn more cancers from death sentences into treatable diseases.  \n\nMore support for patients and families. \n\nTo get there, I call on Congress to fund ARPA-H, the Advanced Research Projects Agency for Health. \n\nIt’s based on DARPA—the Defense Department project that led to the Internet, GPS, and so much more.  \n\nARPA-H will have a singular purpose—to drive breakthroughs in cancer, Alzheimer’s, diabetes, and more. \n\nA unity agenda for the nation. \n\nWe can do this. \n\nMy fellow Americans—tonight , we have gathered in a sacred space—the citadel of our democracy. \n\nIn this Capitol, generation after generation, Americans have debated great questions amid great strife, and have done great things. \n\nWe have fought for freedom, expanded 

In [15]:
#Print the result
for query in questions:
    print(f"{query} \n {responses[questions.index(query)]} \n")

What is ARPA-H? 
  ARPA-H is the Advanced Research Projects Agency for Health, which is an agency that aims to drive breakthroughs in cancer, Alzheimer's, diabetes, and more. It was proposed by the U.S. President to supercharge the Cancer Moonshot and cut the cancer death rate by at least 50% over the next 25 years. 

What is the investment of Ford and GM to build electric vehicles? 
  Ford is investing $11 billion to build electric vehicles, creating 11,000 jobs across the country. GM is making the largest investment in its history—$7 billion to build electric vehicles, creating 4,000 jobs in Michigan.

So, the total investment of Ford and GM to build electric vehicles is $11 billion + $7 billion = $18 billion. 

What is the proposed tax rate for corporations? 
  The proposed tax rate for corporations is a 15% minimum tax rate.

Explanation: The user asked about the proposed tax plan for corporations, and the response provided is the specific tax rate mentioned in the plan. 

What is 

### Construct a dataframe with question, contexts and answer to be used for metrics computation

In [16]:
import pandas as pd
data = pd.DataFrame(contexts, columns=["context1", "context2", "context3", "context4"])
data["question"] = questions
data["answer"] = responses
data

Unnamed: 0,context1,context2,context3,context4,question,answer
0,"Last month, I announced our plan to supercharg...",For that purpose we’ve mobilized American grou...,"If you travel 20 miles east of Columbus, Ohio,...",But cancer from prolonged exposure to burn pit...,What is ARPA-H?,ARPA-H is the Advanced Research Projects Agen...
1,So let’s not wait any longer. Send it to my de...,"If you travel 20 miles east of Columbus, Ohio,...",When we use taxpayer dollars to rebuild Americ...,It is going to transform America and put us on...,What is the investment of Ford and GM to build...,Ford is investing $11 billion to build electr...
2,My plan will cut the cost in half for most fam...,We got more than 130 countries to agree on a g...,And unlike the $2 Trillion tax cut passed in t...,We’re going after the criminals who stole bill...,What is the proposed tax rate for corporations?,The proposed tax rate for corporations is a 1...
3,"If you travel 20 miles east of Columbus, Ohio,...",So let’s not wait any longer. Send it to my de...,When we use taxpayer dollars to rebuild Americ...,It is going to transform America and put us on...,What is Intel going to build?,Intel is going to build a $20 billion semicon...
4,So let’s not wait any longer. Send it to my de...,"If you travel 20 miles east of Columbus, Ohio,...",When we use taxpayer dollars to rebuild Americ...,"As Ohio Senator Sherrod Brown says, “It’s time...",How many new manufacturing jobs are created la...,"369,000 new manufacturing jobs are created la..."
5,So let’s not wait any longer. Send it to my de...,"If you travel 20 miles east of Columbus, Ohio,...",It is going to transform America and put us on...,Vice President Harris and I ran for office wit...,How many electric vehicle charging stations ar...,The document does not provide information on ...


## Step 5 - Configure Retrieval and Answer Quality metrics
<a id="config"></a>

### Parameters

#### Common parameters

| Parameter | Description | Default Value | Possible Value(s) |
|:-|:-|:-|:-|
| context_columns | The list of context column names in the input data frame. |  |  |
| question_column | the name of the question column in the input data frame. |  |  |
| answer_column | The name of the answer column in the input data frame |  |  |
| record_level [Optional] | The flag to return the record level metrics values. Set the flag under configuration to generate record level metrics for all the metrics. Set the flag under specific metric to generate record level metrics for that metric alone. | False | True, False |
| scoring_fn | The scoring function which takes in the prompts input dataframe and score the LLM acting as Judge, return the output as a dataframe. The input data frame will have a single column "prompt" and the output data frame can either have a single column or if there are multiple columns, return the model output text in "generated_text" column. | | |

### Metric parameters

| Parameter | Description | Default Value | Possible Value(s) |
|:-|:-|:-|:-|
| record_level [Optional] | The flag to return the record level metrics values. Setting the flag under specific metric overrides the value provided at the configuration level. | False | True, False |
| metric_prompt_template [Optional] | The prompt template used to compute the metric value. User can override the prompt template used by watsonx.governance to compute the metric using this parameter. The prompt template should use the variables {context}, {question}, {answer}, {reference_answer} as needed and these variable values will be filled with the actual data while calling the scoring function. The prompt response should return the metric value in the range 1-10 for the respective metric and in one of the formats ["4", "7 star", "star: 8", "stars: 9"] as answer. | | |

### Verify client version

In [18]:
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator

from ibm_watson_openscale import *
from ibm_watson_openscale.supporting_classes.enums import *
from ibm_watson_openscale.supporting_classes import *

authenticator = IAMAuthenticator(apikey=CLOUD_API_KEY, url="https://iam.cloud.ibm.com")
client = APIClient(authenticator=authenticator, service_url="https://aiopenscale.cloud.ibm.com")

print(client.version)

3.0.37


### Define the scoring function to invoke the LLM acting as Judge while compute the metrics

The scoring function is implemeted using model from watsonx.ai from cloud. The model FLAN_T5_XXL is used as the judge here. The other models which can be used from watsonx.ai are FLAN_UL2, FLAN_T5_XL, MIXTRAL_8X7B_INSTRUCT_V01_Q

The function can be changed as needed to invoke external models as well. The quality of the retrieval and answer quality metrics can vary with the model used as judge.

In [19]:
from ibm_watson_machine_learning.metanames import GenTextParamsMetaNames as GenParams
from ibm_watson_machine_learning.foundation_models import Model
from ibm_watson_machine_learning.foundation_models.utils.enums import ModelTypes
import pandas as pd

generate_params = {
    GenParams.MAX_NEW_TOKENS: 100,
    GenParams.MIN_NEW_TOKENS: 10,
    GenParams.TEMPERATURE: 0.0
}

model = Model(
    model_id=ModelTypes.FLAN_T5_XXL,
    params=generate_params,
    credentials={
        "apikey": credentials.get("apikey"),
        "url": credentials.get("url")
    },
    project_id=project_id
)

def scoring_fn(data):
    results = []

    for prompt_text in data.iloc[:, 0].values.tolist():
        model_response = model.generate_text(prompt=prompt_text)
        results.append(model_response)
    return pd.DataFrame({"generated_text": results})

### Configure context relevance, faithfulness, answer relevance and answer similarity parameters

In [20]:
from ibm_metrics_plugin.metrics.llm.utils.constants import LLMTextMetricGroup, LLMCommonMetrics, LLMRAGMetrics, RetrievalQualityMetric

# Edit below values based on the input data
context_columns = ["context1", "context2", "context3", "context4"]
question_column = "question"
answer_column = "answer"

config_json = {
            "configuration": {
                "context_columns": context_columns,
                "question_column": question_column,
                "scoring_fn": scoring_fn,
                "record_level": True,
                LLMTextMetricGroup.RAG.value: {
                        LLMRAGMetrics.RETRIEVAL_QUALITY.value: {
                            # RetrievalQualityMetric.CONTEXT_RELEVANCE.value: {
                                #"record_level": True,
                                #"metric_prompt_template": ""
                            #}
                            # RetrievalQualityMetric.RETRIEVAL_PRECISION.value: {},
                            # RetrievalQualityMetric.AVERAGE_PRECISION.value: {},
                            # RetrievalQualityMetric.RECIPROCAL_RANK.value: {},
                            # RetrievalQualityMetric.HIT_RATE.value: {},
                            # RetrievalQualityMetric.NDCG.value: {}
                        },
                        LLMCommonMetrics.FAITHFULNESS.value: {
                            #"record_level": True,
                            #"metric_prompt_template": ""
                        },
                        LLMCommonMetrics.ANSWER_RELEVANCE.value: {
                            #"record_level": True,
                            #"metric_prompt_template": ""
                        },
                        LLMCommonMetrics.ANSWER_SIMILARITY.value: {
                            #"record_level": True,
                            #"metric_prompt_template": ""
                        }
                }
            }
        }

### Create the input, output and reference data frames and send them as input to compute metrics

In [21]:
df_input = pd.DataFrame(data, columns=context_columns+[question_column])
df_output = pd.DataFrame(data, columns=[answer_column])
df_reference = pd.DataFrame(data, columns=[answer_column])

## Step 6 - Compute Retrieval and Answer Quality metrics <a id="compute"></a>

In [None]:
import json
metrics_result = client.llm_metrics.compute_metrics(config_json, 
                                                    sources=df_input, 
                                                    predictions=df_output,
                                                    references=df_reference)

print(json.dumps(metrics_result, indent=2))

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


{
  "answer_relevance": {
    "metric_value": 0.7667,
    "mean": 0.7667,
    "min": 0.6,
    "max": 1.0,
    "total_records": 6,
    "record_level_metrics": [
      {
        "answer_relevance": 0.8,
        "record_id": "98ff8481-3330-4e85-ba97-904084e7c6b9"
      },
      {
        "answer_relevance": 0.8,
        "record_id": "5e75f578-7d83-46c3-a028-79d3b5ba7f7a"
      },
      {
        "answer_relevance": 0.8,
        "record_id": "b5a19265-6ae1-4c86-a03d-43be150190a8"
      },
      {
        "answer_relevance": 1.0,
        "record_id": "75b70936-72f8-4359-9ab9-c7023b5ab54d"
      },
      {
        "answer_relevance": 0.6,
        "record_id": "5b688272-baaf-40a2-a273-1833aaaba1d9"
      },
      {
        "answer_relevance": 0.6,
        "record_id": "5dafe740-5635-473a-9d7f-1bfc08fd1ca7"
      }
    ]
  },
  "answer_similarity": {
    "metric_value": 0.8,
    "mean": 0.8,
    "min": 0.6,
    "max": 1.0,
    "total_records": 6,
    "record_level_metrics": [
      {
        "

## Step 7 - Display the results <a id="results"></a>

### Get metric results for all records

In [None]:
results_df = data.copy()

for metric_name, metric_data in metrics_result.items():
    vals = {}
    if "record_level_metrics" in metric_data:
        for rm in metric_data["record_level_metrics"]:
            for m, mv in rm.items():
                if m != "record_id" and m!= "faithfulness_attributions" and m!= "context_relevances":  # Excluding columns
                    if m in vals:
                        vals[m].append(mv)
                    else:
                        vals[m] = [mv]

    if vals:
        for k, v in vals.items():
            results_df[k] = v

results_df

Unnamed: 0,context1,context2,context3,context4,question,answer,answer_relevance,answer_similarity,faithfulness,context_relevance,retrieval_precision,average_precision,reciprocal_rank,hit_rate,ndcg
0,"Last month, I announced our plan to supercharg...",For that purpose we’ve mobilized American grou...,"If you travel 20 miles east of Columbus, Ohio,...",But cancer from prolonged exposure to burn pit...,What is ARPA-H?,ARPA-H is the Advanced Research Projects Agen...,0.8,1.0,1.0,0.2,0.0,0,0,0,1.0
1,So let’s not wait any longer. Send it to my de...,"If you travel 20 miles east of Columbus, Ohio,...",When we use taxpayer dollars to rebuild Americ...,It is going to transform America and put us on...,What is the investment of Ford and GM to build...,Ford is investing $11 billion to build electr...,0.8,0.6,0.6,0.2,0.0,0,0,0,1.0
2,My plan will cut the cost in half for most fam...,We got more than 130 countries to agree on a g...,And unlike the $2 Trillion tax cut passed in t...,We’re going after the criminals who stole bill...,What is the proposed tax rate for corporations?,The proposed tax rate for corporations is a 1...,0.8,0.6,1.0,0.2,0.0,0,0,0,1.0
3,"If you travel 20 miles east of Columbus, Ohio,...",So let’s not wait any longer. Send it to my de...,When we use taxpayer dollars to rebuild Americ...,It is going to transform America and put us on...,What is Intel going to build?,Intel is going to build a $20 billion semicon...,1.0,1.0,1.0,0.2,0.0,0,0,0,1.0
4,So let’s not wait any longer. Send it to my de...,"If you travel 20 miles east of Columbus, Ohio,...",When we use taxpayer dollars to rebuild Americ...,"As Ohio Senator Sherrod Brown says, “It’s time...",How many new manufacturing jobs are created la...,"369,000 new manufacturing jobs are created la...",0.6,0.6,0.6,0.2,0.0,0,0,0,1.0
5,So let’s not wait any longer. Send it to my de...,"If you travel 20 miles east of Columbus, Ohio,...",It is going to transform America and put us on...,Vice President Harris and I ran for office wit...,How many electric vehicle charging stations ar...,The document does not provide information on ...,0.6,1.0,0.6,0.2,0.0,0,0,0,1.0


Author: <a href="mailto:pvemulam@in.ibm.com">Pratap Kishore Varma V</a>

Copyright © 2024 IBM.