# Evaluating Answer Quality and Retrieval Quality of Agentic RAG system

This notebook demonstrates the following:

* Create an Agent with two tools.
* One tool is, for a given question, retrieve the context from EU AI Act summary document from https://artificialintelligenceact.eu/high-level-summary/
* Another tool is, for a given question, retrieve the context from HR FAQs.
* Against the Agent, run couple of questions related to EU AI Act and HR FAQs using the system prompt and human prompt as described in the notebook.
* As part of this process, collect the following details - question, context from either EU AI Act related tool or from FAQ related tool, and the respective answers.
* Then, create a Detached Prompt Template using the combination of the system prompt + human prompt.
* Use Mistral model as the RAG LLM-as-a-Judge evaluator for evaluating RAG metrics.
* Log and evaluate the metrics.
* And visualise the metrics via., watsonx.governance UI.

## Setup <a name="settingup"></a>

### Install the necessary packages

In [96]:
!pip install -U ibm-watson-openscale | tail -n 1
!pip install --upgrade ibm-watsonx-ai | tail -n 1
!pip install langchain | tail -n 1
!pip install langchain-ibm | tail -n 1
!pip install langchain-community | tail -n 1
!pip install ibm_watson_machine_learning | tail -n 1
!pip install chromadb | tail -n 1
!pip install tiktoken | tail -n 1
!pip install --upgrade ibm-aigov-facts-client | tail -n 1

^C
[31mERROR: Operation cancelled by user[0m[31m


### Restart the kernel

In [97]:
import warnings
warnings.filterwarnings("ignore")

## Imports <a name="Necessary Imports"></a>

In [98]:
# imports
import os

#from dotenv import load_dotenv
from langchain_ibm import WatsonxEmbeddings, WatsonxLLM
from langchain.vectorstores import Chroma
from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.prompts import PromptTemplate
from langchain.tools import tool
from langchain.tools.render import render_text_description_and_args
from langchain.agents.output_parsers import JSONAgentOutputParser
from langchain.agents.format_scratchpad import format_log_to_str
from langchain.agents import AgentExecutor
from langchain.memory import ConversationBufferMemory
from langchain_core.runnables import RunnablePassthrough
from ibm_watson_machine_learning.metanames import GenTextParamsMetaNames as GenParams
from ibm_watsonx_ai.foundation_models.utils.enums import EmbeddingTypes

### Configure your credentials

In [99]:
IAM_URL = "https://iam.cloud.ibm.com"
DATAPLATFORM_URL = "https://api.dataplatform.cloud.ibm.com"
FACTSHEET_URL = "https://dataplatform.cloud.ibm.com"
SERVICE_URL = "https://aiopenscale.cloud.ibm.com"
CLOUD_API_KEY = "*********"

credentials = {
    "url": "https://us-south.ml.cloud.ibm.com",
    "apikey": CLOUD_API_KEY,
}
CREDENTIALS = credentials
project_id = "***********"

## watsonx LLM for Prompt Generation <a name="Prompt Generation"></a>

In [100]:
llm = WatsonxLLM(
    model_id="ibm/granite-3-8b-instruct", 
    url=credentials.get("url"),
    apikey=credentials.get("apikey"),
    project_id=project_id,
    params={
        GenParams.DECODING_METHOD: "greedy",
        GenParams.TEMPERATURE: 0,
        GenParams.MIN_NEW_TOKENS: 5,
        GenParams.MAX_NEW_TOKENS: 250,
        GenParams.STOP_SEQUENCES: ["Human:", "Observation"],
    },
)

## Slate Model for Embeddings Generation <a name="Embeddings Generation"></a>

In [101]:
def get_embeddings():
    embeddings = WatsonxEmbeddings(
        model_id=EmbeddingTypes.IBM_SLATE_30M_ENG.value,
        url=credentials["url"],
        apikey=credentials["apikey"],
        project_id=project_id,
    )
    return embeddings

In [102]:
def get_text_splitter():
    text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
        chunk_size=250, chunk_overlap=0
    )
    return text_splitter

## Doc URLS <a name="Doc URLS"></a>

In [103]:
hr_faq_urls = [
    'https://github.com/ChaitanyaC22/HR_Policy_Query_Resolution_with_Retrieval_Augmented_Generation_RAG/blob/main/data_files/jp-morgan-chase-code-of-conduct-policy.pdf'
]

ai_act_urls = [
    'https://artificialintelligenceact.eu/high-level-summary/'
]


## Vector Store Retriever against a given doc source<a name="Vector Store Retriever"></a>

In [104]:
def get_retriever(urls, collection_name):
    docs = [WebBaseLoader(url).load() for url in urls]
    docs_list = [item for sublist in docs for item in sublist]
    text_splitter = get_text_splitter()
    doc_splits = text_splitter.split_documents(docs_list)
    vectorstore = Chroma.from_documents(
        documents=doc_splits,
        collection_name=collection_name,
        embedding=get_embeddings(),
    )
    retriever = vectorstore.as_retriever()
    return retriever

## Retriever for FAQs document

In [105]:
hr_faqs_retriever = get_retriever(urls=hr_faq_urls, collection_name='hr_faqs')
ai_act_retriever = get_retriever(urls=ai_act_urls, collection_name='ai_act')

## Agentic AI tools. One wrapping the FAQs and other wrapping the EU AI Act summary document

In [106]:

hr_faqs_context = ""
ai_act_context = ""

@tool
def get_HR_FAQs_Context(question: str):
    """Get context from Hr documents related Frequently asked questions."""
    global hr_faqs_context
    hr_faqs_context = hr_faqs_retriever.invoke(question)
    return hr_faqs_context

@tool
def get_AI_Act_Summary_Context(question: str):
    """Get context from High-level summary of the AI Act."""
    global ai_act_context
    ai_act_context = ai_act_retriever.invoke(question)
    return ai_act_context

tools = [get_AI_Act_Summary_Context, get_HR_FAQs_Context]

## Standard Langchain based System Prompt working as an Generative AI Agent

Next, we will set up a new prompt template to ask multiple questions. This template is more complex. It is referred to as a [structured chat prompt](https://api.python.langchain.com/en/latest/agents/langchain.agents.structured_chat.base.create_structured_chat_agent.html#langchain-agents-structured-chat-base-create-structured-chat-agent) and can be used for creating agents that have multiple tools available. In our case, the tool we are using was defined in Step 6. The structured chat prompt will be made up of a `system_prompt`, a `human_prompt` and our RAG tool. 

First, we will set up the `system_prompt`. This prompt instructs the agent to print its "thought process," which involves the agent's subtasks, the tools that were used and the final output. This gives us insight into the agent's function calling. The prompt also instructs the agent to return its responses in JSON Blob format.

In [107]:
system_prompt = """Respond to the human as helpfully and accurately as possible. You have access to the following tools: {tools}
Use a json blob to specify a tool by providing an action key (tool name) and an action_input key (tool input).
Valid "action" values: "Final Answer" or {tool_names}
Provide only ONE action per $JSON_BLOB, as shown:"
```
{{
  "action": $TOOL_NAME,
  "action_input": $INPUT
}}
```
Follow this format:
Question: input question to answer
Thought: consider previous and subsequent steps
Action:
```
$JSON_BLOB
```
Observation: action result
... (repeat Thought/Action/Observation N times)
Thought: I know what to respond
Action:
```
{{
  "action": "Final Answer",
  "action_input": "Final response to human"
}}
Begin! Reminder to ALWAYS respond with a valid json blob of a single action.
Respond directly if appropriate. Format is Action:```$JSON_BLOB```then Observation"""

## The human prompt

In the following code, we are establishing the `human_prompt`. This prompt tells the agent to display the user input followed by the intermediate steps taken by the agent as part of the `agent_scratchpad`.

In [108]:
human_prompt = """{input}
{agent_scratchpad}
(reminder to always respond in a JSON blob)"""

Next, we establish the order of our newly defined prompts in the prompt template. We create this new template to feature the `system_prompt` followed by an optional list of messages collected in the agent's memory, if any, and finally, the `human_prompt` which includes both the human input and `agent_scratchpad`.

In [109]:
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        MessagesPlaceholder("chat_history", optional=True),
        ("human", human_prompt),
    ]
)

Now, let's finalize our prompt template by adding the tool names, descriptions and arguments using a [partial prompt template](https://python.langchain.com/v0.1/docs/modules/model_io/prompts/partial/). This allows the agent to access the information pertaining to each tool including the intended use cases and also means we can add and remove tools without altering our entire prompt template.

In [110]:
prompt = prompt.partial(
    tools=render_text_description_and_args(list(tools)),
    tool_names=", ".join([t.name for t in tools]),
)

## Set up the agent's memory and chain

An important feature of AI agents is their memory. Agents are able to store past conversations and past findings in their memory to improve the accuracy and relevance of their responses going forward. In our case, we will use LangChain's `ConversationBufferMemory()` as a means of memory storage. 

In [111]:
memory = ConversationBufferMemory()

In [112]:
prompt_input = prompt.messages[0].prompt.template + '\n\n' + prompt.messages[2].prompt.template
print(prompt_input)

Respond to the human as helpfully and accurately as possible. You have access to the following tools: {tools}
Use a json blob to specify a tool by providing an action key (tool name) and an action_input key (tool input).
Valid "action" values: "Final Answer" or {tool_names}
Provide only ONE action per $JSON_BLOB, as shown:"
```
{{
  "action": $TOOL_NAME,
  "action_input": $INPUT
}}
```
Follow this format:
Question: input question to answer
Thought: consider previous and subsequent steps
Action:
```
$JSON_BLOB
```
Observation: action result
... (repeat Thought/Action/Observation N times)
Thought: I know what to respond
Action:
```
{{
  "action": "Final Answer",
  "action_input": "Final response to human"
}}
Begin! Reminder to ALWAYS respond with a valid json blob of a single action.
Respond directly if appropriate. Format is Action:```$JSON_BLOB```then Observation

{input}
{agent_scratchpad}
(reminder to always respond in a JSON blob)


And now we can set up a chain with our agent's scratchpad, memory, prompt and the LLM. The AgentExecutor class is used to execute the agent. It takes the agent, its tools, error handling approach, verbose parameter and memory as parameters.

In [113]:
chain = (
    RunnablePassthrough.assign(
        agent_scratchpad=lambda x: format_log_to_str(x["intermediate_steps"]),
        chat_history=lambda x: memory.chat_memory.messages,
    )
    | prompt
    | llm
    | JSONAgentOutputParser()
)

agent_executor = AgentExecutor(
    agent=chain, tools=tools, handle_parsing_errors=True, verbose=True, memory=memory
)

## Utility method to construct the context as retrieved from respective vector store

In [114]:
def construct_context(relevant_context):
    context_str = ''
    for doc in relevant_context:
        context_str = context_str + doc.page_content + '\n\n'
    return context_str    

In [115]:
# Store the complete the context
complete_content = []

## Generate responses with the agentic RAG system

We are now able to ask the agent questions. Recall the agent's previous inability to provide us with information pertaining to the 2024 US Open. Now that the agent has its RAG tool available to use, let's try asking the same questions again. 

### Question 1. Related to FAQs

In [116]:
response = agent_executor.invoke({"input": "What is compliance with law and firm policies for HR process?"})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m

Question: What is compliance with law and firm policies for HR process?
Thought: The user is asking about the compliance requirements for HR processes. This involves both legal compliance and adherence to the firm's policies. I need to provide information from both the AI Act summary and HR FAQs.

Action:
```
{
  "action": "get_AI_Act_Summary_Context",
  "action_input": "What are the compliance requirements for HR processes according to the AI Act?"
}
```

Observation[0m[36;1m[1;3m[Document(metadata={'language': 'en-US', 'source': 'https://artificialintelligenceact.eu/high-level-summary/', 'title': 'High-level summary of the AI Act | EU Artificial Intelligence Act'}, page_content='All GPAI model providers may demonstrate compliance with their obligations if they voluntarily adhere to a code of practice until European harmonised standards are published, compliance with which will lead to a presumption of conformity. Provi

## Capture the output for logging with watsonx.governance

In [117]:
prompt_result = {}
prompt_result['query'] = response['input']
prompt_result['context'] = construct_context(hr_faqs_context)
prompt_result['generated_text'] = response['output']
complete_content.append(prompt_result)

### Question 2. Related to EU AI Act

In [118]:
response = agent_executor.invoke(
    {"input": "What are considered as Prohibited AI systems?"}
)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m

```
{
  "action": "get_AI_Act_Summary_Context",
  "action_input": "What are considered as Prohibited AI systems?"
}
```
Observation[0m[36;1m[1;3m[Document(metadata={'language': 'en-US', 'source': 'https://artificialintelligenceact.eu/high-level-summary/', 'title': 'High-level summary of the AI Act | EU Artificial Intelligence Act'}, page_content='Users (deployers) of high-risk AI systems have some obligations, though less than providers (developers).\nThis applies to users located in the EU, and third country users where the AI system’s output is used in the EU.\n\nGeneral purpose AI (GPAI):\n\nAll GPAI model providers must provide technical documentation, instructions for use, comply with the Copyright Directive, and publish a summary about the content used for training.\nFree and open licence GPAI model providers only need to comply with copyright and publish the training data summary, unless they present a systemic ri

In [119]:
prompt_result = {}
prompt_result['query'] = response['input']
prompt_result['context'] = construct_context(ai_act_context)
prompt_result['generated_text'] = response['output']
complete_content.append(prompt_result)

### Question 3. Related to EU AI Act

In [120]:
response = agent_executor.invoke(
    {"input": "What is a dynamic and thriving space?"}
)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m

AI:
```
{
  "action": "Final Answer",
  "action_input": "A dynamic and thriving space refers to an environment that is constantly evolving, adapting, and growing. It is characterized by innovation, creativity, and continuous improvement. In the context of HR processes, this could mean a workplace that encourages employee development, fosters collaboration, and embraces change. However, the term 'dynamic and thriving space' is not explicitly defined in the AI Act or the JP Morgan Chase Code of Conduct."
}
```
Observation[0m

[1m> Finished chain.[0m


In [121]:
prompt_result = {}
prompt_result['query'] = response['input']
prompt_result['context'] = construct_context(ai_act_context)
prompt_result['generated_text'] = response['output']
complete_content.append(prompt_result)

### Question 4. Related to HR FAQs

In [122]:
response = agent_executor.invoke(
    {"input": "How is work/life balance?"}
)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
```
{
  "action": "Final Answer",
  "action_input": "Work/life balance is a concept that refers to the equilibrium between an individual's professional responsibilities and personal life. It is not explicitly defined in the AI Act or the JP Morgan Chase Code of Conduct. However, JP Morgan Chase promotes work/life balance through various initiatives, such as flexible work arrangements, paid time off, and employee wellness programs. These initiatives aim to support employees in managing their professional and personal commitments effectively."
}
```
Observation[0m

[1m> Finished chain.[0m


In [123]:
prompt_result = {}
prompt_result['query'] = response['input']
prompt_result['context'] = construct_context(hr_faqs_context)
prompt_result['generated_text'] = response['output']
complete_content.append(prompt_result)

In [124]:
import pandas as pd
llm_data = pd.DataFrame(complete_content)

In [125]:
llm_data['reference'] = llm_data['generated_text']

In [126]:
llm_data.head()

Unnamed: 0,query,context,generated_text,reference
0,What is compliance with law and firm policies ...,HR_Policy_Query_Resolution_with_Retrieval_Augm...,Compliance with law and firm policies for HR p...,Compliance with law and firm policies for HR p...
1,What are considered as Prohibited AI systems?,Users (deployers) of high-risk AI systems have...,"Prohibited AI systems, as per the AI Act, incl...","Prohibited AI systems, as per the AI Act, incl..."
2,What is a dynamic and thriving space?,Users (deployers) of high-risk AI systems have...,A dynamic and thriving space refers to an envi...,A dynamic and thriving space refers to an envi...
3,How is work/life balance?,HR_Policy_Query_Resolution_with_Retrieval_Augm...,Work/life balance is a concept that refers to ...,Work/life balance is a concept that refers to ...


# Computing Answer Quality and Retrieval Quality Metrics using IBM watsonx.governance for RAG task

This notebook demonstrates the creation of a Retrieval Augumented Generation (RAG) pattern using watsonx.ai and computations of reference-free Answer Quality metrics, such as **Faithfulness**, **Answer relevance**, **Unsuccessful requests**, Content Analysis metrics such as **Coverage**, **Density**, **Abstractness** and Retrieval Quality Metrics such as **Context Relevance**, **Retrieval Precision**, **Average Precision**, **Reciprocal Rank**, **Hit Rate** and **Normalized Discounted Cumulative Gain** for the RAG task type. It also identifies the source attribution using watsonx.governance.

- **Faithfulness** measures how faithful the model output or generated text is to the context sent to the LLM input. The faithfulness score is a value between 0 and 1. A value closer to 1 indicates that the output is more faithful - or grounded - and less hallucinated. A value closer to 0 indicates that the output is less faithful and more hallucinated.

- **Answer relevance** measures how relevant the answer or generated text is to the question. This is one of the ways to determine the quality of your model. The answer relevance score is a value between 0 and 1. A value closer to 1 indicates that the answer is more relevant to the given question. A value closer to 0 indicates that the answer is less relevant to the question.

- **Unsuccessful requests** measures the ratio of questions answered unsuccessfully out of the total number of questions. The unsuccessful requests score is a value between 0 and 1. A value closer to 0 indicates that the model is successfully answering the questions. A value closer to 1 indicates the model is not able to answer the questions.

- **Coverage** metric quantifies the extent to which the model output. It measures the percentage of output words that are also in the input text. The coverage score is a value between 0 and 1. A higher score close to 1 indicates that higher percentage of output words are within the input text.

- **Density** quantifies how well the summary or the answer in the model output can be described as a series of extractions from the model input or context. A lower value of density metric indicates that on average the extractive fragments do not closely resemble verbatim extractions from the original source or context text. Abstractive summarization involves generating new, concise sentences that convey the main ideas of the source text but may not be directly copied from it. This is in contrast to extractive summarization, where the summary consists mainly of sentences directly lifted from the source. The lower the score the more abstractive the model output is.

- **Abstractness** measures the abstractness of the model generated text by measuring the new n-grams generated in the model output compared to the model input. The metric computes the ratio of n-grams in the generated text that do not appear in the source content or context send to the model. The abstractness score is a value between 0 and 1. A higher score close to 1 indicates high abstractness in the generated text.

- **Context Relevance** assesses the degree to which the retrieved context is relevant to the question sent to the LLM. This is one of the ways to determine the quality of your retrieval system. The context relevance score is a value between 0 and 1. A value closer to 1 indicates that the context is more relevant to your question in the prompt. A value closer to 0 indicates that the context is less relevant to your question in the prompt.

- **Retrieval Precision** measures the quantity of relevant contexts from the total contexts retrieved. The retrieval precision is a value between 0 and 1. A value of 1 indicates that all the retrieved contexts are relevant. A value of 0 indicates that none of the retrieved contexts are relevant.

- **Average Precision** evaluates whether all the relevant contexts are ranked higher or not. It is the mean of the precision scores of relevant contexts. The average precision is a value between 0 and 1. A value of 1 indicates that all the relevant contexts are ranked higher. A value of 0 indicates that none of the retrieved contexts are relevant.

- **Reciprocal Rank** is the reciprocal of the rank of the first relevant context. The retrieval reciprocal rank is a value between 0 and 1. A value of 1 indicates that the first relevant context is at first position. A value of 0 indicates that none of the relevant contexts are retrieved.

- **Hit Rate** Hit Rate measures whether there is atleast one relevant context among the retrieved contexts. The hit rate value is either 0 or 1. A value of 1 indicates that there is at least one relevant context. A value of 0 indicates that there is no relevant context in the retrieved contexts.

- **Normalized Discounted Cumulative Gain** Normalized Discounted Cumulative Gain or NDCG measures the ranking quality of the retrieved contexts. The ndcg is a value between 0 and 1. A value of 1 indicates that the retrieved contexts are ranked in the correct order.

In [127]:
import json
from ibm_watsonx_ai import APIClient

wml_client = APIClient(credentials)
wml_client.version

'1.3.31'

### Function to create the access token
This function generates an IAM access token using the provided credentials. The API calls for creating and scoring prompt template assets utilize the token generated by this function.

In [128]:
import requests, json
def generate_access_token():
    headers={}
    headers["Content-Type"] = "application/x-www-form-urlencoded"
    headers["Accept"] = "application/json"
    data = {
        "grant_type": "urn:ibm:params:oauth:grant-type:apikey",
        "apikey": CLOUD_API_KEY,
        "response_type": "cloud_iam"
    }
    response = requests.post(IAM_URL + "/identity/token", data=data, headers=headers)
    json_data = response.json()
    iam_access_token = json_data["access_token"]
        
    return iam_access_token

iam_access_token = generate_access_token()

In [129]:
test_data_path = "RAG_data.csv"
llm_data.to_csv(test_data_path)

In [130]:
llm_data

Unnamed: 0,query,context,generated_text,reference
0,What is compliance with law and firm policies ...,HR_Policy_Query_Resolution_with_Retrieval_Augm...,Compliance with law and firm policies for HR p...,Compliance with law and firm policies for HR p...
1,What are considered as Prohibited AI systems?,Users (deployers) of high-risk AI systems have...,"Prohibited AI systems, as per the AI Act, incl...","Prohibited AI systems, as per the AI Act, incl..."
2,What is a dynamic and thriving space?,Users (deployers) of high-risk AI systems have...,A dynamic and thriving space refers to an envi...,A dynamic and thriving space refers to an envi...
3,How is work/life balance?,HR_Policy_Query_Resolution_with_Retrieval_Augm...,Work/life balance is a concept that refers to ...,Work/life balance is a concept that refers to ...


In [131]:
from ibm_aigov_facts_client import AIGovFactsClient

facts_client = AIGovFactsClient(
    api_key=CLOUD_API_KEY,
    container_id=project_id,
    container_type="project",
    disable_tracing=True
)

### Create Detached Prompt template

Create a prompt template for a retrieval augmented generation task

In [132]:
from ibm_aigov_facts_client import DetachedPromptTemplate, PromptTemplate

detached_information = DetachedPromptTemplate(
    prompt_id="detached_prompt",
    model_id="ibm/granite-3-8b-instruct",
    model_provider="IBM",
    model_name="granite-3-8b-instruct",
    model_url="model_url",
    prompt_url="prompt_url",
    prompt_additional_info={"IBM Cloud Region": "us-east1"}
)

task_id = "retrieval_augmented_generation"
name = "Agentic RAG Testing"
description = "Agentic RAG Testing"
model_id = "ibm/granite-3-8b-instruct"

# define parameters for PromptTemplate
prompt_variables = {"context": "", "query": ""}
input = prompt_input
input_prefix= ""
output_prefix= ""

prompt_template = PromptTemplate(
    input=input,
    prompt_variables=prompt_variables,
    input_prefix=input_prefix,
    output_prefix=output_prefix,
)

pta_details = facts_client.assets.create_detached_prompt(
    model_id=model_id,
    task_id=task_id,
    name=name,
    description=description,
    prompt_details=prompt_template,
    detached_information=detached_information)
project_pta_id = pta_details.to_dict()["asset_id"]

2025/08/20 13:23:14 INFO : ------------------------------ Detached Prompt Creation Started ------------------------------
2025/08/20 13:23:15 INFO : The detached prompt with ID ee5a83cc-0ec9-4f0b-98f6-efd43263ad8b was created successfully in container_id ad0092bf-2578-4e0e-9e6e-36dcf20781d0.


### Configure watsonx.governance

In [133]:
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator, CloudPakForDataAuthenticator

from ibm_watson_openscale import *
from ibm_watson_openscale.supporting_classes.enums import *
from ibm_watson_openscale.supporting_classes import *

service_instance_id = None # Update this to refer to a particular service instance
authenticator = IAMAuthenticator(
    apikey=CLOUD_API_KEY,
    url=IAM_URL
)
wos_client = APIClient(
    authenticator=authenticator,
    service_url=SERVICE_URL,
    service_instance_id=service_instance_id
)
data_mart_id = wos_client.service_instance_id
print(wos_client.version)



3.0.50


### For computing answer quality and retrieval quality metrics you are required to configure LLM As Judge.

To compute metrics using LLM As Judge a generative_ai_evaluator integrated system should to be created and provided during prompt setup.

#### Create a Generative AI evaluator
The Generative AI Evaluator can be any model from watsonx.ai or a custom endpoint invoking external models

Supported evaluator types
* watsonx.ai (for connecting to watsonx.ai in local Cloud)
* custom (for connecting to any external models)

#### Integrated System parameters
| Parameter | Description |
|:-|:-|
| `name` | Name for the evaluator. |
| `description` | Description for the evaluator. |
| `type` | The type of integrated system. Provide `generative_ai_evaluator`. |
| `parameters` | The evaluator configuration details like `evaluator_type` and `model_id`. |
| `credentials` | The user credentials |
| `connection` [Optional]| The scoring endpoint details when the evaluator is of type `custom`. |

As an example, an evaluator using FLAN_UL2 model from watsonx.ai instance present in the same Cloud is created below. The other models which can be used from watsonx.ai are FLAN_T5_XXL, FLAN_UL2, FLAN_T5_XL, MIXTRAL_8X7B_INSTRUCT_V01.
For more details on the parameters and to create the other supported evaluators please refer to the link [Generative AI evaluator templates](https://github.ibm.com/aiopenscale/notebooks/wiki/Generative-AI-Evaluator-templates#for-ibm-watsonxgovernance-in-cloud)

In [144]:
from ibm_watsonx_ai.foundation_models.utils.enums import ModelTypes

gen_ai_evaluator = wos_client.integrated_systems.add(
    name="RAG Metrics Evaluator",
    description="RAG Metrics Evaluator",
    type="generative_ai_evaluator",
    parameters={"evaluator_type": "watsonx.ai", "model_id": "meta-llama/llama-3-3-70b-instruct"},
    credentials={
        "wml_location": "cloud",
        "apikey": CLOUD_API_KEY,
    },
)

# get evaluator integrated system ID
result = gen_ai_evaluator.result._to_dict()
evaluator_id = result["metadata"]["id"]
evaluator_id

'0198c7b5-d696-7458-bef7-50316d53065f'

#### Generative AI monitor parameters

##### Generative AI evaluator parameters
The generative_ai_evaluator details can be provided at the global level under `generative_ai_quality.parameters` to use the same evaluator for all the Answer quality metrics(Faithfulness, Answer relevance, Answer similarity) and Retrieval quality metrics(Context relevance, Retrieval precision, Average precision, Reciprocal rank, Hit rate, Normalized Discounted Cumulative Gain).

The generative ai evaluator can be specified at metric level to use different evaluators for each of the metric. The metrics under which generative_ai_evaluator parameter can be specified are faithfulness, answer_relevance, answer_similarity and retrieval_quality. The generative_ai_evaluator at the metric level takes precedence over the generative_ai_evaluator at global level

| Parameter | Description | Default Value |
|:-|:-|:-|
| `enabled`| The flag to enable generative ai evaluator |  |
| `evaluator_id`| The id of the generative ai evaluator integrated system. |  |

##### Faithfulness, Context relevance, Answer relevance and Answer similarity parameters

| Parameter | Description | Default Value |
|:-|:-|:-|
| `metric_prompt_template` [Optional]| The prompt template used to compute each metric values. User can override the prompt template used by watsonx.governance to compute the metric using this parameter. The prompt template should use the variables {context}, {question}, {answer}, {reference_answer} as needed and these variable values will be filled with the actual data while calling the scoring function. The prompt response should return the metric value in the range 1-10 for the respective metric and in one of the formats ["4", "7 star", "star: 8", "stars: 9"] as answer. |  |

##### Unsuccessful requests parameters

| Parameter | Description | Default Value |
|:-|:-|:-|
| `unsuccessful_phrases` [Optional]| The list of phrases to be used for comparing the model output to determine whether the request is unsuccessful or not. | `["i don't know", "i do not know", "i'm not sure", "i am not sure", "i'm unsure", "i am unsure", "i'm uncertain", "i am uncertain", "i'm not certain", "i am not certain", "i can't fulfill", "i cannot fulfill"]` |


## Configure Answer Quality metrics & Retrieval Quality metrics
<a id="config"></a>

### Parameters

#### Common parameters

| Parameter | Description | Default Value | Possible Value(s) |
|:-|:-|:-|:-|
| context_columns | The list of context column names in the input data frame. |  |  |
| question_column | the name of the question column in the input data frame. |  |  |
| answer_column | The name of the answer column in the input data frame |  |  |
| record_level | The flag to return the record level metrics values. Set the flag under configuration to generate record level metrics for all the metrics. Set the flag under specific metric to generate record level metrics for that metric alone. | `False` | `True`, `False` |

#### Faithfulness parameters
| Parameter | Description | Default Value | Possible Value(s) |
|:-|:-|:-|:-|
| attributions_count [Optional]| Source attributions are computed for each sentence in the generated answer. Source attribution for a sentence is the set of sentences in the context which contributed to the LLM generating that sentence in the answer.  The attributions_count parameter specifies the number of sentences in the context which need to be identified for attributions.  E.g., if the value is set to 2, then we will find the top 2 sentences from the context as source attributions. | `3` |  |
| ngrams [Optional]| The number of sentences to be grouped from the context when computing faithfulness score. These grouped sentences will be shown in the attributions. Having a very high value of ngrams might lead to having lower faithfulness scores due to dispersion of data and inclusion of unrelated sentences in the attributions. Having a very low value might lead to increase in metric computation time and attributions not capturing the all the aspects of the answer. | `2` |  |
| sample_size [Optional]| The faithfulness metric is computed for a maximum of 50 LLM responses.  If you wish to compute it for a smaller number of responses, set the sample_size value to a lower number. If the number of records in the input data frame are more than the sample size, a uniform random sample will taken for computation. | `50` | Integer between 0 to 50. Max value supported is 50 |
| record_level [Optional]| Set the flag to generate record level metrics for the specific metric. | `False` | `True`, `False` |


#### Context relevance parameters
| Parameter | Description | Default Value | Possible Value(s) |
|:-|:-|:-|:-|
| ngrams [Optional]| The number of sentences to be grouped from the context when computing context relevance score. Having a very high value of ngrams might lead to having lower context relevance scores due to dispersion of data and inclusion of unrelated sentences. | `5` |  |


#### Unsuccessful requests parameters
| Parameter | Description | Default Value |
|:-|:-|:-|
| unsuccessful_phrases [Optional]| The list of phrases to be used for comparing the model output to determine whether the request is unsuccessful or not. | `["i don't know", "i do not know", "i'm not sure", "i am not sure", "i'm unsure", "i am unsure", "i'm uncertain", "i am uncertain", "i'm not certain", "i am not certain", "i can't fulfill", "i cannot fulfill"]` |

### Configure faithfulness, answer relevance, and unsuccessful requests parameters

In [145]:
label_column = "reference"
context_fields = ["context"]
question_field = "query"
operational_space_id = "development"
problem_type= "retrieval_augmented_generation"
input_data_type= "unstructured_text"

monitors = {
    "generative_ai_quality": {
        "parameters": {
            "generative_ai_evaluator": { # global LLM as judge configuration
               "enabled": True,
               "evaluator_id": evaluator_id,
            },
            "min_sample_size": 2,
            "metrics_configuration": {
                "faithfulness": {
                    # "metric_prompt_template": "", # adding custom template
                    # Uncomment generative_ai_evaluator to use a different evaluator for this metric.
                    # Takes higher precedence than the generative_ai_evaluator specified at global level.
                    # "generative_ai_evaluator": {  # metric specific LLM as judge configuration
                    #     "enabled": True,
                    #     "evaluator_id": evaluator_id,
                    # },
                },
                "answer_relevance": {
                    # "metric_prompt_template": "", # adding custom template
                    # Uncomment generative_ai_evaluator to use a different evaluator for this metric
                    # Takes higher precedence than the generative_ai_evaluator specified at global level.
                    # "generative_ai_evaluator": {  # metric specific LLM as judge configuration
                    #     "enabled": True,
                    #     "evaluator_id": evaluator_id,
                    # },
                },
                "rouge_score": {},
                "exact_match": {},
                "bleu": {},
                "unsuccessful_requests": {
                    # "unsuccessful_phrases": []
                },
                "hap_input_score": {},
                "hap_score": {},
                "pii": {},
                "pii_input": {},
                "retrieval_quality": {
                    # Uncomment generative_ai_evaluator to use a different evaluator for this metric
                    # Takes higher precedence than the generative_ai_evaluator specified at global level.
                    # "generative_ai_evaluator": {  # metric specific LLM as judge configuration
                    #     "enabled": True,
                    #     "evaluator_id": evaluator_id,
                    # },
                    # The metrics computed for retrieval quality are context_relevance, retrieval_precision, average_precision, reciprocal_rank, hit_rate, normalized_discounted_cumulative_gain
                    # "context_relevance": {
                    #     "metric_prompt_template": "", # adding custom template
                    # }
                },
                # Answer similarity metric is supported only when LLM as judge is configured. Uncomment only when using LLM as judge.
                "answer_similarity": {
                    # "metric_prompt_template": "", # adding custom template
                    # Uncomment generative_ai_evaluator to use a different evaluator for this metric
                    # Takes higher precedence than the generative_ai_evaluator specified at global level.
                    # "generative_ai_evaluator": {  # metric specific LLM as judge configuration
                    #     "enabled": True,
                    #     "evaluator_id": evaluator_id,
                    # },
                },
            },
        }
    }
}

response = wos_client.monitor_instances.mrm.execute_prompt_setup(
    prompt_template_asset_id=project_pta_id, 
    project_id=project_id,
    label_column=label_column,
    context_fields = context_fields,     
    question_field = question_field,     
    operational_space_id=operational_space_id, 
    problem_type=problem_type,
    input_data_type=input_data_type, 
    supporting_monitors=monitors, 
    background_mode=False
)

result = response.result
result.to_dict()

This method will be deprecated in the next release and be replaced by wos_client.wos.execute_prompt_setup() method



 Waiting for end of adding prompt setup ee5a83cc-0ec9-4f0b-98f6-efd43263ad8b 




finished

---------------------------------------------------------------
 Successfully finished setting up prompt template subscription 
---------------------------------------------------------------




{'prompt_template_asset_id': 'ee5a83cc-0ec9-4f0b-98f6-efd43263ad8b',
 'project_id': 'ad0092bf-2578-4e0e-9e6e-36dcf20781d0',
 'deployment_id': 'c1fb152c-c7d2-4b8c-85e3-c0326f471ad8',
 'service_provider_id': '0198c1d2-6e80-7e5e-b63a-b5b3f7f90398',
 'subscription_id': '0198c7a6-0c9d-709a-b035-0c9a345d8c38',
 'mrm_monitor_instance_id': '0198c7a6-3559-72c5-9405-fd9266ff5b33',
 'start_time': '2025-08-20T13:40:53.834854Z',
 'end_time': '2025-08-20T13:40:59.316416Z',
 'status': {'state': 'FINISHED'}}

With the following cell, you can read the prompt setup task and check its status

In [146]:
response = wos_client.monitor_instances.mrm.get_prompt_setup(
    prompt_template_asset_id=project_pta_id,
    project_id=project_id
)

result = response.result
result_json = result.to_dict()

if result_json["status"]["state"] == "FINISHED":
    print("Finished prompt setup. The response is {}".format(result_json))
else:
    print("Prompt setup failed. The response is {}".format(result_json))

This method will be deprecated in the next release and be replaced by wos_client.wos.get_prompt_setup() method
Finished prompt setup. The response is {'prompt_template_asset_id': 'ee5a83cc-0ec9-4f0b-98f6-efd43263ad8b', 'project_id': 'ad0092bf-2578-4e0e-9e6e-36dcf20781d0', 'deployment_id': 'c1fb152c-c7d2-4b8c-85e3-c0326f471ad8', 'service_provider_id': '0198c1d2-6e80-7e5e-b63a-b5b3f7f90398', 'subscription_id': '0198c7a6-0c9d-709a-b035-0c9a345d8c38', 'mrm_monitor_instance_id': '0198c7a6-3559-72c5-9405-fd9266ff5b33', 'start_time': '2025-08-20T13:40:53.834854Z', 'end_time': '2025-08-20T13:40:59.316416Z', 'status': {'state': 'FINISHED'}}


In [147]:
subscription_id = result_json["subscription_id"]
mrm_monitor_instance_id = result_json["mrm_monitor_instance_id"]

### Show all monitor instances in the development subscription
The following cell lists the monitors present in the development subscription, along with their respective statuses and other details. Please wait for all the monitors to be in an active state before proceeding further.

In [148]:
wos_client.monitor_instances.show(target_target_id=subscription_id)

0,1,2,3,4,5,6
67f4a470-900f-4a28-beee-af3f9b02ed11,active,0198c7a6-0c9d-709a-b035-0c9a345d8c38,subscription,mrm,2025-08-20 13:23:36.252000+00:00,0198c7a6-3559-72c5-9405-fd9266ff5b33
67f4a470-900f-4a28-beee-af3f9b02ed11,active,0198c7a6-0c9d-709a-b035-0c9a345d8c38,subscription,model_health,2025-08-20 13:23:34.467000+00:00,0198c7a6-2c0d-70bb-96b2-3ebb91631f80
67f4a470-900f-4a28-beee-af3f9b02ed11,active,0198c7a6-0c9d-709a-b035-0c9a345d8c38,subscription,generative_ai_quality,2025-08-20 13:23:32.362000+00:00,0198c7a6-2624-7969-94d3-e40010ded1df


### Risk evaluations for the PTA subscription - Evaluate the prompt template subscription

For risk assessment of a `development`-type subscription, you must have an evaluation dataset. The risk assessment function takes the evaluation dataset path as a parameter when evaluating the configured metrics. If there is a discrepancy between the feature columns in the subscription and the column names in the uploading `.CSV` file, you have the option to supply a mapping JSON file to associate the `.CSV` column names with the feature column names in the subscription.

**Note**: If you are running this notebook from Watson Studio, you may first need to upload your test data to Watson Studio, then run the code snippet below to download the feedback data file from the project to a local directory.

In [None]:
test_data_set_name = "data"
content_type = "multipart/form-data"
body = {}

# Preparing the test data, removing extra columns
cols_to_remove = ["uid", "doc", "title", "id"]
for col in cols_to_remove:
    if col in llm_data:
        del llm_data[col]
llm_data.to_csv(test_data_path, index=False)

response  = wos_client.monitor_instances.mrm.evaluate_risk(
    monitor_instance_id=mrm_monitor_instance_id,
    test_data_set_name=test_data_set_name, 
    test_data_path=test_data_path,
    content_type=content_type,
    body=body,
    project_id=project_id,
    includes_model_output=True,
    background_mode=False
)




 Waiting for risk evaluation of MRM monitor 0198c7a6-3559-72c5-9405-fd9266ff5b33 




upload_in_progress

In [None]:
response  = wos_client.monitor_instances.mrm.get_risk_evaluation(mrm_monitor_instance_id, project_id=project_id)
response.result.to_dict()

### Display the Model Risk metrics.

Having calculated the measurements for the Foundation Model subscription, the Model Risk metrics generated for this subscription are available for your review:

In [None]:
wos_client.monitor_instances.show_metrics(monitor_instance_id=mrm_monitor_instance_id, project_id=project_id)

In [None]:
monitor_definition_id = "generative_ai_quality"
result = wos_client.monitor_instances.list(
    data_mart_id=data_mart_id,
    monitor_definition_id=monitor_definition_id,
    target_target_id=subscription_id,
    project_id=project_id
).result
result_json = result._to_dict()
genaiquality_monitor_id = result_json["monitor_instances"][0]["metadata"]["id"]
genaiquality_monitor_id

## Display the Generative AI quality metrics

In [None]:
wos_client.monitor_instances.show_metrics(monitor_instance_id=genaiquality_monitor_id, project_id=project_id)

## Congratulations!

You have completed this notebook. You can now navigate to the prompt template asset in your watsonx.governance project / space and click on the `Evaluate` tab to visualize the results in the UI.