# Source attribution detection for RAG based natural language question responses using WatsonX 

#### Disclaimers

- Use only Projects and Spaces that are available in watsonx context.

## Notebook content
This notebook contains the steps and code to demonstrate support of Retrieval Augumented Generation in watsonx.ai and identify source attribution. It introduces commands for data retrieval, knowledge base building & querying, and model testing.

Some familiarity with Python is helpful. This notebook uses Python 3.10.

### About Retrieval Augmented Generation
Retrieval Augmented Generation (RAG) is a versatile pattern that can unlock a number of use cases requiring factual recall of information, such as querying a knowledge base in natural language.

In its simplest form, RAG requires 3 steps:

- Index knowledge base passages (once)
- Retrieve relevant passage(s) from knowledge base (for every user query)
- Generate a response by feeding retrieved passage into a large language model (for every user query)

#### Source Attribution Detection
Source attribution detection is to identify the part(s) from the context which could have attributed to the response from the foundation model . 

#### The flow of this notebook is as follows :
1. Building a knowledge base
2. Getting the relevant information from the vectordb to get the relevant context for a bunch of questions for which user is looking for responses.
3. Construct the prompt using the question and relevant context for each question considered .
4. Generate the retrieval augmented response to the question using the foundation models hosted on watsonx.ai
5. Intialize the WOS client , supply the configuration needed for identifying source attribution.
6. Identify the source attribution for the RAG based responses.

**Note:** Search for `<EDIT THIS>` and provide the inputs

## Contents

This notebook contains the following parts:

- [Setup](#setup)
- [Document data loading](#data)
- [Build up knowledge base](#build_base)
- [Foundation Models on watsonx](#models)
- [Generate a retrieval-augmented response to a question](#predict)
- [Source Attribution detection using protodash](#sourceattribution)



<a id="setup"></a>
##  Set up the environment

Before you use the sample code in this notebook, you must perform the following setup tasks:

-  Create a <a href="https://console.ng.bluemix.net/catalog/services/ibm-watson-machine-learning/" target="_blank" rel="noopener no referrer">Watson Machine Learning (WML) Service</a> instance (a free plan is offered and information about how to create the instance can be found <a href="https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/ml-service-instance.html?context=analytics" target="_blank" rel="noopener no referrer">here</a>).


### Install and import the dependecies

In [None]:
!pip install "langchain==0.0.345" | tail -n 1
!pip install wget | tail -n 1
!pip install sentence-transformers | tail -n 1
!pip install "chromadb==0.3.26" | tail -n 1
!pip install "ibm-watson-machine-learning>=1.0.335" | tail -n 1
!pip install "pydantic==1.10.0" | tail -n 1
!pip install --upgrade ibm-metrics-plugin  --no-cache | tail -n 1
!pip install --upgrade ibm-watson-openscale --no-cache | tail -n 1
!pip install --upgrade pyspark==3.3.1 | tail -n 1

#If you are working on watson studio please make sure to install the below package
#!pip install blanc

### Action: Restart the Kernel!

### watsonx API connection
This cell defines the credentials required to work with watsonx API for Foundation
Model inferencing.

**Action:** Provide the IBM Cloud user API key. For details, see <a href="https://cloud.ibm.com/docs/account?topic=account-userapikey&interface=ui" target="_blank" rel="noopener no referrer">documentation</a>.

In [1]:
credentials = {
    "url": "https://us-south.ml.cloud.ibm.com",
    "apikey": "<EDIT THIS>"
}

### Defining the project id
The API requires project id that provides the context for the call. We will obtain the id from the project in which this notebook runs. Otherwise, please provide the project id.

**Hint**: You can find the `project_id` as follows. Open the prompt lab in watsonx.ai. At the very top of the UI, there will be `Projects / <project name> /`. Click on the `<project name>` link. Then get the `project_id` from Project's Manage tab (Project -> Manage -> General -> Details).


In [2]:
project_id = "<EDIT THIS>"

<a id="data"></a>
## Document data loading

Download the file with State of the Union.

In [3]:
import wget
import os

filename = 'state_of_the_union.txt'
url = 'https://raw.github.com/IBM/watson-machine-learning-samples/master/cloud/data/foundation_models/state_of_the_union.txt'

if not os.path.isfile(filename):
    wget.download(url, out=filename)

<a id="build_base"></a>
## Build up knowledge base

The most common approach in RAG is to create dense vector representations of the knowledge base in order to calculate the semantic similarity to a given user query.

In this basic example, we take the State of the Union speech content (filename), split it into chunks, embed it using an open-source embedding model, load it into <a href="https://www.trychroma.com/" target="_blank" rel="noopener no referrer">Chroma</a>, and then query it.

In [4]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

loader = TextLoader(filename)
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

The dataset we are using is already split into self-contained passages that can be ingested by Chroma.

### Create an embedding function

Note that you can feed a custom embedding function to be used by chromadb. The performance of Chroma db may differ depending on the embedding model used.

In [1]:
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings()
docsearch = Chroma.from_documents(texts, embeddings)

<a id="models"></a>
## Foundation Models on `watsonx.ai`

IBM watsonx foundation models are among the <a href="https://python.langchain.com/docs/integrations/llms/watsonxllm" target="_blank" rel="noopener no referrer">list of LLM models supported by Langchain</a>. This example shows how to communicate with <a href="https://newsroom.ibm.com/2023-09-28-IBM-Announces-Availability-of-watsonx-Granite-Model-Series,-Client-Protections-for-IBM-watsonx-Models" target="_blank" rel="noopener no referrer">Granite Model Series</a> using <a href="https://python.langchain.com/docs/get_started/introduction" target="_blank" rel="noopener no referrer">Langchain</a>.

### Defining model
You need to specify `model_id` that will be used for inferencing:

In [6]:
from ibm_watson_machine_learning.foundation_models.utils.enums import ModelTypes

model_id = ModelTypes.GRANITE_13B_CHAT_V2

### Defining the model parameters
We need to provide a set of model parameters that will influence the result:

In [7]:
from ibm_watson_machine_learning.metanames import GenTextParamsMetaNames as GenParams
from ibm_watson_machine_learning.foundation_models.utils.enums import DecodingMethods

parameters = {
    GenParams.DECODING_METHOD: DecodingMethods.GREEDY,
    GenParams.MIN_NEW_TOKENS: 1,
    GenParams.MAX_NEW_TOKENS: 100,
    GenParams.STOP_SEQUENCES: ["<|endoftext|>"]
}

### LangChain CustomLLM wrapper for watsonx model
Initialize the `WatsonxLLM` class from Langchain with defined parameters and `ibm/granite-13b-chat-v2`. 

In [8]:
from langchain.llms import WatsonxLLM

watsonx_granite = WatsonxLLM(
    model_id=model_id.value,
    url=credentials.get("url"),
    apikey=credentials.get("apikey"),
    project_id=project_id,
    params=parameters
)

<a id="predict"></a>
## Generate a retrieval-augmented response to a question

Build the `RetrievalQA` (question answering chain) to automate the RAG task.

In [9]:
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(llm=watsonx_granite, chain_type="stuff", retriever=docsearch.as_retriever())

In [10]:
query0 = "What did the president say about Ketanji Brown Jackson?"
query1 = "What is ARPA-H?"
query2 = "How much does it cost to make  a vial of Insulin?"
query3 = "What is the investment of Ford and GM to build electric vehicles?"
query4 = "What is the proposed tax rate for corporations?"
query5 = "What did president say about Bipartisan Infrastructure Act"
query6 = "What is Intel going to build?"
query7 = "How many new manufacturing jobs are created last year?"
query8 = "What did the president say about cancer death rate?"
query9 = "What are the dangers faced by troops in Iraq and Afganistan?"
query10 = "How many electric vehicle charging stations are built?"

questions_list = [query0, query1 , query2, query3,query4, query5, query6, query7, query8, query9,query10]

### Select questions

Get questions from the previously loaded test dataset and retain the context for each question.

In [18]:
#Select the question from the question list above .
questions = questions_list[0:2]
for question in questions:
    print(question)


What did the president say about Ketanji Brown Jackson?
What is ARPA-H?


### Generate a retrieval-augmented response to a question

In [19]:
responses = []
contexts = []
for query in questions:
    #Retrive relevant context for each question from the vector db
    docs = docsearch.as_retriever().get_relevant_documents(query)

    context = []
    #Extaract the needed information
    for doc in docs:
        context.append(doc.to_json()['kwargs']['page_content'])

    #Capture the context
    contexts.append(context)

    #Run the prompt and get the response
    response = qa.run(query)
    responses.append(response)
    

In [20]:
#Print a sample context retrieved for a query 
print(f"Question:{questions[0]}\n context:{contexts[0]}")

Question:What did the president say about Ketanji Brown Jackson?
 context:['Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', 'A former top litigator in private practice. A former federal public defender. And from a family of public school educators and 

In [21]:
#Print the result
for query in questions:
    print(f"{query} \n {responses[questions.index(query)]} \n")

What did the president say about Ketanji Brown Jackson? 
  The president said that Ketanji Brown Jackson is a "top legal mind" and that she will "continue Justice Breyer's legacy of excellence." He also said that she has received a "broad range of support" from various groups and individuals. 

What is ARPA-H? 
  ARPA-H is the Advanced Research Projects Agency for Health. It is based on DARPA, the Defense Department project that led to the Internet, GPS, and so much more. ARPA-H will have a singular purpose—to drive breakthroughs in cancer, Alzheimer’s, diabetes, and more. 



<a id="sourceattribution"></a>
### Source attribution detection for RAG based response for LLMs

Source Attrbution for RAG based response is computed using Protodash Explainer . The information needed for this computation :
1. Response data for which source attribution has to be identified. This is considered as input data.
2. Context information retained using RAG . This is considered as reference data

Using the above information , prototypes of the input are identified . Using this technique the source in the context which has attributed to the response is identified.

### Construct a dataframe with results , contexts to supply for source attribution
- generated_text : Response from the foundation model
- context: Relevant context retreived from vector db (chromadb in this example) for each question . For this notebook 5 questions are been considered for source attribution

In [23]:
import pandas as pd
data = pd.DataFrame({"generated_text":responses,"context":contexts})
data.head()

Unnamed: 0,generated_text,context
0,The president said that Ketanji Brown Jackson...,[Tonight. I call on the Senate to: Pass the Fr...
1,ARPA-H is the Advanced Research Projects Agen...,"[Last month, I announced our plan to superchar..."


In [24]:
data.columns

Index(['generated_text', 'context'], dtype='object')

# Invokation via SDK 
### Set up openscale client

In [25]:
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator,BearerTokenAuthenticator

from ibm_watson_openscale import *
from ibm_watson_openscale.supporting_classes.enums import *
from ibm_watson_openscale.supporting_classes import *


authenticator = IAMAuthenticator(apikey=credentials.get("apikey"))
client = APIClient(authenticator=authenticator)
client.version

'3.0.33'

### Update the configuration needed for source attribution

In [26]:
from ibm_metrics_plugin.common.utils.constants import ExplainabilityMetricType
from ibm_metrics_plugin.metrics.explainability.entity.explain_config import ExplainConfig
from ibm_metrics_plugin.common.utils.constants import InputDataType,ProblemType

config_json = {
            "configuration": {
                "input_data_type": InputDataType.TEXT.value,
                "problem_type": ProblemType.QA.value,
                "feature_columns":["context"],
                "prediction": "generated_text", #Column name that has the prompt response from FM
                "context_column": "context",
                "explainability": {
                    "metrics_configuration":{
                        ExplainabilityMetricType.PROTODASH.value:{
                                    "embedding_fn": embeddings.embed_documents #Make sure to supply the embedded function else TfIDfvectorizer will be used
                                }
                    }
                }
            }
        }

### Run protodash explainer to identify source attribution for the RAG based responses

In [27]:
import warnings

warnings.filterwarnings("ignore")
results = client.ai_metrics.compute_metrics(configuration=config_json,data_frame=data)

Please install evaluate package from Huggingface
Please install evaluate package from Huggingface
Please install evaluate package from Huggingface
Please install evaluate package from Huggingface
Please install evaluate package from Huggingface
Please install evaluate package from Huggingface
Please install evaluate package from Huggingface
Please install evaluate package from Huggingface
Please install evaluate package from Huggingface
Please install evaluate package from Huggingface
Please install evaluate package from Huggingface
Please install evaluate package from Huggingface
Please install evaluate package from Huggingface
Please install evaluate package from Huggingface
Please install evaluate package from Huggingface
Please install evaluate package from Huggingface
Please install evaluate package from Huggingface
Please install evaluate package from Huggingface
Please install watson_nlp package to compute HAP score
Please install watson_nlp package to compute HAP score
Please i

In [28]:
metrics = results.get("metrics_result")
results = metrics.get("explainability").get("protodash")

In [29]:
import json
for idx, entry in enumerate(results):
    print(f"====idx:{idx}: Question:{questions[idx]} Response:{data['generated_text'][idx]}====")
    print(json.dumps(entry,indent=4))

====idx:0: Question:What did the president say about Ketanji Brown Jackson? Response: The president said that Ketanji Brown Jackson is a "top legal mind" and that she will "continue Justice Breyer's legacy of excellence." He also said that she has received a "broad range of support" from various groups and individuals.====
{
    "prototypes": {
        "fields": [
            "weights",
            "prototypes"
        ],
        "values": [
            [
                0.624389135128992,
                "Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you\u2019re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I\u2019d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer\u2014an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the mos

### Explanation
Source attribution can be understood using the weights ( the attribution/contribution factor) and the prototypes ( the relevant context/source) which has attributed to the response by the foundation model behind the scenes . For example a weight: 1.0 indicate that that a single paragraph of the context has attributed for response by foundation model. Likewise weights : 0.6,0.3,0.1 indicate that 3  paragraphs have attributed for response by foundation model behind the scenes.   The prototype values are the paragraphs supplied as part of the relevant context . 