# Building and evaluating Q&A Application using Knowledge Bases for Amazon Bedrock using RAG Assessment (RAGAS) framework

### Context

In this notebook, we will dive deep into building Q&A application using Retrieve API provide by Knowledge Bases for Amazon Bedrock, along with LangChain and RAGAS for evaluating the responses. Here, we will query the knowledge base to get the desired number of document chunks based on similarity search, prompt the query using Anthropic Claude, and then evaluate the responses effectively using Ragas evaluation metrics, such as faithfulness, answer relevancy, context precision based expectations.

### Knowledge Bases for Amazon Bedrock Introduction

With knowledge bases, you can securely connect foundation models (FMs) in Amazon Bedrock to your company
data for Retrieval Augmented Generation (RAG). Access to additional data helps the model generate more relevant,
context-speciﬁc, and accurate responses without continuously retraining the FM. All information retrieved from
knowledge bases comes with source attribution to improve transparency and minimize hallucinations. For more information on creating a knowledge base using console, please refer to this [post](!https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base.html).

### Pattern

We can implement the solution using Retreival Augmented Generation (RAG) pattern. RAG retrieves data from outside the language model (non-parametric) and augments the prompts by adding the relevant retrieved data in context. Here, we are performing RAG effectively on the knowledge base created in the previous notebook or using console. 

### Pre-requisite

Before being able to answer the questions, the documents must be processed and stored in Knowledge Bases for Amazon Bedrock.

1. Load the documents into the knowledge base by connecting your s3 bucket (data source). 
2. Ingestion - Knowledge base will split them into smaller chunks (based on the strategy selected), generate embeddings and store it in the associated vectore store.

![data_ingestion.png](./images/data_ingestion.png)


#### Notebook Walkthrough



For our notebook we will use the `Retreive API` provided by Knowledge Bases for Amazon Bedrock which converts user queries into
embeddings, searches the knowledge base, and returns the relevant results, giving you more control to build custom
workﬂows on top of the semantic search results. The output of the `Retrieve API` includes the the `retrieved text chunks`, the `location type` and `URI` of the source data, as well as the relevance `scores` of the retrievals. 


We will then use the text chunks being generated and augment it with the original prompt and pass it through the `anthropic.claude-instant-v1` model using prompt engineering patterns based for your use case.

Finally we will evaluate the generated responses using RAGAS on using metrics such as faithfulness, answer relevancy,and context precision. For evaluation, we will use `anthropic.claude-v2:1`.
### Ask question


![retrieveapi.png](./images/retrieveAPI.png)


#### Evaluation
1. Utilize Ragas for evaluation on 
    1. **Faithfulness:** This measures the factual consistency of the generated answer against the given context. It is calculated from answer and retrieved context. The answer is scaled to (0,1) range. Higher the better.
    2. **Answer Relevancy:** The evaluation metric, Answer Relevancy, focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information. This metric is computed using the question and the answer, with values ranging between 0 and 1, where higher scores indicate better relevancy.
    3. **Context Precision:** Context Precision is a metric that evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not. Ideally all the relevant chunks must appear at the top ranks. This metric is computed using the question and the contexts, with values ranging between 0 and 1, where higher scores indicate better precision.
    4. **Context Recall:** Context recall measures the extent to which the retrieved context aligns with the annotated answer, treated as the ground truth. It is computed based on the ground truth and the retrieved context, and the values range between 0 and 1, with higher values indicating better performance.
    

### USE CASE:

#### Dataset

In this example, you will use several years of Amazon's Letter to Shareholders as a text corpus to perform Q&A on. This data is already ingested into the knowledge base. You will need the `knowledge base id` to run this example.
In your specific use case, you can sync different files for different domain topics and query this notebook in the same manner to evaluate model responses using the retrieve API from knowledge bases.


### Python 3.10

⚠  For this lab we need to run the notebook based on a Python 3.10 runtime. ⚠

### Setup

To run this notebook you would need to install dependencies, langchain and ragas and the updated boto3, botocore whls.


In [5]:
%pip install --upgrade pip
%pip install boto3==1.33.2 --force-reinstall --quiet
%pip install botocore==1.33.2 --force-reinstall --quiet
%pip install langchain==0.0.342 --force-reinstall --quiet
%pip install ragas==0.0.20 --force-reinstall --quiet

[0mNote: you may need to restart the kernel to use updated packages.
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spyder 5.3.3 requires pyqt5<5.16, which is not installed.
spyder 5.3.3 requires pyqtwebengine<5.16, which is not installed.
awscli 1.31.9 requires botocore==1.33.9, but you have botocore 1.33.13 which is incompatible.
distributed 2022.7.0 requires tornado<6.2,>=6.0.3, but you have tornado 6.4 which is incompatible.
jupyterlab 3.4.4 requires jupyter-server~=1.16, but you have jupyter-server 2.12.1 which is incompatible.
jupyterlab-server 2.10.3 requires jupyter-server~=1.4, but you have jupyter-server 2.12.1 which is incompatible.
notebook 6.5.6 requires jupyter-client<8,>=5.3.4, but you have jupyter-client 8.6.0 which is incompatible.
notebook 6.5.6 requires pyzmq<25,>=17, but you have pyzmq 25.1.2 which is incompatible.
panel 0.13.1 requir

#### Restart the kernel with the updated packages that are installed through the dependencies above

In [6]:
# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

## Data Preparation
Let's first download some of the files to build our document store. For this example we will be using public IRS documents from [here](https://www.irs.gov/publications).

In [16]:
from urllib.request import urlretrieve
import os
os.makedirs("data", exist_ok=True)
files = [
    "https://www.irs.gov/pub/irs-pdf/p1544.pdf",
    "https://www.irs.gov/pub/irs-pdf/p15.pdf",
    "https://www.irs.gov/pub/irs-pdf/p1212.pdf",
]
for url in files:
    file_path = os.path.join("data", url.rpartition("/")[2])
    urlretrieve(url, file_path)

## Create a KnowledgeBase
Using the data downloaded, create a KB in Amazon Bedrock and copy the KnowledgeID.[link](https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base-create.html).

### Follow the steps below to set up necessary packages

1. Import the necessary libraries for creating `bedrock-runtime` for invoking foundation models and `bedrock-agent-runtime` client for using Retrieve API provided by Knowledge Bases for Amazon Bedrock. 
2. Import Langchain for: 
   1. Initializing bedrock model  `anthropic.claude-instant-v1` as our large language model to perform query completions using the RAG pattern. 
   2. Initializing bedrock model  `anthropic.claude-v2:1` as our large language model to perform RAG evaluation. 
   3. Initialize Langchain retriever integrated with knowledge bases. 
   4. Later in the notebook we will wrap the LLM and retriever with `RetrieverQAChain` for building our Q&A application.

In [7]:
import boto3
import pprint
from botocore.client import Config
from langchain.llms.bedrock import Bedrock
from langchain.embeddings import BedrockEmbeddings
from langchain.retrievers.bedrock import AmazonKnowledgeBasesRetriever

pp = pprint.PrettyPrinter(indent=2)

kb_id = "UDDPZH86JK" # replace it with your Knowledge base id.


bedrock_config = Config(connect_timeout=120, read_timeout=120, retries={'max_attempts': 0})
bedrock_client = boto3.client('bedrock-runtime')
bedrock_agent_client = boto3.client("bedrock-agent-runtime",
                              config=bedrock_config
                              )

model_kwargs_claude = {
    "temperature": 0,
    "top_k": 10,
    "max_tokens_to_sample": 3000
}

llm_for_text_generation = Bedrock(model_id="anthropic.claude-instant-v1",
              model_kwargs=model_kwargs_claude,
              streaming=True,
              client = bedrock_client,)

llm_for_evaluation = Bedrock(model_id="anthropic.claude-v2:1",
              model_kwargs=model_kwargs_claude,
              streaming=True,
              client = bedrock_client,)

bedrock_embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v1",client=bedrock_client)

### Retrieve API: Process flow 

Create a `AmazonKnowledgeBasesRetriever` object from LangChain which will call the `Retreive API` provided by Knowledge Bases for Amazon Bedrock which converts user queries into
embeddings, searches the knowledge base, and returns the relevant results, giving you more control to build custom
workﬂows on top of the semantic search results. The output of the `Retrieve API` includes the the `retrieved text chunks`, the `location type` and `URI` of the source data, as well as the relevance `scores` of the retrievals. 

In [8]:

retriever = AmazonKnowledgeBasesRetriever(
        knowledge_base_id=kb_id,
        retrieval_config={"vectorSearchConfiguration": {"numberOfResults": 4}},
        # endpoint_url=endpoint_url,
        # region_name="us-east-1",
        # credentials_profile_name="<profile_name>",
    )

`score`: You can view the associated score of each of the text chunk that was returned which depicts its correlation to the query in terms of how closely it matches it.

### Prompt specific to the model to personalize responses 

Here, we will use the specific prompt below for the model to act as a financial advisor AI system that will provide answers to questions by using fact based and statistical information when possible. We will provide the `Retrieve API` responses from above as a part of the `{context}` in the prompt for the model to refer to, along with the user `query`.  

In [9]:
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser

PROMPT_TEMPLATE = """
        Human: You are Claims assistant with access to various company claims rejection and reasons
        <context>
        {context}
        </context>

        <question>
        {question}
        </question>
        The response is accurate and doesn’t contain any information not directly supported by the 
                    {context}, don't hallucinate.
        Assistant:"""

prompt = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)

# Setup RAG pipeline
rag_chain = (
    {"context": retriever,  "question": RunnablePassthrough()} 
    | prompt 
    | llm_for_text_generation
    | StrOutputParser() 
)

## Preparing the Evaluation Data

As RAGAS aims to be a reference-free evaluation framework, the required preparations of the evaluation dataset are minimal. You will need to prepare `question` and `ground_truths` pairs from which you can prepare the remaining information through inference as shown below. If you are not interested in the `context_recall` metric, you don’t need to provide the `ground_truths` information. In this case, all you need to prepare are the `questions`.

In [10]:
from datasets import Dataset

questions = ["Give me the reason and resolution for my claims rejection 22?", 
             "What's the reason and resolution recommended for Claim rejection code 88?",
             "Explain about Claim Rejection code 12?",
            ]
ground_truths = [["M/I Dispense As Written (DAW)/ Product Selection Code-Field D8.To resolve, go to Edit Rx> TP Detail> Claim Detail> Segment 7 and verify the Dispense As Written (Field D8). To change the DAW, go to the Edit Rx screen."],
                ["DUR Reject Error Fields D1, D7.If a plan rejects for a DUR (such as DD above), a DUR override will need to be sent. This can be done within Edit F11> TP Resend> DUR. Any DUR Conflict codes that the insurance rejected for (DD in screenshot above) will automatically populate the first line in the DUR tab.The pharmacy must then populate the Intervention Code, Outcome Code, and Level of Effort. If there were multiple DUR Conflict Codes in the rejection, the same number of DUR tabs will need to be populated with overrides - regardless of if they are duplicates."],
                ["M/I Place of Service Field C7.Code C7 means that the claim is rejecting because the place of service is invalid or not sending. The place of service is set on the Patient F4 screen> Insurance tab> Place of Service (C7) field."]]

answers = []
contexts = []

for query in questions:
  answers.append(rag_chain.invoke(query))
  contexts.append([docs.page_content for docs in retriever.get_relevant_documents(query)])

# To dict
data = {
    "question": questions,
    "answer": answers,
    "contexts": contexts,
    "ground_truths": ground_truths
}

# Convert dict to dataset
dataset = Dataset.from_dict(data)

  from pandas.core.computation.check import NUMEXPR_INSTALLED
  from pandas.core import (

Human:' and '

Assistant:'. Received 

Human: 
        

Human: You are Claims assistant with access to various company claims rejection and reasons
        <context>
        [Document(page_content='If the claim continues to reject, make sure the DUR Conflict codes have not changed or a new  Conflict code has been added. (For example: DD DD TD on the first rejection and the latest  rejection shows DD DD TD DC.) If DUR overrides transmitting (can be found in TP Detail>  Claim Detail> Segment 8) match all DUR Conflict codes being received, refer the pharmacy to  contact the insurance for additional information on what DUR codes they are looking for.    For additional questions or information contact Computer-Rx Customer Support at (800) 647- 5288.', metadata={'location': {'type': 'S3', 's3Location': {'uri': 's3://sagemaker-us-west-2-329494863578/llm-kgnraham/outcomes/Rejection-Code-88-new.pdf'}}, '

## Evaluating the RAG application
First, import all the metrics you want to use from `ragas.metrics`. Then, you can use the `evaluate()` function and simply pass in the relevant metrics and the prepared dataset.

In [11]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
)
from ragas.llms import LangchainLLM

ragas_bedrock_model = LangchainLLM(llm_for_evaluation)

#set embeddings model for evaluating answer relevancy metric
answer_relevancy.embeddings = bedrock_embeddings

#specify the metrics here
metrics = [
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall
    ]

#set llm for metric evaluation
for m in metrics:
    m.__setattr__("llm", ragas_bedrock_model)

result = evaluate(
    dataset = dataset, 
    metrics=metrics,
)

df = result.to_pandas()

evaluating with [faithfulness]


100%|██████████| 1/1 [00:40<00:00, 40.32s/it]


evaluating with [answer_relevancy]


100%|██████████| 1/1 [00:31<00:00, 31.59s/it]


evaluating with [context_precision]


100%|██████████| 1/1 [01:18<00:00, 78.88s/it]


evaluating with [context_recall]


100%|██████████| 1/1 [00:26<00:00, 26.71s/it]


Below, you can see the resulting RAGAS scores for the examples:

In [12]:
import pandas as pd
pd.options.display.max_colwidth = 800
df

Unnamed: 0,question,contexts,answer,ground_truths,faithfulness,answer_relevancy,context_precision,context_recall
0,Give me the reason and resolution for my claims rejection 22?,"[If the claim continues to reject, make sure the DUR Conflict codes have not changed or a new Conflict code has been added. (For example: DD DD TD on the first rejection and the latest rejection shows DD DD TD DC.) If DUR overrides transmitting (can be found in TP Detail> Claim Detail> Segment 8) match all DUR Conflict codes being received, refer the pharmacy to contact the insurance for additional information on what DUR codes they are looking for. For additional questions or information contact Computer-Rx Customer Support at (800) 647- 5288., Rejection\tCode\t22\t This article is a guide to assist in the resolution of third party rejection code 22. Rejection: Code 22 Message: M/I Dispense As Written (DAW)/ Product Selection Code Field D8 WinRx defaults to DAW of ...","Rejection code 22 indicates an issue with the Dispense As Written (DAW) or Product Selection Code in field D8 of the claim. \n\nThe resolution is to:\n\n1. Go to Edit Rx > TP Detail > Claim Detail > Segment 7 and verify the DAW is correct. \n\n2. If needed, change the DAW by going to the Edit Rx screen. \n\n3. Verify the DAW is set up properly in the Drug screen (F5 screen) Pricing tab.\n\n4. Resend the prescription to the insurance after making any corrections. \n\nIf the DAW is already correct, verify it is set up properly in the insurance claim sending setup. Then resend the claim to the insurance.","[M/I Dispense As Written (DAW)/ Product Selection Code-Field D8.To resolve, go to Edit Rx> TP Detail> Claim Detail> Segment 7 and verify the Dispense As Written (Field D8). To change the DAW, go to the Edit Rx screen.]",1.0,0.587996,0.0,0.333333
1,What's the reason and resolution recommended for Claim rejection code 88?,"[Rejection\tCode\t88\t This article is a guide to assist in the resolution of third party rejection code 88. Rejection: Code 88 Message: DUR Reject Error Fields D1, D7 Code 88 means that DUR override codes are needed or additional DUR override codes are necessary. Question: How do you send a DUR override? HELP: If a plan rejects for a DUR (such as DD above), a DUR override will need to be sent. This can be done within Edit F11> TP Resend> DUR. Any DUR Conflict codes that the insurance rejected for (DD in screenshot above) will automatically populate the first line in the DUR tab. The pharmacy must then populate the Intervention Code, Outcome Code, and Level of Effort. If there were multiple DUR Conflict Codes in the rejection, the same number of D...","The reason for claim rejection code 88 is that DUR (Drug Utilization Review) override codes are needed or additional DUR override codes are necessary. \n\nThe resolution recommended is:\n\n1. Send a DUR override by going to Edit F11 > TP Resend > DUR. Any DUR Conflict codes that the insurance rejected will automatically populate in the DUR tab. \n\n2. Populate the Intervention Code, Outcome Code, and Level of Effort in the DUR tab. If there were multiple DUR Conflict Codes in the rejection, populate overrides in the same number of DUR tabs.\n\n3. Make sure the DUR Conflict codes have not changed or a new Conflict code has been added in subsequent rejections. \n\n4. If the claim continues to reject after sending DUR overrides, refer the pharmacy to contact the insurance for additional ...","[DUR Reject Error Fields D1, D7.If a plan rejects for a DUR (such as DD above), a DUR override will need to be sent. This can be done within Edit F11> TP Resend> DUR. Any DUR Conflict codes that the insurance rejected for (DD in screenshot above) will automatically populate the first line in the DUR tab.The pharmacy must then populate the Intervention Code, Outcome Code, and Level of Effort. If there were multiple DUR Conflict Codes in the rejection, the same number of DUR tabs will need to be populated with overrides - regardless of if they are duplicates.]",1.0,0.764595,0.0,0.384615
2,Explain about Claim Rejection code 12?,"[Rejection\tCode\t12\t Updated 1 year ago FollowNot yet followed by anyone This article is a guide to assist in the resolution of third party rejection code 12. Rejection: Code 12 Message: M/I Place of Service Field C7 Code C7 means that the claim is rejecting because the place of service is invalid or not sending. The place of service is set on the Patient F4 screen> Insurance tab> Place of Service (C7) field. To resolve: verify the correct place of service is selected. Then, go to Cardholder Information> Edit Co> Claim Sending Setup> Segment 1> Make sure the ""Do Not Send Place of Service"" option is not checked. (If checked, the place of service field will not send on the claim). If the reject 12 continues after verifying th...","Rejection code 12 indicates an issue with the place of service field on the claim. Specifically:\n\n- Code 12 means the place of service is invalid or not being sent correctly on the claim. \n\n- The place of service is set on the patient insurance tab. \n\n- To resolve, verify the correct place of service is selected. \n\n- Also check the claim sending setup to ensure ""Do not send place of service"" is not checked, as that would prevent it from transmitting.\n\n- If it's still rejecting after verifying place of service, the pharmacy may need to contact the insurance for further assistance.\n\nThe key things are that rejection 12 relates to an invalid or missing place of service field on the claim. The place of service needs to be correctly set up on the patient insurance information a...",[M/I Place of Service Field C7.Code C7 means that the claim is rejecting because the place of service is invalid or not sending. The place of service is set on the Patient F4 screen> Insurance tab> Place of Service (C7) field.],1.0,0.657334,0.0,0.222222


> Note: Please note the scores above gives a relative idea on the performance of your RAG application and should be used with caution and not as standalone scores. Also note, that we have used only 5 question/answer pairs for evaluation, as best practice, you should use enough data to cover different aspects of your document for evaluating model.

Based on the scores, you can review other components of your RAG workflow to further optimize the scores, few recommended options are to review your chunking strategy, prompt instructions, adding more numberOfResults for additional context and so on. 