# Retrieval Augmented Question & Answering with Amazon Bedrock using LangChain

> *This notebook should work well with the **`Data Science 3.0`** kernel in SageMaker Studio*

### Context
Previously we saw that the model told us how to to change the tire, however we had to manually provide it with the relevant data and provide the contex ourselves. We explored the approach to leverage the model availabe under Bedrock and ask questions based on it's knowledge learned during training as well as providing manual context. While that approach works with short documents or single-ton applications, it fails to scale to enterprise level question answering where there could be large enterprise documents which cannot all be fit into the prompt sent to the model. 

### Pattern
We can improve upon this process by implementing an architecure called Retreival Augmented Generation (RAG). RAG retrieves data from outside the language model (non-parametric) and augments the prompts by adding the relevant retrieved data in context. 

In this notebook we explain how to approach the pattern of Question Answering to find and leverage the documents to provide answers to the user questions.

### Challenges
- How to manage large document(s) that exceed the token limit
- How to find the document(s) relevant to the question being asked

### Proposal
To the above challenges, this notebook proposes the following strategy
#### Prepare documents
![Embeddings](./images/Embeddings_lang.png)

Before being able to answer the questions, the documents must be processed and a stored in a document store index
- Load the documents
- Process and split them into smaller chunks
- Create a numerical vector representation of each chunk using Amazon Bedrock Titan Embeddings model
- Create an index using the chunks and the corresponding embeddings
#### Ask question
![Question](./images/Chatbot_lang.png)

When the documents index is prepared, you are ready to ask the questions and relevant documents will be fetched based on the question being asked. Following steps will be executed.
- Create an embedding of the input question
- Compare the question embedding with the embeddings in the index
- Fetch the (top N) relevant document chunks
- Add those chunks as part of the context in the prompt
- Send the prompt to the model under Amazon Bedrock
- Get the contextual answer based on the documents retrieved

## Use Case
#### Dataset
To explain this architecture pattern we are using the documents from IRS. These documents explain topics such as:
- Original Issue Discount (OID) Instruments
- Reporting Cash Payments of Over $10,000 to IRS
- Employer's Tax Guide

#### Persona
Let's assume a persona of a layman who doesn't have an understanding of how IRS works and if some actions have implications or not.

The model will try to answer from the documents in easy language.


## Implementation
In order to follow the RAG approach this notebook is using the LangChain framework where it has integrations with different services and tools that allow efficient building of patterns such as RAG. We will be using the following tools:

- **LLM (Large Language Model)**: Anthropic Claude V1 available through Amazon Bedrock

  This model will be used to understand the document chunks and provide an answer in human friendly manner.
- **Embeddings Model**: Amazon Titan Embeddings available through Amazon Bedrock

  This model will be used to generate a numerical representation of the textual documents
- **Document Loader**: PDF Loader available through LangChain

  This is the loader that can load the documents from a source, for the sake of this notebook we are loading the sample files from a local path. This could easily be replaced with a loader to load documents from enterprise internal systems.

- **Vector Store**: FAISS available through LangChain

  In this notebook we are using this in-memory vector-store to store both the embeddings and the documents. In an enterprise context this could be replaced with a persistent store such as AWS OpenSearch, RDS Postgres with pgVector, ChromaDB, Pinecone or Weaviate.
- **Index**: VectorIndex

  The index helps to compare the input embedding and the document embeddings to find relevant document
- **Wrapper**: wraps index, vector store, embeddings model and the LLM to abstract away the logic from the user.

In [21]:
!pip uninstall ipywidgets -y

Found existing installation: ipywidgets 7.6.5
Uninstalling ipywidgets-7.6.5:
  Successfully uninstalled ipywidgets-7.6.5
[0m

## Setup



In [2]:
%pip install  \
    "langchain>=0.0.350" \
    #"transformers>=4.24,<5" \
    # sqlalchemy -U \
    "faiss-cpu>=1.7,<2" \
    "pypdf>=3.8,<4" \
    # pinecone-client==2.2.4 \
    # apache-beam==2.52. \
    # tiktoken==0.5.2 \
   # "ipywidgets>=7,<8" \
    matplotlib==3.8.2 \
    anthropic==0.9.0


[0mNote: you may need to restart the kernel to use updated packages.


In [3]:
%pip install datasets==2.15.0

[0mNote: you may need to restart the kernel to use updated packages.


In [4]:
%pip install numexpr==2.8.8

[0mNote: you may need to restart the kernel to use updated packages.


In [5]:
%pip install pypdf

[0mNote: you may need to restart the kernel to use updated packages.


In [6]:
# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [7]:
import warnings
warnings.filterwarnings('ignore')

In [8]:
import json
import os
import sys

import boto3

module_path = ".."
sys.path.append(os.path.abspath(module_path))
from utils import bedrock, print_ww


# ---- ⚠️ Un-comment and edit the below lines as needed for your AWS setup ⚠️ ----

# os.environ["AWS_DEFAULT_REGION"] = "<REGION_NAME>"  # E.g. "us-east-1"
# os.environ["AWS_PROFILE"] = "<YOUR_PROFILE>"
# os.environ["BEDROCK_ASSUME_ROLE"] = "<YOUR_ROLE_ARN>"  # E.g. "arn:aws:..."

boto3_bedrock = bedrock.get_bedrock_client(
    assumed_role=os.environ.get("BEDROCK_ASSUME_ROLE", None),
    region=os.environ.get("AWS_DEFAULT_REGION", None)
)

Create new client
  Using region: us-west-2
boto3 Bedrock client successfully created!
bedrock-runtime(https://bedrock-runtime.us-west-2.amazonaws.com)


## Configure langchain

We begin with instantiating the LLM and the Embeddings model. Here we are using Anthropic Claude for text generation and Amazon Titan for text embedding.

Note: It is possible to choose other models available with Bedrock. You can replace the `model_id` as follows to change the model.

`llm = Bedrock(model_id="amazon.titan-text-express-v1")`

Check [documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/model-ids-arns.html) for Available text generation and embedding models Ids under Amazon Bedrock.

In [9]:
# We will be using the Titan Embeddings Model to generate our Embeddings.
from langchain.embeddings import BedrockEmbeddings
from langchain.llms.bedrock import Bedrock

# - create the Anthropic Model
llm = Bedrock(model_id="anthropic.claude-v2", client=boto3_bedrock,streaming=True, model_kwargs={'max_tokens_to_sample':200})
bedrock_embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v1", client=boto3_bedrock)

## Data Preparation
Let's first download some of the files to build our document store. For this example we will be using public IRS documents from [here](https://www.irs.gov/publications).

In [10]:
from urllib.request import urlretrieve

os.makedirs("data", exist_ok=True)
files = [
    "https://www.irs.gov/pub/irs-pdf/p1544.pdf",
    "https://www.irs.gov/pub/irs-pdf/p15.pdf",
    "https://www.irs.gov/pub/irs-pdf/p1212.pdf",
]
for url in files:
    file_path = os.path.join("data", url.rpartition("/")[2])
    urlretrieve(url, file_path)

After downloading we can load the documents with the help of [DirectoryLoader from PyPDF available under LangChain](https://python.langchain.com/en/latest/reference/modules/document_loaders.html) and splitting them into smaller chunks.

Note: The retrieved document/text should be large enough to contain enough information to answer a question; but small enough to fit into the LLM prompt. Also the embeddings model has a limit of the length of input tokens limited to 8192 tokens, which roughly translates to ~32,000 characters. For the sake of this use-case we are creating chunks of roughly 1000 characters with an overlap of 100 characters using [RecursiveCharacterTextSplitter](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/recursive_text_splitter.html).

In [11]:
import numpy as np
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader, PyPDFDirectoryLoader

loader = PyPDFDirectoryLoader("./data/")

documents = loader.load()
# - in our testing Character split works better with this PDF data set
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = 1000,
    chunk_overlap  = 100,
)
docs = text_splitter.split_documents(documents)

In [12]:
avg_doc_length = lambda documents: sum([len(doc.page_content) for doc in documents])//len(documents)
avg_char_count_pre = avg_doc_length(documents)
avg_char_count_post = avg_doc_length(docs)
print(f'Average length among {len(documents)} documents loaded is {avg_char_count_pre} characters.')
print(f'After the split we have {len(docs)} documents more than the original {len(documents)}.')
print(f'Average length among {len(docs)} documents (after split) is {avg_char_count_post} characters.')

Average length among 81 documents loaded is 5889 characters.
After the split we have 560 documents more than the original 81.
Average length among 560 documents (after split) is 912 characters.


We had 3 PDF documents which have been split into smaller ~500 chunks.

Now we can see how a sample embedding would look like for one of those chunks

In [13]:
try:
    
    sample_embedding = np.array(bedrock_embeddings.embed_query(docs[0].page_content))
    print("Sample embedding of a document chunk: ", sample_embedding)
    print("Size of the embedding: ", sample_embedding.shape)

except ValueError as error:
    if  "AccessDeniedException" in str(error):
        print(f"\x1b[41m{error}\
        \nTo troubeshoot this issue please refer to the following resources.\
         \nhttps://docs.aws.amazon.com/IAM/latest/UserGuide/troubleshoot_access-denied.html\
         \nhttps://docs.aws.amazon.com/bedrock/latest/userguide/security-iam.html\x1b[0m\n")      
        class StopExecution(ValueError):
            def _render_traceback_(self):
                pass
        raise StopExecution        
    else:
        raise error

Sample embedding of a document chunk:  [ 0.11621094  0.06494141 -0.23730469 ...  0.11962891 -0.29882812
 -0.27929688]
Size of the embedding:  (1536,)


Following the similar pattern embeddings could be generated for the entire corpus and stored in a vector store.

This can be easily done using [FAISS](https://github.com/facebookresearch/faiss) implementation inside [LangChain](https://python.langchain.com/en/latest/modules/indexes/vectorstores/examples/faiss.html) which takes  input the embeddings model and the documents to create the entire vector store. Using the Index Wrapper we can abstract away most of the heavy lifting such as creating the prompt, getting embeddings of the query, sampling the relevant documents and calling the LLM. [VectorStoreIndexWrapper](https://python.langchain.com/en/latest/modules/indexes/getting_started.html#one-line-index-creation) helps us with that.

**⚠️⚠️⚠️ NOTE: it might take few minutes to run the following cell ⚠️⚠️⚠️**

In [16]:
from langchain.chains.question_answering import load_qa_chain
from langchain.vectorstores import FAISS
from langchain.indexes import VectorstoreIndexCreator
from langchain.indexes.vectorstore import VectorStoreIndexWrapper

vectorstore_faiss = FAISS.from_documents(
    docs,
    bedrock_embeddings,
)

wrapper_store_faiss = VectorStoreIndexWrapper(vectorstore=vectorstore_faiss)

In [15]:
%pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.6 kB)
Downloading faiss_cpu-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m32.5 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.8.0
[0mNote: you may need to restart the kernel to use updated packages.


## Question Answering

Now that we have our vector store in place, we can start asking questions.

### Prompt specific to the model to personalize responses 

Here, we will use the specific prompt below for the model to act as a financial advisor AI system that will provide answers to questions by using fact based and statistical information when possible. We will provide the `Retrieve API` responses from above as a part of the `{context}` in the prompt for the model to refer to, along with the user `query`.  

In [17]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

prompt_template = """

Human: Use the following pieces of context to provide a concise answer to the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
<context>
{context}
</context

Question: {question}

Assistant:"""

PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)
retriever=vectorstore_faiss.as_retriever(
        search_type="similarity", search_kwargs={"k": 3}
    )
 
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)
# query = "Is it possible that I get sentenced to jail due to failure in filings?"
# result = qa({"query": query})
# print_ww(result['result'])

## Preparing the Evaluation Data

As RAGAS aims to be a reference-free evaluation framework, the required preparations of the evaluation dataset are minimal. You will need to prepare `question` and `ground_truths` pairs from which you can prepare the remaining information through inference as shown below. If you are not interested in the `context_recall` metric, you don’t need to provide the `ground_truths` information. In this case, all you need to prepare are the `questions`.

In [18]:
from datasets import Dataset

questions = ["Is it possible that I get sentenced to jail due to failure in filings?", 
             "What is the difference between market discount and qualified stated interest?",
             "Who must file the fillings"
            ]
ground_truths = [["Yes,If you willfully fail to file Form 8300, you can be fined up to $250,000 for individuals RECORDS($500,000 for corporations) or sentenced to upto 5 years in prison, or both. "],
                ["Market discount. A debt instrument is generally acquired with market discount if its stated redemption price at maturity is greater than its basis after its acquisition.Qualified stated interest. In general, qualified state interest is stated interest that is unconditionally payable in cash or property (other than debt instruments of the issuer) at least annually over the term of the debt instrument at a single fixed rate."],
                ["Any person in a trade or business who receives more than $10,000 in cash in a single transaction or in related transactions must file Form 8300."]]
answers = []
contexts = []

for query in questions:
    answers.append(qa(query)['result'])
    contexts.append([docs.page_content for docs in retriever.get_relevant_documents(query)])

# To dict
data = {
    "question": questions,
    "answer": answers,
    "contexts": contexts,
    "ground_truths": ground_truths
}

# Convert dict to dataset
dataset = Dataset.from_dict(data)

  warn_deprecated(


## Evaluating the RAG application
First, import all the metrics you want to use from `ragas.metrics`. Then, you can use the `evaluate()` function and simply pass in the relevant metrics and the prepared dataset.

In [19]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
)
from ragas.llms import LangchainLLM

ragas_bedrock_model = LangchainLLM(llm)

#set embeddings model for evaluating answer relevancy metric
answer_relevancy.embeddings = bedrock_embeddings

#specify the metrics here
metrics = [
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall
    ]

#set llm for metric evaluation
for m in metrics:
    m.__setattr__("llm", ragas_bedrock_model)

result = evaluate(
    dataset = dataset, 
    metrics=metrics,
)

df = result.to_pandas()

evaluating with [faithfulness]


100%|██████████| 1/1 [00:31<00:00, 31.25s/it]


evaluating with [answer_relevancy]


100%|██████████| 1/1 [00:13<00:00, 13.19s/it]


evaluating with [context_precision]


100%|██████████| 1/1 [00:28<00:00, 28.88s/it]


evaluating with [context_recall]


100%|██████████| 1/1 [00:13<00:00, 13.39s/it]


Below, you can see the resulting RAGAS scores for the examples:

In [20]:
import pandas as pd
pd.options.display.max_colwidth = 800
df

Unnamed: 0,question,contexts,answer,ground_truths,faithfulness,answer_relevancy,context_precision,context_recall
0,Is it possible that I get sentenced to jail due to failure in filings?,"[There are civil penalties for failure to:\nFile a correct Form 8300 by the date it is \ndue, and\nProvide the required statement to those \nnamed in the Form 8300.\nIf you intentionally disregard the requirement \nto file a correct Form 8300 by the date it is due, \nthe penalty is the greater of:\n1.$25,000, or\n2.The amount of cash you received and \nwere required to report (up to $100,000).\nThere are criminal penalties for:\nWillful failure to file Form 8300,\nWillfully filing a false or fraudulent Form \n8300,\nStopping or trying to stop Form 8300 from \nbeing filed, and\nSetting up, helping to set up, or trying to \nset up a transaction in a way that would \nmake it seem unnecessary to file Form \n8300.\nIf you willfully fail to file Form 8300, you can \nbe fined up to $250,000 f...","Yes, based on the information provided in the context, it is possible to be sentenced to jail for up to 5 years for willful failure to file Form 8300 or for willfully filing a false or fraudulent Form 8300. The context states that there are criminal penalties for willful failure to file, willfully filing a false or fraudulent form, stopping or trying to stop the form from being filed, or setting up a transaction to avoid filing. The penalties can include fines up to $250,000 for individuals ($500,000 for corporations) or up to 5 years in prison, or both. So jail time is a possible criminal penalty for willful failures related to Form 8300.","[Yes,If you willfully fail to file Form 8300, you can be fined up to $250,000 for individuals RECORDS($500,000 for corporations) or sentenced to upto 5 years in prison, or both. ]",1.0,0.601696,0.0,0.2
1,What is the difference between market discount and qualified stated interest?,"[was less than the debt instrument's issue price \nplus the total OID that accrued before you ac-\nquired it. The market discount is the difference \nbetween the issue price plus accrued OID and \nyour adjusted basis.\nPremium. A debt instrument is purchased at a \npremium if its adjusted basis immediately after \npurchase is greater than the total of all amounts \npayable on the debt instrument after the pur-\nchase date, other than qualified stated interest. \nThe premium is the excess of the adjusted ba-\nsis over the payable amounts.\nPremium will generally eliminate the future \nreporting of OID in income by the purchaser, as \ndiscussed under Information for Owners of OID \nDebt Instruments , later. See Pub. 550 for more \ninformation on the tax treatment of bond pre-\nmium.\nQu...","Based on the context provided:\n\nMarket discount is the difference between the issue price plus accrued OID and the adjusted basis if the adjusted basis was less than the issue price plus accrued OID when the debt instrument was acquired. \n\nQualified stated interest is stated interest that is unconditionally payable in cash or property at least annually.\n\nSo the key difference is that market discount refers to a discount in price/basis compared to issue price plus accrued OID, while qualified stated interest refers to a specific type of stated interest that meets certain criteria.","[Market discount. A debt instrument is generally acquired with market discount if its stated redemption price at maturity is greater than its basis after its acquisition.Qualified stated interest. In general, qualified state interest is stated interest that is unconditionally payable in cash or property (other than debt instruments of the issuer) at least annually over the term of the debt instrument at a single fixed rate.]",1.0,0.982184,0.0,0.4
2,Who must file the fillings,"[Page 30 of 57 Fileid: … ations/p15/2024/a/xml/cycle06/source 14:07 - 19-Dec-2023\nThe type and rule above prints on all proofs including departmental reproduction proofs. MUST be removed before printing.\ngiven on time, you must give the employee Notice 797 or \nyour written statement by the date Form W -2 is required to \nbe given. If Form W -2 isn't required, you must notify the \nemployee by February 7, 2024.\n11. Depositing Taxes\nGenerally, you must deposit federal income tax withheld \nand both the employer and employee social security and \nMedicare taxes. Y ou must use EFT to make all federal tax \ndeposits. See How T o Deposit , later in this section, for in-\nformation on electronic deposit requirements.\nPayment with return. Y ou may make a payment with a \ntimely filed F...",Based on the context provided:\n- Employers must file Form W-2 to report wages paid to employees.\n- Employers may also need to file information returns like Form 1099-NEC to report payments of $600 or more made during the year to non-employees (e.g. independent contractors) for services performed for the business.\nSo the answer is that employers must file Form W-2 for employees and may need to file forms like 1099-NEC for non-employee compensation payments.,"[Any person in a trade or business who receives more than $10,000 in cash in a single transaction or in related transactions must file Form 8300.]",1.0,0.364663,0.0,0.0


## Conclusion
> Note: Please note the scores above gives a relative idea on the performance of your RAG application and should be used with caution and not as standalone scores. Also note, that we have used only 3 question/answer pairs for evaluation, as best practice, you should use enough data to cover different aspects of your document for evaluating model.

Based on the scores, you can review other components of your RAG workflow to further optimize the scores, few recommended options are to review your chunking strategy, prompt instructions, adding more numberOfResults for additional context and so on. 

# Thank You