# Using Retrieval-Augmented Generation (RAG) approach for Question-Answering

### Introduction

In this notebook we will explore the **Retrieval Augmentated Generation (RAG)** approach to use with a Large Language Model (LLM) to answer questions based on a reference file. In our case, we will use a pdf file that model has not seen before, and test its ability to generate responces based on the content of the pdf file. 

This notebook is based on a template hosted on [GitHub](https://www.youtube.com/watch?v=0xyXYHMrAP0&t=1173s) and tutorial hosted on [YouTube](https://www.youtube.com/watch?v=0xyXYHMrAP0&t=1173s) by James Calam. We have customized portions of the code to fit our particular needs.

In this project we will use Flan T5 XL for an LLM, MimiLM as an embedding model and Pinecone for vector database. All code is executed on SageMaker, using S3 as the storage for the reference file.  

In [1]:
!pip install -qU \
    sagemaker==2.173.0 \
    pinecone-client==2.2.1 \
    ipywidgets==7.0.0

### Deploy Flan-T5-XL from HuggingFace 

The first step in building this project is to deploy the LLM. FLAN (Fine-tuned LAnguage Net) T5 XL is a model produced by Google Research. It's predecesor is T5 model developed in 2019 [Jacob, 2023](https://exemplary.ai/blog/flan-t5). FLAN-T5 is a text-to-text transformer model, that can be used on language tasks, including translation, classification, and question-answering. As this was the model used in the tutorial, we chose to use it in this project as well for its size and performance. The model is smaller and known for its speed and efficiency in comparisom to other larger models, meanwhile having a comparative performance [Jacob, 2023](https://exemplary.ai/blog/flan-t5). 

We will upload the model from HuggingFace Models library. 

In [2]:
import sagemaker
from sagemaker.huggingface import (
    HuggingFaceModel,
    get_huggingface_llm_image_uri
)

role = sagemaker.get_execution_role() # IAM role to use by SageMaker

hub_config = {
    'HF_MODEL_ID':'google/flan-t5-xl', # model_id from hf.co/models
    'HF_TASK':'text-generation' # Specifies the task
}

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
  "huggingface",
  version="0.8.2"
)

# create the model
huggingface_model = HuggingFaceModel(
    env=hub_config,
    role=role, 
    image_uri=llm_image
)

Next, we will deploy the model. Depending on the instance, it might take a different amount of time. Our deployment took no longer than 10 minutes.

In [3]:
llm = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.m4.4xlarge",
    endpoint_name="flan-t5-demo"
)

----------------!

### Asking LLM a question without providing it with the source knowledge

At this point we have configured and deployed our model. Now we will experiment with asking the LLM a question without providing it with any context and see how it performs. 

In [4]:
question = "What is Diana's education?"

out = llm.predict({"inputs": question})
out

[{'generated_text': 'University of Cambridge'}]

That is a nice answer but unfortunatelly, that is not true. This example illustrates why RAG is important. With the RAG approach, we provide extra information, combined with a prompt and send both to the model together with the question. For example: 

In [5]:
context = """Diana attended Carleton Univesrsity in 2017 and she graduated in 2021."""

Because the FLAN T5 XL is "instruction finetuned", meaning we can provide the model with a template to perform a specific language task, we can direct it to use the context we provided to answer our question ([Jacob, 2023](https://exemplary.ai/blog/flan-t5), [Bosma & Wei, 2021](https://blog.research.google/2021/10/introducing-flan-more-generalizable.html)).   

In [6]:
prompt_template = """Answer the following QUESTION based on the CONTEXT
given. If you do not know the answer and the CONTEXT doesn't
contain the answer truthfully say "I don't know".

CONTEXT:
{context}

QUESTION:
{question}

ANSWER:
"""

text_input = prompt_template.replace("{context}", context).replace("{question}", question)

out = llm.predict({"inputs": text_input})
generated_text = out[0]["generated_text"]
print(f"[Input]: {question}\n[Output]: {generated_text}")

[Input]: What is Diana's education?
[Output]: Carleton Univesrsity


Now, that is better! We can also ask it the question we did not provide the context to and see how it performs. 

In [7]:
unanswerable_question = "What was my major?"

text_input = prompt_template.replace("{context}", context).replace("{question}", unanswerable_question)

out = llm.predict({"inputs": text_input})
generated_text = out[0]["generated_text"]
print(f"[Input]: {unanswerable_question}\n[Output]: {generated_text}")

[Input]: What was my major?
[Output]: I don't know


Great, it follows the instruction. This approach works, however as James Calam explains in the tutorial, it is not a good idea to feed the model with whatever context you have and start asking it questions. First, it defeats the purpose of having an LLM in the first place. Second, in the real world, we would be working with hundrends if not thousands of documents. Feeding the context of these documents to an LLM as an input has actually a negative impact on its performance, [link to tutorial](https://www.youtube.com/watch?v=0xyXYHMrAP0&t=1173s). The article that James mentiones which explains this issue is called ["Lost in the Middle: How Language Models Use Long Contexts"](https://arxiv.org/abs/2307.03172). 

What we will do instead is to feed a large body of information and retrieve relevant context to our question. This is the basis of **RAG**. In the next section we will deploy the embedding model. An embedding model creates embeddings for our context document (a pdf file) which we then store in a vector storage space. For our purpose, we would create an embedding for our question using the same embedding model. Then, we would retrieve relevant contents in the embedding space based on our question and feed these to the LLM to generate the correct answer.     

### Deploy MiniLM-L6-V2 - an embedding model from HuggingFace 

We will use MiniLM-L6-V2 from the HuggingFace library as the embedding model.  

In [8]:
hub_config = {
    'HF_MODEL_ID': 'sentence-transformers/all-MiniLM-L6-v2', # model_id from hf.co/models
    'HF_TASK': 'feature-extraction'
}

huggingface_model = HuggingFaceModel(
    env=hub_config,
    role=role,
    transformers_version="4.6", # transformers version used
    pytorch_version="1.7", # pytorch version used
    py_version="py36", # python version of the DLC
)

In [9]:
encoder = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.t2.large",
    endpoint_name="minilm-demo"
)

------!

In [73]:
out = encoder.predict({"inputs": ["some text here", "some more text goes here too"]})

In [77]:
import numpy as np

embeddings = np.mean(np.array(out), axis=1)
embeddings.shape

(2, 384)

In [78]:
from typing import List

def embed_docs(docs: List[str]) -> List[List[float]]:
    out = encoder.predict({'inputs': docs})
    embeddings = np.mean(np.array(out), axis=1)
    return embeddings.tolist()

Now that we have configured the embedding model, we will process to load the pdf file and extract its contents. 

### Upload a PDF file from S3 storage 

For this project, we would like to use something that the model could not have been exposed to during training. Therefore we will be using a resume I created back in 2019 and updated recently. To retrieve it from S3, we will specify the bucket and the file name.

In [None]:
from boto3.session import Session
import boto3

s3 = boto3.client('s3', aws_access_key_id= 'your_access_key', aws_secret_access_key= 'your_secret_key')
with open('Resume_2019.pdf', 'wb') as f:
    s3.download_fileobj('sagemaker-us-east-2-249342478614', 'Resume_2019.pdf', f)
    f.seek(0)

### Examening the PDF's contents and transforming it to feed into the model

To read and transform the contents of the file, we will use two libraries: PyPDF and typing_extensions. 

In [80]:
!pip install PyPDF2



In [81]:
from PyPDF2 import PdfReader

In [82]:
pdf = PdfReader('Resume_2019.pdf')
print("Number of pages in pdf file:", len(pdf.pages))

Number of pages in pdf file: 2


In [83]:
from typing_extensions import Concatenate 
raw_text = ''
for i, page in enumerate(pdf.pages):
    content = page.extract_text()
    if content: 
        raw_text += content 

raw_text

# code source: https://colab.research.google.com/drive/1Fk9um3Af_aV0WvavD01gVljPHAxzQNLp?usp=sharing

'Diana\nRogachova\n(2019\nresume)\nOBJECTIVE\nI\nam\na\nhighly\ndependable\nand\nwell-or ganized\nindividual\nseeking\npart-time\nemployment\nin\nthe\ncustomer\nservice\nindustry .\nI\nam\neager\nto\nprovide\npositive\nand\nprofessional\nservice\nto\nensure\nclient\nsatisfaction\nand\nrecurring\nbusiness\nopportunities.\nWORK\nEXPERIENCE\nCashier\nThe\nRiv’s\nSnack\nBar,\nSparks\nSt,\nOttawa,\nON\nJune\n2019\n–\nCurr ent\n●\nDemonstrated\nability\nto\nprovide\noutstanding\ncustomer\nservice\nby\naccurately\ntaking\norders\nand\naddressing\ncustomer\ninquiries\nwith\na\nfriendly\nand\nprofessional\ndemeanor .\n●\nProven\ntrack\nrecord\nof\ncollaborating\neffectively\nwith\nbartenders,\ncooks,\nmanagement\nand\nother\nstaff\nstreamlining\nthe\noperations.\n●\nProcessed\ntransactions\nwith\naccuracy\nand\nefficiency ,\nincluding\ncash,\ncredit\ncards,\nand\ndigital\npayments,\nensuring\na\nseamless\nand\nefficient\ncheckout\nprocess\nfor\nguests.\nSales\nAssociate\n(Sunday\nonly\nposition

This text is quite messy and unlikely that the model will be able to create appropriate embeddings. As such, we need to clean up this text.  

In [84]:
# filter the text for \n and ● and split into sentences
raw_text = raw_text.replace('\n', ' ') 
raw_text = raw_text.replace('●', ' ')

raw_text = raw_text.split('.')

raw_text

['Diana Rogachova (2019 resume) OBJECTIVE I am a highly dependable and well-or ganized individual seeking part-time employment in the customer service industry ',
 ' I am eager to provide positive and professional service to ensure client satisfaction and recurring business opportunities',
 ' WORK EXPERIENCE Cashier The Riv’s Snack Bar, Sparks St, Ottawa, ON June 2019 – Curr ent   Demonstrated ability to provide outstanding customer service by accurately taking orders and addressing customer inquiries with a friendly and professional demeanor ',
 '   Proven track record of collaborating effectively with bartenders, cooks, management and other staff streamlining the operations',
 '   Processed transactions with accuracy and efficiency , including cash, credit cards, and digital payments, ensuring a seamless and efficient checkout process for guests',
 ' Sales Associate (Sunday only position) The Unrefined Olive, Bank St, Ottawa, ON September 2018 – Curr ent   Strengthened interpersonal 

This looks cleaner and more distinguishible for the embedding model. Next we will 

In [85]:
type(raw_text)

list

In [86]:
len(raw_text)

25

### Generating Embeddings for the document and storing vectors in Pinecone 

Now we need to initialize our connection to Pinecone though a free API key. 

In [None]:
import pinecone
import os

# add Pinecone API key from app.pinecone.io
api_key = os.environ.get("PINECONE_API_KEY") or "your_api_key"
# set Pinecone environment - find next to API key in console
env = os.environ.get("PINECONE_ENVIRONMENT") or "gcp-starter"

pinecone.init(
    api_key=api_key,
    environment=env
)

In [None]:
pinecone.list_indexes() 

['retrieval-augmentation-aws']

In [89]:
import time

index_name = 'retrieval-augmentation-aws' # name of the index

if index_name in pinecone.list_indexes(): # delete index if it already exists
    pinecone.delete_index(index_name)
    
pinecone.create_index( # create index
    name=index_name,
    dimension=embeddings.shape[1],
    metric='cosine' # cosine similarity
)
# wait for index to finish initialization
while not pinecone.describe_index(index_name).status['ready']:
    time.sleep(1) # wait 1 second

In [90]:
pinecone.list_indexes()

['retrieval-augmentation-aws']

In [95]:
from tqdm.auto import tqdm

batch_size = 2 # batch size for upsert
vector_limit = 100 # limit the number of vectors to upload

texts = raw_text[:vector_limit] # limit the number of vectors to upload
index = pinecone.Index(index_name) 

for i in tqdm(range(0, len(raw_text), batch_size)): # iterate over batches
    # find end of batch
    i_end = min(i+batch_size, len(raw_text)) 
    # create IDs batch
    ids = [str(x) for x in range(i, i_end)]
    # create metadata batch
    metadatas = [{'text': text} for text in raw_text[i:i_end]]
    # create embeddings
    embeddings = embed_docs(raw_text[i:i_end]) # embed_docs is a function that embeds a list of documents
    # create records list for upsert
    records = zip(ids, embeddings, metadatas)
    # upsert to Pinecone
    index.upsert(vectors=records) # upsert vectors

100%|██████████| 13/13 [00:02<00:00,  6.17it/s]


In [96]:
# check number of records in the index
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.00025,
 'namespaces': {'': {'vector_count': 25}},
 'total_vector_count': 25}

### Combining the retrieved pdf, prompt, and question to query the LLM

In this section we will finally be able to ask LLM a question based on the document we uploaded. The first step is to embed the question. As you might remember from an earlier section, we had a question about my education.  

In [97]:
question

"What is Diana's education?"

In [98]:
# extract embeddings for the questions
query_vec = embed_docs(question)[0] # embedding a single question

# query pinecone
res = index.query(query_vec, top_k=5, include_metadata=True)

# show the results
res

{'matches': [{'id': '0',
              'metadata': {'text': 'Diana Rogachova (2019 resume) OBJECTIVE I '
                                   'am a highly dependable and well-or ganized '
                                   'individual seeking part-time employment in '
                                   'the customer service industry '},
              'score': 0.336022854,
              'values': []},
             {'id': '20',
              'metadata': {'text': '   Highly responsible and able to work '
                                   'with minimal supervision'},
              'score': 0.294900119,
              'values': []},
             {'id': '21',
              'metadata': {'text': '   Proven ability to work under pressure '
                                   'with a great degree of focus and attention '
                                   'to details'},
              'score': 0.249704123,
              'values': []},
             {'id': '16',
              'metadata': {'text': '   

What we did above is we retrieved the contents relevant to the question. As seen above, these are several different contexts. We can use them to create a single context and feed them to our LLM. 

In [99]:
contexts = [match.metadata['text'] for match in res.matches] 

In [100]:
max_section_len = 10000 # maximum length of the concatenated document
separator = "\n" 

def construct_context(contexts: List[str]) -> str: 
    chosen_sections = []
    chosen_sections_len = 0

    for text in contexts:
        text = text.strip() # remove leading and trailing whitespace
        # Add contexts until we run out of space.
        chosen_sections_len += len(text) + 2 
        if chosen_sections_len > max_section_len:
            break
        chosen_sections.append(text)
    concatenated_doc = separator.join(chosen_sections)
    print(
        f"With maximum sequence length {max_section_len}, selected top {len(chosen_sections)} document sections: \n{concatenated_doc}"
    )
    return concatenated_doc

In [101]:
context_str = construct_context(contexts=contexts)

With maximum sequence length 10000, selected top 5 document sections: 
Diana Rogachova (2019 resume) OBJECTIVE I am a highly dependable and well-or ganized individual seeking part-time employment in the customer service industry
Highly responsible and able to work with minimal supervision
Proven ability to work under pressure with a great degree of focus and attention to details
Clear and strong communication skills
EDUCA TION Bachelor of Arts Honours in Criminology and Criminal Justice System: 2017-2021 Psychology Concentration Carleton University , Ottawa, ON   Current CGPA: 11


Now we feed this concatenated context into an LLM. 

In [102]:
text_input = prompt_template.replace("{context}", context_str).replace("{question}", question)

out = llm.predict({"inputs": text_input})
generated_text = out[0]["generated_text"]
print(f"[Input]: {question}\n[Output]: {generated_text}")

[Input]: What is Diana's education?
[Output]: Bachelor of Arts Honours in Criminology and Criminal Justice System: 2017-2021 Psychology Concentr


In [103]:
def rag_query(question: str) -> str: # function to query the index and return the generated answer
    # create query vec
    query_vec = embed_docs(question)[0]
    # query pinecone
    res = index.query(query_vec, top_k=5, include_metadata=True)
    # get contexts
    contexts = [match.metadata['text'] for match in res.matches]
    # build the multiple contexts string
    context_str = construct_context(contexts=contexts)
    # create our retrieval augmented prompt
    text_input = prompt_template.replace("{context}", context_str).replace("{question}", question)
    # make prediction
    out = llm.predict({"inputs": text_input})
    return out[0]["generated_text"]

In [104]:
rag_query("What is Diana's Education?")

With maximum sequence length 10000, selected top 5 document sections: 
Diana Rogachova (2019 resume) OBJECTIVE I am a highly dependable and well-or ganized individual seeking part-time employment in the customer service industry
Highly responsible and able to work with minimal supervision
Proven ability to work under pressure with a great degree of focus and attention to details
Clear and strong communication skills
EDUCA TION Bachelor of Arts Honours in Criminology and Criminal Justice System: 2017-2021 Psychology Concentration Carleton University , Ottawa, ON   Current CGPA: 11


'Bachelor of Arts Honours in Criminology and Criminal Justice System: 2017-2021 Psychology Concentr'

In [105]:
rag_query("Where did she work in 2017-2018?")

With maximum sequence length 10000, selected top 5 document sections: 
Diana Rogachova (2019 resume) OBJECTIVE I am a highly dependable and well-or ganized individual seeking part-time employment in the customer service industry
Cashier and Food Server Kettleman’ s Bagel Co, Bank St, Ottawa, ON April 2017 – August 2018   Demonstrated an excellent ability to multitask and strengthened priority management skills while working in a high volume, fast paced café
WORK EXPERIENCE Cashier The Riv’s Snack Bar, Sparks St, Ottawa, ON June 2019 – Curr ent   Demonstrated ability to provide outstanding customer service by accurately taking orders and addressing customer inquiries with a friendly and professional demeanor
Sales Associate (Sunday only position) The Unrefined Olive, Bank St, Ottawa, ON September 2018 – Curr ent   Strengthened interpersonal and presentation skills by providing an informative service to each individual customer
Highly responsible and able to work with minimal supervision

'Kettleman’ s Bagel Co'

In [106]:
rag_query("Which city does she live in?")

With maximum sequence length 10000, selected top 5 document sections: 
Fluent in English, Russian and Ukrainian
Diana Rogachova (2019 resume) OBJECTIVE I am a highly dependable and well-or ganized individual seeking part-time employment in the customer service industry
Highly responsible and able to work with minimal supervision
EDUCA TION Bachelor of Arts Honours in Criminology and Criminal Justice System: 2017-2021 Psychology Concentration Carleton University , Ottawa, ON   Current CGPA: 11
Cashier and Food Server Kettleman’ s Bagel Co, Bank St, Ottawa, ON April 2017 – August 2018   Demonstrated an excellent ability to multitask and strengthened priority management skills while working in a high volume, fast paced café


'Ottawa'

In [107]:
rag_query("How many employees did she supervise?")

With maximum sequence length 10000, selected top 5 document sections: 
Often supervised 4-6 employees during busy overnight shifts with the goal of completion of main tasks (wholesale orders, baking for morning shifts, cleaning up, etc)
Highly responsible and able to work with minimal supervision
Proven track record of collaborating effectively with bartenders, cooks, management and other staff streamlining the operations
Further developed organizational and time-management skills by performing most of the store’ s operations without supervision, such as opening/closing, operating a cash register , answering phone calls, etc
Diana Rogachova (2019 resume) OBJECTIVE I am a highly dependable and well-or ganized individual seeking part-time employment in the customer service industry


'4-6'

In [108]:
rag_query("Which industry she wants to find work in?")

With maximum sequence length 10000, selected top 5 document sections: 
Diana Rogachova (2019 resume) OBJECTIVE I am a highly dependable and well-or ganized individual seeking part-time employment in the customer service industry
Highly responsible and able to work with minimal supervision
WORK EXPERIENCE Cashier The Riv’s Snack Bar, Sparks St, Ottawa, ON June 2019 – Curr ent   Demonstrated ability to provide outstanding customer service by accurately taking orders and addressing customer inquiries with a friendly and professional demeanor
Proven track record of collaborating effectively with bartenders, cooks, management and other staff streamlining the operations
Further developed organizational and time-management skills by performing most of the store’ s operations without supervision, such as opening/closing, operating a cash register , answering phone calls, etc


'customer service'

In [109]:
rag_query("What are her computer skills?")

With maximum sequence length 10000, selected top 5 document sections: 
Computer skills: Advanced in Microsoft Word, Outlook, PowerPoint, intermediate in Excel
Clear and strong communication skills
Highly responsible and able to work with minimal supervision
Coached new employees in the computer system and improved their performance which increased the efficiency of their training
Proven ability to work under pressure with a great degree of focus and attention to details


'Advanced in Microsoft Word, Outlook, PowerPoint, intermediate in Excel'