# Retrieval Augmented Generation (RAG) for Question Answering

This notebook demonstrates RAG mechanisms for pre-trained models for question-answering and text-generation tasks. The RAG model is a combination of a retriever and a generator. The retriever is responsible for finding relevant passages from a large corpus of text, and the generator is responsible for generating the answer based on the retrieved passages.

In [None]:
%pip install langchain-huggingface
%pip install transformers
%pip install chromadb
%pip install llama-index-vector-stores-chroma
%pip install llama-index-embeddings-huggingface


In [18]:
import torch
torch.cuda.set_device(0)

In [19]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

question_answering_roberta = "deepset/roberta-base-squad2"
mistral_text_generation = "mistralai/Mistral-Nemo-Instruct-2407"

# deepset/roberta-base-squad2

question_answering_pipeline = pipeline('question-answering', model=question_answering_roberta, tokenizer=question_answering_roberta)
question_answering = AutoModelForQuestionAnswering.from_pretrained(question_answering_roberta)
tokenizer = AutoTokenizer.from_pretrained(question_answering_roberta)

# mistralai/Mistral-Nemo-Instruct-2407
mistral_pipeline = pipeline("text-generation", model="mistralai/Mistral-Nemo-Instruct-2407", max_new_tokens=154)







Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


OSError: You are trying to access a gated repo.
Make sure to have access to it at https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407.
401 Client Error. (Request ID: Root=1-67646194-4de9ca7002c614571aee17d8;d64a54a5-6602-4cfa-b486-4b5095034c19)

Cannot access gated repo for url https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407/resolve/main/config.json.
Access to model mistralai/Mistral-Nemo-Instruct-2407 is restricted. You must have access to it and be authenticated to access it. Please log in.

## Document Chunking
Reading the entire document at once can be computationally expensive. To address this issue, we can split the document into smaller chunks and retrieve relevant chunks based on the question.

_NOTE:_ The document chunking is done rule based and is not perfect. It is just a simple way to split the document into smaller chunks. However, for the rules itself it is quite good.

In [20]:
from utils import extract_rules_from_pdf
import random

pdf_path = "FS-Rules_2024_v1.1.pdf"
extracted_rules = extract_rules_from_pdf(pdf_path)

print(f"Extracted {len(extracted_rules)} rules from the PDF.\n")
print(f"For example, a random chunk in the rules is:\n {extracted_rules[random.randint(0, len(extracted_rules))]}\n\n")

Extracted 1447 rules from the PDF.

For example, a random chunk in the rules is:
 D7.2.3 Staging - The vehicle is staged at a staging line prior to the starting line. The timer starts
only after the vehicle crosses the start line.






## Vector Database

In [None]:
import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore

# Create a ChromaDB collection
chroma_client = chromadb.Client()
chroma_collection = chroma_client.create_collection("rules")


# Create a vector store
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)


Storing the document chunks in a vector database can help in faster retrieval of relevant chunks. For simplicity we use the TfidfVectorizer to convert the chunks into vectors and store them in a chromadb database.

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the documents to get embeddings
embeddings = vectorizer.fit_transform(extracted_rules).toarray()

# Store the documents and their embeddings in ChromaDB
for i, (doc, embedding) in enumerate(zip(extracted_rules, embeddings)):
    chroma_collection.add(
        ids=[str(i)],
        embeddings=[embedding.tolist()],
        metadatas=[{"text": doc}]
    )

# Define the query context
query_context = "The fuel tank must be securely attached."

# Transform the query context to get its embedding
query_embedding = vectorizer.transform([query_context]).toarray()


Insert of existing embedding ID: 0
Add of existing embedding ID: 0
Insert of existing embedding ID: 1
Add of existing embedding ID: 1
Insert of existing embedding ID: 2
Add of existing embedding ID: 2
Insert of existing embedding ID: 3
Add of existing embedding ID: 3
Insert of existing embedding ID: 4
Add of existing embedding ID: 4
Insert of existing embedding ID: 5
Add of existing embedding ID: 5
Insert of existing embedding ID: 6
Add of existing embedding ID: 6
Insert of existing embedding ID: 7
Add of existing embedding ID: 7
Insert of existing embedding ID: 8
Add of existing embedding ID: 8
Insert of existing embedding ID: 9
Add of existing embedding ID: 9
Insert of existing embedding ID: 10
Add of existing embedding ID: 10
Insert of existing embedding ID: 11
Add of existing embedding ID: 11
Insert of existing embedding ID: 12
Add of existing embedding ID: 12
Insert of existing embedding ID: 13
Add of existing embedding ID: 13
Insert of existing embedding ID: 14
Add of existing em

## Top-k Retrieval and Context Generation

For creating the context we can use the top-k retrieved chunks and concatenate them to form the context. Below, there is an example of how the context can be generated.

In [23]:
def get_top_k_retrieved_chunks(query, chroma_collection, top_k=5, display_results=False):
    # 1. embed the query
    query_embedding = vectorizer.transform([query]).toarray()

    # 2. query the collection
    top_k = 5
    results = chroma_collection.query(query_embeddings=query_embedding, n_results=top_k)

    # 3. get the results
    top_k_retrieved_chunks = []
    if display_results:
        print(f"The top {top_k} retrieved chunks are:\n\n")
    for metadata in results["metadatas"][0]:
        top_k_retrieved_chunks.append(metadata["text"])
        if display_results:
            print(metadata["text"]+ "\n")

    return top_k_retrieved_chunks


_ = get_top_k_retrieved_chunks("What is DNF?", chroma_collection, display_results=True)



The top 5 retrieved chunks are:


D9.1.7 Acceleration Skidpad Autocross Endurance Trackdrive
DOO 2 s 0.2 s 2 s 2 s 2 s
OC DNF DNF 10 s 10 s 10 s
USS DNF DNF DNF n/a −50 points

D9.1.8 DNF equals zero points for that run.

S2.9.2 The judges will not evaluate any vehicle that is presented at the cost and manufacturing event,
in what they consider to be an unfinished state and will award zero points for the entire event.

S3.5.2 The judges will not evaluate any vehicle that is presented at the design event in what they
consider to be an unfinished state and will award zero points for the entire design event.

D9.1.11 Each run with an incorrect number of laps at skidpad is classified as DNF.



## Prompt-based Generation

Now, putting it all together, we can create a Retrieval Augmented Generation (RAG) pipeline for question answering. We can use the prompt-based generation to generate the answer based on the context generated from the top-k retrieved chunks.

In [24]:
QUERY = "What are the requirements for the emergency break system regarding opening of the SDC and start of deceleration? You can give multiple answers."

# Get the top k retrieved chunks
top_k_retrieved_chunks = get_top_k_retrieved_chunks(QUERY, chroma_collection, top_k=10)

# extend the context with the top k retrieved chunks
# and get the answer
QA_input = {
    'question': QUERY,
    # context with \n separator
    'context': "\n".join(top_k_retrieved_chunks)
}
res = question_answering_pipeline(QA_input)

print(res)





{'score': 0.14702357351779938, 'start': 106, 'end': 128, 'answer': 'must not exceed 200 ms'}


In [None]:
# extend the context with the top k retrieved chunks
# and get the answer


print(res)
messages = [
    {"role": "system", "content": "Here is some information about Formula Student rules. : " + "\n".join(top_k_retrieved_chunks)},
    {"role": "user", "content": QUERY},
]
res = mistral_pipeline(messages)
print(res)
