## Short-form QnA

In this notebook, we aim to construct and evaluate the performance of a Q-RAG and RAG agent. While in a realistic use case, it would have require developers to set up graph-like databases to build the core elements of Q-RAG (e.g., question-chunk connections), this notebook show that Q-RAG can be easily replicated using vectorstore and document metadata. Here, we would like to utilize LangChain - a library providing wrappers for framework that are very easy and quick to set up, which is useful for demo purposes.
In this notebook, we aim to construct and evaluate the performance of a Q-RAG and RAG agent. While in a realistic use case, it would have require developers to set up graph-like databases to build the core elements of Q-RAG (e.g., question-chunk connections), this notebook show that Q-RAG can be easily replicated using vectorstore and document metadata. Here, we would like to utilize LangChain - a library providing wrappers for framework that are very easy and quick to set up, which is useful for demo purposes.

### Set-up

In [1]:
import pandas as pd
import numpy as np
from copy import deepcopy
import json
import re
import random

from typing import List

from datasets import load_dataset, Dataset

from nltk import word_tokenize
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

from langchain_community.document_loaders import DataFrameLoader
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.documents.base import Document
from langchain_openai import ChatOpenAI

Some important technical details before we start:

- LLM: `gpt-4o-mini`
- Vectorstore: `FAISS`
- Embedding Model: `sentence-transformers/all-MiniLM-L6-v2`
- Evaluation Dataset: `rajpurkar/squad` (HuggingFace)


In [2]:
OPENAI_API_KEY = ""
OPENAI_MODEL = "gpt-4o-mini"

EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"

DATASET = "rajpurkar/squad"

FAISS_PATH = "./vectorstore/squad"
RESULTS_PATH = "./results/squad"

### Embedding Model

Here we want to load our embedding model `all-MiniLM-L6-v2`, which is a good option due to its light-weight, enabling fast inference. To enable access to everyone, the default device to be loaded is the CPU. 

In [3]:
embedding_model = HuggingFaceEmbeddings(
    model_name=EMBEDDING_MODEL, 
    model_kwargs={"device": "cpu"}
)

### Dataset

Here, we take the validation split of the dataset for evaluation. Additionally, we only sample around 200 datapoints out of the dataset.

In [4]:
full_dataset = load_dataset(DATASET, split="validation")

random.seed(13)
rand_indices = random.sample(range(0, len(full_dataset)), 1000)
dataset = full_dataset.select(rand_indices)

dataset

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 1000
})

Next, we create a corpus, grouping the existing `context` chunks provided in the original dataset together.

In [5]:
document_df = pd.DataFrame({
    "id": full_dataset["id"],
    "title": full_dataset["title"],
    "text": full_dataset["context"],
    "type": "chunk"
})

document_df = document_df.drop_duplicates(["text"])
document_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2067 entries, 0 to 10565
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      2067 non-null   object
 1   title   2067 non-null   object
 2   text    2067 non-null   object
 3   type    2067 non-null   object
dtypes: object(4)
memory usage: 80.7+ KB


The documents are then further splited into small chunks (size of around 300 chars) for better retrieval. This chunk corpus will then be loaded into a FAISS instance, which are stored locally. 

In [6]:
loader = DataFrameLoader(document_df, page_content_column="text")
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 300,
    chunk_overlap = 100
)

document_chunks = loader.load()
document_chunks = text_splitter.split_documents(document_chunks)

In [7]:
vectorstore_db = FAISS.from_documents(
    documents=document_chunks, 
    embedding=embedding_model, 
    normalize_L2=True
)

vectorstore_db.save_local(f"{FAISS_PATH}/chunk-vectorstore")

### Retriever

Now, we would like to initiate our vectorstore retrievers. For this design, let us have two main retrievers serving different purposes:

- Chunk vectorstore: This will be used to obtain relevant chunks given a query as normally. We only need to load the indexed vectors from our local storage.
- Question vectorstores: This will be used to retrieve similar "model questions". We will need to create a new instance of FAISS vectorstore, with a default "mock question" inside (simply because it is not possible to create an empty database).

In [8]:
chunks_vectorstore_db = FAISS.load_local(
    folder_path=f"{FAISS_PATH}/chunk-vectorstore", 
    embeddings=embedding_model, 
    normalize_L2=True,
    allow_dangerous_deserialization=True,
)

We should pay attention to the structure of the "Mock Question" template. Its metadata has a list-type attribute called `connections`, which later can be used to store IDs of relevant documents. This helps to replicate the model question-chunk connections in Q-RAG. 

In [9]:
question_vectorstore_db = FAISS.from_documents(
    documents=[
        Document(page_content="Mock Question", metadata={
            "id": 0, 
            "type": "question", 
            "connections": []
        })],
    embedding=embedding_model
)

### LLM Client

This cell loads the OpenAI's `gpt-3.5-turbo` as our default LLM client.

In [10]:
llm_agent = ChatOpenAI(
    api_key=OPENAI_API_KEY,
    model=OPENAI_MODEL,
    temperature=0.0
)

### RAG Agent

A RAG Agent class template is created here, unifying the retrievers and the generator into one pipeline. This agent has the following functionality:

- Utilize the chunk database / question database to perform retrieval. If one recalls, besides the generic chunk retrieval process, Q-RAG also attempts to retrieve relevant questions to obtain the connected context chunks. 
- Construct the context string from the raw retrieval results to augment the input user query. The complete prompt will then be processed by LLM generator to obtain answers.

As a side note, the SQuAD dataset focuses on short-form question-answering, meaning that the answers will be very short and concise. Here, we engineer our prompt template to have the same behavior. 

In [11]:
class RAGAgent:
    def __init__(
        self,
        client: ChatOpenAI,
        chunk_retriever: FAISS,
        question_retriever: FAISS,
        dataframe: pd.DataFrame
    ):
        self.dataframe = dataframe
        self.client = client
        self.chunk_retriever = chunk_retriever
        self.question_retriever = question_retriever

    
    def retrieve_context(
        self,
        query: str,
        top_k: int,
        distance_threshold: float,
        retrieve_questions: bool = False,
        **krawgs
    ):
        """
        Retrieve relevant chunks/questions given query
        """
        retriever = self.chunk_retriever if not retrieve_questions else self.question_retriever
        
        docs_with_metadata = retriever.similarity_search_with_score(
            query=query,
            k=top_k,
            **krawgs
        )
        filtered_docs = [doc for doc, score in docs_with_metadata if score <= distance_threshold] 
        
        if retrieve_questions:
            doc_ids = [doc for relevant_docs in filtered_docs for doc in relevant_docs.metadata["connections"]]
        else:
            doc_ids = [doc.metadata["id"] for doc in filtered_docs]

        filtered_docs = [DataFrameLoader(pd.DataFrame(self.dataframe[self.dataframe["id"] == doc_id])).load()[0] for doc_id in set(doc_ids)]
        
        return filtered_docs
    

    def generate_response(
        self,
        question: str,
        retrieved_docs: List[Document]
    ):
        """
        Generate response based on input query and raw context documents
        """
        documents = ["Title:" + str(doc.metadata["title"]) + "\n" + str(doc.page_content) for doc in retrieved_docs]
        context_str = "\n\n".join(documents)
    
        prompt = ChatPromptTemplate.from_messages([
             ("system", """
                Find the answer to the user's QUESTION by using relevant keywords or phrases from the CONTEXT above. 
                Keep the answer extremely short. 
                You are not allowed to include any of your own words. You are only allowed to use words from the CONTEXT. 
              """),
            ("human", "### CONTEXT\n{context}\n### QUESTION\n{question}")
        ])

        chain = prompt | self.client
        response = chain.invoke({"question": question, "context": context_str})

        return response.content
        

### Helper functions

Before we start

Two evaluation metrics are utilized:

- BLEU score: measure the n-gram precision between a candidate text with a list of references. This should give us a rough idea of the quality of our generated answers compared to the ground truth answers.
- Retrieval score: it's just a made-up score that checks whether the ground truth context has been retrieved or not.

In [12]:
def compute_bleu_score(ref_list: str, cand: str):
    """
    BLEU score
    """
    smoothing_func = SmoothingFunction().method1

    reference = [word_tokenize(re.sub(r'[^\w\s]','', ref.lower())) for ref in ref_list]
    candidate = word_tokenize(re.sub(r'[^\w\s]','', cand.lower()))

    weight_configs = (1, 0, 0, 0)

    bleu_score = sentence_bleu(
        references=reference, 
        hypothesis=candidate, 
        weights=weight_configs, 
        smoothing_function=smoothing_func
    )

    return bleu_score


def compute_retrieval_score(ground_truth_context: str, retrieved_context: List[Document]):
    """
    Whether the ground truth context document is retrieved 
    """
    retrieved_context_str = [doc.page_content for doc in retrieved_context]
    return 1 if ground_truth_context in retrieved_context_str else 0

This cell creates an evaluation loop, which iterates through each row in the dataset to generate the results and evaluate them.

In [13]:
def evaluation(
    agent: RAGAgent, 
    test_set: Dataset, 
    top_k: int, 
    q_top_k: int,
    distance_threshold: float,
    q_distance_threshold: float
):
    results = []

    for i in range(len(test_set)):
        id = test_set[i]["id"]
        question = test_set[i]["question"]
        ground_truth_answer = test_set[i]["answers"]["text"]
        ground_truth_context = test_set[i]["context"]

        relevant_chunks = agent.retrieve_context(
            query=question, 
            top_k=top_k, 
            distance_threshold=distance_threshold,
            retrieve_questions=False
        )

        relevant_question_chunks = agent.retrieve_context(
            query=question, 
            top_k=q_top_k, 
            distance_threshold=q_distance_threshold,
            retrieve_questions=True
        )

        context = relevant_question_chunks + relevant_chunks
        response = agent.generate_response(question, context)

        bleu_score = compute_bleu_score(ground_truth_answer, response)
        retrieval_score = compute_retrieval_score(ground_truth_context, context)

        results.append({
            "question": question,
            "answer": response,
            "ground_truth_answer": ground_truth_answer,
            "context": [chunk.page_content for chunk in relevant_chunks],
            "q_context": [chunk.page_content for chunk in relevant_question_chunks],
            "ground_truth_context": ground_truth_context,
            "bleu_score": bleu_score,
            "retrieval_score": retrieval_score,
            "id": id
        })
        
    return results

### Demo #1 

In this demo, we evaluate the agent before and after the *corrective loop*.

#### Evaluation Dataset

In [43]:
dataset

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 1000
})

#### Naive RAG Agent Evaluation

We first initiate our baseline RAG Agent with access to the chunk database and the (empty) question database.

In [14]:
rag_agent = RAGAgent(
    client=llm_agent,
    chunk_retriever=chunks_vectorstore_db,
    question_retriever=question_vectorstore_db,
    dataframe=document_df
)

In [15]:
results = evaluation(
    agent=rag_agent,
    test_set=dataset,
    top_k=5,
    q_top_k=1,
    distance_threshold=1.5,
    q_distance_threshold=1.5
)

In [16]:
with open(f"{RESULTS_PATH}/run-0-demo-1.json", "w") as file:
    file.write(json.dumps(results))

In [17]:
bleu_scores = [result["bleu_score"] for result in results] 
np.mean(bleu_scores)

0.7147777497184254

In [18]:
retrieval_scores = [result["retrieval_score"] for result in results] 
np.mean(retrieval_scores)

0.902

#### Q-RAG Corrective Loop

In this section, we are going to perform the corrective loop. This requires an external source to provide the ground truth answers. In reality, we can have human evaluators going through poor responses (e.g., responses reported by users, responses with low metric values) to correct them.

We start by first create a copy of the initial question database (currently empty). 

In [19]:
new_question_vectorstore_db = deepcopy(question_vectorstore_db)

We then iterate through the results to detect poor answers, which are defined as BLEU < 0.5. For each of these instances, we are going to utilize the ground truth answer to retrieve the most relevant chunks. These chunks will then be put into the `connections` attribute of the corresponding questions. These questions documents are then appended to the question database. 

In the code, one may be able to observe that we are utilizing both the original question and the ground truth answer to retrieve relevant chunks here (Line 9). This is simply because SQuAD is a short-form QnA dataset, meaning that the expected answers are extremely short and concise, composing of just one or two keywords. Therefore, the answer alone is inappropirate to be used for context retrieval. As such, it is a reasonable option to concatenate the original question and the ground truth answer together for this task. In most cases, inlcluding just a few keywords from the corrected answers is enough to significantly improve the retrieval results. 

In [20]:
for i, score in enumerate(bleu_scores):
    if score < 0.6:
        id = dataset[i]["id"]
        question = dataset[i]["question"]
        ground_truth_answer_list = sorted(dataset[i]["answers"]["text"], key=len)
        ground_truth_answer = ground_truth_answer_list[-1]

        relevant_chunk_list = rag_agent.retrieve_context(
            query=question + " " + ground_truth_answer,
            top_k=2,
            distance_threshold=1.5
        )

        relevant_chunk_id_list = [chunk.metadata["id"] for chunk in relevant_chunk_list]

        question_document = Document(
            page_content=question,
            metadata={
                "id": id,
                "type": "question", 
                "connections": relevant_chunk_id_list
            }
        )

        new_question_vectorstore_db.add_documents([question_document])

In [21]:
new_question_vectorstore_db.save_local(f"{FAISS_PATH}/question-vectorstore")

In [22]:
new_question_vectorstore_db = FAISS.load_local(
    folder_path=f"{FAISS_PATH}/question-vectorstore", 
    embeddings=embedding_model, 
    normalize_L2=True,
    allow_dangerous_deserialization=True,
)

#### Q-RAG Agent Evaluation

We create a new rag agent with access to the newly indexed question database.

In [23]:
new_rag_agent = RAGAgent(
    client=llm_agent,
    chunk_retriever=chunks_vectorstore_db,
    question_retriever=new_question_vectorstore_db,
    dataframe=document_df
)

A new evaluation process takes place for this RAG agent. 

In [24]:
new_results = evaluation(
    agent=new_rag_agent,
    test_set=dataset,
    top_k=5,
    q_top_k=2,
    distance_threshold=1.5,
    q_distance_threshold=1.5
)

In [25]:
with open(f"{RESULTS_PATH}/run-1-demo-1.json", "w") as file:
    file.write(json.dumps(new_results))

In [26]:
new_bleu_scores = [result["bleu_score"] for result in new_results] 
np.mean(new_bleu_scores)

0.7225012975337309

In [27]:
new_retrieval_scores = [result["retrieval_score"] for result in new_results] 
np.mean(new_retrieval_scores)

0.958

As one can observed, there has been some decent improvement in the average BLEU score and a significant improvement in context retrieval score. This implies that Q-RAG addressed the poor context retrieval cases quite well. 

### Demo #2

The previous demo utilized the same question ingested in the database to evaluate Q-RAG, which may not be fair since the question can be directly matched with itself in the database. Therefore, in this demo, we will develop a new question dataset that are the rephrased version of the questions in the original dataset. This allows us to observe how Q-RAG performs when variants of the complex questions are asked. 

#### New Evaluation Dataset

Here, we would like to select only the "complex questions" - ones with BLEU < 0.6 when evaluated using vanilla RAG. These are the same questions that has been ingested in the question vector database ealier. 

This approach would allow us to observe the retrieval performance of the chunk correction mechanism of Q-RAG.

In [28]:
complex_question_dataset = Dataset.from_list([row for index, row in enumerate(dataset) if bleu_scores[index] < 0.6])
complex_question_dataset

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 344
})

This cell rephrases the questions in the dataset. This will help us to evaluate the performance of Q-RAG's question-based retrieval more realistically.

In [29]:
def rephrase_question(row: str):
    prompt = ChatPromptTemplate.from_messages([
        ("system", """
            You are provided with a QUESTION. 
            Your task is to transform the QUESTION into a REPHRASED QUESTION by changing only the wording. 
            Make sure to leave enough information in the REPHRASED QUESTION so that the the answer to the do not change.
        """),
        ("user", "QUESTION: {question}\nREPHRASED QUESTION:")
    ])

    chain = prompt | llm_agent

    rephrased_question = chain.invoke({"question": row["question"]}).content
    row["question"] = re.sub("rephrased question:", "", rephrased_question.lower()).strip()
    
    return row

In [30]:
rephrased_dataset = complex_question_dataset.map(rephrase_question)



Map:   0%|          | 0/344 [00:00<?, ? examples/s]

#### Evaluation

We evaluate 3 different scenarios:

- Vanilla RAG on complex question dataset: This serves as a baseline for comparison.
- Q-RAG on complex question dataset: This shows us how ingestion of the complex questions can improve the retrieval.
- Q-RAG on rephrased complex question dataset: This shows us the performance of Q-RAG when met with variants of the complex questions. 

##### Vanilla RAG on complex questions (Baseline)

In [31]:
rag_results_complex = evaluation(
    agent=rag_agent,
    test_set=complex_question_dataset,
    top_k=5,
    q_top_k=2,
    distance_threshold=1.5,
    q_distance_threshold=1.5
)

In [32]:
with open(f"{RESULTS_PATH}/run-0-demo-2.json", "w") as file:
    file.write(json.dumps(rag_results_complex))

In [33]:
np.mean([result["bleu_score"] for result in rag_results_complex])

0.2501785283062391

In [34]:
np.mean([result["retrieval_score"] for result in rag_results_complex])

0.7383720930232558

##### Q-RAG on complex questions

In [35]:
qrag_results_complex = evaluation(
    agent=new_rag_agent,
    test_set=complex_question_dataset,
    top_k=5,
    q_top_k=2,
    distance_threshold=1.5,
    q_distance_threshold=1.5
)

In [36]:
with open(f"{RESULTS_PATH}/run-1-demo-2.json", "w") as file:
    file.write(json.dumps(qrag_results_complex))

In [37]:
np.mean([result["bleu_score"] for result in qrag_results_complex])

0.39212412498484117

In [38]:
np.mean([result["retrieval_score"] for result in results])

0.902

##### Q-RAG on rephrased complex questions 

In [39]:
qrag_results_rephrased = evaluation(
    agent=new_rag_agent,
    test_set=rephrased_dataset,
    top_k=5,
    q_top_k=2,
    distance_threshold=1.5,
    q_distance_threshold=1.5
)

In [40]:
with open(f"{RESULTS_PATH}/run-2-demo-2.json", "w") as file:
    file.write(json.dumps(qrag_results_rephrased))

In [41]:
np.mean([result["bleu_score"] for result in qrag_results_rephrased])

0.39325279700091936

In [42]:
np.mean([result["retrieval_score"] for result in qrag_results_rephrased])

0.8924418604651163

Depending on how the example question are rephrased, the results may vary. Nevertheless, it can be observed that the Q-RAG agent was able to obtain a highly similar performance (to the original complex question dataset) when performing on the rephrased question dataset. This implies that the question-based retrieval mechanism of Q-RAG is very stable in practical.