## Short-form QnA

In this notebook, we aim to construct and evaluate the performance of a Q-RAG and RAG agent. While in a realistic use case, it would have require developers to set up graph-like databases to build the core elements of Q-RAG (e.g., question-chunk connections), this notebook show that Q-RAG can be easily replicated using vectorstore and document metadata. Here, we would like to utilize LangChain - a library providing wrappers for framework that are very easy and quick to set up, which is useful for demo purposes.

### Set-up

In [1]:
import pandas as pd
import numpy as np
from copy import deepcopy
import json
import re
import random

from typing import List

from datasets import load_dataset, Dataset

from nltk import word_tokenize
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

from langchain_community.document_loaders import DataFrameLoader
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.documents.base import Document
from langchain_openai import ChatOpenAI

Some important technical details before we start:

- LLM: `gpt-3.5-turbo`
- Vectorstore: `FAISS`
- Embedding Model: `sentence-transformers/all-MiniLM-L6-v2`
- Evaluation Dataset: `rajpurkar/squad` (HuggingFace)


In [11]:
OPENAI_API_KEY = ""
OPENAI_MODEL = "gpt-3.5-turbo"

EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"

DATASET = "rajpurkar/squad"

FAISS_PATH = "./vectorstore/squad"
RESULTS_PATH = "./results/squad"

### Embedding Model

Here we want to load our embedding model `all-MiniLM-L6-v2`, which is a good option due to its light-weight, enabling fast inference. To enable access to everyone, the default device to be loaded is the CPU. 

In [3]:
embedding_model = HuggingFaceEmbeddings(
    model_name=EMBEDDING_MODEL, 
    model_kwargs={"device": "cpu"}
)

### Dataset

Here, we take the validation split of the dataset for evaluation. Additionally, we only sample around 200 datapoints out of the dataset.

In [4]:
full_dataset = load_dataset(DATASET, split="validation")

random.seed(13)
rand_indices = random.sample(range(0, len(full_dataset)), 200)
dataset = full_dataset.select(rand_indices)

dataset

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 200
})

Next, we create a corpus, grouping the existing `context` chunks provided in the original dataset together.

In [5]:
document_df = pd.DataFrame({
    "id": full_dataset["id"],
    "title": full_dataset["title"],
    "text": full_dataset["context"],
    "type": "chunk"
})

document_df = document_df.drop_duplicates(["text"])
document_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2067 entries, 0 to 10565
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      2067 non-null   object
 1   title   2067 non-null   object
 2   text    2067 non-null   object
 3   type    2067 non-null   object
dtypes: object(4)
memory usage: 80.7+ KB


The documents are then further splited into small chunks (size of around 300 chars) for better retrieval. This chunk corpus will then be loaded into a FAISS instance, which are stored locally. 

In [6]:
loader = DataFrameLoader(document_df, page_content_column="text")
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 300,
    chunk_overlap = 100
)

document_chunks = loader.load()
document_chunks = text_splitter.split_documents(document_chunks)

In [7]:
vectorstore_db = FAISS.from_documents(
    documents=document_chunks, 
    embedding=embedding_model, 
    normalize_L2=True
)

vectorstore_db.save_local(FAISS_PATH)

### Retriever

Now, we would like to initiate our vectorstore retrievers. For this design, let us have two main retrievers serving different purposes:

- Chunk vectorstore: This will be used to obtain relevant chunks given a query as normally. We only need to load the indexed vectors from our local storage.
- Question vectorstores: This will be used to retrieve similar "model questions". We will need to create a new instance of FAISS vectorstore, with a default "mock question" inside (simply because it is not possible to create an empty database).

In [8]:
chunks_vectorstore_db = FAISS.load_local(
    folder_path=FAISS_PATH, 
    embeddings=embedding_model, 
    normalize_L2=True,
    allow_dangerous_deserialization=True,
)

We should pay attention to the structure of the "Mock Question" template. Its metadata has a list-type attribute called `connections`, which later can be used to store IDs of relevant documents. This helps to replicate the model question-chunk connections in Q-RAG. 

In [9]:
question_vectorstore_db = FAISS.from_documents(
    documents=[
        Document(page_content="Mock Question", metadata={
            "id": 0, 
            "type": "question", 
            "connections": []
        })],
    embedding=embedding_model
)

### LLM Client

This cell loads the OpenAI's `gpt-3.5-turbo` as our default LLM client.

In [12]:
llm_agent = ChatOpenAI(
    api_key=OPENAI_API_KEY,
    model=OPENAI_MODEL,
    temperature=0.0
)

### RAG Agent

A RAG Agent class template is created here, unifying the retrievers and the generator into one pipeline. This agent has the following functionality:

- Utilize the chunk database / question database to perform retrieval. If one recalls, besides the generic chunk retrieval process, Q-RAG also attempts to retrieve relevant questions to obtain the connected context chunks. 
- Construct the context string from the raw retrieval results to augment the input user query. The complete prompt will then be processed by LLM generator to obtain answers.

As a side note, the SQuAD dataset focuses on short-form question-answering, meaning that the answers will be very short and concise. Here, we engineer our prompt template to have the same behavior. 

In [102]:
class RAGAgent:
    def __init__(
        self,
        client: ChatOpenAI,
        chunk_retriever: FAISS,
        question_retriever: FAISS,
        dataframe: pd.DataFrame
    ):
        self.dataframe = dataframe
        self.client = client
        self.chunk_retriever = chunk_retriever
        self.question_retriever = question_retriever

    
    def retrieve_context(
        self,
        query: str,
        top_k: int,
        distance_threshold: float,
        retrieve_questions: bool = False,
        **krawgs
    ):
        """
        Retrieve relevant chunks/questions given query
        """
        retriever = self.chunk_retriever if not retrieve_questions else self.question_retriever
        
        docs_with_metadata = retriever.similarity_search_with_score(
            query=query,
            k=top_k,
            **krawgs
        )
        filtered_docs = [doc for doc, score in docs_with_metadata if score <= distance_threshold] 
        
        if retrieve_questions:
            doc_ids = [doc for relevant_docs in filtered_docs for doc in relevant_docs.metadata["connections"]]
        else:
            doc_ids = [doc.metadata["id"] for doc in filtered_docs]

        filtered_docs = [DataFrameLoader(pd.DataFrame(self.dataframe[self.dataframe["id"] == doc_id])).load()[0] for doc_id in set(doc_ids)]
        
        return filtered_docs
    

    def generate_response(
        self,
        question: str,
        retrieved_docs: List[Document]
    ):
        """
        Generate response based on input query and raw context documents
        """
        documents = ["Title:" + str(doc.metadata["title"]) + "\n" + str(doc.page_content) for doc in retrieved_docs]
        context_str = "\n\n".join(documents)
    
        prompt = ChatPromptTemplate.from_messages([
             ("system", """
                ### INSTRUCTION
                Answer the users QUESTION by extracting from the CONTEXT text above. 
                You should only use keywords from the provided CONTEXT to form your answer.
                Keep your answer concise and short, just in a few words.
              """),
            ("human", "### CONTEXT\n{context}\n### QUESTION\n{question}")
        ])

        chain = prompt | self.client
        response = chain.invoke({"question": question, "context": context_str})

        return response.content
        

### Evaluation

In this section, we evaluate the agent before and after the *corrective loop* - the process that allow poorly-formed responses to be corrected. 

We first initiate our baseline RAG Agent with access to the chunk database and the (empty) question database.

In [103]:
rag_agent = RAGAgent(
    client=llm_agent,
    chunk_retriever=chunks_vectorstore_db,
    question_retriever=question_vectorstore_db,
    dataframe=document_df
)

Two evaluation metrics are utilized:

- BLEU score: measure the n-gram precision between a candidate text with a list of references. This should give us a rough idea of the quality of our generated answers compared to the ground truth answers.
- Retrieval score: it's just a made-up score that checks whether the ground truth context has been retrieved or not.

In [15]:
def compute_bleu_score(ref_list: str, cand: str):
    """
    BLEU score
    """
    smoothing_func = SmoothingFunction().method1

    reference = [word_tokenize(re.sub(r'[^\w\s]','', ref.lower())) for ref in ref_list]
    candidate = word_tokenize(re.sub(r'[^\w\s]','', cand.lower()))

    weight_configs = (1, 0, 0, 0)

    bleu_score = sentence_bleu(
        references=reference, 
        hypothesis=candidate, 
        weights=weight_configs, 
        smoothing_function=smoothing_func
    )

    return bleu_score


def compute_retrieval_score(ground_truth_context: str, retrieved_context: List[Document]):
    """
    Whether the ground truth context document is retrieved 
    """
    retrieved_context_str = [doc.page_content for doc in retrieved_context]
    return 1 if ground_truth_context in retrieved_context_str else 0

This cell creates an evaluation loop, which iterates through each row in the dataset to generate the results and evaluate them.

In [16]:
def evaluation(
    agent: RAGAgent, 
    test_set: Dataset, 
    top_k: int, 
    q_top_k: int,
    distance_threshold: float,
    q_distance_threshold: float
):
    results = []

    for i in range(len(test_set)):
        id = test_set[i]["id"]
        question = test_set[i]["question"]
        ground_truth_answer = test_set[i]["answers"]["text"]
        ground_truth_context = test_set[i]["context"]

        relevant_chunks = agent.retrieve_context(
            query=question, 
            top_k=top_k, 
            distance_threshold=distance_threshold,
            retrieve_questions=False
        )

        relevant_question_chunks = agent.retrieve_context(
            query=question, 
            top_k=q_top_k, 
            distance_threshold=q_distance_threshold,
            retrieve_questions=True
        )

        context = relevant_question_chunks + relevant_chunks
        response = agent.generate_response(question, context)

        bleu_score = compute_bleu_score(ground_truth_answer, response)
        retrieval_score = compute_retrieval_score(ground_truth_context, context)

        results.append({
            "question": question,
            "answer": response,
            "ground_truth_answer": ground_truth_answer,
            "context": [chunk.page_content for chunk in relevant_chunks],
            "q_context": [chunk.page_content for chunk in relevant_question_chunks],
            "ground_truth_context": ground_truth_context,
            "bleu_score": bleu_score,
            "retrieval_score": retrieval_score,
            "id": id
        })
        
    return results

In [17]:
results = evaluation(
    agent=rag_agent,
    test_set=dataset,
    top_k=5,
    q_top_k=1,
    distance_threshold=1.5,
    q_distance_threshold=0.25
)

In [18]:
with open(f"{RESULTS_PATH}/run-0.json", "w") as file:
    file.write(json.dumps(results))

In [19]:
bleu_scores = [result["bleu_score"] for result in results] 
np.mean(bleu_scores)

0.7302206150734558

In [20]:
retrieval_scores = [result["retrieval_score"] for result in results] 
np.mean(retrieval_scores)

0.925

### Q-RAG Corrective Loop

In this section, we are going to perform the corrective loop. This requires an external source to provide the ground truth answers. In reality, we can have human evaluators going through poor responses (e.g., responses reported by users, responses with low metric values) to correct them.

We start by first create a copy of the initial question database (currently empty). 

In [21]:
new_question_vectorstore_db = deepcopy(question_vectorstore_db)

We then iterate through the results to detect poor answers, which are defined as BLEU < 0.5. For each of these instances, we are going to utilize the ground truth answer to retrieve the most relevant chunks. These chunks will then be put into the `connections` attribute of the corresponding questions. These questions documents are then appended to the question database. 

In the code, one may be able to observe that we are utilizing both the original question and the ground truth answer to retrieve relevant chunks here (Line 9). This is simply because SQuAD is a short-form QnA dataset, meaning that the expected answers are extremely short and concise, composing of just one or two keywords. Therefore, the answer alone is inappropirate to be used for context retrieval. As such, it is a reasonable option to concatenate the original question and the ground truth answer together for this task. In most cases, inlcluding just a few keywords from the corrected answers is enough to significantly improve the retrieval results. 

In [22]:
for i, score in enumerate(bleu_scores):
    if score < 0.5:
        id = dataset[i]["id"]
        question = dataset[i]["question"]
        ground_truth_answer_list = sorted(dataset[i]["answers"]["text"], key=len)
        ground_truth_answer = ground_truth_answer_list[-1]

        relevant_chunk_list = rag_agent.retrieve_context(
            query=question + " " + ground_truth_answer,
            top_k=2,
            distance_threshold=1.5
        )

        relevant_chunk_id_list = [chunk.metadata["id"] for chunk in relevant_chunk_list]

        question_document = Document(
            page_content=question,
            metadata={
                "id": id,
                "type": "question", 
                "connections": relevant_chunk_id_list
            }
        )

        new_question_vectorstore_db.add_documents([question_document])

We create a new rag agent with access to the newly indexed question database.

In [104]:
new_rag_agent = RAGAgent(
    client=llm_agent,
    chunk_retriever=chunks_vectorstore_db,
    question_retriever=new_question_vectorstore_db,
    dataframe=document_df
)

A new evaluation process takes place for this RAG agent. Here, we can see that the threshold for the question distance `q_distance_threshold` is quite tight, ensuring that only the most similar questions will be retrieved. This will prevents the agents from drawing context chunks from irrelevant model questions.

In [24]:
new_results = evaluation(
    agent=new_rag_agent,
    test_set=dataset,
    top_k=5,
    q_top_k=2,
    distance_threshold=1.5,
    q_distance_threshold=0.25
)

In [25]:
with open(f"{RESULTS_PATH}/run-1.json", "w") as file:
    file.write(json.dumps(new_results))

In [26]:
new_bleu_scores = [result["bleu_score"] for result in new_results] 
np.mean(new_bleu_scores)

0.77242408767743

In [27]:
new_retrieval_scores = [result["retrieval_score"] for result in new_results] 
np.mean(new_retrieval_scores)

0.97

As one can observed, there has been some decent improvement in the average BLEU score and a significant improvement in context retrieval score. While this evaluation is relatively simplistic to truly assess the improvements, it has clearly demonstrated the overall process of Q-RAG. One can continue to play around with the agents by testing how much better the Q-RAG agent can handle similar complex questions compared to the generic one.

Let us do a sanity check with one example in the dataset:

In [106]:
example = [row for row in dataset if row["id"] == "5726c3da708984140094d0db"][0]
example

{'id': '5726c3da708984140094d0db',
 'title': 'European_Union_law',
 'context': 'The "freedom to provide services" under TFEU article 56 applies to people who give services "for remuneration", especially commercial or professional activity. For example, in Van Binsbergen v Bestuur van de Bedrijfvereniging voor de Metaalnijverheid a Dutch lawyer moved to Belgium while advising a client in a social security case, and was told he could not continue because Dutch law said only people established in the Netherlands could give legal advice. The Court of Justice held that the freedom to provide services applied, it was directly effective, and the rule was probably unjustified: having an address in the member state would be enough to pursue the legitimate aim of good administration of justice. The Court of Justice has held that secondary education falls outside the scope of article 56, because usually the state funds it, though higher education does not. Health care generally counts as a servic

Here, we would like to rephrase the original complex question into something else. This will help us to evaluate more realistically whether Q-RAG is able to retrieve the correct model question.

In [107]:
reformat_question_prompt = ChatPromptTemplate.from_messages([
    ("system", """
        Your task is to make changes the original question while maintaining its original objective and meaning. 
        You may even choose to completely remove small details from the original question to make it a bit more different.  
     """),
    ("user", "\n{question}")
])

chain = reformat_question_prompt | llm_agent
response = chain.invoke({"question": example["question"]}).content
response

'What were the key aspects controlled in all member states in the case of Josemans v Burgemeester van Maastricht according to the Court of Justice?'

In [108]:
example["question"] = response
example_dataset = Dataset.from_list([example])

In [109]:
evaluation(
    agent=rag_agent,
    test_set=example_dataset,
    top_k=5,
    q_top_k=2,
    distance_threshold=1.5,
    q_distance_threshold=0.25
)

[{'question': 'What were the key aspects controlled in all member states in the case of Josemans v Burgemeester van Maastricht according to the Court of Justice?',
  'answer': 'Freedom of establishment and the right to provide services.',
  'ground_truth_answer': ['narcotic drugs',
   'narcotic drugs',
   'narcotic drugs'],
  'context': ['Since its founding, the EU has operated among an increasing plurality of national and globalising legal systems. This has meant both the European Court of Justice and the highest national courts have had to develop principles to resolve conflicts of laws between different systems. Within the EU itself, the Court of Justice\'s view is that if EU law conflicts with a provision of national law, then EU law has primacy. In the first major case in 1964, Costa v ENEL, a Milanese lawyer, and former shareholder of an energy company, named Mr Costa refused to pay his electricity bill to Enel, as a protest against the nationalisation of the Italian energy corpo

In [110]:
evaluation(
    agent=new_rag_agent,
    test_set=example_dataset,
    top_k=5,
    q_top_k=2,
    distance_threshold=1.5,
    q_distance_threshold=0.25
)

[{'question': 'What were the key aspects controlled in all member states in the case of Josemans v Burgemeester van Maastricht according to the Court of Justice?',
  'answer': 'Controlled substances, such as narcotic drugs.',
  'ground_truth_answer': ['narcotic drugs',
   'narcotic drugs',
   'narcotic drugs'],
  'context': ['Since its founding, the EU has operated among an increasing plurality of national and globalising legal systems. This has meant both the European Court of Justice and the highest national courts have had to develop principles to resolve conflicts of laws between different systems. Within the EU itself, the Court of Justice\'s view is that if EU law conflicts with a provision of national law, then EU law has primacy. In the first major case in 1964, Costa v ENEL, a Milanese lawyer, and former shareholder of an energy company, named Mr Costa refused to pay his electricity bill to Enel, as a protest against the nationalisation of the Italian energy corporations. He c

Depending on how the example question are rephrased, the results may vary. You may have to re-run the examples multiple times to clearly observe the effect. Nevertheless, in most runs, the vanilla RAG agent often fails to retrieve the chunk that contains the right answer while the Q-RAG agent could easily handle the complex question. 