# DocuAsk - Stage 2 Interview Task - RAG System

## Pre-Requisites

To perform OCR under the **Refined RAG Pipeline** section, [poppler](https://github.com/oschwartz10612/poppler-windows/releases) and [Tesseract-OCR](https://tesseract-ocr.github.io/) both need to be installed on the machine and added to PATH (System Properties > Environment Variables). This will require a restart of the terminal.

In [1]:
%pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
# Execute Asychrounously
import nest_asyncio
nest_asyncio.apply()

### Configuration

In [33]:
# LLM & Embedding Settings

llm = "gemini-2.0-flash" 
api_key = "AIzaSyArw7MwjRaQjh1it6w6mngXc8NNI4G7FOM" # Gemini API Key
delay = 7  # Delay in seconds between requests to prevent rate limiting - suggested 7 seconds for gemini-2.0-flash

embedding_model = "BAAI/bge-small-en-v1.5"  # HuggingFace Embedding Model

## Basic RAG Pipeline with LlamaIndex


In [4]:
from llama_index.core import Settings

# Configure the LLM
from llama_index.llms.google_genai import GoogleGenAI
Settings.llm = GoogleGenAI(model=llm, api_key=api_key, generate_kwargs={"max_output_tokens": 1}) # only one answer "A, B, C, D"

# Configure the Embedding Model
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
Settings.embed_model = HuggingFaceEmbedding(model_name=embedding_model)

# Google GenAI Embeddings (usually more accurate)
# from llama_index.embeddings.google_genai import GoogleGenAIEmbedding
# Settings.embed_model = GoogleGenAIEmbedding(api_key=api_key)

### Ingestion
 1. Load all PDFs
 2. Create a VectorStoreIndex using the HuggingFace Embedding Model
 3. Save to `index` directory


In [5]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

PDF_DIR = 'data/5_estate_planning/Lessons/'
INDEX_DIR = 'index'

# 1. Load all PDF files from the specified directory
try:
    documents = SimpleDirectoryReader(PDF_DIR).load_data()
    print(f"Loaded {len(documents)} documents from {PDF_DIR}")
except Exception as e:
    print(f"Error loading documents: {e}")
    documents = [] # Ensure documents is an empty list if loading fails

# 2. Create a VectorStoreIndex from the loaded documents
if documents:
    print("Creating VectorStoreIndex...")
    index = VectorStoreIndex.from_documents(documents)
    print("VectorStoreIndex created successfully.")

    # 3. Save the index to disk
    print(f"Saving index to disk in directory '{INDEX_DIR}'...")
    index.storage_context.persist(persist_dir=INDEX_DIR)
    print("Index saved successfully.")
else:
    print("No documents loaded, index creation skipped.")

Loaded 140 documents from data/5_estate_planning/Lessons/
Creating VectorStoreIndex...
VectorStoreIndex created successfully.
Saving index to disk in directory 'index'...
Index saved successfully.


### Retrieval

- Load the indexes from storage
- Read the multiple-choice questions from the JSON file

In [6]:
import json
import os
from llama_index.core import StorageContext, load_index_from_storage
import re
import time
from llama_index.core.response import Response


# Define the directory where the index is saved
INDEX_DIR = 'index'

# Define the path to the questions JSON file
QUESTIONS_FILE = 'data/processed/5_estate_planning_questions.json'


# Load the index
print(f"Attempting to load index from {INDEX_DIR}")
try:
    # Try loading from storage context (assuming it was saved that way)
    if os.path.exists(INDEX_DIR):
        from llama_index.core import StorageContext, load_index_from_storage
        storage_context = StorageContext.from_defaults(persist_dir=INDEX_DIR)
        index = load_index_from_storage(storage_context) # loaded index
        print(f"Successfully loaded index from {INDEX_DIR} using StorageContext")
except Exception as e:
    print(f"An unexpected error occurred while loading index: {e}")
    print("Could not load the index. Please ensure the index is created and saved correctly.")
    exit()


# Read the questions and expected answers from the JSON file
try:
    with open(QUESTIONS_FILE, 'r') as f:
        questions_data = json.load(f)
    print(f"Successfully loaded questions from {QUESTIONS_FILE}")
except FileNotFoundError:
    print(f"Error: Questions file not found at {QUESTIONS_FILE}")
    exit()
except json.JSONDecodeError:
    print(f"Error: Could not decode JSON from {QUESTIONS_FILE}")
    exit()

# index - contains the loaded index
# questions_data - contains the list of questions and expected answers.

Attempting to load index from index
Loading llama_index.core.storage.kvstore.simple_kvstore from index\docstore.json.
Loading llama_index.core.storage.kvstore.simple_kvstore from index\index_store.json.
Successfully loaded index from index using StorageContext
Successfully loaded questions from data/processed/5_estate_planning_questions.json


### Synthesis
Use the `query_engine` to:
1. Retrieve relevant content from the indexed documents -> context
2. Send the both the context & multiple-choice question to the LLM

In [7]:
# Process each question using the index and an LLM.
query_engine = index.as_query_engine()

correct_answers = 0

for chapter_key, questions_list in questions_data.items():
    for question_entry in questions_list:

        question_text = question_entry['question']
        expected_answer = question_entry['answer']

        # Construct the query with explicit instructions to only return the letter answer
        query_text = f"""\
        You are an agent designed to answer a multiple choice question over a set of given documents.
        You must respond with ONLY the letter of the correct answer: A, B, C, or D.
        Do NOT include any explanation, reasoning, or extra text.
        If you include anything other than A, B, C, or D, your answer will be considered invalid. \n
        Question: {question_text}
        """

        # Get response from the RAG system
        response: Response = query_engine.query(query_text)
        
        # Add a small delay to avoid hitting API rate limits
        time.sleep(delay)

        # Print details for analysis
        print("-" * 20)
        print(f"Question: {question_text}")
        print(f"Expected Answer: {expected_answer}")
        # print("Retrieved Context:")
        # for node in response.source_nodes:
        #     print(node.text)
        #     print("-" * 10)
        print(f"Raw LLM Response: {str(response)}")
        print("-" * 20) # Ensure this is consistent

        # ANSWER SELECTION: Use regex to extract the predicted answer (a single uppercase letter A-D)
        match = re.search(r'[A-D]', str(response).strip().upper())
        predicted_answer = match.group(0) if match else None

        # Compare predicted and expected answers
        if predicted_answer == expected_answer:
            correct_answers += 1

--------------------
Question: 1. In Estate Planning there are many views and approaches taken. Which of the following is not true? 

A. It is common for many Malaysians to avoid estate planning 

B. Writing a Will is adequate for estate planning purposes 

C. People generally tend to avoid estate planning because it is a complex subject. 

D. Preservation of the estate is a key objective in estate planning
Expected Answer: B
Raw LLM Response: B

--------------------
--------------------
Question: 2. The various steps in the process of estate planning ensure that all issues are covered. Which of the following sequence in the process is correct? 

A. Set Objectives â€“ Gather information â€“ Implement â€“ Develop a plan â€“ Review 

B. Set Objectives â€“ Gather information â€“ Review â€“ Implement â€“ Develop a plan 

C. Gather information â€“ Set Objectives â€“ Review â€“ Implement â€“ Develop a plan 

D. Set Objectives â€“ Gather information â€“ Develop a plan â€“ Implement â€“ Review

### Evaluation

In [8]:
# Calculate and print accuracy
total_questions = sum(len(questions_list) for questions_list in questions_data.values())
accuracy = (correct_answers / total_questions) * 100 if total_questions > 0 else 0

print(f"\nValidation Results:")
print(f"Total Questions: {total_questions}")
print(f"Correct Answers: {correct_answers}")
print(f"Accuracy: {accuracy:.2f}%")


Validation Results:
Total Questions: 80
Correct Answers: 61
Accuracy: 76.25%


## Refined RAG Pipeline - Unstructured & LangChain

### Using Unstructured to Process PDFs

In [9]:
from unstructured.partition.pdf import partition_pdf
import os

pdf_elements = []
pdf_dir = "data/5_estate_planning/Lessons"

for filename in os.listdir(pdf_dir):
    if filename.lower().endswith(".pdf"):
        file_path = os.path.join(pdf_dir, filename)
        print(f"Processing file: {file_path}")
        
        # Returns a List[Element] present in the pages of the parsed pdf document
        # Applies the English and Malay language pack for ocr. OCR is only applied if the text is not available in the PDF.
        elements = partition_pdf(file_path, languages=["eng", "msa"])

        pdf_elements.extend(elements)


Processing file: data/5_estate_planning/Lessons\1. Chapter 1  The Concepts and Fundamentals of Estate Planning.pdf
Processing file: data/5_estate_planning/Lessons\2. Chapter 2  Testacy and Intestacy.pdf
Processing file: data/5_estate_planning/Lessons\3. Chapter 3  Estate of Muslims.pdf
Processing file: data/5_estate_planning/Lessons\4. Chapter 4  Trusts.pdf
Processing file: data/5_estate_planning/Lessons\5. Chapter 5  Powers of Attorney.pdf
Processing file: data/5_estate_planning/Lessons\6. Chapter 6  Personal Representatives Duties and Powers.pdf
Processing file: data/5_estate_planning/Lessons\7. Chapter 7  Life Insurance and Estate Planning.pdf
Processing file: data/5_estate_planning/Lessons\8. Chapter 8  Estate Planning for Business Owners.pdf
Processing file: data/5_estate_planning/Lessons\Estate Planning-13-15.pdf


In [10]:
for element in pdf_elements[:10]:
    print(f"{element.category.upper()}: {element.text}")

HEADER: RFP Programme - Module 5
HEADER: Chapter 1 : The Concepts and Fundamentals of Estate Planning
TITLE: Chapter 1
TITLE: The Concepts and Fundamentals of Estate Planning
TITLE: Chapter Objectives
NARRATIVETEXT: On completion of this chapter you should have a basic knowledge on:
LISTITEM: The Purpose and Importance of Estate Planning
LISTITEM: The Major Steps in the Estate Planning Process
LISTITEM: The Legal Rules and Principles Applicable in the Estate Planning Process
LISTITEM: The Principal Tools and Legal Instruments Use in Estate Planning


#### Understanding the elements in the PDF

In [26]:
import collections

pdf_categories = [el.category for el in pdf_elements]
collections.Counter(pdf_categories)

Counter({'NarrativeText': 981,
         'Title': 776,
         'ListItem': 236,
         'UncategorizedText': 222,
         'Footer': 210,
         'Header': 206})

#### Removing Noise (Header & Footer)

In [27]:
# Filter out elements that are likely headers or footers
# Assuming headers/footers are categorized as 'Header' or 'Footer' in element.category

filtered_elements = [
    el for el in pdf_elements
    if el.category.lower() not in ['header', 'footer']
]

print(f"Filtered elements count: {len(filtered_elements)}")

Filtered elements count: 2215


In [28]:
pdf_categories = [el.category for el in filtered_elements]
collections.Counter(pdf_categories)

Counter({'NarrativeText': 981,
         'Title': 776,
         'ListItem': 236,
         'UncategorizedText': 222})

### Indexing with LangChain

In [29]:
from langchain_core.documents import Document

documents = []
for element in filtered_elements:
    metadata = element.metadata.to_dict()
    del metadata["languages"]
    del metadata["coordinates"] # remove dictionaries to prevent errors
    metadata["source"] = metadata["filename"]
    documents.append(Document(page_content=element.text, metadata=metadata))

In [30]:
import chromadb
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name=embedding_model)
vectorstore = Chroma.from_documents(documents, embeddings)

retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 6}
)

Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


In [31]:
from langchain.prompts.prompt import PromptTemplate

# ANSWER SELECTION: Construct the query with explicit instructions to only return the letter answer
template = """
        You are an agent designed to answer a multiple choice question from the given context.
        You must respond with ONLY the letter of the correct answer: A, B, C, or D.
        Do NOT include any explanation, reasoning, or extra text.
        If you include anything other than A, B, C, or D, your answer will be considered invalid. \n
        Question: {question}
        =========
        {context}
        =========
        """
prompt = PromptTemplate(template=template, input_variables=["question", "context"])

In [34]:
from langchain_google_genai import ChatGoogleGenerativeAI 
from langchain.chains import RetrievalQA

llm = ChatGoogleGenerativeAI(
    model=llm,
    api_key=api_key,
    generate_kwargs={"max_output_tokens": 1}  # only one answer "A, B, C, D"
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=False,
    chain_type_kwargs={"prompt": prompt}
)

In [35]:
correct_answers = 0

for chapter_key, questions_list in questions_data.items():
    for question_entry in questions_list:
        question_text = question_entry['question']
        expected_answer = question_entry['answer']

        # Run the QA chain
        result = qa_chain.invoke({"query": question_text})
        raw_response = result["result"] if isinstance(result, dict) and "result" in result else str(result)

        print("-" * 20)
        print(f"Question: {question_text}")
        print(f"Expected Answer: {expected_answer}")
        print(f"Raw LLM Response: {raw_response}")
        print("-" * 20)

        # ANSWER SELECTION: Use regex to extract the predicted answer (a single uppercase letter A-D)
        match = re.search(r'[A-D]', raw_response.strip().upper())
        predicted_answer = match.group(0) if match else None

        if predicted_answer == expected_answer:
            correct_answers += 1

        time.sleep(delay) # Add a small delay to avoid hitting API rate limits


--------------------
Question: 1. In Estate Planning there are many views and approaches taken. Which of the following is not true? 

A. It is common for many Malaysians to avoid estate planning 

B. Writing a Will is adequate for estate planning purposes 

C. People generally tend to avoid estate planning because it is a complex subject. 

D. Preservation of the estate is a key objective in estate planning
Expected Answer: B
Raw LLM Response: B
--------------------
--------------------
Question: 2. The various steps in the process of estate planning ensure that all issues are covered. Which of the following sequence in the process is correct? 

A. Set Objectives â€“ Gather information â€“ Implement â€“ Develop a plan â€“ Review 

B. Set Objectives â€“ Gather information â€“ Review â€“ Implement â€“ Develop a plan 

C. Gather information â€“ Set Objectives â€“ Review â€“ Implement â€“ Develop a plan 

D. Set Objectives â€“ Gather information â€“ Develop a plan â€“ Implement â€“ Review


### Evaluation

In [36]:
# Calculate and print accuracy
total_questions = sum(len(questions_list) for questions_list in questions_data.values())
accuracy = (correct_answers / total_questions) * 100 if total_questions > 0 else 0

print(f"\nValidation Results:")
print(f"Total Questions: {total_questions}")
print(f"Correct Answers: {correct_answers}")
print(f"Accuracy: {accuracy:.2f}%")


Validation Results:
Total Questions: 80
Correct Answers: 65
Accuracy: 81.25%
