# Retreival_Augmented Generation (RAG) over PDF
##### Goal: Build a RAG chatbot over a chosen PDF using open-source LLMs.
---
### Ingestion
Parse PDF, split into chunks that make semantic sense

To install all required libraries, run
```bash
pip install pymupdf4llm llama-index llama-index-embeddings-google-genai llama-index-llms-google-genai marker-pdf
```

In [17]:
import os
import pymupdf4llm
import pathlib
from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.google_genai import GoogleGenAIEmbedding
from llama_index.llms.google_genai import GoogleGenAI
from llama_index.core import StorageContext, load_index_from_storage
from llama_index.core import get_response_synthesizer
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor
from llama_index.core.chat_engine import CondensePlusContextChatEngine
from llama_index.core.memory import ChatMemoryBuffer
from llama_index.core.llms import MessageRole


The following snippet asks for the pdf file to be uploaded.

It also has 3 modes:
1. **Quick**: Best for text-based pdfs (eg. `test_easy.pdf`)
2. **Balanced**: Best for pdfs with tables, equations etc. (eg. `test_hard.pdf`)
3. **Detailed**: Best for research papers, scanned pdfs and complex documents. (eg. `test_extreme.pdf`)

In [15]:
file_name = input("Enter name of file to load (don't write file extension): ")
print("1. Quick: Best for text-based pdfs, \n2. Balanced: Best for pdfs with tables, equations, images etc. \n3. Detailed: Best for research papers, scanned pdfs and complex documents.")
mode = input("Enter mode (1/2/3): ")

1. Quick: Best for text-based pdfs, 
2. Balanced: Best for pdfs with tables, equations, images etc. 
3. Detailed: Best for research papers, scanned pdfs and complex documents.


For Quick mode, PyMuPDF4LLM is used. It uses heuristic methods to parse the pdf such as bounding boxes. 

For Balanced mode, marker-pdf is used which uses a combination of heuristic and OCR based methods to parse the pdf, giving more accurate output and support for Latex, tables and code snippets

For Detailed mode, marker-pdf with the force_OCR flag is used, which is very accurate, but time consuming as well. This works well for newspaper-type pdfs, scientific papers where accuracy is paramount and scanned pdfs, where the text cannot be extracted normally

In [18]:
if mode == "1":
    md_text = pymupdf4llm.to_markdown(f"{file_name}.pdf")
    os.makedirs(file_name, exist_ok=True)
    pathlib.Path(f"{file_name}/{file_name}.md").write_bytes(md_text.encode())
elif mode == "2":
    os.system(f"marker_single {file_name}.pdf --output_format markdown --output_dir ./ --disable_image_extraction")
    os.remove(f"{file_name}/{file_name}_meta.json")
elif mode == "3":
    os.system(f"marker_single {file_name}.pdf --output_format markdown --output_dir ./ --force_ocr --disable_image_extraction --format_lines")
    os.remove(f"{file_name}/{file_name}_meta.json")
else:
    print("Invalid mode selected.")

The previous snippet saves the output as markdown, since it captures the structure of document, equations, tables, code, links etc. 

Next, we load the document into the LlamaIndex, which is an orchestration framework for knowledge assistants. 

In [19]:
reader = SimpleDirectoryReader(input_dir=file_name)
documents = reader.load_data()


The next step is chunking the document text into sensible semantic units. Since we use semantic chunking, a text embedding model must be used.

We use Google's open source text embedding model due to its speed, reliability and SOTA technology. Enter your Google Cloud API key in the `api_key` parameter.

In [20]:
embed_model = GoogleGenAIEmbedding(
    model_name="text-embedding-004",
    api_key="GOOGLE_API_KEY",
)


There are largely 5 levels of chunking.
1. Fixed size
2. Recursive: split using certain characters like `\n` or `.`
3. Document based: split based on document structure like headings, code snippets, tables etc.
4. Semantic: split based on semantic meaning of the text, using a text embedding model (**this is what we use**)
5. Agentic: Uses an LLM to determine the best way to chunk the text based on context

The first three methods are not effective enough ways to split text, as chunks aren't contextually similar. The last method is computationally expensive. Semantic chunking is the sweet spot, providing high performance and speed.

In [21]:
splitter = SemanticSplitterNodeParser(
    buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model
)

nodes = splitter.get_nodes_from_documents(documents)

### Indexing

Next step in the process is to store the chunks in a vector database. Using top-k similarity search, we can retrieve the most reevant chunks from this database, based on the user query instead of parsing through the entire pdf over again.

The last line in the snippet persists the index to disk. In Jupyter notebook, since memory of previous cells is stored, it is not necessary, in other cases, it is needed.

In [22]:
from llama_index.core import VectorStoreIndex
from llama_index.core.response.notebook_utils import display_source_node

vector_index = VectorStoreIndex(nodes, embed_model=embed_model)

vector_index.storage_context.persist(persist_dir="<persist_dir>")

### Retrieval

Now, we can load the index from disk. top-k value can be adjusted based on the size of the pdf. 10 is good enough for londer pdfs.

In [23]:
# rebuild storage context
storage_context = StorageContext.from_defaults(persist_dir="<persist_dir>")

# load index
index = load_index_from_storage(storage_context, embed_model=embed_model)



retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=10,
)




### Generation

Google's 2.0 flash is used as the response generation model due to its speed and reliability, and RPD limits (Requests per day). 

The query engine uses a similarity cutoff. This hyperparameter can be reduced if Empty response is generated.

Enter your Google Cloud API key in the parameter `api_key`

In [24]:
llm = GoogleGenAI(
    model="gemini-2.0-flash",
    api_key="GOOGLE_API_KEY",  # uses GOOGLE_API_KEY env var by default
)

# configure response synthesizer
response_synthesizer = get_response_synthesizer(llm=llm)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
    node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.5)],
)

The following snippet asks for the user query and generates a response based on the same.

In [26]:
# Initialize chat memory for conversation history
memory = ChatMemoryBuffer.from_defaults(token_limit=3000)

# Create a history-aware chat engine
chat_engine = CondensePlusContextChatEngine.from_defaults(
    retriever=retriever,
    memory=memory,
    llm=llm,
    context_prompt=(
        "You are a helpful AI assistant that can answer questions about the provided PDF document. "
        "Use the conversation history and the relevant document context to provide accurate and helpful responses.\n"
        "Here are the relevant documents for the context:\n"
        "{context_str}\n"
        "Instruction: Use the previous chat history and context to answer the question. "
        "If the question refers to something mentioned earlier in the conversation, use that context appropriately."
    ),
    node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.1)],
)

print("Type 'quit' to exit, 'history' to see conversation history, or 'clear' to reset history.\n")

while True:
    user_input = input("You: ")
    
    if user_input.lower() == 'quit':
        print("Goodbye!")
        break
    elif user_input.lower() == 'clear':
        memory.reset()
        print("Conversation history cleared!\n")
        continue
    
    response = chat_engine.chat(user_input)
    print(f"\nAssistant: {response}\n")


Type 'quit' to exit, 'history' to see conversation history, or 'clear' to reset history.


Assistant: Okay, I'm ready to answer questions about the provided documents. I understand the context includes information about downtrends, price movements, candle patterns, and criteria related to trend changes. I will do my best to provide accurate and helpful responses based on the information given.


Conversation history cleared!


Assistant: ```tool_code
print(file_path)
```

Goodbye!
