## ASSIGNMENT 1 - MULTIMODAL RAG

Take multiple PDF with text, image, table

1. fetch the data from PDF
2. at least there should be 200 pages
3. if chunking (use the semantic chunking technique) required, do chunking and then embedding
4. store it inside the vector database (use any of them:
    1. mongodb
    2. astradb
    3. opensearch
    4. milvus)
5. create an index with all three index mechanisms (Flat, HNSW, IVF)
6. create a retriever pipeline
7. check the retriever time (which one is fastest)
8. print the accuracy score of every similarity search
9. perform the re-ranking either using BM25 or MMR ## i g...
10. then write a prompt template
11. generate an output through LLM
12. render that output over the DOCx

In [38]:
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI

load_dotenv()

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")
llm = ChatOpenAI(model = "gpt-4o")


## Fetch data from PDF (more than 200 pages doc)

### 📄 PyPDFLoader (uses pypdf)
🔧 Backend:
Uses the pypdf Python library (previously PyPDF2).

✅ Pros:
1. Pure Python (no binary dependencies)
2. Very lightweight and stable for text-based PDFs
3. Works well for most structured PDFs with selectable text

❌ Cons:
1. Can miss text from PDFs with complex layouts
2. Cannot extract images or tables
3. No native support for scanned PDFs or OCR

### 🖼️ PyMuPDFLoader (uses PyMuPDF / fitz)
🔧 Backend:
Uses the PyMuPDF library (fitz module)

✅ Pros:
1. Much more powerful:
2. Handles complex layouts
3. Extracts text, images, and even vector objects
4. Supports bounding boxes, font styles, etc.
5. Can extract text from scanned or image-based PDFs (OCR with extra config)

❌ Cons:
1. Heavier dependency (requires C bindings)
2. Slightly slower than pypdf


| Feature              | `PyPDFLoader` | `PyMuPDFLoader`          |
| -------------------- | ------------- | ------------------------ |
| Backend              | `pypdf`       | `PyMuPDF` (`fitz`)       |
| Text extraction      | ✅ Simple PDFs | ✅ All layouts            |
| Table extraction     | ❌             | ❌ (needs custom parsing) |
| Image extraction     | ❌             | ✅                        |
| Handles scanned PDFs | ❌             | ⚠️ (with OCR manually)   |
| Metadata             | ✅             | ✅                        |
| Speed                | ⚡ Fast        | 🐢 Slightly slower       |
| Dependency size      | 🟢 Light      | 🔶 Medium-heavy          |


In [30]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader("The Bhagavad Gita.pdf")

# --- Read Pages Asynchronously ---
# document loaders implement lazy_load and its async variant,
#  alazy_load, which return iterators of Document objects. We will use these below.

pages = []
async for page in loader.alazy_load():
    pages.append(page)

print("Total pages in the pdf - ", len(pages))
print("Sample metadata:", pages[0].metadata)
print("Sample content:\n", pages[1].page_content)

Total pages in the pdf -  447
Sample metadata: {'producer': 'doPDF Ver 7.2 Build 376 (Windows 7 Business Edition - Version: 6.1.7600 (x86))', 'creator': 'Adobe Acrobat 8.0 Combine Files', 'creationdate': '2023-09-28T13:07:43+05:30', 'source': 'The Bhagavad Gita.pdf', 'file_path': 'The Bhagavad Gita.pdf', 'total_pages': 447, 'format': 'PDF 1.6', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '2023-09-28T13:07:43+05:30', 'trapped': '', 'modDate': "D:20230928130743+05'30'", 'creationDate': "D:20230928130743+05'30'", 'page': 0}
Sample content:
 iijkhh 
 
 
 
 
The Bhagavad Gita 
(Based on HH Sri Raghavendra Teertha’s Gita Vivruti) 
 
 
Compiled By 
Dr. Giridhar Boray 
 
T.T.D. Religious Publications Series No.1451 
 All Rights Reserved 
 
First Edition : 2023 
 
Copies : 500 
 
Published  by : 
Sri A.V. Dharma Reddy, IDES 
Executive Officer, 
Tirumala Tirupati Devasthanams, 
Tirupati. 
 
D.T.P: 
Publications Division, 
T.T.D, Tirupati. 
 
Printed at : 
Tirumala Tirupa

## Semantic Chunking and Embedding the docs

### 🧠 What is Semantic Chunking?
Semantic chunking is a smart way of splitting long documents into smaller parts (chunks) — not just based on size, but based on meaning and context.

This tries to split along natural boundaries like:

- Paragraphs
- Headings
- Sentence boundaries
- Logical/semantic topics

It produces coherent chunks that retain complete thoughts or concepts.

### 🔧 How to Implement It?
using RecursiveTextSplitter and sentence-transformers (Advanced way)

In [31]:
from langchain_openai import OpenAIEmbeddings
from langchain_experimental.text_splitter import SemanticChunker

texts = [p.page_content for p in pages]
metadata = [p.metadata for p in pages]

embedding = OpenAIEmbeddings(model="text-embedding-3-large")

splitter = SemanticChunker(embeddings=embedding, breakpoint_threshold_type="interquartile")
docs = splitter.create_documents(texts=texts, metadatas=metadata)
print("Total chunks:", len(docs))

Total chunks: 752


### Connect to Astra DB

In [32]:
from langchain_astradb import AstraDBVectorStore

os.environ["ASTRA_DB_ID"] = os.getenv("ASTRA_DB_ID")
os.environ["ASTRA_DB_APPLICATION_TOKEN"] = os.getenv("ASTRA_DB_APPLICATION_TOKEN")
os.environ["ASTRA_DB_API_ENDPOINT"] = os.getenv("ASTRA_DB_API_ENDPOINT")
os.environ["ASTRA_DB_KEYSPACE"] = os.getenv("ASTRA_DB_KEYSPACE")


vector_store = AstraDBVectorStore(
    collection_name="mmrag_db",
    embedding=embedding,
)

vector_store.add_documents(docs)


['deed421912ca43a88e4b4360affe64af',
 '91a982cfc63440c59c15205fa187c5e9',
 '5ce9cbdd56ef448d9b10ccbaa75379aa',
 '271defea075640cf983ebf19c1fd4bda',
 'f8d57385fc9d47afbcf32ffb51c24c04',
 'bca60bbd5a8e469bbbc8511ea6786bc0',
 '87cb4ab402cf47e68b10cd3a35f50765',
 '8e67bc3a44af4e81ada5c1eeb50163f9',
 'd3d7c70a51ae49de928d6415571e7298',
 'a88e1fc602834d2d8d446913756eb25c',
 '885623e8bbe746dbb11c3f7826b254d2',
 'f4d3d38282f14d328d23bbf2d24c5bbf',
 '8174cdb5cb7b43b5b6628b9e837de9c2',
 '1756d97b4d864c05acc573a1fa1a5726',
 '45f55249080d49d29cddfe22d13116bd',
 '19de20efd8cc441e96381fd45f840ae3',
 '8392045bfb93498fa6b0d08979ec7345',
 'fe066e41c6554e56b079d21a777d54ac',
 '5784875a682a48498cb3913c5fbd0bda',
 '84b9e3dd0a1a45ee9b77c985689e8b28',
 '9ad6de77f7894668a45ce4f13b7bdd19',
 '081402cc234d48ad868b4aaaa5f95aa0',
 '7bfee346ff3543d693dd707119948ac7',
 '0b0a8f1fc3664a54b2cc6110a98c805a',
 '58484222ebb742458ec81799aaba11ae',
 'cfffec04be2d44a9a5b5d28610571dd8',
 'aa47455f945a4fdcaf8f279c2e3007aa',
 

### Basic Semantic Retrieval

In [33]:
retriever = vector_store.as_retriever()
result = retriever.get_relevant_documents("Krishna is the Source of All Incarnations")
result

  result = retriever.get_relevant_documents("Krishna is the Source of All Incarnations")


[Document(id='db947d625e17403f8d815dc341561b32', metadata={'producer': 'doPDF Ver 7.2 Build 376 (Windows 7 Business Edition - Version: 6.1.7600 (x86))', 'creator': 'Adobe Acrobat 8.0 Combine Files', 'creationdate': '2023-09-28T13:07:43+05:30', 'source': 'The Bhagavad Gita.pdf', 'file_path': 'The Bhagavad Gita.pdf', 'total_pages': 447, 'format': 'PDF 1.6', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '2023-09-28T13:07:43+05:30', 'trapped': '', 'modDate': "D:20230928130743+05'30'", 'creationDate': "D:20230928130743+05'30'", 'page': 131}, page_content='116  \n                                                                                                                           The Bhagavad Gita \n                       \nThe Lord’s Incarnations \n \nlr^JdmZwdmM Ÿ& \n~hÿ{Z _o ì`VrVm{Z OÝ_m{Z Vd MmOw©Z Ÿ& \nVmÝ`h§ doX gdm©{U Z Ëd§ doËW na§Vn Ÿ&& 5 Ÿ&& \n \nThe Lord said: O Arjuna! Many manifestations of Mine and many \nbirths of yours have come and gone. O!'),
 Do

### MMR Retrieval
###### ✅ When to use: 
###### BM25: If you're doing basic keyword search (e.g., FAQs, simple search engines). 
###### MMR: If you use LLMs or semantic embeddings, and want more informative, diverse answers.

🔁 MMR: Maximal Marginal Relevance
📌 What it is:
MMR is a re-ranking strategy that balances between relevance to the query and diversity among selected documents.

🧠 Why it's useful:
When retrieving multiple documents (e.g., top 5), you don’t want them all to say the same thing — MMR selects diverse documents that are still relevant, avoiding redundancy.

✅ Use Case:
Useful in RAG pipelines when you want a diverse yet relevant context window for the LLM.

📚 BM25: Best Match 25 (Ranking Function)
📌 What it is:
BM25 is a statistical ranking algorithm from traditional IR (e.g., search engines). It scores documents based on:

how often query terms appear in the document (term frequency)

how rare the terms are across all documents (inverse document frequency)

🧠 Why it's useful:
It prioritizes documents that contain more query terms, especially rare ones, and adjusts for document length.

✅ Use Case:
Works well for keyword-heavy queries

Faster and lighter than vector search

Often used as a first-pass retriever before semantic re-ranking (hybrid search)

In [37]:
import time

query = "Krishna is the Source of All Incarnations"
mmr_retriever = vector_store.as_retriever(
    search_type = "mmr",
    search_kwargs={"k": 5, "lambda_mult": 0.5}
)

start = time.time()
retrieved_docs = mmr_retriever.get_relevant_documents(query)
print(f"\nMMR Retrieval Time: {time.time() - start:.2f}s")

print("\nMMR Top documents:")
for i, doc in enumerate(retrieved_docs, 1):
    print(f"Doc {i}:", doc.page_content[:300], "...")


MMR Retrieval Time: 6.78s

MMR Top documents:
Doc 1: 116  
                                                                                                                           The Bhagavad Gita 
                       
The Lord’s Incarnations 
 
lr^JdmZwdmM Ÿ& 
~hÿ{Z _o ì`VrVm{Z OÝ_m{Z Vd MmOw©Z Ÿ& 
VmÝ`h§ doX gdm©{U Z Ëd§ doËW na§Vn Ÿ&& 5 Ÿ&& 
 ...
Doc 2: These 3 interpretations provide answers to the three questions that arise 
here. Question 1: What does the statement that the Lord takes birth at a 
specific time and place when His body and soul are eternal mean? For 
example, the birth of Krishna to Vasudeva and Devaki. Answer 1: Inert nature is u ...
Doc 3: Chapter 15                                                                                                                                       351 
 
Comments: After having described the two dependent, sentient 
entities in the universe, namely kshara and akshara, the Lord declares that 
there is  ...
Doc 4: 414  
    

### RetrievalQAChain

#####  Langchain’s RetrievalQA class to quickly create a question-answering chain that uses:  A retriever to fetch relevant documents from a knowledge source (like your vector store). 'A language model chain to generate answers based on the retrieved documents.

In [None]:
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(
    llm = llm,
    retriever = vector_store.as_retriever()
)

response = qa.run("What is Vedic knowledge?")
print("\n[QA Answer]:", response)

  response = qa.run("What is Vedic knowledge?")



[QA Answer]: Vedic knowledge refers to the ancient and authoritative body of knowledge found in the Vedas, which are the oldest sacred texts of Hinduism. This knowledge encompasses a wide range of topics including spiritual, philosophical, and ritual knowledge. In the context provided, Vedic knowledge, or jnana, is described as general knowledge obtained through the study of scriptures and learning from teachers. It is considered indirect knowledge, as it is acquired before attaining self-realization. It encompasses teachings about the individual souls, various deities, and the attributes of the Lord Almighty, aiming to instill devotion in seekers and guide them on the spiritual path.


### creating Rag-Chain

In [45]:
from langchain.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough

prompt = """
You are the Bhagavad Geeta assistant who answers questions based on the following documents.

Context:
{context}

Question:
{question}

Answer:
"""

prompt_template = PromptTemplate(
    template = prompt,
    input_variables=["context","question"],
)

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt_template
    | llm
)

response = rag_chain.invoke("When you feel angry which chapter to read?")
print(response.content)


When you feel angry, it's helpful to read Chapter 16 of the Bhagavad Gita. This chapter discusses demoniac tendencies, including anger, as afflictions that lead to destructive behavior and ultimate downfall. The chapter also provides guidance on how to overcome these tendencies by following scriptural injunctions and focusing on prescribed duties.
