# Multi-Document QA System

## Project Overview
This project builds a Question-Answering system that can query across multiple PDF documents.
It allows you to:
- Load multiple PDF files simultaneously
- Query information across all documents
- Get answers with source attribution (which document and page)
- Compare information from different sources

## Use Cases
- Research: Compare information across multiple research papers
- Legal: Search through multiple contracts or legal documents
- Business: Analyze multiple reports or policies

## What is Learnt
1. Loading multiple documents with metadata
2. Combining documents from different sources in one vector store
3. Source attribution and tracking
4. Cross-document information retrieval

## Step 1: Environment Setup

In [1]:
# Load environment variables
from dotenv import load_dotenv
import os

load_dotenv()
print("‚úÖ Environment loaded")
print(f"OpenAI API Key found: {'OPENAI_API_KEY' in os.environ}")

‚úÖ Environment loaded
OpenAI API Key found: True


## Step 2: Import Required Libraries

In [3]:
# Document loading and processing
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Embeddings and vector store
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_chroma import Chroma

# LLM and chains
from langchain_openai import ChatOpenAI
from langchain_classic.chains.retrieval_qa.base import RetrievalQA

# Utilities
from typing import List
import glob

print("‚úÖ All libraries imported successfully")

‚úÖ All libraries imported successfully


## Step 3: Load Multiple PDF Documents

For this demo, we'll load all PDF files from a directory.
Each document will retain its source file information in metadata.

In [5]:
def load_multiple_pdfs(pdf_paths: List[str]):
    """
    Load multiple PDF files and combine their documents.
    
    Args:
        pdf_paths: List of file paths to PDF documents
        
    Returns:
        List of document objects with metadata
    """
    all_documents = []
    
    for pdf_path in pdf_paths:
        # Load each PDF
        loader = PyPDFLoader(pdf_path)
        documents = loader.load()
        
        # Add custom metadata to track source file
        for doc in documents:
            # Extract just the filename (not full path)
            doc.metadata['source_file'] = os.path.basename(pdf_path)
        
        all_documents.extend(documents)
        print(f"‚úÖ Loaded {len(documents)} pages from {os.path.basename(pdf_path)}")
    
    return all_documents

# Example: Load PDFs from the parent RAG folder
# Modify this list to include your PDF files
pdf_files = [
    "llm_fundamentals.pdf",
    # Add more PDF paths here
    "Cover_Letter.pdf",
    "prakyath_resume.pdf",
]

# Load all documents
all_documents = load_multiple_pdfs(pdf_files)
print(f"\nüìö Total documents loaded: {len(all_documents)} pages from {len(pdf_files)} files")

‚úÖ Loaded 8 pages from llm_fundamentals.pdf
‚úÖ Loaded 1 pages from Cover_Letter.pdf
‚úÖ Loaded 1 pages from prakyath_resume.pdf

üìö Total documents loaded: 10 pages from 3 files


## Step 4: Split Documents into Chunks

We split the documents while preserving the source metadata.

In [7]:
# Initialize text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,        # Size of each chunk
    chunk_overlap=50,      # Overlap between chunks to maintain context
    length_function=len,   # Function to measure chunk length
    separators=["\n\n", "\n", " ", ""]  # Split on paragraphs first, then sentences
)

# Split all documents
chunks = text_splitter.split_documents(all_documents)

print(f"‚úÖ Split {len(all_documents)} pages into {len(chunks)} chunks")
print(f"\nSample chunk with metadata:")
print(f"Source File: {chunks[0].metadata.get('source_file', 'Unknown')}")
print(f"Page: {chunks[0].metadata.get('page', 'Unknown')}")
print(f"Content Preview: {chunks[0].page_content[:100]}...")

‚úÖ Split 10 pages into 50 chunks

Sample chunk with metadata:
Source File: llm_fundamentals.pdf
Page: 0
Content Preview: @genieincodebottle 
Instagram | GitHub | Medium | YouTube 
How to Be Better Than Most in GenAI 
 
Co...


## Step 5: Create Embeddings and Vector Store

Store all chunks in a single vector store for unified search.

In [8]:
# Initialize embedding model
embeddings = HuggingFaceEmbeddings(
    model_name="all-MiniLM-L6-v2"
)

# Create vector store from all chunks
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    collection_name="multi_document_collection"
)

print(f"‚úÖ Vector store created with {vectorstore._collection.count()} chunks")
print(f"   Chunks from {len(pdf_files)} different documents")

‚úÖ Vector store created with 50 chunks
   Chunks from 3 different documents


## Step 6: Initialize LLM

In [9]:
# Initialize OpenAI LLM
llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0.7,  # Balanced creativity and accuracy
    api_key=os.environ["OPENAI_API_KEY"]
)

print("‚úÖ LLM initialized")

‚úÖ LLM initialized


## Step 7: Create Multi-Document QA Chain

This chain will retrieve relevant chunks from ANY of the loaded documents.

In [10]:
# Create QA chain with MMR retrieval for diverse results
multi_doc_qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # Put all retrieved context in the prompt
    retriever=vectorstore.as_retriever(
        search_type="mmr",  # Maximum Marginal Relevance for diverse results
        search_kwargs={
            "k": 5,          # Return top 5 chunks
            "fetch_k": 20    # Consider top 20 for diversity selection
        }
    ),
    return_source_documents=True  # Return source chunks for attribution
)

print("‚úÖ Multi-Document QA Chain created")

‚úÖ Multi-Document QA Chain created


## Step 8: Query Across Multiple Documents

Now we can ask questions and get answers from any of the loaded documents.

In [11]:
def query_multi_documents(question: str):
    """
    Query the multi-document QA system and display results with source attribution.
    
    Args:
        question: The question to ask
    """
    # Get answer from the chain
    result = multi_doc_qa_chain.invoke({"query": question})
    
    # Display question and answer
    print(f"\n{'='*80}")
    print(f"QUESTION: {question}")
    print(f"{'='*80}\n")
    print(f"ANSWER:\n{result['result']}\n")
    
    # Display sources with file and page information
    print(f"{'='*80}")
    print(f"SOURCES ({len(result['source_documents'])} chunks):")
    print(f"{'='*80}\n")
    
    for i, doc in enumerate(result['source_documents'], 1):
        source_file = doc.metadata.get('source_file', 'Unknown')
        page = doc.metadata.get('page', 'Unknown')
        
        print(f"Source {i}: {source_file} (Page {page})")
        print(f"   Content: {doc.page_content[:150]}...")
        print()

# Example query
query_multi_documents("What is LoRA and how does it work?")


QUESTION: What is LoRA and how does it work?

ANSWER:
LoRA (Low-Rank Adaptation) is a method used to fine-tune large language models efficiently. It works by adding low-rank matrices to the original model's weights, allowing for the update of only a small portion of the model during training. This approach reduces the computational resources required for fine-tuning by focusing on a smaller set of parameters, making it possible to adapt large models even on modest hardware.

SOURCES (5 chunks):

Source 1: llm_fundamentals.pdf (Page 2)
   Content: 9. QLoRA ‚Üí LoRA + quantization, enabling fine-tuning of huge models on modest hardware 
10. PEFT ‚Üí Family of methods (e.g., LoRA, QLoRA, adapters) upd...

Source 2: llm_fundamentals.pdf (Page 0)
   Content: @genieincodebottle 
Instagram | GitHub | Medium | YouTube 
How to Be Better Than Most in GenAI 
 
Contents 
 
Core LLM Building Blocks ..................

Source 3: prakyath_resume.pdf (Page 0)
   Content: Calling, LangGraph, Multi-Age

## Step 9: Try More Questions

Test with different types of questions.

In [12]:
# Question about a specific topic
query_multi_documents("What are the different types of attention mechanisms?")


QUESTION: What are the different types of attention mechanisms?

ANSWER:
The different types of attention mechanisms are:

1. **Attention** - Highlights the most relevant tokens in context.
2. **Self-Attention** - Each token attends to every other token for context.
3. **Cross-Attention** - Connects the encoder and decoder in encoder-decoder models.
4. **Multi-Head Attention** - Several attention heads capture different patterns in parallel.

SOURCES (5 chunks):

Source 1: llm_fundamentals.pdf (Page 1)
   Content: 5. Attention ‚Üí Highlights the most relevant tokens in context 
6. Self-Attention ‚Üí Each token attends to every other token for context 
7. Cross-Atten...

Source 2: llm_fundamentals.pdf (Page 6)
   Content: 11. Self-Consistency ‚Üí Compare multiple reasoning paths ‚Üí pick best answer 
12. Tree of Thoughts (ToT) ‚Üí Explore many reasoning branches before decidi...

Source 3: llm_fundamentals.pdf (Page 7)
   Content: 3. Guardrails ‚Üí Rule-based or learned filters to prev

In [13]:
# Comparative question (if you have multiple documents)
query_multi_documents("Compare the approaches to fine-tuning discussed in the documents")


QUESTION: Compare the approaches to fine-tuning discussed in the documents

ANSWER:
The documents discuss several approaches to fine-tuning, each with its unique characteristics:

1. **Quantization-Aware Training (QAT)**: This approach involves fine-tuning models while considering quantization to retain accuracy. It aims to prepare models for deployment in environments with limited resources by adjusting the model's weights to minimize the loss of performance when quantized.

2. **QLoRA**: This is a combination of LoRA (Low-Rank Adaptation) and quantization, which enables the fine-tuning of large models on modest hardware. It allows for efficient adaptation of models by updating only a small number of parameters, making it suitable for situations where hardware resources are limited.

3. **PEFT (Parameter-Efficient Fine-Tuning)**: This method includes techniques like LoRA, QLoRA, and adapters that focus on updating only small parts of the model rather than the entire model. This appro

## Step 10: Filter by Source Document (Optional)

Query specific documents only.

In [14]:
def query_specific_document(question: str, source_file: str):
    """
    Query a specific document by filtering on source_file metadata.
    
    Args:
        question: The question to ask
        source_file: The filename to search in
    """
    # Create retriever with metadata filter
    filtered_retriever = vectorstore.as_retriever(
        search_type="mmr",
        search_kwargs={
            "k": 5,
            "fetch_k": 20,
            "filter": {"source_file": source_file}  # Filter by source file
        }
    )
    
    # Create temporary chain with filtered retriever
    filtered_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=filtered_retriever,
        return_source_documents=True
    )
    
    # Get answer
    result = filtered_chain.invoke({"query": question})
    
    print(f"\n{'='*80}")
    print(f"FILTERED QUERY (Document: {source_file})")
    print(f"QUESTION: {question}")
    print(f"{'='*80}\n")
    print(f"ANSWER:\n{result['result']}\n")

# Example: Query only the llm_fundamentals.pdf
query_specific_document(
    "What is attention mechanism?",
    "llm_fundamentals.pdf"
)


FILTERED QUERY (Document: llm_fundamentals.pdf)
QUESTION: What is attention mechanism?

ANSWER:
The attention mechanism is a technique used in machine learning, particularly in natural language processing and computer vision, to focus on specific parts of the input data that are most relevant for a given task. It allows models to weigh different tokens or elements differently based on their importance in the context, enhancing the model's ability to capture relevant information and dependencies. This mechanism helps improve the performance of various models, especially in tasks involving sequences, such as translation or summarization.



## Summary

### What is Built:
- ‚úÖ Multi-document loading system with metadata tracking
- ‚úÖ Unified vector store for cross-document search
- ‚úÖ QA system with source attribution (file + page)
- ‚úÖ Document-specific filtering capability

### Key Concepts Learned:
1. **Metadata Management**: Tracking source files and pages
2. **Document Combination**: Merging multiple sources in one vector store
3. **Source Attribution**: Showing which document provided the answer
4. **Filtered Retrieval**: Querying specific documents
