# üìö PDF Question-Answer System with LangChain

Complete implementation of a PDF Q&A system using:
- Vector Database (FAISS/ChromaDB)
- Semantic Search with Embeddings
- LLM Integration (Gemini/OpenAI)
- Source Citations from PDF Pages

## Step 1: Installation

In [1]:
%pip install langchain langchain-community langchain-google-genai langchain-openai -q
%pip install pypdf chromadb faiss-cpu sentence-transformers -q
%pip install tiktoken reportlab -q

print("‚úÖ All packages installed!")

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.
‚úÖ All packages installed!



[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


## Step 2: Import Libraries

In [1]:
import os
import json
import warnings
from getpass import getpass
from datetime import datetime
warnings.filterwarnings('ignore')

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma, FAISS
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_openai import ChatOpenAI

print("‚úÖ Libraries imported successfully!")

‚úÖ Libraries imported successfully!


## Step 3: API Configuration

In [2]:
PROVIDER = 'gemini'

if PROVIDER == 'gemini':
    api_key = getpass("Enter your Google Gemini API key: ")
    os.environ['GOOGLE_API_KEY'] = api_key
    print("‚úÖ Gemini API configured")
elif PROVIDER == 'openai':
    api_key = getpass("Enter your OpenAI API key: ")
    os.environ['OPENAI_API_KEY'] = api_key
    print("‚úÖ OpenAI API configured")

‚úÖ Gemini API configured


## Step 4: PDF Processor Class

In [3]:
class PDFProcessor:
    def __init__(self, chunk_size=1000, chunk_overlap=200):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            length_function=len,
            separators=["\n\n", "\n", " ", ""]
        )
    
    def load_pdf(self, pdf_path):
        print(f"üìñ Loading PDF: {pdf_path}")
        loader = PyPDFLoader(pdf_path)
        documents = loader.load()
        print(f"‚úÖ Loaded {len(documents)} pages")
        return documents
    
    def split_documents(self, documents):
        chunks = self.text_splitter.split_documents(documents)
        print(f"‚úÖ Created {len(chunks)} text chunks")
        return chunks
    
    def get_document_stats(self, documents, chunks):
        total_chars = sum(len(doc.page_content) for doc in documents)
        total_words = sum(len(doc.page_content.split()) for doc in documents)
        return {
            'pages': len(documents),
            'chunks': len(chunks),
            'total_characters': total_chars,
            'total_words': total_words,
            'avg_chunk_size': total_chars // len(chunks) if chunks else 0
        }

pdf_processor = PDFProcessor()
print("‚úÖ PDF Processor initialized")

‚úÖ PDF Processor initialized


## Step 5: Create Sample PDF

In [4]:
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas

def create_sample_pdf(filename="sample_ml_document.pdf"):
    c = canvas.Canvas(filename, pagesize=letter)
    width, height = letter
    
    c.setFont("Helvetica-Bold", 16)
    c.drawString(50, height - 50, "Machine Learning Guide")
    c.setFont("Helvetica", 12)
    y = height - 100
    
    text1 = [
        "Machine learning is a subset of AI that enables systems to learn",
        "from experience without being explicitly programmed.",
        "",
        "Types of Machine Learning:",
        "1. Supervised Learning: Uses labeled data",
        "2. Unsupervised Learning: Finds patterns in unlabeled data",
        "3. Reinforcement Learning: Learns through trial and error"
    ]
    
    for line in text1:
        c.drawString(50, y, line)
        y -= 20
    
    c.showPage()
    
    c.setFont("Helvetica-Bold", 14)
    c.drawString(50, height - 50, "Popular ML Algorithms")
    c.setFont("Helvetica", 12)
    y = height - 100
    
    text2 = [
        "Linear Regression: Predicts continuous values.",
        "Decision Trees: Tree-like model for classification.",
        "Neural Networks: Complex pattern recognition.",
        "Random Forest: Ensemble of decision trees.",
        "SVM: Finds optimal hyperplane for classification."
    ]
    
    for line in text2:
        c.drawString(50, y, line)
        y -= 20
    
    c.save()
    print(f"‚úÖ Sample PDF created: {filename}")
    return filename

pdf_path = create_sample_pdf()

‚úÖ Sample PDF created: sample_ml_document.pdf


## Step 6: Upload PDF

In [5]:
try:
    from google.colab import files
    print("üì§ Upload your PDF:")
    uploaded = files.upload()
    pdf_path = list(uploaded.keys())[0]
    print(f"‚úÖ Uploaded: {pdf_path}")
except:
    print(f"Using: {pdf_path}")

Using: sample_ml_document.pdf


## Step 7: Load and Process PDF

In [6]:
documents = pdf_processor.load_pdf(pdf_path)
chunks = pdf_processor.split_documents(documents)
stats = pdf_processor.get_document_stats(documents, chunks)

print("\nüìä Document Statistics:")
print("=" * 50)
print(f"Pages: {stats['pages']}")
print(f"Chunks: {stats['chunks']}")
print(f"Words: {stats['total_words']:,}")
print(f"Characters: {stats['total_characters']:,}")

print("\nüìù First Chunk:")
print(chunks[0].page_content[:200] + "...")

üìñ Loading PDF: sample_ml_document.pdf
‚úÖ Loaded 2 pages
‚úÖ Created 2 text chunks

üìä Document Statistics:
Pages: 2
Chunks: 2
Words: 78
Characters: 585

üìù First Chunk:
Machine Learning Guide
Machine learning is a subset of AI that enables systems to learn
from experience without being explicitly programmed.
Types of Machine Learning:
1. Supervised Learning: Uses lab...


## Step 8: Create Vector Store

In [7]:
VECTOR_DB = 'faiss'

print("üîÑ Creating embeddings...")

embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    model_kwargs={'device': 'cpu'},
    encode_kwargs={'normalize_embeddings': True}
)

if VECTOR_DB == 'faiss':
    vectorstore = FAISS.from_documents(chunks, embeddings)
    print("‚úÖ FAISS vector store created")
else:
    vectorstore = Chroma.from_documents(
        chunks, embeddings, persist_directory="./pdf_chroma_db"
    )
    print("‚úÖ ChromaDB vector store created")

print(f"üìä Indexed {len(chunks)} chunks")

üîÑ Creating embeddings...


  embeddings = HuggingFaceEmbeddings(



‚úÖ FAISS vector store created
üìä Indexed 2 chunks


## Step 9: Initialize LLM

In [8]:
if PROVIDER == 'gemini':
    llm = ChatGoogleGenerativeAI(model="gemini-flash-latest", temperature=0)
    print("‚úÖ Gemini LLM initialized")
else:
    llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
    print("‚úÖ OpenAI LLM initialized")

‚úÖ Gemini LLM initialized


## Step 10: Create Q&A Chain

In [9]:
prompt_template = """You are a helpful AI assistant answering questions based on a PDF.

Use the context from the PDF to answer the question.
If you cannot find the answer, say so.
Always cite the page number.

Context: {context}

Question: {question}

Answer:"""

PROMPT = PromptTemplate(
    template=prompt_template,
    input_variables=["context", "question"]
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)

print("‚úÖ Q&A chain created!")

‚úÖ Q&A chain created!


## Step 11: Ask Questions Function

In [10]:
def ask_pdf(question, show_sources=True):
    print(f"\n{'='*70}")
    print(f"‚ùì Question: {question}")
    print(f"{'='*70}")
    
    result = qa_chain.invoke({"query": question})
    
    print(f"\nüí° Answer:")
    print(result['result'])
    
    if show_sources and result.get('source_documents'):
        print(f"\nüìö Sources:")
        for i, doc in enumerate(result['source_documents'], 1):
            page = doc.metadata.get('page', 'Unknown')
            print(f"\n[{i}] Page {page + 1}:")
            print(doc.page_content[:200] + "...")
    
    return result

## Step 12: Test Questions

In [11]:
ask_pdf("What is machine learning?")


‚ùì Question: What is machine learning?

üí° Answer:
Machine learning is a subset of AI that enables systems to learn from experience without being explicitly programmed (Page 1).

üìö Sources:

[1] Page 1:
Machine Learning Guide
Machine learning is a subset of AI that enables systems to learn
from experience without being explicitly programmed.
Types of Machine Learning:
1. Supervised Learning: Uses lab...

[2] Page 2:
Popular ML Algorithms
Linear Regression: Predicts continuous values.
Decision Trees: Tree-like model for classification.
Neural Networks: Complex pattern recognition.
Random Forest: Ensemble of decisi...


{'query': 'What is machine learning?',
 'result': 'Machine learning is a subset of AI that enables systems to learn from experience without being explicitly programmed (Page 1).',
 'source_documents': [Document(id='b3d239cf-66e8-4861-916d-f38b35dd8051', metadata={'producer': 'ReportLab PDF Library - www.reportlab.com', 'creator': 'ReportLab PDF Library - www.reportlab.com', 'creationdate': '2025-11-09T15:14:46+05:00', 'author': 'anonymous', 'keywords': '', 'moddate': '2025-11-09T15:14:46+05:00', 'subject': 'unspecified', 'title': 'untitled', 'trapped': '/False', 'source': 'sample_ml_document.pdf', 'total_pages': 2, 'page': 0, 'page_label': '1'}, page_content='Machine Learning Guide\nMachine learning is a subset of AI that enables systems to learn\nfrom experience without being explicitly programmed.\nTypes of Machine Learning:\n1. Supervised Learning: Uses labeled data\n2. Unsupervised Learning: Finds patterns in unlabeled data\n3. Reinforcement Learning: Learns through trial and error

In [12]:
ask_pdf("What are the types of machine learning?")


‚ùì Question: What are the types of machine learning?

üí° Answer:
The types of machine learning are:
1. Supervised Learning: Uses labeled data
2. Unsupervised Learning: Finds patterns in unlabeled data
3. Reinforcement Learning: Learns through trial and error (Page 1)

üìö Sources:

[1] Page 1:
Machine Learning Guide
Machine learning is a subset of AI that enables systems to learn
from experience without being explicitly programmed.
Types of Machine Learning:
1. Supervised Learning: Uses lab...

[2] Page 2:
Popular ML Algorithms
Linear Regression: Predicts continuous values.
Decision Trees: Tree-like model for classification.
Neural Networks: Complex pattern recognition.
Random Forest: Ensemble of decisi...


{'query': 'What are the types of machine learning?',
 'result': 'The types of machine learning are:\n1. Supervised Learning: Uses labeled data\n2. Unsupervised Learning: Finds patterns in unlabeled data\n3. Reinforcement Learning: Learns through trial and error (Page 1)',
 'source_documents': [Document(id='b3d239cf-66e8-4861-916d-f38b35dd8051', metadata={'producer': 'ReportLab PDF Library - www.reportlab.com', 'creator': 'ReportLab PDF Library - www.reportlab.com', 'creationdate': '2025-11-09T15:14:46+05:00', 'author': 'anonymous', 'keywords': '', 'moddate': '2025-11-09T15:14:46+05:00', 'subject': 'unspecified', 'title': 'untitled', 'trapped': '/False', 'source': 'sample_ml_document.pdf', 'total_pages': 2, 'page': 0, 'page_label': '1'}, page_content='Machine Learning Guide\nMachine learning is a subset of AI that enables systems to learn\nfrom experience without being explicitly programmed.\nTypes of Machine Learning:\n1. Supervised Learning: Uses labeled data\n2. Unsupervised Learning

In [13]:
ask_pdf("Name some machine learning algorithms")


‚ùì Question: Name some machine learning algorithms

üí° Answer:
Some machine learning algorithms include:

*   Linear Regression
*   Decision Trees
*   Neural Networks
*   Random Forest
*   SVM

(Context)

üìö Sources:

[1] Page 2:
Popular ML Algorithms
Linear Regression: Predicts continuous values.
Decision Trees: Tree-like model for classification.
Neural Networks: Complex pattern recognition.
Random Forest: Ensemble of decisi...

[2] Page 1:
Machine Learning Guide
Machine learning is a subset of AI that enables systems to learn
from experience without being explicitly programmed.
Types of Machine Learning:
1. Supervised Learning: Uses lab...


{'query': 'Name some machine learning algorithms',
 'result': 'Some machine learning algorithms include:\n\n*   Linear Regression\n*   Decision Trees\n*   Neural Networks\n*   Random Forest\n*   SVM\n\n(Context)',
 'source_documents': [Document(id='26c0cc5d-cf97-4681-a3b0-2cb63f0f1a23', metadata={'producer': 'ReportLab PDF Library - www.reportlab.com', 'creator': 'ReportLab PDF Library - www.reportlab.com', 'creationdate': '2025-11-09T15:14:46+05:00', 'author': 'anonymous', 'keywords': '', 'moddate': '2025-11-09T15:14:46+05:00', 'subject': 'unspecified', 'title': 'untitled', 'trapped': '/False', 'source': 'sample_ml_document.pdf', 'total_pages': 2, 'page': 1, 'page_label': '2'}, page_content='Popular ML Algorithms\nLinear Regression: Predicts continuous values.\nDecision Trees: Tree-like model for classification.\nNeural Networks: Complex pattern recognition.\nRandom Forest: Ensemble of decision trees.\nSVM: Finds optimal hyperplane for classification.'),
  Document(id='b3d239cf-66e8-4

## Step 13: Similarity Search Test

In [14]:
def test_similarity_search(query, k=4):
    print(f"\nüîç Search: '{query}'")
    print("=" * 70)
    docs = vectorstore.similarity_search(query, k=k)
    for i, doc in enumerate(docs, 1):
        page = doc.metadata.get('page', 0)
        print(f"\n[{i}] Page {page + 1}:")
        print(doc.page_content[:200] + "...")
    return docs

test_similarity_search("algorithms")


üîç Search: 'algorithms'

[1] Page 2:
Popular ML Algorithms
Linear Regression: Predicts continuous values.
Decision Trees: Tree-like model for classification.
Neural Networks: Complex pattern recognition.
Random Forest: Ensemble of decisi...

[2] Page 1:
Machine Learning Guide
Machine learning is a subset of AI that enables systems to learn
from experience without being explicitly programmed.
Types of Machine Learning:
1. Supervised Learning: Uses lab...


[Document(id='26c0cc5d-cf97-4681-a3b0-2cb63f0f1a23', metadata={'producer': 'ReportLab PDF Library - www.reportlab.com', 'creator': 'ReportLab PDF Library - www.reportlab.com', 'creationdate': '2025-11-09T15:14:46+05:00', 'author': 'anonymous', 'keywords': '', 'moddate': '2025-11-09T15:14:46+05:00', 'subject': 'unspecified', 'title': 'untitled', 'trapped': '/False', 'source': 'sample_ml_document.pdf', 'total_pages': 2, 'page': 1, 'page_label': '2'}, page_content='Popular ML Algorithms\nLinear Regression: Predicts continuous values.\nDecision Trees: Tree-like model for classification.\nNeural Networks: Complex pattern recognition.\nRandom Forest: Ensemble of decision trees.\nSVM: Finds optimal hyperplane for classification.'),
 Document(id='b3d239cf-66e8-4861-916d-f38b35dd8051', metadata={'producer': 'ReportLab PDF Library - www.reportlab.com', 'creator': 'ReportLab PDF Library - www.reportlab.com', 'creationdate': '2025-11-09T15:14:46+05:00', 'author': 'anonymous', 'keywords': '', 'modd

## Step 14: Interactive Q&A Session

In [15]:
def interactive_qa():
    print("\n" + "="*70)
    print("üìö Interactive PDF Q&A")
    print("="*70)
    print("Type 'quit' to exit\n")
    
    while True:
        question = input("\n‚ùì Question: ").strip()
        
        if question.lower() in ['quit', 'exit', 'q']:
            print("\nüëã Goodbye!")
            break
        
        if not question:
            continue
        
        result = qa_chain.invoke({"query": question})
        print(f"\nüí° {result['result']}")
        
        if result.get('source_documents'):
            pages = set(d.metadata.get('page', 0) + 1 for d in result['source_documents'])
            print(f"üìÑ Pages: {', '.join(map(str, sorted(pages)))}")

## Step 15: Q&A History Tracker

In [16]:
class QAHistory:
    def __init__(self):
        self.history = []
    
    def add_qa(self, question, answer, sources=None):
        entry = {
            'timestamp': datetime.now().isoformat(),
            'question': question,
            'answer': answer,
            'pages': [d.metadata.get('page', 0) + 1 for d in sources] if sources else []
        }
        self.history.append(entry)
    
    def export_json(self, filename='qa_history.json'):
        with open(filename, 'w') as f:
            json.dump(self.history, f, indent=2)
        print(f"‚úÖ Exported to {filename}")
    
    def export_text(self, filename='qa_history.txt'):
        with open(filename, 'w') as f:
            f.write("PDF Q&A History\n" + "="*70 + "\n\n")
            for i, e in enumerate(self.history, 1):
                f.write(f"Q{i}: {e['question']}\n")
                f.write(f"A{i}: {e['answer']}\n")
                f.write(f"Pages: {', '.join(map(str, e['pages']))}\n")
                f.write("-"*70 + "\n\n")
        print(f"‚úÖ Exported to {filename}")

qa_history = QAHistory()
print("‚úÖ History tracker ready")

‚úÖ History tracker ready


## Step 16: Complete System Class

In [20]:
class PDFQuestionAnswerSystem:
    def __init__(self, llm_provider='gemini', vector_db='faiss'):
        self.llm_provider = llm_provider
        self.vector_db = vector_db
        self.pdf_processor = PDFProcessor()
        self.vectorstore = None
        self.qa_chain = None
    
    def load_pdf(self, pdf_path):
        self.documents = self.pdf_processor.load_pdf(pdf_path)
        self.chunks = self.pdf_processor.split_documents(self.documents)
        return self.pdf_processor.get_document_stats(self.documents, self.chunks)
    
    def create_vectorstore(self):
        embeddings = HuggingFaceEmbeddings(
            model_name="sentence-transformers/all-MiniLM-L6-v2"
        )
        if self.vector_db == 'faiss':
            self.vectorstore = FAISS.from_documents(self.chunks, embeddings)
        else:
            self.vectorstore = Chroma.from_documents(
                self.chunks, embeddings, persist_directory="./pdf_db"
            )
        print(f"‚úÖ Vector store created")
    
    def initialize_qa(self):
        if self.llm_provider == 'gemini':
            llm = ChatGoogleGenerativeAI(model="gemini-flash-latest", temperature=0)
        else:
            llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
        
        prompt = PromptTemplate(
            template="Context: {context}\n\nQuestion: {question}\n\nAnswer:",
            input_variables=["context", "question"]
        )
        
        self.qa_chain = RetrievalQA.from_chain_type(
            llm=llm,
            retriever=self.vectorstore.as_retriever(search_kwargs={"k": 4}),
            return_source_documents=True,
            chain_type_kwargs={"prompt": prompt}
        )
        print("‚úÖ Q&A system ready!")
    
    def ask(self, question):
        result = self.qa_chain.invoke({"query": question})
        return result['result'], result.get('source_documents', [])
    
    def setup(self, pdf_path):
        stats = self.load_pdf(pdf_path)
        self.create_vectorstore()
        self.initialize_qa()
        return stats

print("‚úÖ Complete system class defined")

‚úÖ Complete system class defined


## Step 17: Quick Start

In [21]:
# Quick start example
qa_system = PDFQuestionAnswerSystem(llm_provider='gemini')
stats = qa_system.setup(pdf_path)

print(f"\nüìä Loaded: {stats['pages']} pages, {stats['total_words']:,} words")

questions = [
    "What is this document about?",
    "What are the main topics?"
]

for q in questions:
    print(f"\nQ: {q}")
    answer, sources = qa_system.ask(q)
    print(f"A: {answer}")

üìñ Loading PDF: sample_ml_document.pdf
‚úÖ Loaded 2 pages
‚úÖ Created 2 text chunks
‚úÖ Vector store created
‚úÖ Q&A system ready!

üìä Loaded: 2 pages, 78 words

Q: What is this document about?
A: This document is about **Machine Learning**, specifically covering its definition, main types (Supervised, Unsupervised, Reinforcement), and popular algorithms.

Q: What are the main topics?
A: The main topics covered in the provided text are:

1.  **Popular Machine Learning Algorithms** (and their primary functions, e.g., Linear Regression, Decision Trees, SVM).
2.  **The Definition and Core Concepts of Machine Learning** (as a subset of AI).
3.  **The Three Main Types of Machine Learning** (Supervised, Unsupervised, and Reinforcement Learning).


## Summary

### Built:
- ‚úÖ PDF processing
- ‚úÖ Vector database (FAISS/ChromaDB)
- ‚úÖ Semantic search
- ‚úÖ LLM integration
- ‚úÖ Q&A with citations
- ‚úÖ History tracking
- ‚úÖ Interactive mode