Ask questions to your PDFs. Get precise, source-backed answers. Zero hallucinations.
This project implements an end-to-end Retrieval-Augmented Generation (RAG) pipeline that allows users to upload a PDF, semantically understand its contents, and ask natural-language questions with accurate, explainable answers grounded strictly in the document.
Built with BGM3 embeddings, semantic chunking, and cosine similarity search, this system is lightweight, transparent, and production-ready — perfect for research papers, textbooks, technical manuals, and reports.
- 📘 PDF → Knowledge Base: Turn any PDF into a searchable intelligence layer
- 🧠 Semantic Chunking: Preserves context instead of naive splitting
- 🔢 BGM3 Vector Embeddings: High-quality multilingual embeddings
- 📐 Cosine Similarity (scikit-learn): Fast & interpretable retrieval
- 🔍 Source-Aware Answers: Every response cites page & chunk references
- 🚫 No Hallucinations: LLM answers only from retrieved document context
- ⚡ Simple & Modular Pipeline: Easy to extend or swap components
This system follows a classic but powerful RAG architecture:
- Document Ingestion – Read and validate PDFs
- Text Processing – Extract, clean, normalize content
- Chunking Engine – Create semantic chunks with metadata
- Embedding Generator – Convert chunks into vector space
- Vector Store – Persist embeddings + metadata
- Query Pipeline – Embed user question & retrieve top matches
- LLM Reasoning Layer – Generate grounded answers using Gemini
read_pdf.py
↓
create_chunks.py
↓
embed_chunks.py
↓
query.py
Each stage is decoupled, making the system debuggable, extensible, and production-friendly.
- User provides a PDF document
- Supported formats validated before processing
- Reads PDF using PyPDF
- Extracts raw text from each page
- Handles malformed or scanned PDFs gracefully
- Converts PDF pages into structured plain text
- Preserves page boundaries for traceability
- Removes headers, footers, noise
- Normalizes whitespace and encoding
- Optional translation to English for consistency
Instead of fixed-size splitting, the system creates semantic chunks:
-
Maintains contextual meaning
-
Attaches metadata:
chunk_idpage_numberchunk_text
This dramatically improves retrieval accuracy.
-
Uses BGM3 embedding model
-
Converts each chunk into a dense vector
-
Saves:
embeddings.json- Corresponding metadata
These vectors form the semantic memory of the document.
-
Lightweight JSON-based vector storage
-
Includes:
- Embedding vectors
- Chunk text
- Page & chunk references
Easily replaceable with FAISS / Pinecone / Weaviate later.
-
User asks a natural-language question
-
Example:
"What is the composition of white Portland cement?"
- Question is embedded using the same BGM3 model
- Ensures vector space consistency
- Uses Cosine Similarity (scikit-learn)
- Compares query vector against all chunk embeddings
- Retrieves top-k most relevant chunks
-
Merges top-ranked chunks
-
Builds a strict prompt for the LLM:
- Answer only from provided context
- Cite sources explicitly
- Uses Google Gemini
- No external knowledge allowed
- Hallucination-free by design
- Accurate
- Explainable
- Source-aware
User Question:
What is the composition of white Portland cement?
System Answer:
White Portland cement is composed of dicalcium silicate (C2S, ~60%), tricalcium silicate (C3S, ~20–30%), and tricalcium aluminate (C3A, ~10%), along with the absence of iron oxide.
References:
- Page 18, Chunk 33
| Category | Tools |
|---|---|
| Language | Python 3.10+ |
| LLM | Google Gemini |
| Embeddings | BGM3 |
| Vector Search | Cosine Similarity (scikit-learn) |
| PDF Parsing | PyPDF |
| Architecture | Retrieval-Augmented Generation (RAG) |
| Utilities | python-dotenv |
- Research paper Q&A
- Legal / policy document analysis
- Technical manuals
- Educational material assistants
- Internal company knowledge bases