RAG Document Bot A Retrieval-Augmented Generation (RAG) system that allows you to ask questions about PDF documents and get AI-generated answers based on the document content. Features
PDF Text Extraction: Extracts text from PDF documents using pdfplumber Text Chunking: Splits documents into manageable chunks for better processing Semantic Search: Uses sentence transformers to create embeddings and FAISS for efficient similarity search AI-Powered Answers: Generates contextual answers using Google's FLAN-T5 model Interactive Q&A: Command-line interface for asking questions about your documents
How It Works
Document Processing: PDF text is extracted and split into chunks Embedding Creation: Each chunk is converted to vector embeddings using SentenceTransformers Index Building: FAISS index is built for fast similarity search Query Processing: User questions are embedded and matched against document chunks Answer Generation: Retrieved relevant chunks are used as context for AI answer generation
Installation
Clone the repository:
bashgit clone cd ragdocbot
Install required dependencies:
bashpip install -r requirements.txt
Add your PDF documents to the data/sample_docs/ directory
Usage
Place your PDF file in data/sample_docs/ directory (currently configured for Marauders.pdf) Run the application:
bashpython run.py
When prompted, ask a question about your document:
Ask a question: What is the main topic of this document?
The system will process your query and provide an AI-generated answer based on the document content.
Project Structure ragdocbot/ ├── src/ │ ├── app.py # Main application logic │ ├── chunker.py # PDF text extraction and chunking │ ├── embedder.py # Text embedding using SentenceTransformers │ ├── retriever.py # FAISS indexing and search │ └── generator.py # Answer generation using FLAN-T5 ├── data/ │ └── sample_docs/ # Place your PDF documents here ├── requirements.txt # Python dependencies ├── run.py # Application entry point └── README.md # This file
Dependencies
sentence-transformers - For creating text embeddings faiss-cpu - For efficient similarity search openai - OpenAI API client (if needed for future enhancements) pdfplumber - PDF text extraction tqdm - Progress bars numpy - Numerical operations transformers - Hugging Face transformers for text generation
Configuration Adjusting Chunk Size Modify the chunk_size parameter in src/app.py: pythonchunks = chunk_text(text, chunk_size=300) # Adjust as needed Changing the PDF File Update the pdf_path in src/app.py: pythonpdf_path = "data/sample_docs/your_document.pdf" Modifying Retrieval Parameters Adjust the number of retrieved chunks in src/app.py: pythonretrieved_ids = search_faiss(index, np.array(query_embedding), k=2)