GitHub - Rukhaiya2004/ragdocbot: Retrieval-Augmented QA system for PDFs using Hugging Face and FAISS

RAG Document Bot A Retrieval-Augmented Generation (RAG) system that allows you to ask questions about PDF documents and get AI-generated answers based on the document content. Features

PDF Text Extraction: Extracts text from PDF documents using pdfplumber Text Chunking: Splits documents into manageable chunks for better processing Semantic Search: Uses sentence transformers to create embeddings and FAISS for efficient similarity search AI-Powered Answers: Generates contextual answers using Google's FLAN-T5 model Interactive Q&A: Command-line interface for asking questions about your documents

How It Works

Document Processing: PDF text is extracted and split into chunks Embedding Creation: Each chunk is converted to vector embeddings using SentenceTransformers Index Building: FAISS index is built for fast similarity search Query Processing: User questions are embedded and matched against document chunks Answer Generation: Retrieved relevant chunks are used as context for AI answer generation

Installation

Clone the repository:

bashgit clone cd ragdocbot

Install required dependencies:

bashpip install -r requirements.txt

Add your PDF documents to the data/sample_docs/ directory

Usage

Place your PDF file in data/sample_docs/ directory (currently configured for Marauders.pdf) Run the application:

bashpython run.py

When prompted, ask a question about your document:

Ask a question: What is the main topic of this document?

The system will process your query and provide an AI-generated answer based on the document content.

Project Structure ragdocbot/ ├── src/ │ ├── app.py # Main application logic │ ├── chunker.py # PDF text extraction and chunking │ ├── embedder.py # Text embedding using SentenceTransformers │ ├── retriever.py # FAISS indexing and search │ └── generator.py # Answer generation using FLAN-T5 ├── data/ │ └── sample_docs/ # Place your PDF documents here ├── requirements.txt # Python dependencies ├── run.py # Application entry point └── README.md # This file

Dependencies

sentence-transformers - For creating text embeddings faiss-cpu - For efficient similarity search openai - OpenAI API client (if needed for future enhancements) pdfplumber - PDF text extraction tqdm - Progress bars numpy - Numerical operations transformers - Hugging Face transformers for text generation

Configuration Adjusting Chunk Size Modify the chunk_size parameter in src/app.py: pythonchunks = chunk_text(text, chunk_size=300) # Adjust as needed Changing the PDF File Update the pdf_path in src/app.py: pythonpdf_path = "data/sample_docs/your_document.pdf" Modifying Retrieval Parameters Adjust the number of retrieved chunks in src/app.py: pythonretrieved_ids = search_faiss(index, np.array(query_embedding), k=2)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data/sample_docs		data/sample_docs
src		src
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages