Skip to content

Rukhaiya2004/ragdocbot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAG Document Bot A Retrieval-Augmented Generation (RAG) system that allows you to ask questions about PDF documents and get AI-generated answers based on the document content. Features

PDF Text Extraction: Extracts text from PDF documents using pdfplumber Text Chunking: Splits documents into manageable chunks for better processing Semantic Search: Uses sentence transformers to create embeddings and FAISS for efficient similarity search AI-Powered Answers: Generates contextual answers using Google's FLAN-T5 model Interactive Q&A: Command-line interface for asking questions about your documents

How It Works

Document Processing: PDF text is extracted and split into chunks Embedding Creation: Each chunk is converted to vector embeddings using SentenceTransformers Index Building: FAISS index is built for fast similarity search Query Processing: User questions are embedded and matched against document chunks Answer Generation: Retrieved relevant chunks are used as context for AI answer generation

Installation

Clone the repository:

bashgit clone cd ragdocbot

Install required dependencies:

bashpip install -r requirements.txt

Add your PDF documents to the data/sample_docs/ directory

Usage

Place your PDF file in data/sample_docs/ directory (currently configured for Marauders.pdf) Run the application:

bashpython run.py

When prompted, ask a question about your document:

Ask a question: What is the main topic of this document?

The system will process your query and provide an AI-generated answer based on the document content.

Project Structure ragdocbot/ ├── src/ │ ├── app.py # Main application logic │ ├── chunker.py # PDF text extraction and chunking │ ├── embedder.py # Text embedding using SentenceTransformers │ ├── retriever.py # FAISS indexing and search │ └── generator.py # Answer generation using FLAN-T5 ├── data/ │ └── sample_docs/ # Place your PDF documents here ├── requirements.txt # Python dependencies ├── run.py # Application entry point └── README.md # This file

Dependencies

sentence-transformers - For creating text embeddings faiss-cpu - For efficient similarity search openai - OpenAI API client (if needed for future enhancements) pdfplumber - PDF text extraction tqdm - Progress bars numpy - Numerical operations transformers - Hugging Face transformers for text generation

Configuration Adjusting Chunk Size Modify the chunk_size parameter in src/app.py: pythonchunks = chunk_text(text, chunk_size=300) # Adjust as needed Changing the PDF File Update the pdf_path in src/app.py: pythonpdf_path = "data/sample_docs/your_document.pdf" Modifying Retrieval Parameters Adjust the number of retrieved chunks in src/app.py: pythonretrieved_ids = search_faiss(index, np.array(query_embedding), k=2)

About

Retrieval-Augmented QA system for PDFs using Hugging Face and FAISS

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages