This project is a Python-based Retrieval-Augmented Generation (RAG) pipeline that intelligently summarizes .pdf
and .txt
documents using a lightweight sentence embedding model and a locally hosted LLM (ollama
). It leverages semantic search over chunked document text to extract the most relevant context before querying the language model.
- ✅ Supports both
.txt
and.pdf
inputs - ✅ Automatically filters out irrelevant sections like References or Bibliography
- ✅ Splits large documents into context-friendly chunks
- ✅ Uses sentence-transformers (
all-MiniLM-L6-v2
) for vector similarity search - ✅ Summarizes based on top-k relevant chunks using Ollama + Gemma3:1b
- ✅ Outputs concise, context-aware answers to a given question
- ✅ Saves results as
.txt
- Python
- SentenceTransformers
- PyPDF2
- Ollama
- NumPy
- Pandas
- File Reading
- Preprocessing
- Chunking
- Embedding
- Retrieval
- RAG Summarization
- Output
pip install -r requirements.txt
ollama run gemma3:1b
python pdf-summaizer.py
Don't forget to star me on GitHub and follow me! Thanks :)