An intelligent PDF document Q&A application using RAG (Retrieval-Augmented Generation) with LangChain and Groq API.
- Upload and analyze PDF files (CVs, reports, documents, etc.)
- Ask natural language questions about your documents
- Get accurate, context-aware responses
- Semantic search using embeddings
- FAISS vector store for efficient similarity search
- Multi-document chunk retrieval for comprehensive context
- Groq API for lightning-fast LLM inference
- HuggingFace embeddings for semantic understanding
- Real-time streaming responses
- Maintains chat history within a session
- Automatic reset when uploading a new document
- File-aware conversation management
PDF Upload
↓
[Text Extraction] → PyPDF2
↓
[Chunking] → RecursiveCharacterTextSplitter (1000 chars, 200 overlap)
↓
[Embeddings] → HuggingFace "all-MiniLM-L6-v2"
↓
[Vector Store] → FAISS (4 top results per query)
↓
[RAG Chain] → Retriever | Prompt Template | Groq LLM
↓
Answer
- Python 3.10+
- Groq API key (get free at https://groq.com)
# 1. Create virtual environment
python -m venv ml_env
source ml_env/bin/activate # On Windows: ml_env\Scripts\activate
# 2. Install dependencies
pip install -r requirements.txt
# 3. Create .env file
echo "api_key=your_groq_api_key_here" > .env# Activate environment
source ml_env/bin/activate
# Run the application
streamlit run app.pyThe app will open at http://localhost:8501
- Enter API Key (optional if in .env)
- Upload PDF via the sidebar
- Ask Questions about the document content
- Get Instant Answers based on document context
api_key=your_groq_api_key_here
| Parameter | Value | Description |
|---|---|---|
| Chunk Size | 1000 | Characters per text chunk |
| Chunk Overlap | 200 | Overlap between chunks for context continuity |
| Retrieval K | 4 | Number of chunks retrieved per query |
| Embedding Model | all-MiniLM-L6-v2 | 384-dim, lightweight embeddings |
| LLM | llama-3.3-70b | Groq's fastest available model |
pdf_chatbot/
├── app.py # Main Streamlit application
├── requirements.txt # Python dependencies
├── .env # API keys (git-ignored)
├── ml_env/ # Virtual environment
└── README.md # This file
streamlit # Web UI framework
langchain # LLM orchestration
langchain-core # Core abstractions
langchain-groq # Groq API integration
langchain-text-splitters # Document splitting
langchain-community # Community integrations
sentence-transformers # Semantic embeddings
faiss-cpu # Vector similarity search
PyPDF2 # PDF text extraction
python-dotenv # Environment variable loading
Extracts text from PDF using PyPDF2.
Creates RAG chain:
- Splits document into chunks
- Generates embeddings
- Builds FAISS vector store
- Creates LCEL chain combining retrieval + LLM
messages: Chat historychain: RAG pipeline instanceretriever: FAISS retrievercurrent_file: Tracks loaded file to avoid reprocessing
- Uses CPU FAISS (install
faiss-gpufor GPU acceleration) - HuggingFace embeddings are cached locally
- Groq API provides sub-second inference
- Keep chunk size between 500-2000 for balance
- Use k=3-5 for retrieval to balance relevance/scope
- Small PDFs (<500KB) work best for instant processing
Your installed LangChain version is incompatible. Update:
pip install --upgrade langchain langchain-core langchain-groqThe prompt template expects specific variables. Ensure build_chain() passes correct keys.
- Reduce chunk_size to speed up embedding
- Reduce k (retrieval count) from 4 to 2
- Use
faiss-gpufor GPU acceleration
Some PDFs are image-only. They need OCR (not supported in this version).
In build_chain():
llm = ChatGroq(groq_api_key=api_key, model_name="mixtral-8x7b-32768")splitter = RecursiveCharacterTextSplitter(
chunk_size=2000, # Larger chunks
chunk_overlap=500 # More overlap
)Modify the system_prompt variable in build_chain() to change LLM behavior.
- Cannot process image-based PDFs (OCR not implemented)
- Limited to 500MB+ file sizes depending on memory
- No support for multi-language documents (English optimized)
- Chat history resets on application restart
- Multi-document Q&A
- Persistent chat history (database)
- OCR for image-based PDFs
- Citation tracking (show source pages)
- Custom prompt templates UI
- Batch document processing
- Export chat to PDF
MIT License - Feel free to modify and share!
- Groq API: https://groq.com/documentation
- LangChain: https://python.langchain.com/
- Streamlit: https://docs.streamlit.io/
Created with ❤️ using LangChain, Streamlit, and Groq API