A sophisticated RAG (Retrieval-Augmented Generation) system that enables intelligent conversations with documents, images, and audio files through a clean ChatGPT-style interface.
# Start the application
streamlit run chatbot_app.py- Python 3.8+ - Primary language
- Streamlit - Web interface and UI framework
- SQLite3 - File metadata storage and management
- Ollama - Local LLM hosting (Llama 3.1 8B model)
- PyTorch - Deep learning framework
- Transformers - Hugging Face model library
- OpenAI Whisper - Speech-to-text conversion (base model)
- BLIP - Image captioning (Salesforce/blip-image-captioning-base)
- ChromaDB - Vector storage and similarity search
- Nomic Embed Text - Text embeddings via Ollama (768-dim vectors)
- CLIP - Visual embeddings for images (openai/clip-vit-base-patch32)
- FAISS - Alternative vector search (Facebook AI)
- PyPDF2 - PDF text extraction
- python-docx - Word document processing
- pdfplumber - Advanced PDF parsing
- python-pptx - PowerPoint file support
- Pillow (PIL) - Image manipulation
- OpenCV - Computer vision operations
- Tesseract OCR - Text extraction from images
- pytesseract - Python wrapper for Tesseract
- PyDub - Audio file manipulation
- librosa - Audio analysis and processing
- Whisper - Audio transcription
- NumPy - Numerical computations
- PyYAML - Configuration management
- tqdm - Progress bars
- requests - HTTP client
smartrag/
โโโ chatbot_app.py # Main Streamlit application
โโโ config.yaml # System configuration
โโโ requirements.txt # Python dependencies
โโโ multimodal_rag/ # Core RAG system
โ โโโ system.py # Main RAG orchestrator
โ โโโ base.py # Base classes and interfaces
โ โโโ processors/ # File processors
โ โ โโโ document_processor.py # PDF, DOCX, TXT
โ โ โโโ image_processor.py # Images with OCR
โ โ โโโ audio_processor.py # Audio transcription
โ โโโ vector_stores/ # Vector database implementations
โ โโโ chroma_store.py # ChromaDB integration
โ โโโ faiss_store.py # FAISS integration
โโโ file_storage.db # SQLite database
โโโ vector_db/ # ChromaDB persistence
โโโ user_data/ # User session data
- Documents: PDF, DOCX, DOC, TXT, MD, RTF
- Images: JPG, PNG, BMP, TIFF, WEBP (with OCR)
- Audio: MP3, WAV, M4A, OGG, FLAC, AAC
- Local LLM inference with Ollama
- Semantic search with vector embeddings
- Image understanding and captioning
- Speech-to-text transcription
- Context-aware document retrieval
- ChatGPT-style conversation interface
- File upload and management
- Real-time processing feedback
- Document viewer for stored files
- Recent uploads tracking
The system uses config.yaml for configuration:
models:
llm_model: "llama3.1:8b" # Ollama Llama 3.1 8B model
embedding_model: "nomic-embed-text" # Nomic text embeddings (768-dim)
vision_model: "Salesforce/blip-image-captioning-base" # BLIP for image captioning
whisper_model: "base" # Whisper base model for audio
vector_store:
type: "chromadb"
persist_directory: "./vector_db"
collection_name: "traditional_multimodal_documents"
embedding_dimension: 768 # Nomic embed text dimension
processing:
chunk_size: 1000
chunk_overlap: 200
max_image_size: [1024, 1024]
ocr_enabled: true # Tesseract OCR for images- Python: 3.8 or higher
- Ollama: For local LLM inference
- Tesseract OCR: For image text extraction
- FFmpeg: For audio processing (optional)
-
Install dependencies:
pip install -r requirements.txt
-
Install Ollama and pull models:
# Install Ollama (see ollama.ai) ollama pull llama3.1:8b ollama pull nomic-embed-text -
Install Tesseract OCR:
- Windows: Download from GitHub releases
- macOS:
brew install tesseract - Linux:
sudo apt-get install tesseract-ocr
-
Run the application:
streamlit run chatbot_app.py
[User Input] โ [Streamlit UI] โ [RAG System] โ [File Processors]
โ โ
[Document: PyPDF2/python-docx]
[Image: Tesseract OCR + BLIP]
[Audio: Whisper Transcription]
โ โ
[SQLite DB] โ [Text Chunks] โ [Nomic Embed Text (Ollama)] โ [ChromaDB]
โ
[Vector Search] โ [Context Retrieval] โ [Llama 3.1 8B (Ollama)] โ [Response]
โข Text Documents: Extracted with PyPDF2/python-docx โ Chunked โ Embedded with Nomic Embed Text
โข Images: OCR with Tesseract + Captioning with BLIP โ Combined text โ Embedded with Nomic Embed Text
โข Audio: Transcribed with Whisper โ Chunked โ Embedded with Nomic Embed Text
โข Storage: All embeddings stored in ChromaDB (768-dim vectors) for semantic search
โข Generation: Retrieved context fed to Llama 3.1 8B via Ollama for response generation
- Upload Files: Drag & drop or browse files in the sidebar
- Chat: Ask questions about your uploaded content
- View Files: Use the eye icon to preview stored documents
- Manage Data: Clear chat history or uploaded files as needed
- Fully Offline: All processing happens locally
- No Data Sent: No external API calls for LLM inference
- Local Storage: Files and embeddings stored on your machine
system.ingest_file("presentation.pdf") # Slides system.ingest_file("screenshot.png") # Image with text system.ingest_file("meeting_recording.mp3") # Audio transcript
response = system.query("What was discussed about the Q4 budget?")
### Batch Processing
```python
# Process entire directories
results = system.ingest_directory("./company_docs/", recursive=True)
# Get processing summary
successful = sum(1 for r in results.values() if r.success)
total_chunks = sum(len(r.chunks) for r in results.values() if r.success)
print(f"Processed {successful} files, created {total_chunks} chunks")
Run the test suite to verify installation:
# Run all tests
python -m pytest tests/
# Run specific test file
python tests/test_system.py
# Run with coverage
pip install coverage
coverage run tests/test_system.py
coverage reportChromaDB (Default - Recommended)
vector_store:
type: "chromadb"
persist_directory: "./vector_db"
collection_name: "documents"
embedding_dimension: 768 # For nomic-embed-textFAISS (Alternative - High performance)
vector_store:
type: "faiss"
persist_directory: "./faiss_db"
embedding_dimension: 768 # Must match nomic-embed-textpython cli.py interactiveFROM python:3.10-slim
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt
CMD ["python", "cli.py", "interactive"]from fastapi import FastAPI
from multimodal_rag.system import MultimodalRAGSystem
app = FastAPI()
system = MultimodalRAGSystem()
@app.post("/query")
async def query_endpoint(query: str):
response = system.query(query)
return {"answer": response.answer, "sources": len(response.sources)}# Install development dependencies
pip install -r requirements.txt
pip install pytest black flake8
# Run tests
python -m pytest
# Format code
black multimodal_rag/ tests/ examples/
# Lint code
flake8 multimodal_rag/ tests/ examples/This project is licensed under the MIT License - see the LICENSE file for details.
- ChromaDB for vector storage
- Sentence Transformers for embeddings
- Hugging Face Transformers for language models
- OpenAI Whisper for speech recognition
- Tesseract OCR for text extraction
SmartRAG - Intelligent multimodal document understanding for the modern age.