Skip to content

SmartRAG is a privacy-first multimodal RAG system that lets you chat intelligently with your documents, images, and audio. Upload PDFs, Word files, or recordings and get accurate, context-aware answersโ€”all processed locally on your device with no external APIs.

License

Notifications You must be signed in to change notification settings

Databricks-Design/SmartRAG

ย 
ย 

Repository files navigation

SmartRAG - Multimodal Document Chat System

A sophisticated RAG (Retrieval-Augmented Generation) system that enables intelligent conversations with documents, images, and audio files through a clean ChatGPT-style interface.

Screenshot 2025-10-18 125810

๐Ÿš€ Quick Start

# Start the application
streamlit run chatbot_app.py

๐Ÿ—๏ธ Tech Stack

Core Framework

  • Python 3.8+ - Primary language
  • Streamlit - Web interface and UI framework
  • SQLite3 - File metadata storage and management

AI/ML Models

  • Ollama - Local LLM hosting (Llama 3.1 8B model)
  • PyTorch - Deep learning framework
  • Transformers - Hugging Face model library
  • OpenAI Whisper - Speech-to-text conversion (base model)
  • BLIP - Image captioning (Salesforce/blip-image-captioning-base)

Vector Database & Embeddings

  • ChromaDB - Vector storage and similarity search
  • Nomic Embed Text - Text embeddings via Ollama (768-dim vectors)
  • CLIP - Visual embeddings for images (openai/clip-vit-base-patch32)
  • FAISS - Alternative vector search (Facebook AI)

Document Processing

  • PyPDF2 - PDF text extraction
  • python-docx - Word document processing
  • pdfplumber - Advanced PDF parsing
  • python-pptx - PowerPoint file support

Image Processing

  • Pillow (PIL) - Image manipulation
  • OpenCV - Computer vision operations
  • Tesseract OCR - Text extraction from images
  • pytesseract - Python wrapper for Tesseract

Audio Processing

  • PyDub - Audio file manipulation
  • librosa - Audio analysis and processing
  • Whisper - Audio transcription

Utilities

  • NumPy - Numerical computations
  • PyYAML - Configuration management
  • tqdm - Progress bars
  • requests - HTTP client

๐Ÿ“ Project Structure

smartrag/
โ”œโ”€โ”€ chatbot_app.py              # Main Streamlit application
โ”œโ”€โ”€ config.yaml                 # System configuration
โ”œโ”€โ”€ requirements.txt            # Python dependencies
โ”œโ”€โ”€ multimodal_rag/            # Core RAG system
โ”‚   โ”œโ”€โ”€ system.py              # Main RAG orchestrator
โ”‚   โ”œโ”€โ”€ base.py                # Base classes and interfaces
โ”‚   โ”œโ”€โ”€ processors/            # File processors
โ”‚   โ”‚   โ”œโ”€โ”€ document_processor.py  # PDF, DOCX, TXT
โ”‚   โ”‚   โ”œโ”€โ”€ image_processor.py     # Images with OCR
โ”‚   โ”‚   โ””โ”€โ”€ audio_processor.py     # Audio transcription
โ”‚   โ””โ”€โ”€ vector_stores/         # Vector database implementations
โ”‚       โ”œโ”€โ”€ chroma_store.py    # ChromaDB integration
โ”‚       โ””โ”€โ”€ faiss_store.py     # FAISS integration
โ”œโ”€โ”€ file_storage.db            # SQLite database
โ”œโ”€โ”€ vector_db/                 # ChromaDB persistence
โ””โ”€โ”€ user_data/                 # User session data

๐ŸŽฏ Features

Multimodal Support

  • Documents: PDF, DOCX, DOC, TXT, MD, RTF
  • Images: JPG, PNG, BMP, TIFF, WEBP (with OCR)
  • Audio: MP3, WAV, M4A, OGG, FLAC, AAC

AI Capabilities

  • Local LLM inference with Ollama
  • Semantic search with vector embeddings
  • Image understanding and captioning
  • Speech-to-text transcription
  • Context-aware document retrieval

User Interface

  • ChatGPT-style conversation interface
  • File upload and management
  • Real-time processing feedback
  • Document viewer for stored files
  • Recent uploads tracking

โš™๏ธ Configuration

The system uses config.yaml for configuration:

models:
  llm_model: "llama3.1:8b" # Ollama Llama 3.1 8B model
  embedding_model: "nomic-embed-text" # Nomic text embeddings (768-dim)
  vision_model: "Salesforce/blip-image-captioning-base" # BLIP for image captioning
  whisper_model: "base" # Whisper base model for audio

vector_store:
  type: "chromadb"
  persist_directory: "./vector_db"
  collection_name: "traditional_multimodal_documents"
  embedding_dimension: 768 # Nomic embed text dimension

processing:
  chunk_size: 1000
  chunk_overlap: 200
  max_image_size: [1024, 1024]
  ocr_enabled: true # Tesseract OCR for images

๐Ÿ”ง Requirements

  • Python: 3.8 or higher
  • Ollama: For local LLM inference
  • Tesseract OCR: For image text extraction
  • FFmpeg: For audio processing (optional)

๐Ÿƒโ€โ™‚๏ธ Installation

  1. Install dependencies:

    pip install -r requirements.txt
  2. Install Ollama and pull models:

    # Install Ollama (see ollama.ai)
    ollama pull llama3.1:8b
    ollama pull nomic-embed-text
  3. Install Tesseract OCR:

    • Windows: Download from GitHub releases
    • macOS: brew install tesseract
    • Linux: sudo apt-get install tesseract-ocr
  4. Run the application:

    streamlit run chatbot_app.py

๐Ÿ“Š Architecture

[User Input] โ†’ [Streamlit UI] โ†’ [RAG System] โ†’ [File Processors]
                    โ†“                              โ†“
                                        [Document: PyPDF2/python-docx]
                                        [Image: Tesseract OCR + BLIP]
                                        [Audio: Whisper Transcription]
                    โ†“                              โ†“
[SQLite DB] โ† [Text Chunks] โ†’ [Nomic Embed Text (Ollama)] โ†’ [ChromaDB]
                                                                โ†“
[Vector Search] โ†’ [Context Retrieval] โ†’ [Llama 3.1 8B (Ollama)] โ†’ [Response]

Processing Pipeline

โ€ข Text Documents: Extracted with PyPDF2/python-docx โ†’ Chunked โ†’ Embedded with Nomic Embed Text โ€ข Images: OCR with Tesseract + Captioning with BLIP โ†’ Combined text โ†’ Embedded with Nomic Embed Text
โ€ข Audio: Transcribed with Whisper โ†’ Chunked โ†’ Embedded with Nomic Embed Text โ€ข Storage: All embeddings stored in ChromaDB (768-dim vectors) for semantic search โ€ข Generation: Retrieved context fed to Llama 3.1 8B via Ollama for response generation

๐ŸŽฎ Usage

  1. Upload Files: Drag & drop or browse files in the sidebar
  2. Chat: Ask questions about your uploaded content
  3. View Files: Use the eye icon to preview stored documents
  4. Manage Data: Clear chat history or uploaded files as needed

๐Ÿ”’ Privacy

  • Fully Offline: All processing happens locally
  • No Data Sent: No external API calls for LLM inference
  • Local Storage: Files and embeddings stored on your machine

Ingest mixed content

system.ingest_file("presentation.pdf") # Slides system.ingest_file("screenshot.png") # Image with text system.ingest_file("meeting_recording.mp3") # Audio transcript

Query across all modalities

response = system.query("What was discussed about the Q4 budget?")


### Batch Processing

```python
# Process entire directories
results = system.ingest_directory("./company_docs/", recursive=True)

# Get processing summary
successful = sum(1 for r in results.values() if r.success)
total_chunks = sum(len(r.chunks) for r in results.values() if r.success)
print(f"Processed {successful} files, created {total_chunks} chunks")

๐Ÿงช Testing

Run the test suite to verify installation:

# Run all tests
python -m pytest tests/

# Run specific test file
python tests/test_system.py

# Run with coverage
pip install coverage
coverage run tests/test_system.py
coverage report

๐Ÿ”ง Advanced Configuration

Vector Store Options

ChromaDB (Default - Recommended)

vector_store:
  type: "chromadb"
  persist_directory: "./vector_db"
  collection_name: "documents"
  embedding_dimension: 768 # For nomic-embed-text

FAISS (Alternative - High performance)

vector_store:
  type: "faiss"
  persist_directory: "./faiss_db"
  embedding_dimension: 768 # Must match nomic-embed-text

๐Ÿš€ Deployment

Local Development

python cli.py interactive

Docker Container

FROM python:3.10-slim

WORKDIR /app
COPY . .

RUN pip install -r requirements.txt
CMD ["python", "cli.py", "interactive"]

API Server

from fastapi import FastAPI
from multimodal_rag.system import MultimodalRAGSystem

app = FastAPI()
system = MultimodalRAGSystem()

@app.post("/query")
async def query_endpoint(query: str):
    response = system.query(query)
    return {"answer": response.answer, "sources": len(response.sources)}

Development Setup

# Install development dependencies
pip install -r requirements.txt
pip install pytest black flake8

# Run tests
python -m pytest

# Format code
black multimodal_rag/ tests/ examples/

# Lint code
flake8 multimodal_rag/ tests/ examples/

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments


SmartRAG - Intelligent multimodal document understanding for the modern age.

About

SmartRAG is a privacy-first multimodal RAG system that lets you chat intelligently with your documents, images, and audio. Upload PDFs, Word files, or recordings and get accurate, context-aware answersโ€”all processed locally on your device with no external APIs.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%