SmartRAG - Multimodal Document Chat System

A sophisticated RAG (Retrieval-Augmented Generation) system that enables intelligent conversations with documents, images, and audio files through a clean ChatGPT-style interface.

🚀 Quick Start

# Start the application
streamlit run chatbot_app.py

🏗️ Tech Stack

Core Framework

Python 3.8+ - Primary language
Streamlit - Web interface and UI framework
SQLite3 - File metadata storage and management

AI/ML Models

Ollama - Local LLM hosting (Llama 3.1 8B model)
PyTorch - Deep learning framework
Transformers - Hugging Face model library
OpenAI Whisper - Speech-to-text conversion (base model)
BLIP - Image captioning (Salesforce/blip-image-captioning-base)

Vector Database & Embeddings

ChromaDB - Vector storage and similarity search
Nomic Embed Text - Text embeddings via Ollama (768-dim vectors)
CLIP - Visual embeddings for images (openai/clip-vit-base-patch32)
FAISS - Alternative vector search (Facebook AI)

Document Processing

PyPDF2 - PDF text extraction
python-docx - Word document processing
pdfplumber - Advanced PDF parsing
python-pptx - PowerPoint file support

Image Processing

Pillow (PIL) - Image manipulation
OpenCV - Computer vision operations
Tesseract OCR - Text extraction from images
pytesseract - Python wrapper for Tesseract

Audio Processing

PyDub - Audio file manipulation
librosa - Audio analysis and processing
Whisper - Audio transcription

Utilities

NumPy - Numerical computations
PyYAML - Configuration management
tqdm - Progress bars
requests - HTTP client

📁 Project Structure

smartrag/
├── chatbot_app.py              # Main Streamlit application
├── config.yaml                 # System configuration
├── requirements.txt            # Python dependencies
├── multimodal_rag/            # Core RAG system
│   ├── system.py              # Main RAG orchestrator
│   ├── base.py                # Base classes and interfaces
│   ├── processors/            # File processors
│   │   ├── document_processor.py  # PDF, DOCX, TXT
│   │   ├── image_processor.py     # Images with OCR
│   │   └── audio_processor.py     # Audio transcription
│   └── vector_stores/         # Vector database implementations
│       ├── chroma_store.py    # ChromaDB integration
│       └── faiss_store.py     # FAISS integration
├── file_storage.db            # SQLite database
├── vector_db/                 # ChromaDB persistence
└── user_data/                 # User session data

🎯 Features

Multimodal Support

Documents: PDF, DOCX, DOC, TXT, MD, RTF
Images: JPG, PNG, BMP, TIFF, WEBP (with OCR)
Audio: MP3, WAV, M4A, OGG, FLAC, AAC

AI Capabilities

Local LLM inference with Ollama
Semantic search with vector embeddings
Image understanding and captioning
Speech-to-text transcription
Context-aware document retrieval

User Interface

ChatGPT-style conversation interface
File upload and management
Real-time processing feedback
Document viewer for stored files
Recent uploads tracking

⚙️ Configuration

The system uses config.yaml for configuration:

models:
  llm_model: "llama3.1:8b" # Ollama Llama 3.1 8B model
  embedding_model: "nomic-embed-text" # Nomic text embeddings (768-dim)
  vision_model: "Salesforce/blip-image-captioning-base" # BLIP for image captioning
  whisper_model: "base" # Whisper base model for audio

vector_store:
  type: "chromadb"
  persist_directory: "./vector_db"
  collection_name: "traditional_multimodal_documents"
  embedding_dimension: 768 # Nomic embed text dimension

processing:
  chunk_size: 1000
  chunk_overlap: 200
  max_image_size: [1024, 1024]
  ocr_enabled: true # Tesseract OCR for images

🔧 Requirements

Python: 3.8 or higher
Ollama: For local LLM inference
Tesseract OCR: For image text extraction
FFmpeg: For audio processing (optional)

🏃‍♂️ Installation

Install dependencies:
```
pip install -r requirements.txt
```

Install Ollama and pull models:

# Install Ollama (see ollama.ai)
ollama pull llama3.1:8b
ollama pull nomic-embed-text

Install Tesseract OCR:
- Windows: Download from GitHub releases
- macOS: brew install tesseract
- Linux: sudo apt-get install tesseract-ocr
Run the application:
```
streamlit run chatbot_app.py
```

📊 Architecture

[User Input] → [Streamlit UI] → [RAG System] → [File Processors]
                    ↓                              ↓
                                        [Document: PyPDF2/python-docx]
                                        [Image: Tesseract OCR + BLIP]
                                        [Audio: Whisper Transcription]
                    ↓                              ↓
[SQLite DB] ← [Text Chunks] → [Nomic Embed Text (Ollama)] → [ChromaDB]
                                                                ↓
[Vector Search] → [Context Retrieval] → [Llama 3.1 8B (Ollama)] → [Response]

Processing Pipeline

• Text Documents: Extracted with PyPDF2/python-docx → Chunked → Embedded with Nomic Embed Text • Images: OCR with Tesseract + Captioning with BLIP → Combined text → Embedded with Nomic Embed Text
• Audio: Transcribed with Whisper → Chunked → Embedded with Nomic Embed Text • Storage: All embeddings stored in ChromaDB (768-dim vectors) for semantic search • Generation: Retrieved context fed to Llama 3.1 8B via Ollama for response generation

🎮 Usage

Upload Files: Drag & drop or browse files in the sidebar
Chat: Ask questions about your uploaded content
View Files: Use the eye icon to preview stored documents
Manage Data: Clear chat history or uploaded files as needed

🔒 Privacy

Fully Offline: All processing happens locally
No Data Sent: No external API calls for LLM inference
Local Storage: Files and embeddings stored on your machine

Ingest mixed content

system.ingest_file("presentation.pdf") # Slides system.ingest_file("screenshot.png") # Image with text system.ingest_file("meeting_recording.mp3") # Audio transcript

Query across all modalities

response = system.query("What was discussed about the Q4 budget?")


### Batch Processing

```python
# Process entire directories
results = system.ingest_directory("./company_docs/", recursive=True)

# Get processing summary
successful = sum(1 for r in results.values() if r.success)
total_chunks = sum(len(r.chunks) for r in results.values() if r.success)
print(f"Processed {successful} files, created {total_chunks} chunks")

🧪 Testing

Run the test suite to verify installation:

# Run all tests
python -m pytest tests/

# Run specific test file
python tests/test_system.py

# Run with coverage
pip install coverage
coverage run tests/test_system.py
coverage report

🔧 Advanced Configuration

Vector Store Options

ChromaDB (Default - Recommended)

vector_store:
  type: "chromadb"
  persist_directory: "./vector_db"
  collection_name: "documents"
  embedding_dimension: 768 # For nomic-embed-text

FAISS (Alternative - High performance)

vector_store:
  type: "faiss"
  persist_directory: "./faiss_db"
  embedding_dimension: 768 # Must match nomic-embed-text

🚀 Deployment

Local Development

python cli.py interactive

Docker Container

FROM python:3.10-slim

WORKDIR /app
COPY . .

RUN pip install -r requirements.txt
CMD ["python", "cli.py", "interactive"]

API Server

from fastapi import FastAPI
from multimodal_rag.system import MultimodalRAGSystem

app = FastAPI()
system = MultimodalRAGSystem()

@app.post("/query")
async def query_endpoint(query: str):
    response = system.query(query)
    return {"answer": response.answer, "sources": len(response.sources)}

Development Setup

# Install development dependencies
pip install -r requirements.txt
pip install pytest black flake8

# Run tests
python -m pytest

# Format code
black multimodal_rag/ tests/ examples/

# Lint code
flake8 multimodal_rag/ tests/ examples/

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

ChromaDB for vector storage
Sentence Transformers for embeddings
Hugging Face Transformers for language models
OpenAI Whisper for speech recognition
Tesseract OCR for text extraction

SmartRAG - Intelligent multimodal document understanding for the modern age.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
multimodal_rag		multimodal_rag
.gitignore		.gitignore
CLEAN_STRUCTURE.md		CLEAN_STRUCTURE.md
INSTALLATION.md		INSTALLATION.md
LICENSE		LICENSE
PROJECT_STRUCTURE.md		PROJECT_STRUCTURE.md
README.md		README.md
chatbot_app.py		chatbot_app.py
clean_db.py		clean_db.py
config.yaml		config.yaml
multimodal_image_processor.py		multimodal_image_processor.py
requirements.txt		requirements.txt
setup.py		setup.py
start.py		start.py
start_chatbot.py		start_chatbot.py
test.txt		test.txt

License

Databricks-Design/SmartRAG

Folders and files

Latest commit

History

Repository files navigation