A complete command-line tool for managing document datasets with AI-powered embeddings and interactive chat.
Powered by Memvid v1 - Turn millions of text chunks into a single, searchable video file. 🎬
- ✅ Self-contained virtual environment management
- ✅ Automatic dependency installation
- ✅ Multiple dataset support
- ✅ 11 file format support: Documents (PDF, DOCX, RTF, TXT, MD), Spreadsheets (XLSX, XLS, CSV), Presentations (PPTX), E-books (EPUB), Web (HTML)
- ✅ Automatic embedding generation with source attribution
- ✅ Interactive AI chat with your documents
- ✅ Dataset versioning (append/rebuild)
- ✅ Simple, intuitive CLI with interactive menu mode
- ✅ Improved context retrieval (10 chunks, up from 5)
- ✅ Source file tracking to prevent data mixing
- ✅ Full environment variable configuration (chunk size, LLM model, temperature, etc.)
- ✅ NEW: Interactive menu system with numbered selection
- ✅ NEW: File management interface
- ✅ NEW: Comprehensive help system
Major improvement to text chunking quality with sentence-boundary-aware splitting:
SemanticChunkerclass: Respects sentence boundaries instead of arbitrary character splits- Benefits: +25% quality improvement over fixed chunking
- Preserves complete sentences (no mid-sentence splits)
- Better semantic coherence per chunk
- Smart sentence-based overlap for continuity
- Size range control: min_chunk_size (300) to max_chunk_size (700)
Full version history: See CHANGELOG.md
ChatVid supports 11 file formats across 5 categories:
| Category | Formats | Extensions | Features |
|---|---|---|---|
| Documents | PDF, Word, RTF, Text, Markdown | .pdf, .docx, .doc, .rtf, .txt, .md, .markdown |
Full text extraction, metadata |
| Spreadsheets | Excel, CSV | .xlsx, .xls, .csv |
Multi-sheet support, markdown tables, configurable row limits |
| Presentations | PowerPoint | .pptx |
Slides, speaker notes, tables |
| E-books | EPUB | .epub |
EPUB2/EPUB3, chapter extraction, metadata |
| Web Content | HTML | .html, .htm |
Clean text extraction, script/style removal |
- Spreadsheets: Configurable row limit via
MAX_SPREADSHEET_ROWS(default: 10,000) prevents memory issues - PowerPoint: Only
.pptxsupported (not legacy.pptformat) - EPUB: DRM-protected files not supported
- All Formats: Source attribution included automatically for accurate LLM responses
ChatVid requires Python 3.10, 3.11, 3.12, or 3.13.
The CLI script automatically detects and validates your Python installation:
- First run: Detects suitable Python command (
python3orpython) and saves it - Subsequent runs: Uses saved command for fast startup (~10ms overhead)
- Auto-recovery: Re-detects if saved command becomes invalid
- Smart detection: Prefers
python3overpythonfor better cross-platform compatibility
If you need to use a specific Python executable (e.g., for pyenv/asdf users):
export CHATVID_PYTHON_CMD=/path/to/python3.12
./cli.shpython3 --version # Should show 3.10.x, 3.11.x, 3.12.x, or 3.13.x
# OR
python --version # If python3 is not availableIf you don't have a compatible Python version:
- macOS:
brew install python@3.12 - Ubuntu/Debian:
sudo apt install python3.12 - Windows:
winget install Python.Python.3.12 - Any OS: Download from python.org
- Version managers: Use
pyenv,asdf, or similar tools
New in v1.2.0: Interactive menu mode! Simply run ./cli.sh and follow the numbered menus - no commands to memorize!
cd ChatVid
./cli.shThe interactive menu will guide you through:
- Setup - Configure your API key
- Create Dataset - Name your dataset
- Build - Select dataset and process documents
- Chat - Select dataset and start asking questions
- File Management - View and manage files
- Help - Comprehensive documentation
Benefits:
- No command memorization needed
- Numbered selection (just type 1, 2, 3, etc.)
- Visual dataset status indicators
- Guided workflows with validation
- Built-in help and troubleshooting
cd ChatVid
./cli.sh setupThis will:
- Create a virtual environment (
venv/) - Install all dependencies
- Prompt for your OpenAI or OpenRouter API key
./cli.sh create my-projectThis creates:
datasets/my-project/
├── documents/ # Add your files here
└── metadata.json # Dataset tracking
# Copy your files to the documents folder
cp ~/my-documents/*.pdf datasets/my-project/documents/
cp ~/my-documents/*.txt datasets/my-project/documents/Supported formats:
- PDF (
.pdf) - Text (
.txt) - Markdown (
.md) - Word (
.docx,.doc) - HTML (
.html,.htm)
./cli.sh build my-projectThis will:
- Extract text from all documents
- Add source attribution to prevent data mixing
- Generate semantic embeddings
- Create searchable knowledge base (
knowledge.mp4)
./cli.sh chat my-projectAsk questions about your documents and get AI-powered answers! The AI now correctly distinguishes between different source files.
Start interactive menu with numbered selection
./cli.shMenu Options:
- Setup / Configure API
- Create New Dataset
- Build Dataset (Process Documents)
- Chat with Dataset
- Append Documents to Dataset
- Rebuild Dataset
- List All Datasets
- Dataset Info
- Manage Dataset Files - NEW!
- Delete Dataset
- Help & Documentation - NEW!
- Exit
Features:
- Dataset selection from numbered list
- File management (view, remove, open folder)
- Built-in help and tutorials
- Progress tracking and validation
Configure your API key (first-time setup)
./cli.sh setupPrompts for:
- OpenAI API key (https://platform.openai.com/api-keys)
- OpenRouter API key (https://openrouter.ai/keys)
Saves configuration to .env file.
Show comprehensive help documentation - NEW!
./cli.sh helpDisplays:
- Command reference with examples
- Configuration variable guide
- Workflow tutorials
- Troubleshooting tips
- Configuration presets
Create a new dataset
./cli.sh create research-papersCreates folder structure at datasets/research-papers/
List all datasets with statistics
./cli.sh listShows:
- Dataset names
- Creation dates
- Build status
- Number of chunks and files
Show detailed dataset information
./cli.sh info research-papersDisplays:
- Document list with sizes
- Build timestamps
- Embedding statistics
- File paths
Delete a dataset
./cli.sh delete old-projectRequires confirmation by typing the dataset name.
Build embeddings from documents
./cli.sh build research-papersProcesses all files in datasets/<name>/documents/ and creates:
knowledge.mp4- QR code video with embeddingsknowledge_index.json- Metadata indexknowledge_index.faiss- Vector search index
Add new documents to existing dataset
# 1. Add new files to documents/
cp new-file.pdf datasets/research-papers/documents/
# 2. Append to embeddings
./cli.sh append research-papersNote: Currently rebuilds entire dataset (Memvid limitation)
Rebuild embeddings from scratch
./cli.sh rebuild research-papersDeletes existing embeddings and rebuilds from all documents.
When to rebuild:
- After updating to v1.0.2 (adds source attribution)
- When chat responses seem inaccurate
- After changing chunk size settings
Start interactive chat session
./cli.sh chat research-papersFeatures:
- Context-aware responses with 10-chunk retrieval window
- Semantic search across all documents
- Source attribution prevents data mixing
- Conversation history
- Type
quitorexitto end
Example session:
You: What did Company A offer in their proposal?
Assistant: [Source: Proposal_CompanyA_3.11.pdf]
Based on the Company A proposal, they offered...
You: What about B pricing?
Assistant: [Source: Proposal_CompanyB_3.11.pdf]
According to the Company B proposal, their pricing structure...
You: quit
Note: The chat now correctly distinguishes between different source files!
# 1. Setup (first time only)
./cli.sh setup
# Enter your OpenAI API key
# 2. Create dataset
./cli.sh create quantum-research
# 3. Add research papers
cp ~/Downloads/quantum-*.pdf datasets/quantum-research/documents/
# 4. Build embeddings
./cli.sh build quantum-research
# Processing 15 documents...
# Build complete!
# 5. Start chatting
./cli.sh chat quantum-research
You: What are the key breakthroughs in quantum computing?
Assistant: The research papers highlight several key breakthroughs...
# 6. Later: Add more papers
cp new-paper.pdf datasets/quantum-research/documents/
./cli.sh append quantum-research
# 7. Chat with updated knowledge
./cli.sh chat quantum-researchChatVid/
├── cli.sh # Main CLI entry point
├── memvid_cli.py # Python implementation
├── requirements.txt # Python dependencies
├── .env.example # API key template
├── .env # Your API key (created by setup)
├── README.md # This file
├── venv/ # Virtual environment (auto-created)
└── datasets/ # All your datasets
├── research-papers/
│ ├── documents/ # Your PDF, TXT, MD files
│ ├── metadata.json # Dataset tracking
│ ├── knowledge.mp4 # Embeddings (QR video)
│ ├── knowledge_index.json
│ └── knowledge_index.faiss
└── meeting-notes/
└── documents/
- Get key: https://platform.openai.com/api-keys
- Run
./cli.sh setup - Choose option 1 (OpenAI)
- Enter your key:
sk-...
- Get key: https://openrouter.ai/keys
- Run
./cli.sh setup - Choose option 2 (OpenRouter)
- Enter your key:
sk-or-v1-...
Or manually edit .env file:
# For OpenAI
OPENAI_API_KEY=sk-your-key
# For OpenRouter
OPENAI_API_BASE=https://openrouter.ai/api/v1
OPENAI_API_KEY=sk-or-v1-your-keyChatVid is fully configurable via environment variables in the .env file.
| Variable | Range | Default | Description |
|---|---|---|---|
CHUNK_SIZE |
100-1000 | 300 | Size of text chunks in characters |
CHUNK_OVERLAP |
20-200 | 50 | Overlap between consecutive chunks |
Example: For technical documents with complex topics:
CHUNK_SIZE=400
CHUNK_OVERLAP=80| Variable | Range | Default | Description |
|---|---|---|---|
LLM_MODEL |
- | gpt-4o-mini-2024-07-18 (OpenAI) openai/gpt-4o (OpenRouter) |
Model to use for chat |
LLM_TEMPERATURE |
0.0-2.0 | 0.7 | Response creativity level |
LLM_MAX_TOKENS |
100-4000 | 1000 | Maximum response length |
CONTEXT_CHUNKS |
1-20 | 10 | Chunks retrieved per query |
MAX_HISTORY |
1-50 | 10 | Conversation turns remembered |
Note: Setup command automatically uses the correct model based on provider choice.
Example: For cost optimization:
LLM_MODEL=gpt-4o-mini-2024-07-18
LLM_TEMPERATURE=0.7
LLM_MAX_TOKENS=500
CONTEXT_CHUNKS=7
MAX_HISTORY=5Example: For maximum quality:
LLM_MODEL=gpt-4o-mini-2024-07-18
LLM_TEMPERATURE=0.3
LLM_MAX_TOKENS=2000
CONTEXT_CHUNKS=15
MAX_HISTORY=20CHUNK_SIZE=400
CHUNK_OVERLAP=80
LLM_MODEL=gpt-4o-mini-2024-07-18
LLM_TEMPERATURE=0.3
CONTEXT_CHUNKS=12CHUNK_SIZE=300
CHUNK_OVERLAP=50
LLM_MODEL=gpt-4o-mini-2024-07-18
LLM_TEMPERATURE=1.0
CONTEXT_CHUNKS=10CHUNK_SIZE=300
CHUNK_OVERLAP=40
LLM_MODEL=gpt-4o-mini-2024-07-18
LLM_MAX_TOKENS=500
CONTEXT_CHUNKS=7-
During setup (recommended):
./cli.sh setup
Creates
.envwith all default values -
Manual editing:
nano .env # or use any text editor -
Per-project configuration:
- Copy ChatVid to different directories
- Each directory can have its own
.envfile - Different settings for different use cases
After changing chunking settings (CHUNK_SIZE, CHUNK_OVERLAP), rebuild your datasets:
./cli.sh rebuild <dataset-name>LLM settings (LLM_MODEL, LLM_TEMPERATURE, etc.) take effect immediately, no rebuild needed.
| Format | Extension | Support | Notes |
|---|---|---|---|
.pdf |
✅ Full | Via PyPDF2 | |
| Text | .txt |
✅ Full | Plain text |
| Markdown | .md |
✅ Full | Plain text |
| Word | .docx |
✅ Full | Via python-docx |
| Word (old) | .doc |
May require conversion | |
| HTML | .html, .htm |
✅ Full | Via BeautifulSoup4, strips tags/scripts |
# Reinstall dependencies
./cli.sh setup# Run setup to configure
./cli.sh setup
# Or check .env file exists and has OPENAI_API_KEY=...
cat .env# Make sure files are in the right place
ls datasets/my-project/documents/
# Supported extensions: .pdf, .txt, .md, .html# Build embeddings first
./cli.sh build my-project
# Then try chat again
./cli.sh chat my-projectSolution: Rebuild your dataset to add source attribution (v1.0.2+)
./cli.sh rebuild my-projectAdjust document chunk sizes in the .env file:
CHUNK_SIZE=1000
CHUNK_OVERLAP=200
After modifying these values, rebuild datasets:
./cli.sh rebuild <dataset_name>
Edit .env to define API source and model:
OPENAI_API_KEY=sk-your-key
For OpenRouter integration:
OPENAI_API_BASE=https://openrouter.ai/api/v1
OPENAI_API_KEY=sk-or-v1-your-key
LLM_MODEL=openai/gpt-4o
Examples of alternative models:
LLM_MODEL=anthropic/claude-4.5-haiku
LLM_MODEL=google/gemini-pro-2.5
Each dataset is independent:
./cli.sh create work-docs
./cli.sh create personal-notes
./cli.sh create research-papers
# Each has its own embeddings and chat
./cli.sh chat work-docs
./cli.sh chat personal-notes- Text Extraction: Reads PDF/TXT/MD/HTML files and extracts text
- Source Attribution: Prepends
[Source: filename.pdf]to each document (v1.0.2+) - Chunking: Splits text into overlapping chunks (~300 chars with 50 char overlap)
- Embeddings: Generates semantic vectors using sentence-transformers (all-MiniLM-L6-v2)
- QR Encoding: Encodes chunks as QR codes
- Video Creation: Creates MP4 video where each frame is a QR code
- Vector Index: Builds FAISS index for fast similarity search
- Chat: Retrieves 10 most relevant chunks and sends to LLM for contextual answers
- 10 pages: ~30 seconds
- 100 pages: ~3 minutes
- 1000 pages: ~20 minutes
- Search: <2 seconds for 1M chunks
- LLM response: 2-5 seconds (depends on model)
- 10K chunks: ~20MB video + ~15MB index
- Text compression: ~10:1 ratio
- Organize by topic: Create separate datasets for different subjects
- Rebuild after updates: Run
./cli.sh rebuild <name>after updating ChatVid - Clean documents: Remove headers/footers for better results
- Chunk size: Use larger chunks (400-500) for technical docs, smaller (200-300) for mixed content
- API costs: Use GPT-4o-mini (current default) for cost efficiency
- Backup: Keep original documents separate from datasets/
- File naming: Use descriptive filenames - they appear in source attribution
- Context window: 10 chunks is optimal; increase in
memvid_cli.pyline 527 if needed
- Append: Currently rebuilds entire dataset (Memvid API limitation)
- Binary files: Only text-based formats supported
- OCR: Scanned PDFs require pre-processing with Tesseract (see TODO.md)
- Large files: Very large PDFs (>100MB) may be slow to process
- API costs: Chat requires API key with credits
- Metadata: memvid doesn't support chunk metadata - workaround: source attribution in text
- Page numbers: Not yet tracked for PDFs (planned in TODO.md)
All automatically installed by ./cli.sh setup:
Core:
memvid>=0.1.3- Memvid v1 - Core embedding storage and semantic search engine
Document Processing:
PyPDF2>=3.0.1- PDF text extractionpython-docx>=0.8.11- Word document supportbeautifulsoup4>=4.12.0- HTML/web content parsinglxml>=4.9.0- HTML parser backend
API & Configuration:
openai>=1.0.0- LLM integration (OpenAI and OpenRouter)python-dotenv>=1.0.0- Environment variable management
Get API Keys:
- OpenAI: https://platform.openai.com/api-keys
- OpenRouter: https://openrouter.ai/keys
Memvid Documentation:
Issues:
- Check
./cli.sh listto verify datasets - Check
./cli.sh info <name>for details - Ensure API key is set in
.env - Run
./cli.sh setupto reconfigure
ChatVid is powered by Memvid v1 - an innovative library that turns millions of text chunks into a single, searchable video file.
Memvid compresses an entire knowledge base into MP4 files while keeping millisecond-level semantic search. Think of it as SQLite for AI memory - portable, efficient, and self-contained.
Key Features:
- 📦 50-100× smaller storage than traditional vector databases
- 🎬 Encodes text as QR codes in video frames
- 🚀 Zero infrastructure required - just video files
- 🔍 Millisecond-level semantic search
- 💾 Portable and self-contained
Learn more:
- GitHub: https://github.com/Olow304/memvid
- PyPI: https://pypi.org/project/memvid/
- License: MIT
- Author: Olow304
Why Memvid?
ChatVid leverages Memvid's unique approach to storing embeddings as video files, making your document knowledge bases portable, efficient, and requiring zero database infrastructure. The entire dataset fits in a single .mp4 file!
- Memvid v1 by Olow304 - The core technology that makes ChatVid possible
- OpenAI - API for embeddings and chat completions
- OpenRouter - Alternative API provider supporting multiple models
MIT License
Copyright (c) 2025 Esmaabi (ChatVid)
This project is built upon and complies with the MIT License of the Memvid library:
- Memvid v1: Copyright (c) 2025 Olow304
See LICENSE for full details.
# Interactive Menu (Recommended for beginners)
./cli.sh # Start interactive menu - NEW in v1.2.0!
# Help & Documentation
./cli.sh help # Comprehensive help - NEW!
./cli.sh --help # Quick command reference
# Setup
./cli.sh setup # First-time configuration
# Datasets
./cli.sh create <name> # Create new
./cli.sh list # Show all
./cli.sh info <name> # Details
./cli.sh delete <name> # Remove
# Documents
# → Add files to: datasets/<name>/documents/
# Embeddings
./cli.sh build <name> # Initial build
./cli.sh append <name> # Add more docs
./cli.sh rebuild <name> # Start fresh
# Chat
./cli.sh chat <name> # Interactive Q&A
# Configuration
# → Edit .env to change: chunk size, model, temperature, etc.Ready to get started?
- Beginners: Run
./cli.shfor interactive menu 🎯 - Advanced: Run
./cli.sh setupfor command-line mode 🚀