A fully local, privacy-first voice assistant that answers questions from your documents using Retrieval Augmented Generation (RAG). Speak naturally and get conversational responses based on your ingested PDFs.
- 100% Local: All processing happens on your machine - no cloud APIs required
- Voice Interface: Speak questions naturally and hear spoken responses
- RAG Pipeline: Retrieves relevant context from your documents before generating answers
- Hybrid Search: Combines vector similarity with BM25 keyword search for better results
- Apple Silicon Optimized: MLX-accelerated Whisper for fast speech recognition on Mac
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Microphone │────▶│ Silero VAD │────▶│ MLX Whisper │
│ (Input) │ │ (Detect) │ │ (STT) │
└─────────────┘ └─────────────┘ └─────────────┘
│
▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Speaker │◀────│ Coqui TTS │◀────│ LM Studio │
│ (Output) │ │ (Speak) │ │ (LLM) │
└─────────────┘ └─────────────┘ └─────────────┘
▲
│
┌───────────────────┴───────────────────┐
│ Weaviate │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ BGE Vectors │ │ BM25 Index │ │
│ └─────────────┘ └─────────────┘ │
└───────────────────────────────────────┘
| Component | Technology | Purpose |
|---|---|---|
| Audio I/O | sounddevice | Microphone input and speaker output |
| VAD | Silero VAD | Detects when you start/stop speaking |
| STT | MLX Whisper | Converts speech to text (Apple Silicon) |
| Embeddings | BGE-base-en-v1.5 | Converts text to vectors for search |
| Vector DB | Weaviate | Stores and searches document chunks |
| LLM | LM Studio | Generates conversational responses |
| TTS | Coqui TTS | Converts response text to speech |
- Python 3.10+ - Required for all dependencies
- Docker Desktop - For running Weaviate vector database
- LM Studio - For running local LLMs
- Git - For cloning the repository
- Xcode Command Line Tools:
xcode-select --install - Homebrew (recommended): For installing system dependencies
- Visual Studio Build Tools: Required for compiling some Python packages
- CUDA Toolkit (optional): For GPU acceleration with NVIDIA cards
This setup uses MLX for hardware-accelerated speech recognition.
git clone https://github.com/InsightStream-Dev/local-voice-rag.git
cd local-voice-ragpython3 -m venv venv
source venv/bin/activatepip install --upgrade pip
pip install -r requirements.txtNote: If you encounter numpy/transformers version conflicts, run:
pip install numpy==1.26.4 transformers==4.49.0
docker compose -f docker/docker-compose.yml up -dVerify it's running:
curl http://localhost:8081/v1/.well-known/ready- Download from lmstudio.ai
- Launch LM Studio
- Download a model (recommended:
qwen2.5-3b-instructfor speed) - Go to Developer tab → Start the local server
- Ensure it's running on
http://localhost:1234
cp .env.example .envEdit .env if needed (defaults work for most setups):
# Weaviate
WEAVIATE_URL=http://localhost:8081
# LM Studio
LMSTUDIO_URL=http://localhost:1234/v1
LMSTUDIO_MODEL=local-model
# Models
WHISPER_MODEL=mlx-community/whisper-large-v3-turbo
EMBEDDING_MODEL=BAAI/bge-base-en-v1.5
TTS_MODEL=tts_models/en/ljspeech/tacotron2-DDCIntel Macs and Linux systems cannot use MLX. Use CPU Whisper instead.
Edit requirements.txt and uncomment the CPU fallback:
# openai-whisper>=20231117 ← Uncomment this lineThen install:
pip install openai-whisperEdit .env:
WHISPER_MODEL=base.en # or small.en, medium.en, large-v3Windows setup requires additional steps for audio and some Python packages.
-
Python 3.10+: Download from python.org
- Check "Add Python to PATH" during installation
-
Docker Desktop: Download from docker.com
- Enable WSL 2 backend when prompted
-
Visual Studio Build Tools:
- Download from visualstudio.microsoft.com
- Install "Desktop development with C++"
-
Git: Download from git-scm.com
Open PowerShell or Command Prompt:
git clone https://github.com/InsightStream-Dev/local-voice-rag.git
cd local-voice-rag
python -m venv venv
.\venv\Scripts\activate
pip install --upgrade pip
pip install -r requirements.txtFor audio support, you may need to install PortAudio:
# Using pip (try this first)
pip install sounddevice
# If that fails, install manually:
# Download PortAudio from http://www.portaudio.com/download.htmlWindows cannot use MLX. Edit requirements.txt:
# Comment out MLX lines:
# mlx>=0.12.0
# mlx-whisper>=0.4.0
# Uncomment CPU whisper:
openai-whisper>=20231117Install:
pip install openai-whisperCreate .env file:
WEAVIATE_URL=http://localhost:8081
LMSTUDIO_URL=http://localhost:1234/v1
LMSTUDIO_MODEL=local-model
# Use CPU Whisper model
WHISPER_MODEL=base.endocker compose -f docker/docker-compose.yml up -d- Download from lmstudio.ai
- Install and launch
- Download a model (e.g.,
qwen2.5-3b-instruct) - Start the local server on port 1234
First, add your PDF documents to the knowledge base:
# Activate virtual environment
source venv/bin/activate # macOS/Linux
.\venv\Scripts\activate # Windows
# Ingest a PDF
python scripts/ingest_document.py /path/to/your/document.pdfExample output:
Ingesting: /path/to/document.pdf
Extracted 50 pages
Created 120 chunks
Embedded and stored in Weaviate
Done! Document ingested successfully.
Verify your documents are searchable:
python scripts/test_search.py "your search query"python -m src.assistant.pipelineOutput:
Warming up voice assistant...
Loading VAD...
Loading STT (Whisper)...
Loading TTS...
Connecting to LLM...
Model: qwen2.5-3b-instruct
Connecting to vector database...
Initializing audio...
Voice assistant ready!
==================================================
Voice Assistant Active
Speak to ask questions about your documents.
Press Ctrl+C to exit.
==================================================
[Listening...]
Now speak your question! The assistant will:
- Detect when you start speaking
- Transcribe your speech
- Search for relevant document chunks
- Generate a response using the LLM
- Speak the answer back to you
local-voice-rag/
├── docker/
│ └── docker-compose.yml # Weaviate configuration
├── docs/ # Your PDF documents
├── scripts/
│ ├── ingest_document.py # PDF ingestion CLI
│ ├── test_search.py # Test vector search
│ ├── test_llm.py # Test LM Studio connection
│ ├── test_stt.py # Test speech-to-text
│ ├── test_tts.py # Test text-to-speech
│ ├── test_vad.py # Test voice activity detection
│ └── test_audio.py # Test audio I/O
├── src/
│ ├── audio/
│ │ ├── microphone.py # Async microphone capture
│ │ ├── speaker.py # Audio playback
│ │ └── vad.py # Silero VAD wrapper
│ ├── stt/
│ │ ├── whisper_mlx.py # MLX Whisper (Apple Silicon)
│ │ └── whisper_cpu.py # CPU Whisper fallback
│ ├── tts/
│ │ └── coqui.py # Coqui TTS wrapper
│ ├── rag/
│ │ ├── embeddings.py # BGE embeddings
│ │ ├── weaviate_client.py # Weaviate connection
│ │ ├── chunker.py # Document chunking
│ │ ├── ingest.py # PDF ingestion pipeline
│ │ └── search.py # Hybrid search
│ ├── llm/
│ │ └── lmstudio.py # LM Studio client
│ ├── assistant/
│ │ ├── pipeline.py # Main voice assistant
│ │ └── prompts.py # System prompts
│ └── config.py # Configuration management
├── requirements.txt
├── .env.example
└── README.md
All settings can be configured via environment variables or .env file:
| Variable | Default | Description |
|---|---|---|
WEAVIATE_URL |
http://localhost:8081 |
Weaviate HTTP endpoint |
LMSTUDIO_URL |
http://localhost:1234/v1 |
LM Studio API endpoint |
LMSTUDIO_MODEL |
local-model |
Model identifier (auto-detected) |
WHISPER_MODEL |
mlx-community/whisper-large-v3-turbo |
Speech recognition model |
EMBEDDING_MODEL |
BAAI/bge-base-en-v1.5 |
Text embedding model |
TTS_MODEL |
tts_models/en/ljspeech/tacotron2-DDC |
Text-to-speech model |
AUDIO_SAMPLE_RATE |
16000 |
Audio sample rate (Hz) |
VAD_THRESHOLD |
0.5 |
Voice activity detection sensitivity |
VAD_SILENCE_DURATION_MS |
700 |
Silence before speech end (ms) |
CHUNK_SIZE |
500 |
Tokens per document chunk |
CHUNK_OVERLAP |
50 |
Overlap between chunks |
SEARCH_LIMIT |
5 |
Number of search results |
SEARCH_ALPHA |
0.5 |
Hybrid search balance (0=BM25, 1=Vector) |
Error: Connection refused or Cannot connect to Weaviate
-
Check if Docker is running:
docker ps
-
Check Weaviate container status:
docker compose -f docker/docker-compose.yml logs
-
Verify Weaviate is healthy:
curl http://localhost:8081/v1/.well-known/ready
Error: LM Studio is not available
- Ensure LM Studio is running
- Check the Developer tab - server must be started
- Verify a model is loaded
- Test the connection:
curl http://localhost:1234/v1/models
Error: No input device found or PortAudio error
macOS:
- Grant microphone permission in System Preferences → Security & Privacy → Microphone
Windows:
- Check Windows audio settings
- Try installing PortAudio manually
- Run as Administrator if permission issues persist
Error: Various import errors or deprecation warnings
pip install numpy==1.26.4 transformers==4.49.0Error: MLX is only supported on Apple Silicon
Use CPU Whisper instead:
- Comment out
mlxandmlx-whisperinrequirements.txt - Uncomment
openai-whisper - Set
WHISPER_MODEL=base.enin.env
If responses take too long (>30s):
- Use a smaller LLM: In LM Studio, try
qwen2.5-1.5b-instructorphi-3-mini - Use a smaller Whisper model: Set
WHISPER_MODEL=base.en - Reduce search results: Set
SEARCH_LIMIT=3in.env
If you run out of RAM:
- Close other applications
- Use smaller models:
- LLM: 1-3B parameter models
- Whisper:
base.enorsmall.en
- Reduce chunk size:
CHUNK_SIZE=300
Test each component independently to isolate issues:
# Test audio input/output
python scripts/test_audio.py
# Test voice activity detection
python scripts/test_vad.py
# Test speech-to-text
python scripts/test_stt.py
# Test text-to-speech
python scripts/test_tts.py
# Test LLM connection
python scripts/test_llm.py
# Test document search
python scripts/test_search.py "test query"Typical response times on Apple Silicon M1/M2:
| Stage | Time |
|---|---|
| Speech-to-Text | 1-3s |
| Vector Search | 0.1-0.3s |
| LLM Generation | 5-15s |
| Text-to-Speech | 2-5s |
| Total | 8-23s |
Times vary based on:
- Input audio length
- LLM model size and complexity
- Response length
- Hardware specifications
MIT License - See LICENSE file for details.
Contributions are welcome! Please open an issue or submit a pull request.
- Fork the repository
- Create a feature branch:
git checkout -b feature/my-feature - Commit changes:
git commit -am 'Add my feature' - Push to branch:
git push origin feature/my-feature - Open a Pull Request