A fully offline, self-contained voice interaction system featuring speech-to-text, text-to-speech with voice cloning, and configurable wake word detection. Designed to run as a standalone service with comprehensive API endpoints.
- Engine: Vosk
- Models: Dual-model system
- High-accuracy model for transcription
- Lightweight model for fast wake word detection
- Fully offline: No internet connection required
- Engine: XTTS v2 (Coqui TTS)
- Voice Cloning: Create and manage custom voice profiles
- Multi-voice Support: Switch between cloned voices on demand
- Fully offline: No cloud dependencies
- Configurable wake words to activate listening
- Configurable "start listening" phrases
- Configurable "done talking" phrases
- Low-latency detection using lightweight Vosk model
- Persistent storage of cloned voice profiles
- Add/remove training samples at any time
- Automatic retraining only when samples change
- No manual retraining required for voice usage
- RESTful API endpoints for all features
- WebSocket support for real-time streaming
- Designed for integration with other applications
- Python 3.9+
- 4GB+ RAM (8GB recommended for voice cloning)
- ~5GB disk space for models and dependencies
- Microphone for speech input
- Speakers/audio output for TTS playback
talk2me/
├── README.md
├── requirements.txt
├── setup.py
├── pyproject.toml
├── config/
│ ├── default.yaml # Default configuration
│ └── voices.yaml # Voice profiles configuration
├── models/
│ └── vosk-model-small-en-us-0.15/ # Vosk STT model
├── voices/
│ └── test_voice/
│ └── samples/ # Audio samples for cloning
├── src/
│ └── talk2me/
│ ├── __init__.py
│ ├── api/
│ │ ├── __init__.py
│ │ └── main.py # FastAPI server and all endpoints
│ ├── core/
│ │ ├── __init__.py
│ │ └── wake_word.py # Wake word detection
│ ├── stt/
│ │ ├── __init__.py
│ │ └── engine.py # Vosk STT implementation
│ ├── tts/
│ │ ├── __init__.py
│ │ └── engine.py # XTTS v2 implementation
│ └── utils/
│ └── __init__.py
├── scripts/
│ ├── setup.sh # Linux/macOS setup
│ ├── setup.bat # Windows setup
│ └── download_models.py # Model download script
├── tests/
│ ├── __init__.py
│ ├── test_api.py
│ ├── test_stt_engine.py
│ ├── test_tts_engine.py
│ └── test_wake_word.py
Linux/macOS:
git clone https://github.com/FatStinkyPanda/talk2me.git
cd talk2me
chmod +x scripts/setup.sh
./scripts/setup.shWindows:
git clone https://github.com/FatStinkyPanda/talk2me.git
cd talk2me
scripts\setup.bat-
Create virtual environment:
python -m venv venv source venv/bin/activate # Linux/macOS # or venv\Scripts\activate # Windows
-
Install dependencies:
pip install -r requirements.txt
-
Download models:
python scripts/download_models.py
stt:
model_path: "models/vosk-model-small-en-us-0.15"
wake_word_model_path: "models/vosk-model-small-en-us-0.15"
sample_rate: 16000
tts:
model_path: "models/models/xtts/v2"
default_voice: "default"
sample_rate: 24000
wake_words:
activation:
- "hey talk to me"
- "hello computer"
start_listening:
- "start listening"
- "listen up"
done_talking:
- "done talking"
- "that's all"
- "stop listening"
api:
host: "0.0.0.0"
port: 8000
cors_origins:
- "*"
audio:
input_device: null # null = system default
output_device: null # null = system default
chunk_size: 1024voices:
default:
name: "Default Voice"
samples_dir: "voices/default/samples"
language: "en"
test_voice:
name: "Test Voice"
samples_dir: "voices/test_voice/samples"
language: "en"# Start with default configuration
talk2me
# Start with custom config
talk2me --config path/to/config.yaml
# Start API server only
talk2me --api-only
# Start with specific port
talk2me --port 9000talk2me --interactivehttp://localhost:8000/api/v1
Transcribe audio file to text.
Request:
- Content-Type:
multipart/form-data - Body:
audio(file)
Response:
{
"text": "transcribed text here",
"confidence": 0.95,
"duration": 2.5
}Real-time streaming transcription.
Messages:
- Send: Binary audio chunks
- Receive: JSON with partial/final transcriptions
Convert text to speech.
Request:
{
"text": "Hello, world!",
"voice": "default",
"language": "en"
}Response:
- Content-Type:
audio/wav - Body: Audio binary data
Stream synthesized audio.
Request:
{
"text": "Long text to synthesize...",
"voice": "default"
}Response:
- Content-Type:
audio/wav - Transfer-Encoding: chunked
List all available voices.
Response:
{
"voices": [
{
"id": "default",
"name": "Default Voice",
"samples_count": 3,
"language": "en"
}
]
}Create a new voice profile.
Request:
{
"id": "new_voice",
"name": "My New Voice",
"language": "en"
}Get voice profile details.
Delete a voice profile.
Add sample audio to voice profile.
Request:
- Content-Type:
multipart/form-data - Body:
audio(file)
List all samples for a voice.
Remove a sample from voice profile.
Get current configuration.
Update configuration.
Request:
{
"wake_words": {
"activation": ["hey computer", "wake up"]
}
}Get wake word configuration.
Update wake word configuration.
Health check endpoint.
Response:
{
"status": "healthy",
"stt_loaded": true,
"tts_loaded": true,
"version": "1.0.0"
}Detailed system status.
Reload models and configuration.
Full-duplex conversation mode.
Client -> Server Messages:
{"type": "audio", "data": "<base64 audio>"}
{"type": "config", "wake_words": ["hey"]}
{"type": "command", "action": "start_listening"}Server -> Client Messages:
{"type": "transcription", "text": "hello", "final": true}
{"type": "audio", "data": "<base64 audio>"}
{"type": "status", "listening": true}
{"type": "wake_word_detected", "word": "hey talk to me"}-
Create voice profile:
curl -X POST http://localhost:8000/api/v1/voices \ -H "Content-Type: application/json" \ -d '{"id": "my_voice", "name": "My Voice", "language": "en"}'
-
Add audio samples:
curl -X POST http://localhost:8000/api/v1/voices/my_voice/samples \ -F "audio=@sample1.wav" -
Use the voice:
curl -X POST http://localhost:8000/api/v1/tts/synthesize \ -H "Content-Type: application/json" \ -d '{"text": "Hello!", "voice": "my_voice"}' \ --output output.wav
- Format: WAV, MP3, FLAC, or OGG
- Duration: 6-30 seconds per sample (10-15 seconds optimal)
- Quality: Clear audio, minimal background noise
- Content: Natural speech, varied intonation
- Quantity: 1-10 samples (3-5 recommended)
Samples can be added or removed at any time. The voice will automatically use the updated samples on the next synthesis request without requiring manual retraining.
Talk2Me is designed for complete offline operation:
- All models bundled: STT and TTS models included in distribution
- No API calls: No external service dependencies
- Local processing: All computation runs locally
- Portable: Single folder contains everything needed
On first run, the setup script downloads required models (~4GB). After setup, no internet connection is needed.
pytest tests/This project uses ruff for code formatting and linting, which combines the functionality of black, isort, and flake8 into a single, fast tool.
ruff format src/
ruff check src/"No audio input device found"
- Ensure microphone is connected
- Check system audio settings
- Set specific device in config
"Model not found"
- Run
python scripts/download_models.py - Check
models/directory structure
"CUDA out of memory"
- Reduce batch size in config
- Use CPU mode:
--device cpu
"Voice cloning quality poor"
- Add more diverse samples
- Ensure samples are clear audio
- Use 10-15 second clips
MIT License - See LICENSE file for details.