Talk2Me

A fully offline, self-contained voice interaction system featuring speech-to-text, text-to-speech with voice cloning, and configurable wake word detection. Designed to run as a standalone service with comprehensive API endpoints.

Features

Speech-to-Text (STT)

Engine: Vosk
Models: Dual-model system
- High-accuracy model for transcription
- Lightweight model for fast wake word detection
Fully offline: No internet connection required

Text-to-Speech (TTS)

Engine: XTTS v2 (Coqui TTS)
Voice Cloning: Create and manage custom voice profiles
Multi-voice Support: Switch between cloned voices on demand
Fully offline: No cloud dependencies

Wake Word System

Configurable wake words to activate listening
Configurable "start listening" phrases
Configurable "done talking" phrases
Low-latency detection using lightweight Vosk model

Voice Cloning Management

Persistent storage of cloned voice profiles
Add/remove training samples at any time
Automatic retraining only when samples change
No manual retraining required for voice usage

API Service

RESTful API endpoints for all features
WebSocket support for real-time streaming
Designed for integration with other applications

System Requirements

Python 3.9+
4GB+ RAM (8GB recommended for voice cloning)
~5GB disk space for models and dependencies
Microphone for speech input
Speakers/audio output for TTS playback

Project Structure

talk2me/
├── README.md
├── requirements.txt
├── setup.py
├── pyproject.toml
├── config/
│   ├── default.yaml          # Default configuration
│   └── voices.yaml           # Voice profiles configuration
├── models/
│   └── vosk-model-small-en-us-0.15/  # Vosk STT model
├── voices/
│   └── test_voice/
│       └── samples/          # Audio samples for cloning
├── src/
│   └── talk2me/
│       ├── __init__.py
│       ├── api/
│       │   ├── __init__.py
│       │   └── main.py       # FastAPI server and all endpoints
│       ├── core/
│       │   ├── __init__.py
│       │   └── wake_word.py  # Wake word detection
│       ├── stt/
│       │   ├── __init__.py
│       │   └── engine.py     # Vosk STT implementation
│       ├── tts/
│       │   ├── __init__.py
│       │   └── engine.py     # XTTS v2 implementation
│       └── utils/
│           └── __init__.py
├── scripts/
│   ├── setup.sh              # Linux/macOS setup
│   ├── setup.bat             # Windows setup
│   └── download_models.py    # Model download script
├── tests/
│   ├── __init__.py
│   ├── test_api.py
│   ├── test_stt_engine.py
│   ├── test_tts_engine.py
│   └── test_wake_word.py

Installation

Automated Setup

Linux/macOS:

git clone https://github.com/FatStinkyPanda/talk2me.git
cd talk2me
chmod +x scripts/setup.sh
./scripts/setup.sh

Windows:

git clone https://github.com/FatStinkyPanda/talk2me.git
cd talk2me
scripts\setup.bat

Manual Setup

Create virtual environment:

python -m venv venv
source venv/bin/activate  # Linux/macOS
# or
venv\Scripts\activate     # Windows

Install dependencies:
```
pip install -r requirements.txt
```
Download models:
```
python scripts/download_models.py
```

Configuration

Default Configuration (`config/default.yaml`)

stt:
  model_path: "models/vosk-model-small-en-us-0.15"
  wake_word_model_path: "models/vosk-model-small-en-us-0.15"
  sample_rate: 16000

tts:
  model_path: "models/models/xtts/v2"
  default_voice: "default"
  sample_rate: 24000

wake_words:
  activation:
    - "hey talk to me"
    - "hello computer"
  start_listening:
    - "start listening"
    - "listen up"
  done_talking:
    - "done talking"
    - "that's all"
    - "stop listening"

api:
  host: "0.0.0.0"
  port: 8000
  cors_origins:
    - "*"

audio:
  input_device: null # null = system default
  output_device: null # null = system default
  chunk_size: 1024

Voice Profiles (`config/voices.yaml`)

voices:
  default:
    name: "Default Voice"
    samples_dir: "voices/default/samples"
    language: "en"

  test_voice:
    name: "Test Voice"
    samples_dir: "voices/test_voice/samples"
    language: "en"

Usage

Starting the Service

# Start with default configuration
talk2me

# Start with custom config
talk2me --config path/to/config.yaml

# Start API server only
talk2me --api-only

# Start with specific port
talk2me --port 9000

Interactive Mode

talk2me --interactive

API Reference

Base URL

http://localhost:8000/api/v1

Speech-to-Text Endpoints

`POST /stt/transcribe`

Transcribe audio file to text.

Request:

Content-Type: multipart/form-data
Body: audio (file)

Response:

{
  "text": "transcribed text here",
  "confidence": 0.95,
  "duration": 2.5
}

`WebSocket /stt/stream`

Real-time streaming transcription.

Messages:

Send: Binary audio chunks
Receive: JSON with partial/final transcriptions

Text-to-Speech Endpoints

`POST /tts/synthesize`

Convert text to speech.

Request:

{
  "text": "Hello, world!",
  "voice": "default",
  "language": "en"
}

Response:

Content-Type: audio/wav
Body: Audio binary data

`POST /tts/synthesize/stream`

Stream synthesized audio.

Request:

{
  "text": "Long text to synthesize...",
  "voice": "default"
}

Response:

Content-Type: audio/wav
Transfer-Encoding: chunked

Voice Management Endpoints

`GET /voices`

List all available voices.

Response:

{
  "voices": [
    {
      "id": "default",
      "name": "Default Voice",
      "samples_count": 3,
      "language": "en"
    }
  ]
}

`POST /voices`

Create a new voice profile.

Request:

{
  "id": "new_voice",
  "name": "My New Voice",
  "language": "en"
}

`GET /voices/{voice_id}`

Get voice profile details.

`DELETE /voices/{voice_id}`

Delete a voice profile.

`POST /voices/{voice_id}/samples`

Add sample audio to voice profile.

Request:

Content-Type: multipart/form-data
Body: audio (file)

`GET /voices/{voice_id}/samples`

List all samples for a voice.

`DELETE /voices/{voice_id}/samples/{sample_id}`

Remove a sample from voice profile.

Configuration Endpoints

`GET /config`

Get current configuration.

`PATCH /config`

Update configuration.

Request:

{
  "wake_words": {
    "activation": ["hey computer", "wake up"]
  }
}

`GET /config/wake-words`

Get wake word configuration.

`PUT /config/wake-words`

Update wake word configuration.

System Endpoints

`GET /health`

Health check endpoint.

Response:

{
  "status": "healthy",
  "stt_loaded": true,
  "tts_loaded": true,
  "version": "1.0.0"
}

`GET /status`

Detailed system status.

`POST /reload`

Reload models and configuration.

WebSocket Endpoints

`WebSocket /ws/conversation`

Full-duplex conversation mode.

Client -> Server Messages:

{"type": "audio", "data": "<base64 audio>"}
{"type": "config", "wake_words": ["hey"]}
{"type": "command", "action": "start_listening"}

Server -> Client Messages:

{"type": "transcription", "text": "hello", "final": true}
{"type": "audio", "data": "<base64 audio>"}
{"type": "status", "listening": true}
{"type": "wake_word_detected", "word": "hey talk to me"}

Voice Cloning

Adding a New Voice

Create voice profile:

curl -X POST http://localhost:8000/api/v1/voices \
  -H "Content-Type: application/json" \
  -d '{"id": "my_voice", "name": "My Voice", "language": "en"}'

Add audio samples:

curl -X POST http://localhost:8000/api/v1/voices/my_voice/samples \
  -F "audio=@sample1.wav"

Use the voice:

curl -X POST http://localhost:8000/api/v1/tts/synthesize \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello!", "voice": "my_voice"}' \
  --output output.wav

Sample Requirements

Format: WAV, MP3, FLAC, or OGG
Duration: 6-30 seconds per sample (10-15 seconds optimal)
Quality: Clear audio, minimal background noise
Content: Natural speech, varied intonation
Quantity: 1-10 samples (3-5 recommended)

Managing Samples

Samples can be added or removed at any time. The voice will automatically use the updated samples on the next synthesis request without requiring manual retraining.

Offline Operation

Talk2Me is designed for complete offline operation:

All models bundled: STT and TTS models included in distribution
No API calls: No external service dependencies
Local processing: All computation runs locally
Portable: Single folder contains everything needed

First-Time Setup

On first run, the setup script downloads required models (~4GB). After setup, no internet connection is needed.

Development

Running Tests

pytest tests/

Code Style

This project uses ruff for code formatting and linting, which combines the functionality of black, isort, and flake8 into a single, fast tool.

ruff format src/
ruff check src/

Troubleshooting

Common Issues

"No audio input device found"

Ensure microphone is connected
Check system audio settings
Set specific device in config

"Model not found"

Run python scripts/download_models.py
Check models/ directory structure

"CUDA out of memory"

Reduce batch size in config
Use CPU mode: --device cpu

"Voice cloning quality poor"

Add more diverse samples
Ensure samples are clear audio
Use 10-15 second clips

License

MIT License - See LICENSE file for details.

Acknowledgments

Vosk - Offline speech recognition
Coqui TTS - XTTS v2 text-to-speech
FastAPI - API framework

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
config		config
models/vosk-model-small-en-us-0.15		models/vosk-model-small-en-us-0.15
node_modules		node_modules
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.secrets.baseline		.secrets.baseline
API_DOCUMENTATION.md		API_DOCUMENTATION.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
ai_script.txt		ai_script.txt
memory.sqlite		memory.sqlite
package-lock.json		package-lock.json
package.json		package.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

Talk2Me

Features

Speech-to-Text (STT)

Text-to-Speech (TTS)

Wake Word System

Voice Cloning Management

API Service

System Requirements

Project Structure

Installation

Automated Setup

Manual Setup

Configuration

Default Configuration (config/default.yaml)

Voice Profiles (config/voices.yaml)

Usage

Starting the Service

Interactive Mode

API Reference

Base URL

Speech-to-Text Endpoints

POST /stt/transcribe

WebSocket /stt/stream

Text-to-Speech Endpoints

POST /tts/synthesize

POST /tts/synthesize/stream

Voice Management Endpoints

GET /voices

POST /voices

GET /voices/{voice_id}

DELETE /voices/{voice_id}

POST /voices/{voice_id}/samples

GET /voices/{voice_id}/samples

DELETE /voices/{voice_id}/samples/{sample_id}

Configuration Endpoints

GET /config

PATCH /config

GET /config/wake-words

PUT /config/wake-words

System Endpoints

GET /health

GET /status

POST /reload

WebSocket Endpoints

WebSocket /ws/conversation

Voice Cloning

Adding a New Voice

Sample Requirements

Managing Samples

Offline Operation

First-Time Setup

Development

Running Tests

Code Style

Troubleshooting

Common Issues

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Default Configuration (`config/default.yaml`)

Voice Profiles (`config/voices.yaml`)

`POST /stt/transcribe`

`WebSocket /stt/stream`

`POST /tts/synthesize`

`POST /tts/synthesize/stream`

`GET /voices`

`POST /voices`

`GET /voices/{voice_id}`

`DELETE /voices/{voice_id}`

`POST /voices/{voice_id}/samples`

`GET /voices/{voice_id}/samples`

`DELETE /voices/{voice_id}/samples/{sample_id}`

`GET /config`

`PATCH /config`

`GET /config/wake-words`

`PUT /config/wake-words`

`GET /health`

`GET /status`

`POST /reload`

`WebSocket /ws/conversation`

Packages