A modern desktop application for creating audiobooks with advanced text-to-speech and voice cloning capabilities
v1.1.1 - Hotfix for remote backend connectivity. See Release Notes.
v1.1.0 - Docker-based deployment, Remote GPU hosts, Engine variants. See Release Notes.
Audiobook Maker is a powerful Tauri 2.0 desktop application that transforms text into high-quality audiobooks using state-of-the-art text-to-speech technology. Built with a modern tech stack combining React, TypeScript, and Python FastAPI, it offers professional-grade features in an intuitive interface.
- Docker-Based Deployment - One-command setup with prebuilt containers for backend and engines
- Remote GPU Hosts - Offload GPU-intensive engines to dedicated servers via SSH
- Multi-Engine Architecture - 4 engine types (TTS, STT, Text Processing, Audio Analysis)
- Engine Variants - Run engines locally (subprocess), in Docker, or on remote hosts
- Voice Cloning - Create custom voices using XTTS, Chatterbox, or VibeVoice with speaker samples
- Quality Assurance - Whisper-based transcription analysis and Silero-VAD audio quality detection
- Pronunciation Rules - Pattern-based text transformation to fix mispronunciations
- Project Organization - Hierarchical structure with Projects, Chapters, and Segments
- Drag & Drop Interface - Intuitive content organization and reordering
- Multi-Language Support - 17+ languages including English, German, Spanish, French, Chinese, Japanese
- Multiple Export Formats - Export to MP3, M4A, or WAV with quality presets
- Smart Text Segmentation - Automatic text splitting using spaCy NLP engine
- Real-Time Updates - Server-Sent Events for instant UI feedback
- Job Management - Database-backed queue, resume cancelled jobs, track progress
- Markdown or EPUB Import - Import entire projects from structured files
┌─────────────────────────────────────────────────────────────────┐
│ Audiobook Maker Desktop App │
│ (Tauri + React Frontend) │
└───────────────────────────┬─────────────────────────────────────┘
│ HTTP/REST API + SSE
▼
┌─────────────────────────────────────────────────────────────────┐
│ Backend Container (Port 8765) │
│ ghcr.io/digijoe79/audiobook-maker/backend │
├─────────────────────────────────────────────────────────────────┤
│ FastAPI │ SQLite │ TTS/Quality Workers │ Engine Managers │
│ │ │ │ (Docker Runner) │
└───────────────────────────┬─────────────────────────────────────┘
│ Docker API
┌───────────────────┼───────────────────┐
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Local Docker │ │ Local Docker │ │ Remote Docker │
│ Engines │ │ Engines │ │ Host (GPU) │
│ xtts, spacy │ │whisper,silero │ │ xtts,whisper │
└───────────────┘ └───────────────┘ └───────────────┘
Key Architecture Features:
- Backend and engines run as Docker containers
- GPU engines can run on remote hosts via SSH tunnel
- Automatic engine discovery from online catalog
- Engine enable/disable with auto-stop after inactivity
- Real-time updates via Server-Sent Events (SSE)
| Requirement | Purpose | Installation |
|---|---|---|
| Docker Desktop | Run backend and engines | Download |
| NVIDIA Container Toolkit | GPU support (optional) | Install Guide |
Note: For GPU-accelerated TTS (XTTS, Chatterbox, Whisper), you need an NVIDIA GPU with CUDA support and the NVIDIA Container Toolkit installed.
Download the latest Windows release from GitHub Releases:
- Windows:
Audiobook-Maker_1.1.1_x64-setup.exe
Linux/macOS: No prebuilt binaries available. See Development Setup to build from source.
docker pull ghcr.io/digijoe79/audiobook-maker/backend:latestdocker run -d \
--name audiobook-maker-backend \
-p 8765:8765 \
--add-host=host.docker.internal:host-gateway \
-e DOCKER_ENGINE_HOST=host.docker.internal \
-v /var/run/docker.sock:/var/run/docker.sock \
-v audiobook-data:/app/data \
-v audiobook-media:/app/media \
ghcr.io/digijoe79/audiobook-maker/backend:latestImportant: The container must be named
audiobook-maker-backend. On startup, the backend cleans up orphaned engine containers (prefixaudiobook-) from previous sessions. Containers matching this prefix are stopped unless explicitly excluded by name.
- Start the Audiobook Maker desktop app
- Connect to backend:
http://localhost:8765 - Go to Settings → Engines and install engines from the catalog
- Create a speaker and start creating audiobooks!
Engines are pulled automatically from the online catalog:
- Open Settings → Engines
- Browse available engines in the catalog
- Click Install to pull the Docker image
- Enable the engine and it starts automatically
See audiobook-maker-engines for the full list of available engines.
Run GPU-intensive engines on a dedicated server:
# On the remote GPU server
# Install Docker and NVIDIA Container Toolkit
curl -fsSL https://get.docker.com | sh
# Follow NVIDIA Container Toolkit installation guide- Open Settings → Hosts
- Click Add Host
- Enter connection details:
- Host Name: e.g., "GPU Server"
- SSH URL: e.g.,
ssh://user@192.168.1.100
- Click Generate SSH Key
- Copy the displayed install command and run it on the remote host
- Click Test Connection to verify
- Click Save
- Go to Settings → Hosts
- Click on + for your remote host
- Install (GPU) engines (XTTS, Whisper, etc.)
- Engines run on the remote host, audio streams back to your machine
- Create a Project - Click "+" in the sidebar
- Add Chapters - Organize your content
- Add Segments - Upload text or type manually
- Configure Voice - Select speaker and language
- Generate Audio - Click "Generate All"
- Export - Download as MP3/M4A/WAV
- Navigate to Speakers view (Ctrl+3)
- Click Add Speaker
- Upload 1-3 WAV samples (3-30 seconds each)
- Use the speaker in your segments
- Generate audio for segments
- Click quality indicator or use Analyze Chapter
- Review transcription accuracy and audio metrics
- Re-generate segments with issues
- Navigate to Pronunciation view (Ctrl+4)
- Create rules for mispronounced words
- Rules are automatically applied during generation
For contributors who want to develop locally without Docker:
Development Installation (click to expand)
- Node.js 18+ - Download
- Python 3.12+ - Download
- Rust 1.70+ - Install
- FFmpeg - Install Guide
cd backend
python -m venv venv
venv\Scripts\activate # Windows
source venv/bin/activate # Linux/Mac
pip install -r requirements.txtClone the engines repository:
git clone https://github.com/DigiJoe79/audiobook-maker-engines backend/enginesSet up individual engines:
cd backend/engines/tts/xtts
setup.bat # Windows
./setup.sh # Linux/Maccd frontend
npm install
npm run dev:tauriaudiobook-maker/
├── frontend/ # Tauri + React desktop app
│ ├── src/ # React components, hooks, stores
│ ├── src-tauri/ # Rust backend (Tauri)
│ └── e2e/ # Playwright E2E tests
│
├── backend/ # Python FastAPI backend
│ ├── api/ # REST endpoints
│ ├── core/ # Engine managers, Docker runner
│ ├── services/ # Business logic
│ └── Dockerfile # Backend container definition
│
└── .github/workflows/ # CI/CD for container builds
When the backend is running:
- Swagger UI: http://localhost:8765/docs
- ReDoc: http://localhost:8765/redoc
# Check logs
docker logs audiobook-maker-backend
# Verify port is available
docker ps -a | grep 8765The backend cleans up orphaned engine containers on startup. If your container is named differently than audiobook-maker-backend, it may be stopped as an orphan. Always use the exact name audiobook-maker-backend.
# Verify NVIDIA Container Toolkit
nvidia-smi
docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi- Check engine logs in Monitoring → Activity
- Verify Docker has enough resources (memory, disk)
- For GPU engines, ensure NVIDIA Container Toolkit is installed
- Verify SSH key is in remote
~/.ssh/authorized_keys - Check firewall allows SSH (port 22)
- Test manually:
ssh user@host
- Tauri 2.9 - Desktop framework
- React 19 + TypeScript 5.9 - UI framework
- Material-UI 7 - Component library
- React Query 5 - Server state
- Zustand 5 - Local state
- Python 3.12 - Runtime
- FastAPI - Web framework
- SQLite 3 - Database
- Docker SDK - Container management
- TTS: XTTS v2, Chatterbox, VibeVoice
- STT: Whisper (5 model sizes)
- Text: spaCy (11 languages)
- Audio: Silero-VAD
This project is licensed under the MIT License - see the LICENSE file for details.
- Coqui TTS - XTTS v2 voice cloning engine
- Chatterbox - Expressive TTS by Resemble AI
- VibeVoice - Long-form multi-speaker TTS by Microsoft
- OpenAI Whisper - Speech recognition
- Silero VAD - Voice activity detection
- spaCy - NLP text segmentation
- Issues: GitHub Issues
Made with care by DigiJoe79





