Skip to content

A modern desktop application built with Tauri 2.0 for creating professional audiobooks using advanced text-to-speech and voice cloning technology (XTTS, Chatterbox, VibeVoice). Features drag & drop organization, multi-language support (17+ languages), smart text segmentation with NLP, and export to MP3/M4A/WAV formats.

License

Notifications You must be signed in to change notification settings

DigiJoe79/AudioBook-Maker

Repository files navigation

Audiobook Maker

A modern desktop application for creating audiobooks with advanced text-to-speech and voice cloning capabilities

Version License Platform

v1.1.1 - Hotfix for remote backend connectivity. See Release Notes.

v1.1.0 - Docker-based deployment, Remote GPU hosts, Engine variants. See Release Notes.

Overview

Audiobook Maker is a powerful Tauri 2.0 desktop application that transforms text into high-quality audiobooks using state-of-the-art text-to-speech technology. Built with a modern tech stack combining React, TypeScript, and Python FastAPI, it offers professional-grade features in an intuitive interface.

Key Features

  • Docker-Based Deployment - One-command setup with prebuilt containers for backend and engines
  • Remote GPU Hosts - Offload GPU-intensive engines to dedicated servers via SSH
  • Multi-Engine Architecture - 4 engine types (TTS, STT, Text Processing, Audio Analysis)
  • Engine Variants - Run engines locally (subprocess), in Docker, or on remote hosts
  • Voice Cloning - Create custom voices using XTTS, Chatterbox, or VibeVoice with speaker samples
  • Quality Assurance - Whisper-based transcription analysis and Silero-VAD audio quality detection
  • Pronunciation Rules - Pattern-based text transformation to fix mispronunciations
  • Project Organization - Hierarchical structure with Projects, Chapters, and Segments
  • Drag & Drop Interface - Intuitive content organization and reordering
  • Multi-Language Support - 17+ languages including English, German, Spanish, French, Chinese, Japanese
  • Multiple Export Formats - Export to MP3, M4A, or WAV with quality presets
  • Smart Text Segmentation - Automatic text splitting using spaCy NLP engine
  • Real-Time Updates - Server-Sent Events for instant UI feedback
  • Job Management - Database-backed queue, resume cancelled jobs, track progress
  • Markdown or EPUB Import - Import entire projects from structured files

Screenshots

Alt text Alt text Alt text Alt text Alt text Alt text

Sample audio

Moby Dick Sample (Chatterbox)

Moby Dick Sample (VibeVoice)

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    Audiobook Maker Desktop App                   │
│                     (Tauri + React Frontend)                     │
└───────────────────────────┬─────────────────────────────────────┘
                            │ HTTP/REST API + SSE
                            ▼
┌─────────────────────────────────────────────────────────────────┐
│                 Backend Container (Port 8765)                    │
│              ghcr.io/digijoe79/audiobook-maker/backend           │
├─────────────────────────────────────────────────────────────────┤
│  FastAPI │ SQLite │ TTS/Quality Workers │ Engine Managers        │
│          │        │                     │ (Docker Runner)        │
└───────────────────────────┬─────────────────────────────────────┘
                            │ Docker API
        ┌───────────────────┼───────────────────┐
        ▼                   ▼                   ▼
┌───────────────┐   ┌───────────────┐   ┌───────────────┐
│  Local Docker │   │  Local Docker │   │ Remote Docker │
│    Engines    │   │    Engines    │   │  Host (GPU)   │
│ xtts, spacy   │   │whisper,silero │   │ xtts,whisper  │
└───────────────┘   └───────────────┘   └───────────────┘

Key Architecture Features:

  • Backend and engines run as Docker containers
  • GPU engines can run on remote hosts via SSH tunnel
  • Automatic engine discovery from online catalog
  • Engine enable/disable with auto-stop after inactivity
  • Real-time updates via Server-Sent Events (SSE)

Quick Start

Prerequisites

Requirement Purpose Installation
Docker Desktop Run backend and engines Download
NVIDIA Container Toolkit GPU support (optional) Install Guide

Note: For GPU-accelerated TTS (XTTS, Chatterbox, Whisper), you need an NVIDIA GPU with CUDA support and the NVIDIA Container Toolkit installed.

Installation

1. Download the Desktop App

Download the latest Windows release from GitHub Releases:

  • Windows: Audiobook-Maker_1.1.1_x64-setup.exe

Linux/macOS: No prebuilt binaries available. See Development Setup to build from source.

2. Pull the Backend Container

docker pull ghcr.io/digijoe79/audiobook-maker/backend:latest

3. Start the Backend

docker run -d \
  --name audiobook-maker-backend \
  -p 8765:8765 \
  --add-host=host.docker.internal:host-gateway \
  -e DOCKER_ENGINE_HOST=host.docker.internal \
  -v /var/run/docker.sock:/var/run/docker.sock \
  -v audiobook-data:/app/data \
  -v audiobook-media:/app/media \
  ghcr.io/digijoe79/audiobook-maker/backend:latest

Important: The container must be named audiobook-maker-backend. On startup, the backend cleans up orphaned engine containers (prefix audiobook-) from previous sessions. Containers matching this prefix are stopped unless explicitly excluded by name.

4. Launch the App

  1. Start the Audiobook Maker desktop app
  2. Connect to backend: http://localhost:8765
  3. Go to Settings → Engines and install engines from the catalog
  4. Create a speaker and start creating audiobooks!

Installing Engines

Engines are pulled automatically from the online catalog:

  1. Open Settings → Engines
  2. Browse available engines in the catalog
  3. Click Install to pull the Docker image
  4. Enable the engine and it starts automatically

See audiobook-maker-engines for the full list of available engines.

GPU Offloading to Remote Hosts

Run GPU-intensive engines on a dedicated server:

1. Prepare the Remote Host

# On the remote GPU server
# Install Docker and NVIDIA Container Toolkit
curl -fsSL https://get.docker.com | sh
# Follow NVIDIA Container Toolkit installation guide

2. Add Host in Audiobook Maker

  1. Open Settings → Hosts
  2. Click Add Host
  3. Enter connection details:
    • Host Name: e.g., "GPU Server"
    • SSH URL: e.g., ssh://user@192.168.1.100
  4. Click Generate SSH Key
  5. Copy the displayed install command and run it on the remote host
  6. Click Test Connection to verify
  7. Click Save

3. Install Engines on Remote Host

  1. Go to Settings → Hosts
  2. Click on + for your remote host
  3. Install (GPU) engines (XTTS, Whisper, etc.)
  4. Engines run on the remote host, audio streams back to your machine

Usage Guide

Creating Your First Audiobook

  1. Create a Project - Click "+" in the sidebar
  2. Add Chapters - Organize your content
  3. Add Segments - Upload text or type manually
  4. Configure Voice - Select speaker and language
  5. Generate Audio - Click "Generate All"
  6. Export - Download as MP3/M4A/WAV

Voice Cloning

  1. Navigate to Speakers view (Ctrl+3)
  2. Click Add Speaker
  3. Upload 1-3 WAV samples (3-30 seconds each)
  4. Use the speaker in your segments

Quality Analysis

  1. Generate audio for segments
  2. Click quality indicator or use Analyze Chapter
  3. Review transcription accuracy and audio metrics
  4. Re-generate segments with issues

Pronunciation Rules

  1. Navigate to Pronunciation view (Ctrl+4)
  2. Create rules for mispronounced words
  3. Rules are automatically applied during generation

Development Setup

For contributors who want to develop locally without Docker:

Development Installation (click to expand)

Prerequisites

Backend Setup

cd backend
python -m venv venv
venv\Scripts\activate      # Windows
source venv/bin/activate   # Linux/Mac
pip install -r requirements.txt

Engine Setup (Subprocess Mode)

Clone the engines repository:

git clone https://github.com/DigiJoe79/audiobook-maker-engines backend/engines

Set up individual engines:

cd backend/engines/tts/xtts
setup.bat   # Windows
./setup.sh  # Linux/Mac

Frontend Setup

cd frontend
npm install
npm run dev:tauri

Project Structure

audiobook-maker/
├── frontend/                 # Tauri + React desktop app
│   ├── src/                  # React components, hooks, stores
│   ├── src-tauri/            # Rust backend (Tauri)
│   └── e2e/                  # Playwright E2E tests
│
├── backend/                  # Python FastAPI backend
│   ├── api/                  # REST endpoints
│   ├── core/                 # Engine managers, Docker runner
│   ├── services/             # Business logic
│   └── Dockerfile            # Backend container definition
│
└── .github/workflows/        # CI/CD for container builds

API Documentation

When the backend is running:

Troubleshooting

Backend container won't start

# Check logs
docker logs audiobook-maker-backend

# Verify port is available
docker ps -a | grep 8765

Backend container stops immediately

The backend cleans up orphaned engine containers on startup. If your container is named differently than audiobook-maker-backend, it may be stopped as an orphan. Always use the exact name audiobook-maker-backend.

GPU not detected in containers

# Verify NVIDIA Container Toolkit
nvidia-smi
docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

Engine fails to start

  • Check engine logs in Monitoring → Activity
  • Verify Docker has enough resources (memory, disk)
  • For GPU engines, ensure NVIDIA Container Toolkit is installed

Remote host connection fails

  • Verify SSH key is in remote ~/.ssh/authorized_keys
  • Check firewall allows SSH (port 22)
  • Test manually: ssh user@host

Tech Stack

Frontend

  • Tauri 2.9 - Desktop framework
  • React 19 + TypeScript 5.9 - UI framework
  • Material-UI 7 - Component library
  • React Query 5 - Server state
  • Zustand 5 - Local state

Backend

  • Python 3.12 - Runtime
  • FastAPI - Web framework
  • SQLite 3 - Database
  • Docker SDK - Container management

Engines

  • TTS: XTTS v2, Chatterbox, VibeVoice
  • STT: Whisper (5 model sizes)
  • Text: spaCy (11 languages)
  • Audio: Silero-VAD

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

TTS Engines

Analysis Engines

Frameworks

  • Tauri - Desktop app framework
  • FastAPI - Python web framework

Support


Made with care by DigiJoe79

About

A modern desktop application built with Tauri 2.0 for creating professional audiobooks using advanced text-to-speech and voice cloning technology (XTTS, Chatterbox, VibeVoice). Features drag & drop organization, multi-language support (17+ languages), smart text segmentation with NLP, and export to MP3/M4A/WAV formats.

Resources

License

Stars

Watchers

Forks

Packages