Skip to content

Voice Operation and Design Engine with Reproduction capabilities

License

Notifications You must be signed in to change notification settings

HAKORADev/VODER

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

374 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VODER - Voice Blender

VODER Logo

VODER is a Local, Free, Offline, professional-grade voice processing and transformation tool that enables seamless conversion between speech, text, and music. Built for creators, developers, and audio professionals, VODER delivers high-quality synthesis, voice cloning, and music generation capabilities through an intuitive interface.

🚀 Ready in Colab: Open VODER in Google Colab

🤖 For AI agents and automated tools: See Bots.md


Quick Start

Run from Source

# Clone the repository
git clone https://github.com/HAKORADev/VODER.git
cd VODER

# Install dependencies
pip install -r requirements.txt

# IMPORTANT: After installing requirements, upgrade protobuf to avoid compatibility issues
pip install --upgrade protobuf==5.29.6

# Launch GUI
python src/voder.py

# Or use CLI mode
python src/voder.py cli

Run in Google Colab

Open In Colab

Open the link, connect to a runtime, and press Run All (or run cells one by one until the last one). Once execution completes, VODER is ready to use directly in your browser — no installation required.

Installation Requirements

# Install FFmpeg (required for audio processing)
# Windows: winget install FFmpeg
# macOS: brew install ffmpeg
# Linux: sudo apt install ffmpeg

Core Capabilities

🎤 6 Processing Modes

VODER offers six distinct voice processing modes, each designed for specific audio transformation needs:

Mode Description Input Output
STT+TTS Speech-to-Text then Text-to-Speech Audio Audio
TTS Text-to-Speech with Voice Design Text Audio
TTS+VC Text-to-Speech + Voice Cloning Text + Reference Audio
STS Speech-to-Speech (Voice Conversion) Audio + Reference Audio
TTM Text-to-Music Generation Text Audio
TTM+VC Text-to-Music + Voice Conversion Text + Reference Audio

🎭 Dialogue System

VODER features a powerful row-based dialogue editor designed for creating multi-speaker audio content such as podcasts, AI news broadcasts, audiobooks, and conversational content. This system enables script-based generation where multiple characters speak with distinct voices in a cohesive narrative flow.

GUI Dialogue Input:

  • Each line is a separate row with Character and Dialogue fields.
  • New rows are added automatically when you fill the last row.
  • First row has no delete button; subsequent rows can be deleted individually.
  • Voice prompts or audio assignments appear dynamically for every character found in the script.

Optional Background Music:

  • When generating dialogue (TTS or TTS+VC mode), VODER can automatically add ambient background music that matches the length of the spoken audio.
  • A clean dialog appears before processing, asking: “Enter music description (or press Skip):”
  • If a description is provided (e.g., "soft piano, cinematic strings"), VODER:
    • Generates music via ACE-Step using the description as style prompt and "..." as empty lyrics.
    • Automatically fits the music duration to the exact length of the dialogue.
    • Mixes the music at 35% volume relative to the dialogue.
    • Cleans up temporary files and saves the final result with an _m suffix (e.g., voder_tts_dialogue_..._m.wav).
  • If the user skips, processing proceeds normally without music.

This feature is available in both GUI and CLI modes (interactive and one‑line). It is only triggered for dialogue scripts (i.e., more than one line, or a single line containing a colon).

Example Script (conceptual):

James: Welcome to our podcast! Today we'll explore AI advances.
Sarah: Thanks James! I'm excited to discuss my latest research.
James: Let's dive in. First, tell us about neural networks.

Key Features:

  • Multi-character script support with real-time character extraction
  • Individual voice prompts for each character (TTS mode)
  • Reference audio assignment per character via dropdown numbers (TTS+VC mode)
  • Optional background music – automatically generated, duration‑fitted, volume‑controlled
  • Automatic audio concatenation with proper pacing
  • Ideal for podcasts, news segments, interviews, and storytelling

The dialogue system is available in both TTS (Voice Design) and TTS+VC (Voice Cloning) modes, allowing you to create voices either through descriptive prompts or by cloning from real audio samples.


🔧 AI Model Integration

VODER leverages state-of-the-art open-source models for professional-grade audio processing:


Usage Guide

GUI Mode

  1. Launch: python src/voder.py
  2. Select mode from dropdown (6 available modes)
  3. Load input files based on mode:
    • STT+TTS: Load base audio (content), then load target audio (voice)
    • TTS: Enter dialogue row‑by‑row in the script area, and fill the automatically generated voice prompts for each character
      Optional: Before generation, a dialog will ask if you want background music; enter a description or press Skip.
    • TTS+VC: Enter dialogue rows, load voice reference audio files (each assigned a number), then assign each character an audio number via dropdown
      Optional: The same background music dialog appears before generation.
    • STS: Load base audio and target voice audio
    • TTM: Enter lyrics and style prompt
    • TTM+VC: Enter lyrics, style prompt, and load target voice audio
  4. Click "Generate" (TTS/TTS+VC/TTM/TTM+VC) or "Patch" (STT+TTS/STS)
  5. Listen to output and save results

CLI Mode (Interactive)

python src/voder.py cli

The interactive CLI now supports full dialogue creation:

  • Enter multiple lines (empty line to finish).
  • Lines without a colon → single mode (one text, one voice prompt/audio).
  • Lines with colon (Character: text) → dialogue mode.
  • VODER will ask for a voice prompt (TTS) or audio file path (TTS+VC) for each character, in order.
  • After collecting all voice prompts/assignments, you will be asked:
    Add background music? (y/N):
    If you answer y or yes, you can enter a music description.
    Leaving the description blank or entering empty skips the music.

One-Line Commands

One‑line commands now support dialogue mode through multiple values per parameter, as well as the optional music parameter for background music.

Single mode examples:

python src/voder.py tts script "Hello world" voice "female, cheerful"
python src/voder.py tts+vc script "Hello" target "voice.wav"
python src/voder.py sts base "input.wav" target "voice.wav"
python src/voder.py ttm lyrics "Verse 1: ..." styling "upbeat pop" 30
python src/voder.py ttm+vc lyrics "..." styling "pop" duration 30 target "voice.wav"

Dialogue mode examples (TTS):

python src/voder.py tts script "James: Welcome to the show!" "Sarah: Glad to be here." voice "James: deep male voice, authoritative" "Sarah: bright female voice, energetic"
python src/voder.py tts script "James: Welcome to the show!" "Sarah: Glad to be here." voice "James: deep male voice, authoritative" "Sarah: bright female voice, energetic" music "soft piano, cinematic"

Dialogue mode examples (TTS+VC):

python src/voder.py tts+vc script "James: Let's start with AI." "Sarah: I've been working on this for years." target "James: /path/to/james_voice.wav" "Sarah: /path/to/sarah_voice.wav"
python src/voder.py tts+vc script "James: Let's start with AI." "Sarah: I've been working on this for years." target "James: /path/to/james_voice.wav" "Sarah: /path/to/sarah_voice.wav" music "ambient electronic, chill"

Note: STT+TTS mode is not available in one-line CLI because it requires interactive text editing. If the music parameter is supplied in single‑mode (plain text without colon), it is ignored with a warning.


System Requirements

Minimum Requirements

Component Specification
CPU 4-6 cores
RAM 12GB+ system memory
GPU (CUDA) Optional (CPU-only operation supported)
VRAM 4GB minimum (6GB recommended, 16GB for best performance)
Storage SSD recommended

Note: VODER runs entirely on CPU. No GPU is required for any mode. However, having a GPU with sufficient VRAM can significantly improve processing speed for certain modes.

Recommended Requirements

VODER is designed to maximize output quality rather than speed. Meeting the minimum requirements ensures reliable operation — the focus is on achieving professional-grade audio results, not processing benchmarks. More RAM allows for longer audio generation and more complex workflows.


Technical Highlights

  • Unified Audio Pipeline: Six processing modes in a single interface eliminates the need for multiple tools
  • Intelligent Dialogue Editor: Row‑based script input with automatic character tracking and per‑voice assignment
  • State-of-the-Art Models: Production-quality models from leading AI research organizations
  • Voice Cloning: Extract and replicate voice characteristics from reference audio samples
  • Music Generation: Lyrics-to-music synthesis with style control and voice conversion
  • Cross-Modal Transformation: Speech-to-speech, text-to-speech, and speech-to-text conversions
  • Memory Optimisation: TTM+VC pipeline now releases GPU memory between stages to reduce VRAM usage
  • Background Music for Dialogue: Automatically generated, duration‑fitted, volume‑controlled ambient music – a unique enhancement for narrated content

Documentation

  • Guide.md — Detailed usage guide, technical implementation, and creative techniques
  • CHANGELOG.md — Development history and version changes
  • Bots.md — Guidelines for AI agents and automated systems

Version Information

Note: VODER does not maintain PyPI packages or pre-built binaries. Running from source ensures access to the most recent features and improvements.


Contributing

VODER is open-source (MIT License) and welcomes contributions:

  • New voice processing modes
  • Additional model integrations
  • UI/UX improvements
  • Performance optimizations
  • Documentation and translations
  • Bug reports and feature requests

Please submit pull requests or issues via GitHub.


License

MIT License — See LICENSE for full details.


Acknowledgments

Built with appreciation for the open-source AI voice synthesis community and the amazing models that power VODER.

About

Voice Operation and Design Engine with Reproduction capabilities

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages