Stars
VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning
Long-form streaming TTS system for multi-speaker dialogue generation
A ComfyUI custom node integration for multi-engine multi-language Text-to-Speech and Voice Conversion. Supports: RVC, IndexTTS-2, Chatterbox (classic and multilingual 23-lang), F5-TTS, Higgs Audio …
A single-layer, streaming codec model providing SOTA audio quality and discrete tokens designed for superior downstream modelability.
Fair-code workflow automation platform with native AI capabilities. Combine visual building with custom code, self-host or cloud, 400+ integrations.
A Unified Framework for Expressive Speech Synthesis with Voice Cloning
Official Repository of Paper: "Emilia-NV: A Non-Verbal Speech Dataset with Word-Level Annotation for Human-Like Speech Modeling"
MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows
Ultra-low bitrate speech codec (0.27-1 kbps) with cross-modal alignment and real-time capabilities
JimmyMa99 / train-higgs-audio
Forked from boson-ai/higgs-audioText-audio foundation model from Boson AI
This is the official implement of ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning
Text-audio foundation model from Boson AI
Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation.
A Foundation Model for Industrial Signal Comprehensive Representation
A toolkit for processing speech data and creating speech datasets
Pytorch implementation of MeanFlow on ImageNet and CIFAR10
SoftVC VITS Singing Voice Conversion
PyTorch implementation of Audio Flamingo: Series of Advanced Audio Understanding Language Models
Kyutai's Speech-To-Text and Text-To-Speech models based on the Delayed Streams Modeling framework.
STARS: A Unified Framework for Singing Transcription, Alignment, and Refined Style Annotation
[NeurIPS 2025] PyTorch implementation of [ThinkSound], a unified framework for generating audio from any modality, guided by Chain-of-Thought (CoT) reasoning.
This is the code for paper: XY-Tokenizer: Mitigating the Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs. Demos, technical insights and experimental results are presented on
A cross-platform bilibili toolbox. 跨平台哔哩哔哩工具箱,支持下载视频、番剧等等各类资源
Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching
Unofficial PyTorch implementation of "Autoregressive Speech Synthesis without Vector Quantization (MELLE)"