Stars
zero-shot voice conversion & singing voice conversion, with real-time support
No fortress, purely open ground. OpenManus is Coming.
Wan: Open and Advanced Large-Scale Video Generative Models
Zonos-v0.1 is a leading open-weight text-to-speech model trained on more than 200k hours of varied multilingual speech, delivering expressiveness and quality on par with—or even surpassing—top TTS …
MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone
Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
An Open-Sourced LLM-empowered Foundation TTS System
Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec.
A small seq2seq punctuator tool based on DistilBERT
This is the official repo of our work titled "The Codecfake Dataset and Countermeasures for the Universally Detection of Deepfake Audio".
open-source multimodal large language model that can hear, talk while thinking. Featuring real-time end-to-end speech input and streaming audio output conversational capabilities.
Official code of the paper: Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis.
Reverse Engineering of Supervised Semantic Speech Tokenizer (S3Tokenizer) proposed in CosyVoice
Chinese Mandarin Grapheme-to-Phoneme Converter. 中文轉注音或拼音 (INTERSPEECH 2022)
real time face swap and one-click video deepfake with only a single image
Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
Multilingual Voice Understanding Model
Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models
A feature-rich command-line audio/video downloader
Simple text to phones converter for multiple languages
📖 A curated list of resources dedicated to talking face.