<a href="https://colab.research.google.com/github/Troyanovsky/awesome-TTS-Colab/blob/main/Qwen3_TTS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üó£Ô∏è Qwen3-TTS Colab

## üìÑ Description

This Colab notebook runs **Qwen3-TTS-12Hz-0.6B-Base** and **Qwen3-TTS-12Hz-1.7B-Base**, powerful **multilingual, low-latency text-to-speech (TTS)** models from the **Qwen3 TTS family**.
Designed with a **universal end-to-end architecture**, Qwen3-TTS delivers **high-fidelity**, **instruction-controllable**, and **real-time streaming** speech synthesis with strong robustness to noisy or complex text inputs.

**Capabilities:**
Multilingual TTS (10 Languages), Ultra-Low-Latency Streaming (‚âà97ms), Instruction-Based Voice Control, Rapid 3s Voice Cloning, High-Fidelity Speech Reconstruction

---

## How to use

* Modify text and instruction variables
* Run all following cells, upload reference audio if needed, and generate speech

---

## ‚öôÔ∏è Model Highlights

* üåç **10-language support** ‚Äì Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
* ‚ö° **Extreme low-latency generation** ‚Äì first audio packet emitted after a single character input
* üß† **Instruction-aware speech synthesis** ‚Äì adaptive control over tone, emotion, prosody, and speaking rate
* üß¨ **3-second rapid voice cloning** ‚Äì supported by both Base models
* üèó **End-to-end discrete LM architecture** ‚Äì avoids cascading errors of traditional TTS pipelines

---

## üß† Model Details

* **Models Included:** Qwen3-TTS-12Hz-0.6B-Base, Qwen3-TTS-12Hz-1.7B-Base
* **Speech Tokenizer:** Qwen3-TTS-Tokenizer-12Hz
* **Architecture:** Discrete multi-codebook LM (non-DiT)
* **Streaming Support:** Yes (streaming & non-streaming in one model)
* **Latency:** As low as ~97 ms end-to-end
* **Use Cases:** Real-time assistants, voice agents, multilingual narration, TTS fine-tuning

---

## üîó Resources

* **Hugging Face (1.7B):** https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base  
* **Hugging Face (0.6B):** https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base  
* **Official Blog:** https://qwen.ai/blog

---

## üéôÔ∏è Explore More TTS Models

Looking for more cutting-edge voice models?
üëâ Check out the full collection: [awesome-TTS-Colab](https://github.com/Troyanovsky/awesome-TTS-Colab)


## TTS/Voice Generation with Voice Cloning

In [None]:
!pip -q install -U qwen-tts soundfile

# (Optional, recommended on GPU) FlashAttention 2 for lower memory + faster attention.
# If this fails (GPU not compatible / build issues), the notebook will still run without it.
try:
    import flash_attn  # noqa: F401
    print("flash-attn already installed.")
except Exception:
    !pip -q install -U flash-attn --no-build-isolation

In [None]:
import torch
from qwen_tts import Qwen3TTSModel

MODEL_ID = "Qwen/Qwen3-TTS-12Hz-0.6B-Base" # or "Qwen/Qwen3-TTS-12Hz-1.7B-Base"
LANGUAGE = "English"  # e.g., Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian

device = "cuda:0" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32

attn_impl = "flash_attention_2" if torch.cuda.is_available() else None

model = Qwen3TTSModel.from_pretrained(
    MODEL_ID,
    device_map=device,
    dtype=dtype,
    attn_implementation=attn_impl,
)

print(f"Loaded {MODEL_ID} on {device} with dtype={dtype} attn={attn_impl}")


In [None]:
from google.colab import files

uploaded = files.upload()  # upload a .wav/.mp3/.flac etc.
ref_audio_path = next(iter(uploaded.keys()))
print("Reference audio:", ref_audio_path)

ref_text = "I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character."
# Tip: For best cloning quality, make this transcript match the reference audio as closely as possible.

In [None]:
import soundfile as sf

input_text = "This is a cloned voice demo generated with Qwen3-TTS in Google Colab."

out_clone_path = "output_voice_clone.wav"

wavs, sr = model.generate_voice_clone(
    text=input_text,
    language=LANGUAGE,
    ref_audio=ref_audio_path,
    ref_text=ref_text,
    # If you don't want to provide ref_text, set x_vector_only_mode=True (may reduce quality):
    # x_vector_only_mode=True,
)

sf.write(out_clone_path, wavs[0], sr)
print("Saved:", out_clone_path, "| sr:", sr)

In [None]:
from IPython.display import Audio, display

display(Audio(out_clone_path))