<a href="https://colab.research.google.com/github/Troyanovsky/awesome-TTS-Colab/blob/main/chatterbox_TTS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🗣️ Chatterbox TTS Google Colab

## 📄 Description  
This Colab notebook uses chatterbox TTS to generate speech from text. It supports emotion exaggeration control, voice cloning, watermarked outputs.

**Capabilities**: Text-to-speech, Emotion Exaggeration Control, Voice Cloning, Watermarked Outputs

---

## How to use

- Follow the instructions from the comments to change the text_to_generate
- Run all cells in the section you need
- For voice cloning section, follow instructions to upload reference file
- The generated output will be in `output.wav`

---

## 🔗 Resources

- **GitHub Repository:** https://github.com/resemble-ai/chatterbox
- **Model Availability:** https://huggingface.co/ResembleAI/chatterbox

---

## Special note

- Every audio file generated by Chatterbox includes [Resemble AI's Perth (Perceptual Threshold) Watermarker](https://github.com/resemble-ai/perth) - imperceptible neural watermarks that survive MP3 compression, audio editing, and common manipulations while maintaining nearly 100% detection accuracy.

---

## 🎙️ Explore More TTS Models  
Want to try out additional TTS models? Check out the curated collection here:  
👉 [awesome-TTS-Colab](https://github.com/Troyanovsky/awesome-TTS-Colab)


## General text-to-speech

- General TTS: exaggeration=0.5 and cfg_weight=0.5 work well for most prompts.
- Expressive or Dramatic Speech: Try lower cfg_weight ~= 0.3 and increase exaggeration to around 0.7 or higher. Higher exaggeration tends to speed up speech; reducing cfg_weight helps compensate with slower, more deliberate pacing.

In [None]:
!pip uninstall -y numpy scipy librosa resemble-perth chatterbox-tts
!pip install --no-cache-dir numpy==1.26.4
!pip install --no-cache-dir \
    resampy==0.4.3 \
    librosa==0.11.0 \
    s3tokenizer \
    torch==2.6.0 \
    torchaudio==2.6.0 \
    transformers==4.46.3 \
    diffusers==0.29.0 \
    resemble-perth==1.0.1 \
    omegaconf==2.3.0 \
    conformer==0.3.2 \
    safetensors==0.5.3 \
    scipy
!pip install --no-cache-dir chatterbox-tts==0.1.1

Found existing installation: numpy 1.26.0
Uninstalling numpy-1.26.0:
  Successfully uninstalled numpy-1.26.0
Found existing installation: scipy 1.15.3
Uninstalling scipy-1.15.3:
  Successfully uninstalled scipy-1.15.3
Found existing installation: librosa 0.10.0
Uninstalling librosa-0.10.0:
  Successfully uninstalled librosa-0.10.0
Found existing installation: resemble-perth 1.0.1
Uninstalling resemble-perth-1.0.1:
  Successfully uninstalled resemble-perth-1.0.1
Found existing installation: chatterbox-tts 0.1.1
Uninstalling chatterbox-tts-0.1.1:
  Successfully uninstalled chatterbox-tts-0.1.1
Collecting numpy==1.26.4
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m74.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Collecting chatterbox-tts==0.1.1
  Downloading chatterbox_tts-0.1.1-py3-none-any.whl.metadata (5.9 kB)
Collecting numpy==1.26.0 (from chatterbox-tts==0.1.1)
  Downloading numpy-1.26.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.5/58.5 kB[0m [31m206.2 MB/s[0m eta [36m0:00:00[0m
Collecting librosa==0.10.0 (from chatterbox-tts==0.1.1)
  Downloading librosa-0.10.0-py3-none-any.whl.metadata (8.3 kB)
Downloading chatterbox_tts-0.1.1-py3-none-any.whl (91 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.4/91.4 kB[0m [31m295.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading librosa-0.10.0-py3-none-any.whl (252 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m252.9/252.9 kB[0m [31m348.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
# Remember to click Runtime - Restart session before runnig the next cell

In [None]:
import torchaudio as ta
import torch
from chatterbox.tts import ChatterboxTTS
from IPython.display import Audio, display

# Automatically detect the best available device
if torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"

# Params
exaggeration = 0.5  # higher value gives exaggerated, sped up speech
cfg_weight = 0.5 # lower value gives slower speech

model = ChatterboxTTS.from_pretrained(device=device)

text = "You're now listening to voice generated by chatterbox TTS. Isn't it great?"
wav = model.generate(text, exaggeration=exaggeration, cfg_weight=cfg_weight)
ta.save("output.wav", wav, model.sr)

display(Audio('output.wav'))

## Voice-clone text-to-speech

- If the reference audio is fast, try lowering cfg_weight to compensate.

In [1]:
!pip uninstall -y numpy scipy librosa resemble-perth chatterbox-tts
!pip install --no-cache-dir numpy==1.26.4
!pip install --no-cache-dir \
    resampy==0.4.3 \
    librosa==0.11.0 \
    s3tokenizer \
    torch==2.6.0 \
    torchaudio==2.6.0 \
    transformers==4.46.3 \
    diffusers==0.29.0 \
    resemble-perth==1.0.1 \
    omegaconf==2.3.0 \
    conformer==0.3.2 \
    safetensors==0.5.3 \
    scipy
!pip install --no-cache-dir chatterbox-tts==0.1.1
!apt-get update
!apt-get install -y ffmpeg portaudio19-dev
!pip install ffmpeg-python sounddevice

Found existing installation: numpy 2.0.2
Uninstalling numpy-2.0.2:
  Successfully uninstalled numpy-2.0.2
Found existing installation: scipy 1.15.3
Uninstalling scipy-1.15.3:
  Successfully uninstalled scipy-1.15.3
Found existing installation: librosa 0.11.0
Uninstalling librosa-0.11.0:
  Successfully uninstalled librosa-0.11.0
[0mCollecting numpy==1.26.4
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m165.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.3/18.3 MB[0m [31m340.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the follow

In [2]:
# Remember to click Runtime - Restart session before runnig the next cell

In [3]:
# Upload a file as reference audio file

from google.colab import files
import ffmpeg
import os

# Prompt user to upload a file
uploaded = files.upload()

# Process the uploaded file
for filename in uploaded.keys():
    print(f'User uploaded file "{filename}"')
    output_filename = "reference.wav"

    # Use ffmpeg-python to convert the file to WAV
    try:
        (
            ffmpeg
            .input(filename)
            .output(output_filename, acodec='pcm_s16le', ar='16000')
            .run(overwrite_output=True)
        )
        print(f'Converted "{filename}" to "{output_filename}"')
    except ffmpeg.Error as e:
        print("Error during conversion:", e.stderr.decode())

Saving trump_promptvn.wav to trump_promptvn.wav
User uploaded file "trump_promptvn.wav"
Converted "trump_promptvn.wav" to "reference.wav"


In [None]:
import torchaudio as ta
import torch
from chatterbox.tts import ChatterboxTTS
from IPython.display import Audio, display

# Automatically detect the best available device
if torch.cuda.is_available():
    device = "cuda"
elif torch.backends.mps.is_available():
    device = "mps"
else:
    device = "cpu"

# Params
exaggeration = 0.5  # higher value gives exaggerated, sped up speech
cfg_weight = 0.5 # lower value gives slower speech
reference_audio = "reference.wav"

model = ChatterboxTTS.from_pretrained(device=device)

text = "You're now listening to voice generated by chatterbox TTS. Isn't it great?"
wav = model.generate(text, exaggeration=exaggeration, cfg_weight=cfg_weight, audio_prompt_path=reference_audio)
ta.save("voice_clone_output.wav", wav, model.sr)

display(Audio('voice_clone_output.wav'))