<a href="https://colab.research.google.com/github/Troyanovsky/awesome-TTS-Colab/blob/main/Dia_TTS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🗣️ Dia 1.6B TTS Google Colab

## 📄 Description  
This Colab notebook uses Dia 1.6B TTS model for conversational text-to-speech generation.

**Capabilities**: Text-to-speech, Conversational, Non-verbal sounds, Voice Cloning from Audio

---

## How to use
- Follow the instructions to input text to generate and adjust params.
- Run all cells and output will be in `output.mp3` or `voice_clone_output.mp3` (if using voice cloning)

---

## 🔗 Resources

- **GitHub Repository:** [nari-labs/dia](https://github.com/nari-labs/dia)
- **Model Availability:** [nari-labs/Dia-1.6B](https://huggingface.co/nari-labs/Dia-1.6B)

---

## 🎙️ Explore More TTS Models  
Want to try out additional TTS models? Check out the curated collection here:  
👉 [awesome-TTS-Colab](https://github.com/Troyanovsky/awesome-TTS-Colab)


## Conversational TTS

In [1]:
!pip install git+https://github.com/nari-labs/dia.git

Collecting git+https://github.com/nari-labs/dia.git
  Cloning https://github.com/nari-labs/dia.git to /tmp/pip-req-build-ts7gidct
  Running command git clone --filter=blob:none --quiet https://github.com/nari-labs/dia.git /tmp/pip-req-build-ts7gidct
  Resolved https://github.com/nari-labs/dia.git to commit 7fb9822b1a976b4fe3ded9961a9b4331bc09eab4
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting descript-audio-codec>=1.0.0 (from nari-tts==0.1.0)
  Downloading descript_audio_codec-1.0.0-py3-none-any.whl.metadata (7.8 kB)
Collecting gradio>=5.25.2 (from nari-tts==0.1.0)
  Downloading gradio-5.29.1-py3-none-any.whl.metadata (16 kB)
Collecting numpy>=2.2.4 (from nari-tts==0.1.0)
  Downloading numpy-2.2.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m 

In [2]:
# Note: Always start with [S1] and alternate between speakers.
# Non-verbals supported are: (laughs), (clears throat), (sighs), (gasps), (coughs), (singing), (sings), (mumbles), (beep), (groans), (sniffs), (claps), (screams), (inhales), (exhales), (applause), (burps), (humming), (sneezes), (chuckle), (whistles)

text = "[S1] Dia is an open weights text to dialogue model. [S2] You get full control over scripts and voices. [S1] Wow. Amazing. (laughs) [S2] Try it now on Git hub or Hugging Face."

In [None]:
from dia.model import Dia
from IPython.display import Audio

model = Dia.from_pretrained("nari-labs/Dia-1.6B", compute_dtype="float16")

output = model.generate(text, use_torch_compile=True, verbose=True)

model.save_audio("output.mp3", output)

Audio("output.mp3")

## Conversational TTS with Voice Cloning

In [1]:
!pip install git+https://github.com/nari-labs/dia.git
!apt-get install ffmpeg
!pip install ffmpeg-python

Collecting git+https://github.com/nari-labs/dia.git
  Cloning https://github.com/nari-labs/dia.git to /tmp/pip-req-build-nkzrupjv
  Running command git clone --filter=blob:none --quiet https://github.com/nari-labs/dia.git /tmp/pip-req-build-nkzrupjv
  Resolved https://github.com/nari-labs/dia.git to commit 7fb9822b1a976b4fe3ded9961a9b4331bc09eab4
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting descript-audio-codec>=1.0.0 (from nari-tts==0.1.0)
  Downloading descript_audio_codec-1.0.0-py3-none-any.whl.metadata (7.8 kB)
Collecting gradio>=5.25.2 (from nari-tts==0.1.0)
  Downloading gradio-5.29.1-py3-none-any.whl.metadata (16 kB)
Collecting numpy>=2.2.4 (from nari-tts==0.1.0)
  Downloading numpy-2.2.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m 

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
ffmpeg is already the newest version (7:4.4.2-0ubuntu0.22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 34 not upgraded.
Collecting ffmpeg-python
  Downloading ffmpeg_python-0.2.0-py3-none-any.whl.metadata (1.7 kB)
Downloading ffmpeg_python-0.2.0-py3-none-any.whl (25 kB)
Installing collected packages: ffmpeg-python
Successfully installed ffmpeg-python-0.2.0


In [2]:
from google.colab import files
import ffmpeg

# Prompt the user to upload a file
print("Please upload your reference audio file:")
uploaded = files.upload()

# Get the uploaded file name
for file_name in uploaded.keys():
    print(f"Uploaded file: {file_name}")

# Extract the uploaded file name
input_file = list(uploaded.keys())[0]

# Convert to reference_audio.mp3 using ffmpeg-python
try:
    # Using ffmpeg-python to convert the audio
    (
        ffmpeg
        .input(input_file)
        .output('reference_audio.mp3', vn=None, ar=44100, ac=2, audio_bitrate='192k')
        .run()
    )
    print("Conversion complete! The file has been saved as reference_audio.mp3")
except Exception as e:
    print(f"An error occurred during conversion: {e}")

Please upload your reference audio file:


Saving Catching Up With Friends Audio 2.mp3 to Catching Up With Friends Audio 2.mp3
Uploaded file: Catching Up With Friends Audio 2.mp3
Conversion complete! The file has been saved as reference_audio.mp3


In [3]:
# Change reference_text to match the transcript of your reference audio, denote speakers with [S1], [S2] etc
reference_text = "[S1] How are you and Jacky? [S2] Yeah, good thanks. [S1] And your new baby? George, isn't it? [S2] (laughs) You've got a good memory."

reference_audio = "reference_audio.mp3"

# Note: Always start with [S1] and alternate between speakers.
# Non-verbals supported are: (laughs), (clears throat), (sighs), (gasps), (coughs), (singing), (sings), (mumbles), (beep), (groans), (sniffs), (claps), (screams), (inhales), (exhales), (applause), (burps), (humming), (sneezes), (chuckle), (whistles)
text_to_generate = "[S1] Dia is an open weights text to dialogue model. [S2] You get full control over scripts and voices. [S1] Wow. Amazing. (laughs) [S2] Try it now on Git hub or Hugging Face."

In [None]:
from dia.model import Dia
from IPython.display import Audio

model = Dia.from_pretrained("nari-labs/Dia-1.6B", compute_dtype="float16")

output = model.generate(
    reference_text + text_to_generate, audio_prompt=reference_audio, use_torch_compile=True, verbose=True
)

model.save_audio("voice_clone_output.mp3", output)

Audio("voice_clone_output.mp3")