# Text-to-Speech introduction


Text-to-Speech (TTS) technology has evolved from robotic, monotone computer voices to sophisticated AI models that produce remarkably human-like speech. TTS represents a powerful tool that can transform how users interact with applications across virtually every domain.

## Core Applications

- Accessibility & Inclusion: TTS models break down barriers for visually impaired users, those with reading difficulties like dyslexia, or anyone who benefits from auditory learning. TTS technology can revolutionize how educational content, news articles, or documentation becomes accessible to broader audiences.

- Content Creation & Media: Modern TTS can generate podcasts, audiobooks, or video narrations without human voice actors. This opens possibilities for automated content pipelines, multilingual media production, or personalized storytelling applications that adapt tone and style to individual preferences.

- Conversational Interfaces Beyond simple chatbots, TTS enables natural voice interactions for customer service, virtual assistants, or interactive learning platforms. Combined with speech recognition, you can build truly conversational applications that feel more human and engaging.

- Real-time Communication TTS powers live translation services, voice messaging platforms, or accessibility tools that can speak text in real-time during meetings or phone calls. This creates opportunities for breaking language barriers or supporting communication needs.

## Example TTS datasets
- [LJ Speech](https://keithito.com/LJ-Speech-Dataset/).
- [VCTK](https://datashare.ed.ac.uk/handle/10283/2950).

## Leverage TTS in your project

Consider how TTS might enhance your hackathon project: Could your app benefit from voice feedback? Would audio output make your solution more accessible? How might different voices or languages expand your target audience? The technology is ready - the creative applications are up to you.

# Easy Inferencing with 🐸 TTS ⚡

🐸 Coqui TTS is a library for advanced Text-to-Speech generation.

#### You want to quicly synthesize speech using Coqui 🐸 TTS model?

💡: Grab a pre-trained model and use it to synthesize speech using any speaker voice, including yours! ⚡

🐸 TTS comes with a list of pretrained models and speaker voices. You can even start a local demo server that you can open it on your favorite web browser and 🗣️ .

In this notebook, we will:
```
1. List available pre-trained 🐸 TTS models
2. Run a 🐸 TTS model
3. Listen to the synthesized wave 📣
4. Run multispeaker 🐸 TTS model
```
So, let's jump right in!


## Install 🐸 TTS ⬇️

In [None]:
! pip install -U pip
! pip install coqui-tts==0.26.0
! sudo apt-get install espeak

## ✅ List available pre-trained 🐸 TTS models

Coqui 🐸TTS comes with a list of pretrained models for different model types (ex: TTS, vocoder), languages, datasets used for training and architectures.

You can either use your own model or the release models under 🐸TTS.

Use `tts --list_models` to find out the available models.



In [None]:
! tts --list_models

2025-05-23 12:53:54.817188: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1748004834.860387    4463 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1748004834.872290    4463 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-05-23 12:53:54.910830: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

INFO:TTS.utils.manage:
Name format: type/language/dataset/model
INFO:TTS.utils.manage:Name format: type/language/dataset/mod

## ✅ Run a 🐸 TTS model

#### **First things first**: Using a release model and default vocoder:

You can simply copy the full model name from the list above and use it

The pre-trained model takes in input a spectrogram (a representation of the spectrum of frequencies of a signal as it varies with time) and produces a waveform in output. Typically, a vocoder is used after a TTS model that converts an input text into a spectrogram (in this case, Glow-TTS).

In [None]:
!tts --text "hello world" \
--model_name "tts_models/en/ljspeech/glow-tts" \
--out_path output.wav


2025-05-23 12:54:18.121526: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1748004858.145673    4565 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1748004858.152682    4565 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-05-23 12:54:18.175879: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
tts_models/en/ljspeech/glow-tts is already downloaded.
INFO:TTS.utils.manage:tts_models/en/ljspeech/glow-tts is already downl

## 📣 Listen to the synthesized wave 📣

In [None]:
import IPython
IPython.display.Audio("output.wav")

### **Second things second**:

🔶 A TTS model can be either trained on a single speaker voice or multispeaker voices. This training choice is directly reflected on the inference ability and the available speaker voices that can be used to synthesize speech.

🔶 If you want to run a multispeaker model from the released models list, you can first check the speaker ids using `--list_speaker_idx` flag and use this speaker voice to synthesize speech.

In [None]:
# list the possible speaker IDs.
!tts --model_name "tts_models/en/vctk/vits" \
--list_speaker_idxs


2025-05-23 12:54:47.963613: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1748004887.988639    4699 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1748004887.996268    4699 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-05-23 12:54:48.020049: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
tts_models/en/vctk/vits is already downloaded.
INFO:TTS.utils.manage:tts_models/en/vctk/vits is already downloaded.
Using mod

## 💬 Synthesize speech using speaker ID 💬

In [None]:
!tts --text "Trying out specific speaker voice"\
--out_path spkr-out.wav --model_name "tts_models/en/vctk/vits" \
--speaker_idx "p341"

2025-05-23 12:55:12.934242: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1748004912.969236    4818 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1748004912.976220    4818 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-05-23 12:55:12.999354: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
tts_models/en/vctk/vits is already downloaded.
INFO:TTS.utils.manage:tts_models/en/vctk/vits is already downloaded.
Using mod

## 📣 Listen to the synthesized speaker specific wave 📣

In [None]:
import IPython
IPython.display.Audio("spkr-out.wav")

🔶 If you want to use an external speaker to synthesize speech, you need to supply `--speaker_wav` flag along with an external speaker encoder path and config file, as follows:

First we need to get the speaker encoder model, its config and a reference `speaker_wav`

In [None]:
!wget https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/config_se.json
!wget https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/model_se.pth.tar
!wget https://github.com/coqui-ai/TTS/raw/speaker_encoder_model/tests/data/ljspeech/wavs/LJ001-0001.wav

--2025-05-23 12:55:37--  https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/config_se.json
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/265612440/2d84c8bc-814b-474c-96e5-282d318667ba?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=releaseassetproduction%2F20250523%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250523T125537Z&X-Amz-Expires=300&X-Amz-Signature=83b7ff4d17f51c40bc077803fdaee230d9d67df1bb2c5d2d68feec2a8a0878ba&X-Amz-SignedHeaders=host&response-content-disposition=attachment%3B%20filename%3Dconfig_se.json&response-content-type=application%2Foctet-stream [following]
--2025-05-23 12:55:37--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/265612440/2d84c8bc-814b-474c-96e5-282d318667ba?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-

In [None]:
!tts --model_name tts_models/multilingual/multi-dataset/your_tts \
--encoder_path model_se.pth.tar \
--encoder_config config_se.json \
--speaker_wav LJ001-0001.wav \
--text "Are we not allowed to dim the lights so people can see that a bit better?"\
--out_path spkr-out.wav \
--language_idx "en"

2025-05-23 12:55:56.629643: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1748004956.654102    5014 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1748004956.661587    5014 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-05-23 12:55:56.686981: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Downloading model to /root/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts
INFO:TTS.utils.manage:Downloadi

## 📣 Listen to the synthesized speaker specific wave 📣

In [None]:
import IPython
IPython.display.Audio("spkr-out.wav")

## 🎉 Congratulations! 🎉 You now know how to use a TTS model to synthesize speech!

# Challenge
- Which languages are supported by the "tts_models/multilingual/multi-dataset/your_tts" model? use the command-line interface (CLI).