# Easy Inferencing with 🐸 TTS ⚡

#### You want to quicly synthesize speech using Coqui 🐸 TTS model?

💡: Grab a pre-trained model and use it to synthesize speech using any speaker voice, including yours! ⚡

🐸 TTS comes with a list of pretrained models and speaker voices. You can even start a local demo server that you can open it on your favorite web browser and 🗣️ .

In this notebook, we will: 
```
1. List available pre-trained 🐸 TTS models
2. Run a 🐸 TTS model
3. Listen to the synthesized wave 📣
4. Run multispeaker 🐸 TTS model 
```
So, let's jump right in!


## Install 🐸 TTS ⬇️

In [2]:
! pip install -U pip
! pip install TTS
! pip install gradio_client


Collecting scipy>=1.11.2 (from TTS)
  Using cached scipy-1.13.1-cp39-cp39-macosx_12_0_arm64.whl.metadata (60 kB)
Collecting numpy==1.22.0 (from TTS)
  Using cached numpy-1.22.0-cp39-cp39-macosx_11_0_arm64.whl.metadata (2.0 kB)
INFO: pip is looking at multiple versions of scipy to determine which version is compatible with other requirements. This could take a while.
Collecting scipy>=1.11.2 (from TTS)
  Using cached scipy-1.13.0-cp39-cp39-macosx_12_0_arm64.whl.metadata (60 kB)
  Using cached scipy-1.12.0-cp39-cp39-macosx_12_0_arm64.whl.metadata (60 kB)
  Using cached scipy-1.11.4-cp39-cp39-macosx_12_0_arm64.whl.metadata (60 kB)
Collecting torch>=2.1 (from TTS)
  Using cached torch-2.4.1-cp39-none-macosx_11_0_arm64.whl.metadata (26 kB)
Using cached numpy-1.22.0-cp39-cp39-macosx_11_0_arm64.whl (12.8 MB)
Using cached scipy-1.11.4-cp39-cp39-macosx_12_0_arm64.whl (29.7 MB)
Using cached torch-2.4.1-cp39-none-macosx_11_0_arm64.whl (62.1 MB)
[0mInstalling collected packages: numpy, torch, sci

In [4]:
from gradio_client import Client

Loaded as API: https://qwen-qwen2-5.hf.space ✔
Qwen's Response: Thank you for your kind words! I'm here to support and assist you in any way I can. How can I help you today?


## ✅ List available pre-trained 🐸 TTS models

Coqui 🐸TTS comes with a list of pretrained models for different model types (ex: TTS, vocoder), languages, datasets used for training and architectures. 

You can either use your own model or the release models under 🐸TTS.

Use `tts --list_models` to find out the availble models.



In [21]:
! tts --list_models

from TTS.api import TTS

# Load the model
model_name = "tts_models/multilingual/multi-dataset/xtts_v2"
tts = TTS(model_name)

# Check for available speaker indices
if hasattr(tts, "speaker_manager") and tts.speaker_manager:
    print("Available speaker IDs:", tts.speaker_manager.speaker_ids)
else:
    print("This model does not expose speaker information directly.")


RuntimeError: module was compiled against NumPy C-API version 0x10 (NumPy 1.23) but the running NumPy has C-API version 0xf. Check the section C-API incompatibility at the Troubleshooting ImportError section at https://numpy.org/devdocs/user/troubleshooting-importerror.html#c-api-incompatibility for indications on how to solve this problem.

 Name format: type/language/dataset/model
 1: tts_models/multilingual/multi-dataset/xtts_v2 [already downloaded]
 2: tts_models/multilingual/multi-dataset/xtts_v1.1
 3: tts_models/multilingual/multi-dataset/your_tts
 4: tts_models/multilingual/multi-dataset/bark
 5: tts_models/bg/cv/vits
 6: tts_models/cs/cv/vits
 7: tts_models/da/cv/vits
 8: tts_models/et/cv/vits
 9: tts_models/ga/cv/vits
 10: tts_models/en/ek1/tacotron2
 11: tts_models/en/ljspeech/tacotron2-DDC
 12: tts_models/en/ljspeech/tacotron2-DDC_ph
 13: tts_models/en/ljspeech/glow-tts [already downloaded]
 14: tts_models/en/ljspeech/speedy-speech
 15: tts_models/en/ljspeech/tacotron2-DCA
 

RuntimeError: module was compiled against NumPy C-API version 0x10 (NumPy 1.23) but the running NumPy has C-API version 0xf. Check the section C-API incompatibility at the Troubleshooting ImportError section at https://numpy.org/devdocs/user/troubleshooting-importerror.html#c-api-incompatibility for indications on how to solve this problem.

 > tts_models/multilingual/multi-dataset/xtts_v2 is already downloaded.
 > Using model: xtts


  self.speakers = torch.load(speaker_file_path)
  return torch.load(f, map_location=map_location, **kwargs)


This model does not expose speaker information directly.


## ✅ Run a 🐸 TTS model

#### **First things first**: Using a release model and default vocoder:

You can simply copy the full model name from the list above and use it 


In [None]:
client = Client("Qwen/Qwen2.5")
result = client.predict(
    query="This is your first message, introduce yourself to the world!",                # Text message to Qwen
    history=[],                     # Previous conversation history
    system="You are a helpful assistant.",  # System message setting
    radio="72B",                    # Model choice
    api_name="/model_chat"
)
qwen_response = result[1][0][1]['text']  # This is where Qwen's actual response is located
print("Qwen's Response:", qwen_response)

!tts --text "{qwen_response}" \
--model_name "tts_models/multilingual/multi-dataset/xtts_v2" \
--language_idx "en" \
--speaker_wav "/Users/ayush/Desktop/XTTS/japanese-woman.wav" \
--out_path output.wav




Loaded as API: https://qwen-qwen2-5.hf.space ✔
Qwen's Response: Hello, everyone! I'm Qwen, a large language model created by Alibaba Cloud. I'm excited to be here and to have the opportunity to interact with all of you. My purpose is to assist, engage, and learn from our conversations, helping to solve problems, generate ideas, and explore new topics. Whether you need help with technical questions, creative writing, or just want to chat about the latest trends in AI, I'm here to help! Looking forward to connecting with you and making this journey both productive and enjoyable. Feel free to ask me anything!
RuntimeError: module was compiled against NumPy C-API version 0x10 (NumPy 1.23) but the running NumPy has C-API version 0xf. Check the section C-API incompatibility at the Troubleshooting ImportError section at https://numpy.org/devdocs/user/troubleshooting-importerror.html#c-api-incompatibility for indications on how to solve this problem.
 > tts_models/multilingual/multi-dataset/xt

## 📣 Listen to the synthesized wave 📣

In [18]:
import IPython
IPython.display.Audio("output.wav")

### **Second things second**:

🔶 A TTS model can be either trained on a single speaker voice or multispeaker voices. This training choice is directly reflected on the inference ability and the available speaker voices that can be used to synthesize speech. 

🔶 If you want to run a multispeaker model from the released models list, you can first check the speaker ids using `--list_speaker_idx` flag and use this speaker voice to synthesize speech.

In [5]:
# list the possible speaker IDs.
!tts --model_name "tts_models/en/vctk/vits" \
--list_speaker_idxs 


 > Downloading model to /Users/ayush/Library/Application Support/tts/tts_models--en--vctk--vits
 99%|██████████████████████████████████████▌| 146M/148M [00:02<00:00, 66.0MiB/s] > Model's license - apache 2.0
 > Check https://choosealicense.com/licenses/apache-2.0/ for more info.
 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > bas

## 💬 Synthesize speech using speaker ID 💬

In [6]:
!tts --text "Trying out specific speaker voice"\
--out_path spkr-out.wav --model_name "tts_models/en/vctk/vits" \
--speaker_idx "p341"

 > tts_models/en/vctk/vits is already downloaded.
 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
Traceback (most recent call last):
  File "/Users/ayush/Desktop/XTTS/venv/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/Users/ayush/Desktop/XTTS/TTS/bin/synthesize.py", l

## 📣 Listen to the synthesized speaker specific wave 📣

In [7]:
import IPython
IPython.display.Audio("spkr-out.wav")

ValueError: rate must be specified when data is a numpy array or list of audio samples.

🔶 If you want to use an external speaker to synthesize speech, you need to supply `--speaker_wav` flag along with an external speaker encoder path and config file, as follows:

First we need to get the speaker encoder model, its config and a referece `speaker_wav`

In [None]:
!wget https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/config_se.json
!wget https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/model_se.pth.tar
!wget https://github.com/coqui-ai/TTS/raw/speaker_encoder_model/tests/data/ljspeech/wavs/LJ001-0001.wav

In [None]:
!tts --model_name tts_models/multilingual/multi-dataset/your_tts \
--encoder_path model_se.pth.tar \
--encoder_config config_se.json \
--speaker_wav LJ001-0001.wav \
--text "Are we not allowed to dim the lights so people can see that a bit better?"\
--out_path spkr-out.wav \
--language_idx "en"

## 📣 Listen to the synthesized speaker specific wave 📣

In [None]:
import IPython
IPython.display.Audio("spkr-out.wav")

## 🎉 Congratulations! 🎉 You now know how to use a TTS model to synthesize speech! 
Follow up with the next tutorials to learn more adnavced material.