# Easy Inferencing with 🐸 TTS ⚡

#### You want to quicly synthesize speech using Coqui 🐸 TTS model?

💡: Grab a pre-trained model and use it to synthesize speech using any speaker voice, including yours! ⚡

🐸 TTS comes with a list of pretrained models and speaker voices. You can even start a local demo server that you can open it on your favorite web browser and 🗣️ .

In this notebook, we will: 
```
1. List available pre-trained 🐸 TTS models
2. Run a 🐸 TTS model
3. Listen to the synthesized wave 📣
4. Run multispeaker 🐸 TTS model 
```
So, let's jump right in!


## Install 🐸 TTS ⬇️

In [None]:
! pip install -U pip
! pip install TTS

## ✅ List available pre-trained 🐸 TTS models

Coqui 🐸TTS comes with a list of pretrained models for different model types (ex: TTS, vocoder), languages, datasets used for training and architectures. 

You can either use your own model or the release models under 🐸TTS.

Use `tts --list_models` to find out the availble models.



In [1]:
! tts --list_models

 Name format: type/language/dataset/model
 1: tts_models/multilingual/multi-dataset/your_tts
 2: tts_models/bg/cv/vits
 3: tts_models/cs/cv/vits
 4: tts_models/da/cv/vits
 5: tts_models/et/cv/vits
 6: tts_models/ga/cv/vits
 7: tts_models/en/ek1/tacotron2
 8: tts_models/en/ljspeech/tacotron2-DDC
 9: tts_models/en/ljspeech/tacotron2-DDC_ph
 10: tts_models/en/ljspeech/glow-tts
 11: tts_models/en/ljspeech/speedy-speech
 12: tts_models/en/ljspeech/tacotron2-DCA
 13: tts_models/en/ljspeech/vits
 14: tts_models/en/ljspeech/vits--neon
 15: tts_models/en/ljspeech/fast_pitch
 16: tts_models/en/ljspeech/overflow
 17: tts_models/en/ljspeech/neural_hmm
 18: tts_models/en/vctk/vits
 19: tts_models/en/vctk/fast_pitch
 20: tts_models/en/sam/tacotron-DDC
 21: tts_models/en/blizzard2013/capacitron-t2-c50
 22: tts_models/en/blizzard2013/capacitron-t2-c150_v2
 23: tts_models/es/mai/tacotron2-DDC
 24: tts_models/es/css10/vits
 25: tts_models/fr/mai/tacotron2-DDC
 26: tts_models/fr/css10/vits
 27: tts_model

## ✅ Run a 🐸 TTS model

#### **First things first**: Using a release model and default vocoder:

You can simply copy the full model name from the list above and use it 


In [2]:
!tts --text "hello world" \
--model_name "tts_models/en/ljspeech/glow-tts" \
--out_path output.wav


 > Downloading model to C:\Users\iambl\AppData\Local\tts\tts_models--en--ljspeech--glow-tts
 > Model's license - MPL
 > Check https://www.mozilla.org/en-US/MPL/2.0/ for more info.
 > Downloading model to C:\Users\iambl\AppData\Local\tts\vocoder_models--en--ljspeech--multiband-melgan
 > Model's license - MPL
 > Check https://www.mozilla.org/en-US/MPL/2.0/ for more info.
 > Using model: glow_tts
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:0
 | > fft_size:1024
 | > power:1.1
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:True
 | > mel_fmin:50.0
 | > mel_fmax:7600.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 


  0%|          | 0.00/344M [00:00<?, ?iB/s]
  0%|          | 197k/344M [00:00<03:13, 1.77MiB/s]
  0%|          | 809k/344M [00:00<01:21, 4.22MiB/s]
  0%|          | 1.25M/344M [00:00<01:20, 4.28MiB/s]
  0%|          | 1.68M/344M [00:00<01:20, 4.27MiB/s]
  1%|          | 2.33M/344M [00:00<01:07, 5.05MiB/s]
  1%|          | 3.21M/344M [00:00<00:53, 6.32MiB/s]
  1%|1         | 3.85M/344M [00:00<00:54, 6.20MiB/s]
  1%|1         | 4.90M/344M [00:00<00:44, 7.56MiB/s]
  2%|1         | 5.85M/344M [00:00<00:41, 8.11MiB/s]
  2%|1         | 6.66M/344M [00:01<00:52, 6.38MiB/s]
  2%|2         | 7.52M/344M [00:01<00:48, 6.94MiB/s]
  2%|2         | 8.39M/344M [00:01<00:45, 7.35MiB/s]
  3%|2         | 9.17M/344M [00:01<01:03, 5.31MiB/s]
  3%|3         | 10.7M/344M [00:01<00:44, 7.41MiB/s]
  3%|3         | 11.6M/344M [00:01<00:54, 6.15MiB/s]
  4%|3         | 12.5M/344M [00:01<00:49, 6.68MiB/s]
  4%|3         | 13.2M/344M [00:02<00:47, 6.93MiB/s]
  4%|4         | 14.0M/344M [00:02<00:49, 6.64MiB/s]
  4

## 📣 Listen to the synthesized wave 📣

In [3]:
import IPython
IPython.display.Audio("output.wav")

### **Second things second**:

🔶 A TTS model can be either trained on a single speaker voice or multispeaker voices. This training choice is directly reflected on the inference ability and the available speaker voices that can be used to synthesize speech. 

🔶 If you want to run a multispeaker model from the released models list, you can first check the speaker ids using `--list_speaker_idx` flag and use this speaker voice to synthesize speech.

In [4]:
# list the possible speaker IDs.
!tts --model_name "tts_models/en/vctk/vits" \
--list_speaker_idxs 


 > Downloading model to C:\Users\iambl\AppData\Local\tts\tts_models--en--vctk--vits
 > Model's license - apache 2.0
 > Check https://choosealicense.com/licenses/apache-2.0/ for more info.
 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > initialization of speaker-embedding layers.


  0%|          | 0.00/148M [00:00<?, ?iB/s]
  0%|          | 426k/148M [00:00<00:34, 4.21MiB/s]
  1%|          | 1.29M/148M [00:00<00:21, 6.81MiB/s]
  1%|1         | 2.16M/148M [00:00<00:19, 7.57MiB/s]
  2%|1         | 2.92M/148M [00:00<00:19, 7.38MiB/s]
  3%|2         | 3.75M/148M [00:00<00:18, 7.71MiB/s]
  3%|3         | 4.60M/148M [00:00<00:18, 7.93MiB/s]
  4%|3         | 5.52M/148M [00:00<00:17, 8.31MiB/s]
  4%|4         | 6.42M/148M [00:00<00:16, 8.51MiB/s]
  5%|4         | 7.31M/148M [00:00<00:16, 8.58MiB/s]
  6%|5         | 8.16M/148M [00:01<00:16, 8.57MiB/s]
  6%|6         | 9.08M/148M [00:01<00:15, 8.72MiB/s]
  7%|6         | 9.98M/148M [00:01<00:15, 8.77MiB/s]
  7%|7         | 10.9M/148M [00:01<00:15, 8.77MiB/s]
  8%|7         | 11.7M/148M [00:01<00:19, 7.07MiB/s]
  8%|8         | 12.6M/148M [00:01<00:18, 7.35MiB/s]
  9%|9         | 13.4M/148M [00:01<00:17, 7.69MiB/s]
 10%|9         | 14.3M/148M [00:01<00:16, 7.87MiB/s]
 10%|#         | 15.2M/148M [00:01<00:16, 8.19MiB/s]
 1

## 💬 Synthesize speech using speaker ID 💬

In [5]:
!tts --text "Trying out specific speaker voice"\
--out_path spkr-out.wav --model_name "tts_models/en/vctk/vits" \
--speaker_idx "p341"

 > tts_models/en/vctk/vits is already downloaded.
 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > initialization of speaker-embedding layers.
 > Text: Trying out specific speaker voice
 > Text splitted to sentences.
['Trying out specific speaker voice']
 > Processing time: 0.521



## 📣 Listen to the synthesized speaker specific wave 📣

In [6]:
import IPython
IPython.display.Audio("spkr-out.wav")

🔶 If you want to use an external speaker to synthesize speech, you need to supply `--speaker_wav` flag along with an external speaker encoder path and config file, as follows:

First we need to get the speaker encoder model, its config and a referece `speaker_wav`

In [None]:
!wget https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/config_se.json
!wget https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/model_se.pth.tar
!wget https://github.com/coqui-ai/TTS/raw/speaker_encoder_model/tests/data/ljspeech/wavs/LJ001-0001.wav

In [None]:
!tts --model_name tts_models/multilingual/multi-dataset/your_tts \
--encoder_path model_se.pth.tar \
--encoder_config config_se.json \
--speaker_wav LJ001-0001.wav \
--text "Are we not allowed to dim the lights so people can see that a bit better?"\
--out_path spkr-out.wav \
--language_idx "en"

## 📣 Listen to the synthesized speaker specific wave 📣

In [None]:
import IPython
IPython.display.Audio("spkr-out.wav")

## 🎉 Congratulations! 🎉 You now know how to use a TTS model to synthesize speech! 
Follow up with the next tutorials to learn more adnavced material.