# Easy Inferencing with 🐸 TTS ⚡

#### You want to quicly synthesize speech using Coqui 🐸 TTS model?

💡: Grab a pre-trained model and use it to synthesize speech using any speaker voice, including yours! ⚡

🐸 TTS comes with a list of pretrained models and speaker voices. You can even start a local demo server that you can open it on your favorite web browser and 🗣️ .

In this notebook, we will:
```
1. List available pre-trained 🐸 TTS models
2. Run a 🐸 TTS model
3. Listen to the synthesized wave 📣
4. Run multispeaker 🐸 TTS model
```
So, let's jump right in!


## Install 🐸 TTS ⬇️

In [None]:
! pip install -U pip
! pip install TTS

Collecting pip
  Downloading pip-23.2.1-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-23.2.1
Collecting TTS
  Obtaining dependency information for TTS from https://files.pythonhosted.org/packages/67/bf/c3fb7b77c74335a8932f003c1fd8db1bca9ae9328d2951e107207fa7b8fa/TTS-0.17.8-cp310-cp310-manylinux1_x86_64.whl.metadata
  Downloading TTS-0.17.8-cp310-cp310-manylinux1_x86_64.whl.metadata (22 kB)
Collecting cython==0.29.30 (from TTS)
  Downloading Cython-0.29.30-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
Collecting scikit-lear

## ✅ List available pre-trained 🐸 TTS models

Coqui 🐸TTS comes with a list of pretrained models for different model types (ex: TTS, vocoder), languages, datasets used for training and architectures.

You can either use your own model or the release models under 🐸TTS.

Use `tts --list_models` to find out the availble models.



In [None]:
! tts --list_models

No API token found for 🐸Coqui Studio voices - https://coqui.ai 
Visit 🔗https://app.coqui.ai/account to get one.
Set it as an environment variable `export COQUI_STUDIO_TOKEN=<token>`


 Name format: type/language/dataset/model
 1: tts_models/multilingual/multi-dataset/xtts_v1
 2: tts_models/multilingual/multi-dataset/your_tts
 3: tts_models/multilingual/multi-dataset/bark
 4: tts_models/bg/cv/vits
 5: tts_models/cs/cv/vits
 6: tts_models/da/cv/vits
 7: tts_models/et/cv/vits
 8: tts_models/ga/cv/vits
 9: tts_models/en/ek1/tacotron2
 10: tts_models/en/ljspeech/tacotron2-DDC
 11: tts_models/en/ljspeech/tacotron2-DDC_ph
 12: tts_models/en/ljspeech/glow-tts
 13: tts_models/en/ljspeech/speedy-speech
 14: tts_models/en/ljspeech/tacotron2-DCA
 15: tts_models/en/ljspeech/vits
 16: tts_models/en/ljspeech/vits--neon
 17: tts_models/en/ljspeech/fast_pitch
 18: tts_models/en/ljspeech/overflow
 19: tts_models/en/ljspeech/neural_hmm
 20: tts_models/en/vctk/vits
 21: tts_models/en/vctk/fast_pitch
 22: 

## ✅ Run a 🐸 TTS model

#### **First things first**: Using a release model and default vocoder:

You can simply copy the full model name from the list above and use it


In [None]:
!tts --text "hello world" \
--model_name "tts_models/en/ljspeech/glow-tts" \
--out_path output.wav


 > Downloading model to /root/.local/share/tts/tts_models--en--ljspeech--glow-tts
100% 344M/344M [00:07<00:00, 43.2MiB/s]
 > Model's license - MPL
 > Check https://www.mozilla.org/en-US/MPL/2.0/ for more info.
 > Downloading model to /root/.local/share/tts/vocoder_models--en--ljspeech--multiband-melgan
100% 82.8M/82.8M [00:02<00:00, 35.1MiB/s]
 > Model's license - MPL
 > Check https://www.mozilla.org/en-US/MPL/2.0/ for more info.
 > Using model: glow_tts
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:0
 | > fft_size:1024
 | > power:1.1
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:True
 | > mel_fmin:50.0
 | > mel_fmax:7600.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_

## 📣 Listen to the synthesized wave 📣

In [None]:
import IPython
IPython.display.Audio("output.wav")

### **Second things second**:

🔶 A TTS model can be either trained on a single speaker voice or multispeaker voices. This training choice is directly reflected on the inference ability and the available speaker voices that can be used to synthesize speech.

🔶 If you want to run a multispeaker model from the released models list, you can first check the speaker ids using `--list_speaker_idx` flag and use this speaker voice to synthesize speech.

In [None]:
# list the possible speaker IDs.
!tts --model_name "tts_models/en/vctk/vits" \
--list_speaker_idxs


 > Downloading model to /root/.local/share/tts/tts_models--en--vctk--vits
100% 148M/148M [00:05<00:00, 29.1MiB/s]
 > Model's license - apache 2.0
 > Check https://choosealicense.com/licenses/apache-2.0/ for more info.
 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
Traceback (most 

## 💬 Synthesize speech using speaker ID 💬

In [None]:
!tts --text "Trying out specific speaker voice"\
--out_path spkr-out.wav --model_name "tts_models/en/vctk/vits" \
--speaker_idx "p341"

 > tts_models/en/vctk/vits is already downloaded.
 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
Traceback (most recent call last):
  File "/usr/local/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/TTS/bin/synthesize.py", line 43

## 📣 Listen to the synthesized speaker specific wave 📣

In [None]:
import IPython
IPython.display.Audio("spkr-out.wav")

ValueError: ignored

🔶 If you want to use an external speaker to synthesize speech, you need to supply `--speaker_wav` flag along with an external speaker encoder path and config file, as follows:

First we need to get the speaker encoder model, its config and a referece `speaker_wav`

In [None]:
!wget https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/config_se.json
!wget https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/model_se.pth.tar
!wget https://github.com/coqui-ai/TTS/raw/speaker_encoder_model/tests/data/ljspeech/wavs/LJ001-0001.wav

--2023-10-15 09:46:59--  https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/config_se.json
Resolving github.com (github.com)... 192.30.255.112
Connecting to github.com (github.com)|192.30.255.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/265612440/2d84c8bc-814b-474c-96e5-282d318667ba?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20231015%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20231015T094700Z&X-Amz-Expires=300&X-Amz-Signature=57af626c8c912dee0e90de8c06a853c244ec39de6ebfeca87f91e0cbb6a62572&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=265612440&response-content-disposition=attachment%3B%20filename%3Dconfig_se.json&response-content-type=application%2Foctet-stream [following]
--2023-10-15 09:47:00--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/265612440/2d84c8bc-814b-474c-96e5-282d3186

In [None]:
!tts --model_name tts_models/multilingual/multi-dataset/your_tts \
--encoder_path model_se.pth.tar \
--encoder_config config_se.json \
--speaker_wav LJ001-0001.wav \
--text "Are we not allowed to speak?"\
--out_path spkr-out.wav \
--language_idx "en"

 > tts_models/multilingual/multi-dataset/your_tts is already downloaded.
 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > Model fully restored. 
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:64
 | > log_func:np.log10
 | > min_level_db:-

## 📣 Listen to the synthesized speaker specific wave 📣

In [None]:
import IPython
IPython.display.Audio("spkr-out.wav")

## 🎉 Congratulations! 🎉 You now know how to use a TTS model to synthesize speech!
Follow up with the next tutorials to learn more adnavced material.