# Easy Inferencing with 🐸 TTS ⚡

#### You want to quicly synthesize speech using Coqui 🐸 TTS model?

💡: Grab a pre-trained model and use it to synthesize speech using any speaker voice, including yours! ⚡

🐸 TTS comes with a list of pretrained models and speaker voices. You can even start a local demo server that you can open it on your favorite web browser and 🗣️ .

In this notebook, we will: 
```
1. List available pre-trained 🐸 TTS models
2. Run a 🐸 TTS model
3. Listen to the synthesized wave 📣
4. Run multispeaker 🐸 TTS model 
```
So, let's jump right in!


## Install 🐸 TTS ⬇️

In [1]:
! pip install -U pip
! pip install TTS

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pip
  Downloading pip-22.3.1-py3-none-any.whl (2.1 MB)
[K     |████████████████████████████████| 2.1 MB 4.9 MB/s 
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 21.1.3
    Uninstalling pip-21.1.3:
      Successfully uninstalled pip-21.1.3
Successfully installed pip-22.3.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting TTS
  Downloading TTS-0.10.1-cp38-cp38-manylinux1_x86_64.whl (590 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m590.5/590.5 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
Collecting numba==0.55.1
  Downloading numba-0.55.1-1-cp38-cp38-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (3.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m55.3 MB/s[0m eta [36m0:00:00[0m
Collecting umap-l

## ✅ List available pre-trained 🐸 TTS models

Coqui 🐸TTS comes with a list of pretrained models for different model types (ex: TTS, vocoder), languages, datasets used for training and architectures. 

You can either use your own model or the release models under 🐸TTS.

Use `tts --list_models` to find out the availble models.



In [2]:
! tts --list_models

 Name format: type/language/dataset/model
 1: tts_models/multilingual/multi-dataset/your_tts
 2: tts_models/bg/cv/vits
 3: tts_models/cs/cv/vits
 4: tts_models/da/cv/vits
 5: tts_models/et/cv/vits
 6: tts_models/ga/cv/vits
 7: tts_models/en/ek1/tacotron2
 8: tts_models/en/ljspeech/tacotron2-DDC
 9: tts_models/en/ljspeech/tacotron2-DDC_ph
 10: tts_models/en/ljspeech/glow-tts
 11: tts_models/en/ljspeech/speedy-speech
 12: tts_models/en/ljspeech/tacotron2-DCA
 13: tts_models/en/ljspeech/vits
 14: tts_models/en/ljspeech/vits--neon
 15: tts_models/en/ljspeech/fast_pitch
 16: tts_models/en/ljspeech/overflow
 17: tts_models/en/vctk/vits
 18: tts_models/en/vctk/fast_pitch
 19: tts_models/en/sam/tacotron-DDC
 20: tts_models/en/blizzard2013/capacitron-t2-c50
 21: tts_models/en/blizzard2013/capacitron-t2-c150_v2
 22: tts_models/es/mai/tacotron2-DDC
 23: tts_models/es/css10/vits
 24: tts_models/fr/mai/tacotron2-DDC
 25: tts_models/fr/css10/vits
 26: tts_models/uk/mai/glow-tts
 27: tts_models/uk/ma

## ✅ Run a 🐸 TTS model

#### **First things first**: Using a release model and default vocoder:

You can simply copy the full model name from the list above and use it 


In [8]:
!tts --text "The general goal of this project is to study the processes and outcomes of speech perception training in postlingually deafened adults fitted with cochlear implants" \
--model_name "tts_models/en/ljspeech/glow-tts" \
--out_path output.wav


 > tts_models/en/ljspeech/glow-tts is already downloaded.
 > vocoder_models/en/ljspeech/multiband-melgan is already downloaded.
 > Using model: glow_tts
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:0
 | > fft_size:1024
 | > power:1.1
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:True
 | > mel_fmin:50.0
 | > mel_fmax:7600.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > Vocoder Model: multiband_melgan
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resam

## 📣 Listen to the synthesized wave 📣

In [9]:
import IPython
IPython.display.Audio("output.wav")

### **Second things second**:

🔶 A TTS model can be either trained on a single speaker voice or multispeaker voices. This training choice is directly reflected on the inference ability and the available speaker voices that can be used to synthesize speech. 

🔶 If you want to run a multispeaker model from the released models list, you can first check the speaker ids using `--list_speaker_idx` flag and use this speaker voice to synthesize speech.

In [10]:
# list the possible speaker IDs.
!tts --model_name "tts_models/en/vctk/vits" \
--list_speaker_idxs 


 > tts_models/en/vctk/vits is already downloaded.
 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
Traceback (most recent call last):
  File "/usr/local/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/TTS/bin/synthesize.py", line 316

## 💬 Synthesize speech using speaker ID 💬

In [14]:
!pip install py-espeak-ng

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting py-espeak-ng
  Downloading py_espeak_ng-0.1.8-py2.py3-none-any.whl (6.3 kB)
Installing collected packages: py-espeak-ng
Successfully installed py-espeak-ng-0.1.8
[0m

In [37]:
!tts --model_name "tts_models/en/vctk/vits"  --list_speaker_idxs

 > tts_models/en/vctk/vits is already downloaded.
 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > initialization of speaker-embedding layers.
 > Available speaker ids: (Set --speaker_idx flag to one of these values to use the multi-speaker model.
{'ED\n': 0, 'p225': 1, 'p226': 2

In [23]:
!pip install pyespeak
!pip install speake3  # Python 3

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[31mERROR: Could not find a version that satisfies the requirement pyespeak (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for pyespeak[0m[31m
[0mLooking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting speake3
  Downloading speake3-0.3.tar.gz (2.9 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: speake3
  Building wheel for speake3 (setup.py) ... [?25l[?25hdone
  Created wheel for speake3: filename=speake3-0.3-py3-none-any.whl size=3571 sha256=bb8930214bc55dce78ae5fed2753d7aef4373d3a0c7ee9604f04c2815fd719f2
  Stored in directory: /root/.cache/pip/wheels/dd/f0/39/385b576e6dbe1845dc82a733169bf5967d78b9ccd4e144161a
Successfully built speake3
Installing collected packages: speake3
Successfully installed speake3-0.3
[0m

In [28]:
!pip install espeakng

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting espeakng
  Downloading espeakng-1.0.2-py3-none-any.whl (16 kB)
Installing collected packages: espeakng
Successfully installed espeakng-1.0.2
[0m

In [38]:
!tts --text "Trying out specific speaker voice"\
--out_path spkr-out.wav --model_name "tts_models/en/vctk/vits" \
--speaker_idx "p341"

 > tts_models/en/vctk/vits is already downloaded.
 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > initialization of speaker-embedding layers.
 > Text: Trying out specific speaker voice
 > Text splitted to sentences.
['Trying out specific speaker voice']
 > Processing time: 3.857

In [36]:
from shutil import which
print(which('espeak'))
print(which('espeak-ng'))

/usr/bin/espeak
None


In [34]:
!pip search espeak
!pip install py-espeak-ng
!sudo apt-get install python-espeak
!sudo apt-get update && sudo apt-get install espeak

[31mERROR: XMLRPC request failed [code: -32500]
RuntimeError: PyPI no longer supports 'pip search' (or XML-RPC search). Please use https://pypi.org/search (via a browser) instead. See https://warehouse.pypa.io/api-reference/xml-rpc.html#deprecated-methods for more information.[0m[31m
[0mLooking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'sudo apt autoremove' to remove it.
The following additional packages will be installed:
  espeak-data libespeak1 libportaudio2 libsonic0
The following NEW packages will be installed:
  espeak-data libespeak1 libportaudio2 libsonic0 python-espeak
0 upgraded, 5 newly installed, 0 to remove and 20 not upgraded.
Need to get 1,166 kB of archives.
After this operation, 2,859 kB of additional disk space will be 

## 📣 Listen to the synthesized speaker specific wave 📣

In [39]:
import IPython
IPython.display.Audio("spkr-out.wav")

🔶 If you want to use an external speaker to synthesize speech, you need to supply `--speaker_wav` flag along with an external speaker encoder path and config file, as follows:

First we need to get the speaker encoder model, its config and a referece `speaker_wav`

In [40]:
!wget https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/config_se.json
!wget https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/model_se.pth.tar
!wget https://github.com/coqui-ai/TTS/raw/speaker_encoder_model/tests/data/ljspeech/wavs/LJ001-0001.wav

--2022-12-27 13:45:01--  https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/config_se.json
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/265612440/2d84c8bc-814b-474c-96e5-282d318667ba?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20221227%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20221227T134502Z&X-Amz-Expires=300&X-Amz-Signature=08e5328330c4b3e6c3e1fb253f0871f2804b7678ff0ccad2f17d1ce13f218f20&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=265612440&response-content-disposition=attachment%3B%20filename%3Dconfig_se.json&response-content-type=application%2Foctet-stream [following]
--2022-12-27 13:45:02--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/265612440/2d84c8bc-814b-474c-96e5-282d318667ba

In [41]:
!tts --model_name tts_models/multilingual/multi-dataset/your_tts \
--encoder_path model_se.pth.tar \
--encoder_config config_se.json \
--speaker_wav LJ001-0001.wav \
--text "Are we not allowed to dim the lights so people can see that a bit better?"\
--out_path spkr-out.wav \
--language_idx "en"

 > Downloading model to /root/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts
100% 425M/425M [00:15<00:00, 26.9MiB/s]
 > Model's license - CC BY-NC-ND 4.0
 > Check https://creativecommons.org/licenses/by-nc-nd/4.0/ for more info.
 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > w

## 📣 Listen to the synthesized speaker specific wave 📣

In [42]:
import IPython
IPython.display.Audio("spkr-out.wav")

## 🎉 Congratulations! 🎉 You now know how to use a TTS model to synthesize speech! 
Follow up with the next tutorials to learn more adnavced material.