# Easy Inferencing with 🐸 TTS ⚡

#### You want to quicly synthesize speech using Coqui 🐸 TTS model?

💡: Grab a pre-trained model and use it to synthesize speech using any speaker voice, including yours! ⚡

🐸 TTS comes with a list of pretrained models and speaker voices. You can even start a local demo server that you can open it on your favorite web browser and 🗣️ .

In this notebook, we will:
```
1. List available pre-trained 🐸 TTS models
2. Run a 🐸 TTS model
3. Listen to the synthesized wave 📣
4. Run multispeaker 🐸 TTS model
```
So, let's jump right in!


## Install 🐸 TTS ⬇️

In [1]:
! pip install -U pip
! pip install TTS
!sudo apt-get install espeak-ng
!pip install espeakng

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
espeak-ng is already the newest version (1.50+dfsg-10).
0 upgraded, 0 newly installed, 0 to remove and 18 not upgraded.
[0m

## ✅ List available pre-trained 🐸 TTS models

Coqui 🐸TTS comes with a list of pretrained models for different model types (ex: TTS, vocoder), languages, datasets used for training and architectures.

You can either use your own model or the release models under 🐸TTS.

Use `tts --list_models` to find out the availble models.



In [2]:
! tts --list_models

No API token found for 🐸Coqui Studio voices - https://coqui.ai 
Visit 🔗https://app.coqui.ai/account to get one.
Set it as an environment variable `export COQUI_STUDIO_TOKEN=<token>`


 Name format: type/language/dataset/model
 1: tts_models/multilingual/multi-dataset/xtts_v1
 2: tts_models/multilingual/multi-dataset/your_tts
 3: tts_models/multilingual/multi-dataset/bark
 4: tts_models/bg/cv/vits
 5: tts_models/cs/cv/vits
 6: tts_models/da/cv/vits
 7: tts_models/et/cv/vits
 8: tts_models/ga/cv/vits
 9: tts_models/en/ek1/tacotron2
 10: tts_models/en/ljspeech/tacotron2-DDC
 11: tts_models/en/ljspeech/tacotron2-DDC_ph
 12: tts_models/en/ljspeech/glow-tts [already downloaded]
 13: tts_models/en/ljspeech/speedy-speech
 14: tts_models/en/ljspeech/tacotron2-DCA
 15: tts_models/en/ljspeech/vits
 16: tts_models/en/ljspeech/vits--neon
 17: tts_models/en/ljspeech/fast_pitch
 18: tts_models/en/ljspeech/overflow
 19: tts_models/en/ljspeech/neural_hmm
 20: tts_models/en/vctk/vits [already downloaded

## ✅ Run a 🐸 TTS model

#### **First things first**: Using a release model and default vocoder:

You can simply copy the full model name from the list above and use it


In [3]:
!tts --text "hello world" \
--model_name "tts_models/en/ljspeech/glow-tts" \
--out_path output.wav


 > tts_models/en/ljspeech/glow-tts is already downloaded.
Traceback (most recent call last):
  File "/usr/local/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/TTS/bin/synthesize.py", line 387, in main
    model_path, config_path, model_item = manager.download_model(args.model_name)
  File "/usr/local/lib/python3.10/dist-packages/TTS/utils/manage.py", line 400, in download_model
    output_model_path, output_config_path = self._find_files(output_path)
  File "/usr/local/lib/python3.10/dist-packages/TTS/utils/manage.py", line 423, in _find_files
    raise ValueError(" [!] Model file not found in the output path")
ValueError:  [!] Model file not found in the output path


## 📣 Listen to the synthesized wave 📣

In [5]:
import IPython
IPython.display.Audio("output.wav")

ValueError: ignored

### **Second things second**:

🔶 A TTS model can be either trained on a single speaker voice or multispeaker voices. This training choice is directly reflected on the inference ability and the available speaker voices that can be used to synthesize speech.

🔶 If you want to run a multispeaker model from the released models list, you can first check the speaker ids using `--list_speaker_idx` flag and use this speaker voice to synthesize speech.

In [6]:
# list the possible speaker IDs.
!tts --model_name "tts_models/en/vctk/vits" \
--list_speaker_idxs


 > tts_models/en/vctk/vits is already downloaded.
 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > initialization of speaker-embedding layers.
 > Available speaker ids: (Set --speaker_idx flag to one of these values to use the multi-speaker model.
{'ED\n': 0, 'p225': 1, 'p226': 2

## 💬 Synthesize speech using speaker ID 💬

In [19]:
!tts --text "ccfdfgxfgxdxgdfcxzzfdgvbghfhgfhgyjghjuyghuygfdhtfhytryhtrhtreyher5uhedr5uj5eryhtrgrftytretyetyhrewftgdrewftgetgewtgetyethytr5ytretghrefewrw"\
--out_path spkr-out.wav --model_name "tts_models/en/vctk/vits" \
--speaker_idx "p341"

 > tts_models/en/vctk/vits is already downloaded.
 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > initialization of speaker-embedding layers.
 > Text: ccfdfgxfgxdxgdfcxzzfdgvbghfhgfhgyjghjuyghuygfdhtfhytryhtrhtreyher5uhedr5uj5eryhtrgrftytretyetyhrewftgdrewftgetgewtgetyethytr5ytr

## 📣 Listen to the synthesized speaker specific wave 📣

In [20]:
import IPython
IPython.display.Audio("spkr-out.wav")

🔶 If you want to use an external speaker to synthesize speech, you need to supply `--speaker_wav` flag along with an external speaker encoder path and config file, as follows:

First we need to get the speaker encoder model, its config and a referece `speaker_wav`

In [9]:
!wget https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/config_se.json
!wget https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/model_se.pth.tar
!wget https://github.com/coqui-ai/TTS/raw/speaker_encoder_model/tests/data/ljspeech/wavs/LJ001-0001.wav

--2023-10-15 09:53:10--  https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/config_se.json
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/265612440/2d84c8bc-814b-474c-96e5-282d318667ba?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20231015%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20231015T095310Z&X-Amz-Expires=300&X-Amz-Signature=6cf84e673f4f1305094841cd3f5dfc78a1daa971b9469232e55faf9405b683cc&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=265612440&response-content-disposition=attachment%3B%20filename%3Dconfig_se.json&response-content-type=application%2Foctet-stream [following]
--2023-10-15 09:53:10--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/265612440/2d84c8bc-814b-474c-96e5-282d318667ba

In [10]:
!tts --model_name tts_models/multilingual/multi-dataset/your_tts \
--encoder_path model_se.pth.tar \
--encoder_config config_se.json \
--speaker_wav LJ001-0001.wav \
--text "Are we not allowed to dim the lights so people can see that a bit better?"\
--out_path spkr-out.wav \
--language_idx "en"

 > Downloading model to /root/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts
100% 425M/425M [00:16<00:00, 25.1MiB/s]
 > Model's license - CC BY-NC-ND 4.0
 > Check https://creativecommons.org/licenses/by-nc-nd/4.0/ for more info.
 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > w

## 📣 Listen to the synthesized speaker specific wave 📣

In [11]:
import IPython
IPython.display.Audio("spkr-out.wav")

## 🎉 Congratulations! 🎉 You now know how to use a TTS model to synthesize speech!
Follow up with the next tutorials to learn more adnavced material.