# Easy Inferencing with 🐸 TTS ⚡

#### You want to quicly synthesize speech using Coqui 🐸 TTS model?

💡: Grab a pre-trained model and use it to synthesize speech using any speaker voice, including yours! ⚡

🐸 TTS comes with a list of pretrained models and speaker voices. You can even start a local demo server that you can open it on your favorite web browser and 🗣️ .

In this notebook, we will: 
```
1. List available pre-trained 🐸 TTS models
2. Run a 🐸 TTS model
3. Listen to the synthesized wave 📣
4. Run multispeaker 🐸 TTS model 
```
So, let's jump right in!


## Install 🐸 TTS ⬇️

In [None]:
! pip install -U pip
! pip install TTS

## ✅ List available pre-trained 🐸 TTS models

Coqui 🐸TTS comes with a list of pretrained models for different model types (ex: TTS, vocoder), languages, datasets used for training and architectures. 

You can either use your own model or the release models under 🐸TTS.

Use `tts --list_models` to find out the availble models.



In [1]:
!tts --list_models

Expecting value: line 1 column 1 (char 0)

 Name format: type/language/dataset/model
 1: tts_models/multilingual/multi-dataset/xtts_v1 [already downloaded]
 2: tts_models/multilingual/multi-dataset/your_tts [already downloaded]
 3: tts_models/multilingual/multi-dataset/bark [already downloaded]
 4: tts_models/bg/cv/vits
 5: tts_models/cs/cv/vits
 6: tts_models/da/cv/vits
 7: tts_models/et/cv/vits
 8: tts_models/ga/cv/vits
 9: tts_models/en/ek1/tacotron2 [already downloaded]
 10: tts_models/en/ljspeech/tacotron2-DDC
 11: tts_models/en/ljspeech/tacotron2-DDC_ph
 12: tts_models/en/ljspeech/glow-tts [already downloaded]
 13: tts_models/en/ljspeech/speedy-speech [already downloaded]
 14: tts_models/en/ljspeech/tacotron2-DCA [already downloaded]
 15: tts_models/en/ljspeech/vits
 16: tts_models/en/ljspeech/vits--neon
 17: tts_models/en/ljspeech/fast_pitch [already downloaded]
 18: tts_models/en/ljspeech/overflow [already downloaded]
 19: tts_models/en/ljspeech/neural_hmm
 20: tts_models/en/vc

## ✅ Run a 🐸 TTS model

#### **First things first**: Using a release model and default vocoder:

You can simply copy the full model name from the list above and use it 


In [28]:
!tts --text "My name is Boris and I am a typical gundam" \
--model_name "tts_models/multilingual/multi-dataset/xtts_v1" \
--out_path output.wav


^C


## 📣 Listen to the synthesized wave 📣

In [1]:
from TTS.api import TTS
tts = TTS("tts_models/en/ljspeech/tacotron2-DCA", gpu=True)

# generate speech by cloning a voice using default settings
tts.tts_to_file(text="What is the strongest muscle in the human body? [laughs] It's obviously dick!",
                file_path="output.wav")



 > Downloading model to C:\Users\super\AppData\Local\tts\tts_models--en--ljspeech--tacotron2-DCA


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 339M/339M [02:02<00:00, 2.77MiB/s]


 > Model's license - MPL
 > Check https://www.mozilla.org/en-US/MPL/2.0/ for more info.
 > vocoder_models/en/ljspeech/multiband-melgan is already downloaded.
 > Using model: Tacotron2
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:True
 | > symmetric_norm:True
 | > mel_fmin:50.0
 | > mel_fmax:7600.0
 | > pitch_fmin:0.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:C:\Users\super\AppData\Local\tts\tts_models--en--ljspeech--tacotron2-DCA\scale_stats.npy
 | > base:10
 | > hop_length:256
 | > wi



 > Text splitted to sentences.
['What is the strongest muscle in the human body?', "[laughs] It's obviously dick!"]
 > Processing time: 3.5563182830810547
 > Real-time factor: 0.712155061591264


'output.wav'

In [2]:
text = """
    What is the strongest muscle in the human body?
    [laughs] It's obviously dick!
"""

from TTS.tts.configs.bark_config import BarkConfig
from TTS.tts.models.bark import Bark

config = BarkConfig()
model = Bark.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="models/bark-gura", eval=True)

# with random speaker
# output_dict = model.synthesize(text, config, speaker_id="random", voice_dirs=None)

# cloning a speaker.
# It assumes that you have a speaker file in `bark_voices/speaker_n/speaker.wav` or `bark_voices/speaker_n/speaker.npz`
output_dict = model.synthesize(text, config, speaker_id="gura2", voice_dirs="bark_voices/")

Some weights of the model checkpoint at facebook/hubert-base-ls960 were not used when initializing HubertModel: ['encoder.pos_conv_embed.conv.weight_v', 'encoder.pos_conv_embed.conv.weight_g']
- This IS expected if you are initializing HubertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing HubertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of HubertModel were not initialized from the model checkpoint at facebook/hubert-base-ls960 and are newly initialized: ['encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for pre

In [3]:
import IPython
from scipy.io.wavfile import write

# write("output.wav", 25050, output_dict['wav'])
IPython.display.Audio("output.wav")

### **Second things second**:

🔶 A TTS model can be either trained on a single speaker voice or multispeaker voices. This training choice is directly reflected on the inference ability and the available speaker voices that can be used to synthesize speech. 

🔶 If you want to run a multispeaker model from the released models list, you can first check the speaker ids using `--list_speaker_idx` flag and use this speaker voice to synthesize speech.

In [None]:
# list the possible speaker IDs.
!tts --model_name "tts_models/en/vctk/vits" \
--list_speaker_idxs 


## 💬 Synthesize speech using speaker ID 💬

In [None]:
!tts --text "Trying out specific speaker voice"\
--out_path spkr-out.wav --model_name "tts_models/en/vctk/vits" \
--speaker_idx "p341"

## 📣 Listen to the synthesized speaker specific wave 📣

In [None]:
import IPython
IPython.display.Audio("spkr-out.wav")

🔶 If you want to use an external speaker to synthesize speech, you need to supply `--speaker_wav` flag along with an external speaker encoder path and config file, as follows:

First we need to get the speaker encoder model, its config and a referece `speaker_wav`

In [None]:
!wget https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/config_se.json
!wget https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/model_se.pth.tar
!wget https://github.com/coqui-ai/TTS/raw/speaker_encoder_model/tests/data/ljspeech/wavs/LJ001-0001.wav

In [None]:
!tts --model_name tts_models/multilingual/multi-dataset/your_tts \
--encoder_path model_se.pth.tar \
--encoder_config config_se.json \
--speaker_wav LJ001-0001.wav \
--text "Are we not allowed to dim the lights so people can see that a bit better?"\
--out_path spkr-out.wav \
--language_idx "en"

## 📣 Listen to the synthesized speaker specific wave 📣

In [None]:
import IPython
IPython.display.Audio("spkr-out.wav")

## 🎉 Congratulations! 🎉 You now know how to use a TTS model to synthesize speech! 
Follow up with the next tutorials to learn more adnavced material.