# Lab 5: Text-to-Speech (TTS)


In this lab, first, we will study how to train a neural-based TTS model Tacotron 2 with  using SpeechBrain on the LJSpeech corpus.


#### <span style="color:green"> Text-to-Speech (TTS) with Tacotron2 trained on LJSpeech (with a pretrained model) </span>


The pre-trained model takes in input a short text and produces a spectrogram in output. One can get the final waveform by applying a vocoder (e.g., HiFIGAN) on top of the generated spectrogram.

#### Perform TTS with a pretrained model




In [1]:
import os
import torch
import torchaudio
import IPython
import matplotlib.pyplot as plt
from speechbrain.inference.TTS import Tacotron2
from speechbrain.inference.vocoders import HIFIGAN

In [2]:
torch.random.manual_seed(0)
device = "cuda" if torch.cuda.is_available() else "cpu"

print(torch.__version__)
print(torchaudio.__version__)
print(device)

2.2.1
2.2.1
cpu


In [3]:
# Intialize TTS (tacotron2) and Vocoder (HiFIGAN)
tacotron2 = Tacotron2.from_hparams(source="speechbrain/tts-tacotron2-ljspeech", savedir="tmpdir_tts")
hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir="tmpdir_vocoder")

# Running the TTS
mel_output, mel_length, alignment = tacotron2.encode_text("Hello!")

# Running Vocoder (spectrogram-to-waveform)
waveforms = hifi_gan.decode_batch(mel_output)

# Save the waverform
torchaudio.save('example_TTS.wav', waveforms.squeeze(1), 22050)

IPython.display.Audio(data=waveforms.squeeze(1), rate=22050)

INFO:speechbrain.utils.fetching:Fetch hyperparams.yaml: Fetching from HuggingFace Hub 'speechbrain/tts-tacotron2-ljspeech' if not cached
INFO:speechbrain.utils.fetching:Fetch custom.py: Fetching from HuggingFace Hub 'speechbrain/tts-tacotron2-ljspeech' if not cached
INFO:speechbrain.utils.fetching:Fetch model.ckpt: Fetching from HuggingFace Hub 'speechbrain/tts-tacotron2-ljspeech' if not cached
INFO:speechbrain.utils.parameter_transfer:Loading pretrained files for: model
INFO:speechbrain.utils.fetching:Fetch hyperparams.yaml: Fetching from HuggingFace Hub 'speechbrain/tts-hifigan-ljspeech' if not cached
INFO:speechbrain.utils.fetching:Fetch custom.py: Fetching from HuggingFace Hub 'speechbrain/tts-hifigan-ljspeech' if not cached
INFO:speechbrain.utils.fetching:Fetch generator.ckpt: Fetching from HuggingFace Hub 'speechbrain/tts-hifigan-ljspeech' if not cached
INFO:speechbrain.utils.parameter_transfer:Loading pretrained files for: generator


#### <span style="color:green"> Train Text-to-Speech (TTS) </span>

The model is trained with SpeechBrain (on a small subset of the origonal corpus)

Complete 3 steps:


1. ##### Reduce number of files in the train dataset:

In [4]:
! mv LJSpeech-1.1/metadata.csv LJSpeech-1.1/metadata_original.csv
! head -n 128 LJSpeech-1.1/metadata_original.csv > LJSpeech-1.1/metadata.csv

2. ##### Change params in train.yaml:

    Path: speechbrain/recipes/LJSpeech/TTS/tacotron2/hparams/train.yaml

        - epochs: 8
        - keep_checkpoint_interval: 1
        - batch_size: 8
        - num_workers: 0


3. ##### Run training:

<span style="color:red"> **Exercise:**</span>
<span style="color:orange"> **Speech-to-text training**</span>

The training of the full model is too long, and in this excercise,  we will try only several epochs of traing on the subset of the traing corpus in order to speed up the experiment:

1. Train 8 (or more) epochs: 
 - keep_checkpoint_interval: 1
 - batch_size: 8
 - 128 files of the LJSpeech
 - num_workers: 0

2. Report logs with loss values on train and validation dataset. Which losses were used in the training?

3. Do you observe overfitting? Please explain your answer.

In [12]:
! python \
  speechbrain/recipes/LJSpeech/TTS/tacotron2/train.py \
  --device=cpu \
  --max_grad_norm=1.0 \
  --data_folder=LJSpeech-1.1 \
  speechbrain/recipes/LJSpeech/TTS/tacotron2/hparams/train.yaml

INFO:speechbrain.utils.quirks:Applied quirks (see `speechbrain.utils.quirks`): [allow_tf32, disable_jit_profiling]
INFO:speechbrain.utils.quirks:Excluded quirks specified by the `SB_DISABLE_QUIRKS` environment (comma-separated list): []
INFO:speechbrain.utils.seed:Setting seed to 1234
speechbrain.utils.quirks - Applied quirks (see `speechbrain.utils.quirks`): [allow_tf32, disable_jit_profiling]
speechbrain.utils.quirks - Excluded quirks specified by the `SB_DISABLE_QUIRKS` environment (comma-separated list): []
speechbrain.core - Beginning experiment!
speechbrain.core - Experiment folder: ./results/tacotron2/1234
ljspeech_prepare - Skipping preparation, completed in previous run.
speechbrain.core - Gradscaler enabled: False. Using precision: fp32.
speechbrain.core - Tacotron2Brain Model Statistics:
* Total Number of Trainable Parameters: 28.2M
* Total Number of Parameters: 28.2M
* Trainable Parameters represent 100.0000% of the total size.
speechbrain.utils.checkpoints - Would load a c

In [None]:
# Comment: No i didn't observe overfitting in the training process.
# The model was trained for 8 epochs and the training loss was decreasing with each epoch.
# The validation loss was also decreasing with each epoch