# Lab 5: Text-to-Speech (TTS)


In this lab, first, we will study how to train a neural-based TTS model Tacotron 2 with  using SpeechBrain on the LJSpeech corpus.


#### <span style="color:green"> Text-to-Speech (TTS) with Tacotron2 trained on LJSpeech (with a pretrained model) </span>


The pre-trained model takes in input a short text and produces a spectrogram in output. One can get the final waveform by applying a vocoder (e.g., HiFIGAN) on top of the generated spectrogram.

#### Perform TTS with a pretrained model




In [1]:
import os
import torch
import torchaudio
import IPython
import matplotlib.pyplot as plt
from speechbrain.inference.TTS import Tacotron2
from speechbrain.inference.vocoders import HIFIGAN

  from .autonotebook import tqdm as notebook_tqdm
INFO:speechbrain.utils.quirks:Applied quirks (see `speechbrain.utils.quirks`): [disable_jit_profiling, allow_tf32]
INFO:speechbrain.utils.quirks:Excluded quirks specified by the `SB_DISABLE_QUIRKS` environment (comma-separated list): []


In [2]:
torch.random.manual_seed(0)
device = "cuda" if torch.cuda.is_available() else "cpu"

print(torch.__version__)
print(torchaudio.__version__)
print(device)

2.5.1+cu118
2.5.1+cu118
cpu


In [3]:
# Intialize TTS (tacotron2) and Vocoder (HiFIGAN)
tacotron2 = Tacotron2.from_hparams(source="speechbrain/tts-tacotron2-ljspeech", savedir="tmpdir_tts")
hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir="tmpdir_vocoder")

# Running the TTS
mel_output, mel_length, alignment = tacotron2.encode_text("Hello!")

# Running Vocoder (spectrogram-to-waveform)
waveforms = hifi_gan.decode_batch(mel_output)

# Save the waverform
torchaudio.save('example_TTS.wav', waveforms.squeeze(1), 22050)

IPython.display.Audio(data=waveforms.squeeze(1), rate=22050)

INFO:speechbrain.utils.fetching:Fetch hyperparams.yaml: Fetching from HuggingFace Hub 'speechbrain/tts-tacotron2-ljspeech' if not cached
INFO:speechbrain.utils.fetching:Fetch custom.py: Fetching from HuggingFace Hub 'speechbrain/tts-tacotron2-ljspeech' if not cached
INFO:speechbrain.utils.fetching:Fetch model.ckpt: Fetching from HuggingFace Hub 'speechbrain/tts-tacotron2-ljspeech' if not cached
INFO:speechbrain.utils.parameter_transfer:Loading pretrained files for: model
  state_dict = torch.load(path, map_location=device)
INFO:speechbrain.utils.fetching:Fetch hyperparams.yaml: Fetching from HuggingFace Hub 'speechbrain/tts-hifigan-ljspeech' if not cached
INFO:speechbrain.utils.fetching:Fetch custom.py: Fetching from HuggingFace Hub 'speechbrain/tts-hifigan-ljspeech' if not cached
  WeightNorm.apply(module, name, dim)
INFO:speechbrain.utils.fetching:Fetch generator.ckpt: Fetching from HuggingFace Hub 'speechbrain/tts-hifigan-ljspeech' if not cached
INFO:speechbrain.utils.parameter_tran

#### <span style="color:green"> Train Text-to-Speech (TTS) </span>

The model is trained with SpeechBrain (on a small subset of the origonal corpus)

Complete 3 steps:


1. ##### Reduce number of files in the train dataset:

In [4]:
! mv LJSpeech-1.1/metadata.csv LJSpeech-1.1/metadata_original.csv
! head -n 128 LJSpeech-1.1/metadata_original.csv > LJSpeech-1.1/metadata.csv

2. ##### Change params in train.yaml:

    Path: speechbrain/recipes/LJSpeech/TTS/tacotron2/hparams/train.yaml

        - epochs: 8
        - keep_checkpoint_interval: 1
        - batch_size: 8


3. ##### Run training:

In [None]:
! python \
  speechbrain/recipes/LJSpeech/TTS/tacotron2/train.py \
  --device=cpu \
  --max_grad_norm=1.0 \
  --data_folder=LJSpeech-1.1 \
  speechbrain/recipes/LJSpeech/TTS/tacotron2/hparams/train.yaml

INFO:speechbrain.utils.quirks:Applied quirks (see `speechbrain.utils.quirks`): [allow_tf32, disable_jit_profiling]
INFO:speechbrain.utils.quirks:Excluded quirks specified by the `SB_DISABLE_QUIRKS` environment (comma-separated list): []
INFO:speechbrain.utils.seed:Setting seed to 1234
speechbrain.utils.quirks - Applied quirks (see `speechbrain.utils.quirks`): [allow_tf32, disable_jit_profiling]
speechbrain.utils.quirks - Excluded quirks specified by the `SB_DISABLE_QUIRKS` environment (comma-separated list): []
speechbrain.core - Beginning experiment!
speechbrain.core - Experiment folder: ./results/tacotron2/1234
ljspeech_prepare - Creating json file for ljspeech Dataset..
ljspeech_prepare - preparing ./results/tacotron2/1234/save/train.json.
100%|█████████████████████████████████████| 115/115 [00:00<00:00, 253999.45it/s]
ljspeech_prepare - ./results/tacotron2/1234/save/train.json successfully created!
ljspeech_prepare - preparing ./results/tacotron2/1234/save/valid.json.
100%|████████

<span style="color:red"> **Exercise:**</span>
<span style="color:orange"> **Speech-to-text training**</span>

The training of the full model is too long, and in this excercise,  we will try only several epochs of traing on the subset of the traing corpus in order to speed up the experiment:

1. Train 8 (or more) epochs: 
 - keep_checkpoint_interval: 1
 - batch_size: 8
 - 128 files of the LJSpeech

2. Report logs with loss values on train and validation dataset. Which losses were used in the training?

3. Do you observe overfitting? Please explain your answer.

In [None]:
! python \
  speechbrain/recipes/LJSpeech/TTS/tacotron2/train.py \
  --device=cpu \
  --keep_checkpoint_interval=1 \
  --batch_size=8 \
  --max_grad_norm=1.0 \
  --epochs=8 \
  --data_folder=LJSpeech_subset\
  speechbrain/recipes/LJSpeech/TTS/tacotron2/hparams/train.yaml

 





/bin/bash: -c: line 1: syntax error near unexpected token `newline'
/bin/bash: -c: line 1: ` python    speechbrain/recipes/LJSpeech/TTS/tacotron2/train.py    --device=cpu    --max_grad_norm=1.0    --data_folder=LJSpeech-1.1    --hparams=speechbrain/recipes/LJSpeech/TTS/tacotron2/hparams/train.yaml    --epochs=8    --batch_size=8    --keep_checkpoint_interval=1    --logger=<speechbrain.utils.train_logger.FileTrainLogger object at 0x7f6ba474f890>'


AttributeError: 'FileTrainLogger' object has no attribute 'get_metric'