# Lab 4: Text-to-Speech (TTS)

## <span style="color:darkblue"> Introduction </span>


In this lab, first, we will study how to build and evaluate a neural-based TTS system using a pretrained Tacotron 2 in torchaudio and different vocoders. Then, we will use a multi-speaker TTS model for experiments with different voices using SpeechBrain pretrained models.


The TTS pipeline comprises 3 steps:

<img src="tts.png">


#### <span style="color:green"> Step 1. Text processing </span>

The input text is encoded into a list of symbols. We will use English characters and phonemes as the symbols. A phonemizer transforms text into phoneme sequences. Phonemes are textual representations of the pronunciation of words. 

#### <span style="color:green"> Step 2. Spectrogram generation </span>

From the encoded text, a spectrogram is generated. We use the ``Tacotron 2``
model for this. 

#### <span style="color:green"> Step 3. Conversion of the spectrogram into a waveform (speech generation) </span>

The spectrogram is converted into a speech waveform using a Vocoder.
In this lab, three different vocoders are used,
   :py:class:`~torchaudio.models.WaveRNN`,
   :py:class:`~torchaudio.transforms.GriffinLim`, and
   [Nvidia's WaveGlow](https://pytorch.org/hub/nvidia_deeplearningexamples_tacotron2/).


The related components are bundled in :py:class:`torchaudio.pipelines.Tacotron2TTSBundle`.

<hr/>

#### Block diagram of the Tacotron 2 system architecture

<img src="https://pytorch.org/assets/images/tacotron2_diagram.png" width="500">

The Tacotron 2 and WaveGlow model form a TTS system that enables user to synthesise a natural sounding speech from raw transcripts without any additional prosody information. 
The Tacotron 2 model produces mel spectrograms from input text using encoder-decoder architecture. 
WaveGlow is a flow-based model that consumes the mel spectrograms to generate speech.

*Note: 
    This lab is partly based on the [torchaudio tutorial](https://pytorch.org/audio/stable/tutorials/tacotron2_pipeline_tutorial.html) and [SpeechBrain](https://github.com/speechbrain/speechbrain) examples.*

In [None]:
import os
import torch
import torchaudio
import IPython
import matplotlib.pyplot as plt
from torchmetrics.audio import NonIntrusiveSpeechQualityAssessment
from speechbrain.inference.TTS import MSTacotron2
from speechbrain.inference.vocoders import HIFIGAN

torch.random.manual_seed(0)
device = "cuda" if torch.cuda.is_available() else "cpu"

In [None]:
# Speech and transcripts sub-directories paths
data_dir = "../dataset"
data_speech_dir = os.path.join(data_dir, 'speech')
data_transc_dir = os.path.join(data_dir, 'transcription')

## <span style="color:green"> Step 1. Text processing </span>

### Character-based encoding

In this section, we will learn how character-based encoding works.

Since the pre-trained Tacotron 2 model expects specific set of symbol
tables, the same functionalities is available in ``torchaudio``. However,
we will first manually implement the encoding for better understanding.

First, we define a set of symbols
``'_-!\'(),.:;? abcdefghijklmnopqrstuvwxyz'``. Then, we map  each character of the input text into the index of the corresponding symbol in the table. Symbols that are not in the table are ignored.



In [None]:
symbols = "_-!'(),.:;? abcdefghijklmnopqrstuvwxyz"
look_up = {s: i for i, s in enumerate(symbols)}
symbols = set(symbols)

def text_to_sequence(text):
    text = text.lower()
    return [look_up[s] for s in text if s in symbols]

text = "We are studing text-to-speech models with different vocoders. What are the differences between these models?"
print(text_to_sequence(text))

As mentioned above, the symbol table and indices must match
what the pretrained Tacotron 2 model expects. ``torchaudio`` provides the same
transform along with the pretrained model. You can
instantiate and use such transform as follows.

In [None]:
processor = torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH.get_text_processor()

text = "We are studing text-to-speech models with different vocoders. What are the differences between these models?"
processed, lengths = processor(text)

print(processed)
print(lengths)

Note: The output of our manual encoding and the ``torchaudio`` ``text_processor`` output matches (meaning we correctly re-implemented what the library does internally). It takes either a text or list of texts as inputs.
When a list of texts are provided, the returned ``lengths`` variable
represents the valid length of each processed tokens in the output
batch.

The intermediate representation can be retrieved as follows:




In [None]:
print([processor.tokens[i] for i in processed[0, : lengths[0]]])

### Phoneme-based encoding

Phoneme-based encoding is similar to character-based encoding, but it
uses a symbol table based on phonemes and a G2P (Grapheme-to-Phoneme)
model.

The detail of the G2P model is out of the scope of this tutorial, we will
just look at what the conversion looks like.

Similar to the case of character-based encoding, the encoding process is
expected to match what a pretrained Tacotron 2 model is trained on.
``torchaudio`` has an interface to create the process.

The following code illustrates how to make and use the process. Behind
the scene, a G2P model is created using ``DeepPhonemizer`` package, and
the pretrained weights published by the author of ``DeepPhonemizer`` is
fetched.




In [None]:
bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH

processor = bundle.get_text_processor()

text = "We are studing text-to-speech models with different vocoders. What are the differences between these models?"
with torch.inference_mode():
    processed, lengths = processor(text)

print(processed)
print(lengths)

Notice that the encoded values are different from the example of
character-based encoding.

The intermediate representation looks as follows.


In [None]:
print([processor.tokens[i] for i in processed[0, : lengths[0]]])

## <span style="color:green"> Step 2. Spectrogram generation </span>


``Tacotron 2`` is the model we use to generate spectrogram from the
encoded text. For more details, please refer to [the
paper](https://arxiv.org/abs/1712.05884).

It is easy to instantiate a Tacotron 2 model with pretrained weights,
however, note that the input to Tacotron 2 models need to be processed
by the matching text processor.

:py:class:`torchaudio.pipelines.Tacotron2TTSBundle` bundles the matching
models and processors together so that it is easy to create the pipeline.

For the available bundles, and its usage, please refer to
:py:class:`~torchaudio.pipelines.Tacotron2TTSBundle`.




In [None]:
bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH
processor = bundle.get_text_processor()
tacotron2 = bundle.get_tacotron2().to(device)

text = "We are studing text-to-speech models with different vocoders. What are the differences between these models?"

with torch.inference_mode():
    processed, lengths = processor(text)
    processed = processed.to(device)
    lengths = lengths.to(device)
    spec, _, _ = tacotron2.infer(processed, lengths)

_ = plt.imshow(spec[0].cpu().detach(), origin="lower", aspect="auto")

Note that ``Tacotron2.infer`` method perfoms multinomial sampling,
therefore, the process of generating the spectrogram incurs randomness.




In [None]:
fig, ax = plt.subplots(3, 1)
for i in range(3):
    with torch.inference_mode():
        spec, spec_lengths, _ = tacotron2.infer(processed, lengths)
    print(spec[0].shape)
    ax[i].imshow(spec[0].cpu().detach(), origin="lower", aspect="auto")

## <span style="color:green"> Step 3. Waveform generation using different vocoders </span>

The obtained spectrogram is used to generate a waveform using a vocoder.

``torchaudio`` provides vocoders based on ``GriffinLim`` and ``WaveRNN``.

### WaveRNN vocoder

Continuing from the previous section, we can instantiate the matching
WaveRNN model from the same bundle.




In [None]:
bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH

processor = bundle.get_text_processor()
tacotron2 = bundle.get_tacotron2().to(device)
vocoder = bundle.get_vocoder().to(device)

text = "We are studing text-to-speech models with different vocoders. What are the differences between these models?"

with torch.inference_mode():
    processed, lengths = processor(text)
    processed = processed.to(device)
    lengths = lengths.to(device)
    spec, spec_lengths, _ = tacotron2.infer(processed, lengths)
    waveforms, lengths = vocoder(spec, spec_lengths)

In [None]:
def plot(waveforms, spec, sample_rate):
    waveforms = waveforms.cpu().detach()

    fig, [ax1, ax2] = plt.subplots(2, 1)
    ax1.plot(waveforms[0])
    ax1.set_xlim(0, waveforms.size(-1))
    ax1.grid(True)
    ax2.imshow(spec[0].cpu().detach(), origin="lower", aspect="auto")
    return IPython.display.Audio(waveforms[0:1], rate=sample_rate)

plot(waveforms, spec, vocoder.sample_rate)

### Griffin-Lim vocoder

The usage of the Griffin-Lim vocoder is similar to WaveRNN. You can instantiate the vocoder object with
:py:func:`~torchaudio.pipelines.Tacotron2TTSBundle.get_vocoder`
method and pass the spectrogram.




In [None]:
bundle = torchaudio.pipelines.TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH

processor = bundle.get_text_processor()
tacotron2 = bundle.get_tacotron2().to(device)
vocoder = bundle.get_vocoder().to(device)

with torch.inference_mode():
    processed, lengths = processor(text)
    processed = processed.to(device)
    lengths = lengths.to(device)
    spec, spec_lengths, _ = tacotron2.infer(processed, lengths)
    waveforms, lengths = vocoder(spec, spec_lengths)

In [None]:
plot(waveforms, spec, vocoder.sample_rate)

### Waveglow vocoder

Waveglow is a vocoder published by Nvidia. The pretrained weights are published on Torch Hub. One can instantiate the model using ``torch.hub`` module.

In [None]:
# Workaround to load model mapped on GPU
# https://stackoverflow.com/a/61840832
waveglow = torch.hub.load(
    "NVIDIA/DeepLearningExamples:torchhub",
    "nvidia_waveglow",
    model_math="fp32",
    pretrained=False,
)
checkpoint = torch.hub.load_state_dict_from_url(
    "https://api.ngc.nvidia.com/v2/models/nvidia/waveglowpyt_fp32/versions/1/files/nvidia_waveglowpyt_fp32_20190306.pth",  # noqa: E501
    progress=False,
    map_location=device,
)
state_dict = {key.replace("module.", ""): value for key, value in checkpoint["state_dict"].items()}

waveglow.load_state_dict(state_dict)
waveglow = waveglow.remove_weightnorm(waveglow)
waveglow = waveglow.to(device)
waveglow.eval()

with torch.no_grad():
    waveforms = waveglow.infer(spec)

In [None]:
plot(waveforms, spec, 22050)
# Save and display the waverform
torchaudio.save("synthesized_sample.wav", waveforms.squeeze(1).cpu(), 22050)

<span style="color:red"> **Exercise 1**</span>

<span style="color:orange"> **Speech-to-text generation and subjective evaluation**</span>


1. Take a fragment of an arbitrary text (a few utterances).
2. Synthesize this text using the Tacotron-2 model and 3 different vocoders (Griffin-Lim, WaveRNN, and Waveglow).
3. Listen to the obtained synthesised audio samples and evaluate them in terms of speech naturalness and intelligibility using the scale 1..5. See slides #6-8 *Subjective Evaluation* for more details on the criteria of evaluation.
4. Based on these scores, perform ranking of the vocoders and describe the problems you observe (mispronunciation, wrong intonation, etc.) for each vocoder regarding speech naturalness and intelligibility and provide audio examples (one for each factor is sufficient).




## <span style="color:green"> Objective evaluation </span>

### Objective evaluation using ASR

Sometimes, instead of human listeners, the ASR model can be used as an analogue for speech intelligibility assessment.

The WER metric is computed between the reference text and the output of the ASR system applied to the synthesized speech sample.



### Objective evaluation using NISQA (Non Intrusive Speech Quality Assessment)

This part is devoted to objective evaluation of speech quality/nauralness using NISQA models.
In the current examples, for simplicity, a general NISQA model for speech quality evaluation is used. For evaluaton of TTS results, in practice, a specified NISQA-TTS model should be used.
In the output, the first value corresponds to the overal MOS quality estimation that is suggested to use in this lab.

In [None]:
nisqa = NonIntrusiveSpeechQualityAssessment(22050)

waveform, sr = torchaudio.load('synthesized_sample.wav', channels_first=True)
nisqa.update(waveform)
fig, ax_ = nisqa.plot()
fig.savefig("nisqa_synthesized_test.png")

# Float tensor with shape (...,5) 
# corresponding to overall MOS, noisiness, discontinuity, coloration and loudness in that order
print(nisqa(waveform))

### Objective evaluation using wv-MOS

This part is devoted to objective MOS score prediction by the fine-tuned wav2vec2.0 model.

**Commented because non-working due to compatibility issues**

In [None]:
#from wvmos import get_wvmos
#model = get_wvmos(cuda=False)

#mos = model.calculate_one(os.path.join(data_speech_dir, "237-126133-0009.wav")) # infer MOS score for one audio
#print(mos)
#mos = model.calculate_dir("path/to/dir/with/wav/files", mean=True) # infer average MOS score across .wav files in directory

<span style="color:red"> **Exercise 2**</span>

<span style="color:orange"> **Objective evaluation** </span>

1. Use text transcripts from the text files in */asr-dataset/transcription* (directory from the previous ASR labs). Select a subsample of 6-20 files, preferably including all different speakers of the original dataset (different speakers have different first numbers in the file names, i.e. file 121-121726-0002.txt corresponds to speaker #121, file 61-70968-0018.txt - to speaker #61). Synthesize speech for these files three times using three different vocoders.
2. Compute MOS naturalness and MOS quality predictions using neural network wv-MOS and NISQA models and compare them.
3. Do you observe similar or different ranking of the TTS systems when using subjective (in Exercise 1) and objective scores?



## <span style="color:green"> Multi-speaker TTS </span>

This part is devoted to Zero-Shot Multi-Speaker TTS with SpeechBrain toolkit using a variation of Tacotron 2, extended to incorporate speaker identity information when generating speech. It is pretrained on the [LibriTTS](https://openslr.org/60/) corpus (multi-speaker English corpus of approximately 585 hours of read English).

*Note*:
    The model generates speech at a rate of 22050 Hz, but it's important to note that the input signal, crucial for capturing speaker identities, must be sampled at 16kHz.

In [None]:
#Intialize TTS (mstacotron2) and Vocoder (HiFIGAN)
ms_tacotron2 = MSTacotron2.from_hparams(source="speechbrain/tts-mstacotron2-libritts", savedir="pretrained_models/tts-mstacotron2-libritts")
hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-libritts-22050Hz", savedir="pretrained_models/tts-hifigan-libritts-22050Hz")

In [None]:
# Examples of reference files for different speakers
# audio_file_reference = '121-127105-0006.wav' # female voice
# audio_file_reference = '61-70968-0004.wav' # male voice
audio_file_reference = '237-126133-0009.wav' # female voice
audio_file_reference_path = os.path.join(data_speech_dir, audio_file_reference)
print(f"Audio file reference path: {audio_file_reference_path}")

waveform, sr = torchaudio.load(audio_file_reference_path, channels_first=True)

In [None]:
IPython.display.Audio(data=waveform, rate=sr)

In [None]:
text = "Hello! This is a multi-speaker text-to-speech model. We generate a test sample."

# Running the Zero-Shot Multi-Speaker Tacotron2 model to generate mel-spectrogram
mel_outputs, mel_lengths, alignments = ms_tacotron2.clone_voice(text, audio_file_reference_path)

# Running Vocoder (spectrogram-to-waveform)
waveforms = hifi_gan.decode_batch(mel_outputs)

# Save and display the waverform
torchaudio.save("synthesized_sample.wav", waveforms.squeeze(1).cpu(), 22050)
IPython.display.Audio(data=waveforms.squeeze(1), rate=22050)

<span style="color:red"> **Exercise 3**</span>

<span style="color:orange"> **Multi-speaker TTS** </span>


1. Use wav files from  */dataset/transcription* (directory from the previous (#1-3) ASR labs). Take a subsample of 20 files, the same as you used in Exercise 1. Synthesize speech for these files several times using the multi-speaker SpeechBrain Tacotron 2 model and HiFi-GAN vocoder by providing different samples of different speakers.
2. Listen to the obtained synthesised audio samples and evaluate them in terms of speech naturalness and intelligibility using the scale 1..5. You do not need to listen to all the synthesis files. For this subjective evaluation (listening test), only 1-2 samples of each speaker is sufficient to listen.
3. Apply an objective metric (i.e. wv-MOS) to the synthesized files.
4. Are there any differences in speech quality depending on the reference speaker voice? If yes, could you specify the most difficult/simple voice(s) for TTS synthesis? What are the characteristics of these voices and main differences?
   

