#### TEXT-TO-SPEECH WITH TACOTRON2

The text-to-speech pipeline

1. Text preprocessing
* Text is encoded into a list of symbols

2. Spectrogram generation
* TacoTron2 model is used to generate a spectrogram from the encoded text

3. Time-domain conversion
* The spectrogram is then converted into a waveform. This process is called a `Vocoder`.

In [None]:
import IPython
import matplotlib
import matplotlib.pyplot as plt

In [None]:
import torch
import torchaudio
device = "cuda" if torch.cuda.is_available() else "cpu"
matplotlib.rcParams["figure.figsize"] = [16.0, 4.8] # Set default figsize

print(torch.__version__)
print(torchaudio.__version__)
print(device)

In [None]:
# Text that will be read
text = "Hello world! Text to speech!"

#### Text Processing

The pre-trained Tacotron2 model expects specific set of symbol tables.


##### Phoneme-based encoding

Similar to `Character based encoding`, but it uses a symbol table based on phonemes and a G2P (Grapheme-to-Phoneme) model.

In [None]:
bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH

processor = bundle.get_text_processor()

with torch.inference_mode():
    processed, lengths = processor(text)

print(processed)
print(lengths)

In [None]:
print([processor.tokens[i] for i in processed[0, : lengths[0]]])

#### Spectrogram Generation

TacoTron2 is used to generate the spectogram.

`torchaudio.pipelines.Tacotron2TTSBundle` bundles the matching models and processors together

In [None]:
bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH
processor = bundle.get_text_processor()
tacotron2 = bundle.get_tacotron2().to(device)

with torch.inference_mode():
    processed, lengths = processor(text)
    processed = processed.to(device)
    lengths = lengths.to(device)
    spec, _, _ = tacotron2.infer(processed, lengths)


_ = plt.imshow(spec[0].cpu().detach(), origin="lower", aspect="auto")

In [None]:
fig, ax = plt.subplots(3, 1, figsize=(16, 4.3 * 3))
for i in range(3):
    with torch.inference_mode():
        spec, spec_lengths, _ = tacotron2.infer(processed, lengths)
    print(spec[0].shape)
    ax[i].imshow(spec[0].cpu().detach(), origin="lower", aspect="auto")
plt.show()

#### Waveform Generation
The last process is to get the waveform from the spectrogram.

`torchaudio` provides vocoders based on `GriffinLim` and `WaveRNN`.

1. WaveRNN

In [None]:
bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH

processor = bundle.get_text_processor()
tacotron2 = bundle.get_tacotron2().to(device)
vocoder = bundle.get_vocoder().to(device)

text = "Hello world! Text to speech!"

with torch.inference_mode():
    processed, lengths = processor(text)
    processed = processed.to(device)
    lengths = lengths.to(device)
    spec, spec_lengths, _ = tacotron2.infer(processed, lengths)
    waveforms, lengths = vocoder(spec, spec_lengths)

fig, [ax1, ax2] = plt.subplots(2, 1, figsize=(16, 9))
ax1.imshow(spec[0].cpu().detach(), origin="lower", aspect="auto")
ax2.plot(waveforms[0].cpu().detach())

IPython.display.Audio(waveforms[0:1].cpu(), rate=vocoder.sample_rate)

2. Griffin-Lim


In [None]:
bundle = torchaudio.pipelines.TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH

processor = bundle.get_text_processor()
tacotron2 = bundle.get_tacotron2().to(device)
vocoder = bundle.get_vocoder().to(device)

with torch.inference_mode():
    processed, lengths = processor(text)
    processed = processed.to(device)
    lengths = lengths.to(device)
    spec, spec_lengths, _ = tacotron2.infer(processed, lengths)
waveforms, lengths = vocoder(spec, spec_lengths)

fig, [ax1, ax2] = plt.subplots(2, 1, figsize=(16, 9))
ax1.imshow(spec[0].cpu().detach(), origin="lower", aspect="auto")
ax2.plot(waveforms[0].cpu().detach())

IPython.display.Audio(waveforms[0:1].cpu(), rate=vocoder.sample_rate)