# NATURAL TTS SYNTHESIS BY CONDITIONING WAVENET ON MEL SPECTROGRAM PREDICTIONS

_Chandra Shekhar Pandey_

_Priyanka Bose_

This notebook validates some of the qualitative claims made in the paper

> J. Shen et al., "Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions," 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 2018, pp. 4779-4783, doi: 10.1109/ICASSP.2018.8461368.

which describes Tacotron2, a neural network for predicting mel spectograms from text, which can then be used with a vocoder to synthesize speech.

We use the implementation of this model in TensorSpeech ([TensorflowTTS](https://github.com/TensorSpeech/TensorFlowTTS)) to check the specific claims on the [authors' website](https://google.github.io/tacotron/publications/tacotron2/) for selected test cases:

* Claim: Tacotron 2 works well on out-of-domain and complex words.
* Claim: Tacotron 2 learns pronunciations based on phrase semantics.
* Claim: Tacotron 2 is somewhat robust to spelling errors.
* Claim: Tacotron 2 is sensitive to punctuation.
* Claim: Tacotron 2 learns stress and intonation.
* Claim: Tacotron 2's prosody changes when turning a statement into a question.
* Claim: Tacotron 2 is good at tongue twisters.

For each test case, we share the audio from the authors' websites, then you can generate audio using Tacotron2 (Nvidia & Waveglow implementation).


For comparison. You can also change the text of the test case to explore the limits of the claim.

# Validating the claims

#### Loading Nvidia Tacotron 2 Model from torch.hub

In [2]:
import torch
from scipy.io.wavfile import write
from IPython.display import Audio
tacotron2_n = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_tacotron2', model_math='fp16')
tacotron2_n = tacotron2_n.to('cuda')
tacotron2_n.eval()

  from .autonotebook import tqdm as notebook_tqdm
Using cache found in /home/cc/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub


Tacotron2(
  (embedding): Embedding(148, 512)
  (encoder): Encoder(
    (convolutions): ModuleList(
      (0): Sequential(
        (0): ConvNorm(
          (conv): Conv1d(512, 512, kernel_size=(5,), stride=(1,), padding=(2,))
        )
        (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (1): Sequential(
        (0): ConvNorm(
          (conv): Conv1d(512, 512, kernel_size=(5,), stride=(1,), padding=(2,))
        )
        (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (2): Sequential(
        (0): ConvNorm(
          (conv): Conv1d(512, 512, kernel_size=(5,), stride=(1,), padding=(2,))
        )
        (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (lstm): LSTM(512, 256, batch_first=True, bidirectional=True)
  )
  (decoder): Decoder(
    (prenet): Prenet(
      (layers): ModuleList(
        (0): LinearNorm(
          (lin

#### Loading the Neural Vocoder "waveglow" by Nvidia from torch.hub

In [3]:
waveglow = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_waveglow', model_math='fp16')
waveglow = waveglow.remove_weightnorm(waveglow)
waveglow = waveglow.to('cuda')
waveglow.eval()

Using cache found in /home/cc/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub
Downloading checkpoint from https://api.ngc.nvidia.com/v2/models/nvidia/waveglow_ckpt_amp/versions/19.09.0/files/nvidia_waveglowpyt_fp16_20190427


WaveGlow(
  (upsample): ConvTranspose1d(80, 80, kernel_size=(1024,), stride=(256,))
  (WN): ModuleList(
    (0): WN(
      (in_layers): ModuleList(
        (0): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(1,))
        (1): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(2,), dilation=(2,))
        (2): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(4,), dilation=(4,))
        (3): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(8,), dilation=(8,))
        (4): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(16,), dilation=(16,))
        (5): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(32,), dilation=(32,))
        (6): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(64,), dilation=(64,))
        (7): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(128,), dilation=(128,))
      )
      (res_skip_layers): ModuleList(
        (0): Conv1d(512, 1024, kernel_size=(1,), stride=(1,))
        (1): Conv1d(51

#### Function to generate the sounds from predicted spectograms

In [5]:
def Nvidia_tacotron2_waveglow(text):
  utils = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_tts_utils')
  sequences, lengths = utils.prepare_input_sequence([text])
  with torch.no_grad():
    mel, _, _ = tacotron2_n.infer(sequences, lengths)
    audio = waveglow.infer(mel)
  audio_numpy = audio[0].data.cpu().numpy()
  rate = 25000
  write("audio.wav", rate, audio_numpy)
  return Audio(audio_numpy, rate=rate)

### Validating Claim of the Paper

#### Claim 1: Tacotron 2 works well on out of domain and complex words.

In [15]:
text = "Basilar membrane and otolaryngology are not auto-correlations."
#Audio as claimed by the paper's Author
Audio("/home/cc/Tacotron2/basilar.wav")

In [13]:
text = "Basilar membrane and otolaryngology are not auto-correlations."
Nvidia_tacotron2_waveglow(text) 

Using cache found in /home/cc/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub


####Claim 2: Tacotron 2 learns pronunciations based on phrase semantics.

In [16]:
from scipy.io.wavfile import write

from IPython.display import Audio, display
text = "Don't desert me here in the desert!"
#Audio as claimed by the paper
Audio("/home/cc/Tacotron2/desertdesert.wav")

In [17]:

Nvidia_tacotron2_waveglow(text)

Using cache found in /home/cc/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub


#### Claim 3: Tacotron 2 is somewhat robust to spelling errors.

In [18]:
text = "Thisss isrealy awhsome."
#Audio as claimed by the paper
Audio("/home/cc/Tacotron2/awhsome.wav")

In [19]:
Nvidia_tacotron2_waveglow(text)

Using cache found in /home/cc/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub


####Claim 4: Tacotron 2 is sensitive to punctuation.

In [20]:
text = "This is your personal assistant, Google Home."
#Audio as claimed by the paper
Audio("/home/cc/Tacotron2/ghome_comma.wav")

In [21]:

Nvidia_tacotron2_waveglow(text)

Using cache found in /home/cc/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub


In [None]:
text = "This is your personal assistant Google Home."
#Audio as claimed by the paper
Audio("/home/cc/Tacotron2/ghome_nocomma.wav")

In [22]:

Nvidia_tacotron2_waveglow(text)

Using cache found in /home/cc/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub


####Claim 5: Tacotron 2 learns stress and intonation.

In [23]:
text = "The buses aren't the problem, they actually provide a solution."
#Audio as claimed by the paper
Audio("/home/cc/Tacotron2/bus_nostress.wav")

In [24]:
Nvidia_tacotron2_waveglow(text)

Using cache found in /home/cc/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub


In [25]:
text = "The buses aren't the PROBLEM, they actually provide a SOLUTION."
#Audio as claimed by the paper
Audio("/home/cc/Tacotron2/bus_stress.wav")

In [26]:

Nvidia_tacotron2_waveglow(text)

Using cache found in /home/cc/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub


####Claim 6: Tacotron 2's prosody changes when turning a statement into a question.

In [27]:
text = "The quick brown fox jumps over the lazy dog."
#Audio as claimed by the paper
Audio("/home/cc/Tacotron2/fox_period.wav")

In [28]:

Nvidia_tacotron2_waveglow(text)

Using cache found in /home/cc/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub


In [29]:
text = "Does the quick brown fox jump over the lazy dog?"
#Audio as claimed by the paper
Audio("/home/cc/Tacotron2/fox_question.wav")

In [30]:
Nvidia_tacotron2_waveglow(text)

Using cache found in /home/cc/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub


####Claim 7: Tacotron 2 is good at tongue twisters.

In [31]:
text = "Peter Piper picked a peck of pickled peppers. How many pickled peppers did Peter Piper pick?"
#Audio as claimed by the paper
Audio("/home/cc/Tacotron2/peterpiper.wav")

In [32]:
Nvidia_tacotron2_waveglow(text)

Using cache found in /home/cc/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub


## References

[1] A. Hunt and A. Black, “UNIT SELECTION IN A CONCATENATIVE SPEECH SYNTHESIS SYSTEM USING A LARGE SPEECH DATABASE” [Online]. Available: https://www.ee.columbia.edu/~dpwe/e6820/papers/HuntB96-speechsynth.pdf. [Accessed: Dec. 18, 2022]


[2] H. Zen, K. Tokuda, and A. W. Black, “Statistical parametric speech synthesis,” Speech Communication, vol. 51, no. 11, pp. 1039–1064, Nov. 2009, doi: 10.1016/j.specom.2009.04.004.


[3] J. Shen et al., “NATURAL TTS SYNTHESIS BY CONDITIONING WAVENET ON MEL SPECTROGRAM PREDICTIONS,” 2017 [Online]. Available: https://arxiv.org/pdf/1712.05884.pdf


[4] D. Coldewey, “Google’s Tacotron 2 simplifies the process of teaching an AI to speak,” TechCrunch, Dec. 19, 2017. [Online]. Available: https://techcrunch.com/2017/12/19/googles-tacotron-2-simplifies-the-process-of-teaching-an-ai-to-speak/. [Accessed: Dec. 18, 2022]


[5] I. Elias et al., “PARALLEL TACOTRON: NON-AUTOREGRESSIVE AND CONTROLLABLE TTS” [Online]. Available: https://arxiv.org/pdf/2010.11439.pdf. [Accessed: Dec. 18, 2022]