# NATURAL TTS SYNTHESIS BY CONDITIONING WAVENET ON MEL SPECTROGRAM PREDICTIONS

_Chandra Shekhar Pandey, Priyanka Bose_



This notebook validates some of the qualitative claims made in the paper

> J. Shen et al., "Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions," 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 2018, pp. 4779-4783, doi: 10.1109/ICASSP.2018.8461368.

which describes Tacotron2, a neural network for predicting mel spectograms from text, which can then be used with a vocoder to synthesize speech.

We use the implementation of this model in TensorSpeech ([TensorflowTTS](https://github.com/TensorSpeech/TensorFlowTTS)) to check the specific claims on the [authors' website](https://google.github.io/tacotron/publications/tacotron2/) for selected test cases:

* Claim: Tacotron 2 works well on out-of-domain and complex words.
* Claim: Tacotron 2 learns pronunciations based on phrase semantics.
* Claim: Tacotron 2 is somewhat robust to spelling errors.
* Claim: Tacotron 2 is sensitive to punctuation.
* Claim: Tacotron 2 learns stress and intonation.
* Claim: Tacotron 2's prosody changes when turning a statement into a question.
* Claim: Tacotron 2 is good at tongue twisters.

For each test case, we share the audio from the authors' websites, then you can generate audio using Tacotron2 with different vocoders:

* MelGAN
* MB MelGAN
* MelGAN STFT

for comparison. You can also change the text of the test case to explore the limits of the claim.



# Validating the claims

#### Importing Libraries

In [1]:
import tensorflow as tf

import yaml
import numpy as np
import matplotlib.pyplot as plt

import IPython.display as ipd

from tensorflow_tts.inference import TFAutoModel
from tensorflow_tts.inference import AutoConfig
from tensorflow_tts.inference import AutoProcessor

 The versions of TensorFlow you are currently using is 2.7.0 and is not supported. 
Some things might work, some things might not.
If you were to encounter a bug, do not file an issue.
If you want to make sure you're using a tested and supported configuration, either change the TensorFlow version or the TensorFlow Addons's version. 
You can find the compatibility matrix in TensorFlow Addon's readme:
https://github.com/tensorflow/addons
  from .autonotebook import tqdm as notebook_tqdm


#### Loading pretrained weights of Tacotron 2 model and  vocoders

In [2]:
tacotron2 = TFAutoModel.from_pretrained("tensorspeech/tts-tacotron2-ljspeech-en", name="tacotron2")

2023-03-28 16:46:28.659135: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-28 16:46:29.262305: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 22839 MB memory:  -> device: 0, name: Quadro RTX 6000, pci bus id: 0000:3b:00.0, compute capability: 7.5
2023-03-28 16:46:36.560469: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Quadro RTX 6000" frequency: 1770 num_cores: 72 environment { key: "architecture" value: "7.5" } environment 

In [3]:
melgan = TFAutoModel.from_pretrained("tensorspeech/tts-melgan-ljspeech-en", name="melgan")

In [4]:
melgan_stft_config = AutoConfig.from_pretrained('TensorFlowTTS/examples/melgan_stft/conf/melgan_stft.v1.yaml')
melgan_stft = TFAutoModel.from_pretrained(
    config=melgan_stft_config,
    pretrained_path="melgan.stft-2M.h5",
    name="melgan_stft"
)

In [5]:
mb_melgan = TFAutoModel.from_pretrained("tensorspeech/tts-mb_melgan-ljspeech-en", name="mb_melgan")

In [6]:
processor = AutoProcessor.from_pretrained("tensorspeech/tts-tacotron2-ljspeech-en")

In [7]:
def do_synthesis(input_text, text2mel_model, vocoder_model, text2mel_name, vocoder_name):
  input_ids = processor.text_to_sequence(input_text)

  # text2mel part
  if text2mel_name == "TACOTRON":
    _, mel_outputs, stop_token_prediction, alignment_history = text2mel_model.inference(
        tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0),
        tf.convert_to_tensor([len(input_ids)], tf.int32),
        tf.convert_to_tensor([0], dtype=tf.int32)
    )
  else:
    raise ValueError("Only TACOTRON, FASTSPEECH, FASTSPEECH2 are supported on text2mel_name")

  # vocoder part
  if vocoder_name == "MELGAN" or vocoder_name == "MELGAN-STFT":
    audio = vocoder_model(mel_outputs)[0, :, 0]
  elif vocoder_name == "MB-MELGAN":
    audio = vocoder_model(mel_outputs)[0, :, 0]
  else:
    raise ValueError("Only MELGAN, MELGAN-STFT and MB_MELGAN are supported on vocoder_name")

  if text2mel_name == "TACOTRON":
    return mel_outputs.numpy(), alignment_history.numpy(), audio.numpy()
  else:
    return mel_outputs.numpy(), audio.numpy()

In [8]:
# setup window for tacotron2 if you want to try
tacotron2.setup_window(win_front=10, win_back=10)

In [9]:
def tts_tacotron2_melgan_stft(input_text):
  mels, alignment_history, audios = do_synthesis(input_text, tacotron2, melgan_stft, "TACOTRON", "MELGAN-STFT")
  return ipd.Audio(audios, rate=25000)

In [10]:
def tts_tacotron2_melgan(input_text):
  mels, alignment_history, audios = do_synthesis(input_text, tacotron2, melgan, "TACOTRON", "MELGAN")
  return ipd.Audio(audios, rate=25000)

In [11]:
def tts_tacotron2_mb_melgan(input_text):
  mels, alignment_history, audios = do_synthesis(input_text, tacotron2, mb_melgan, "TACOTRON", "MB-MELGAN")
  return ipd.Audio(audios, rate=25000)


#### Claim 1: Tacotron 2 works well on out of domain and complex words.

In [12]:

from scipy.io.wavfile import write

from IPython.display import Audio, display
text = "Basilar membrane and otolaryngology are not auto-correlations."
#Audio as claimed by the paper's Author
Audio("/home/cc/Tacotron2/basilar.wav")

In [13]:
tts_tacotron2_melgan(text)

In [14]:
tts_tacotron2_melgan_stft(text) 

In [15]:
tts_tacotron2_mb_melgan(text)

After hearing all the audio we can say that the claim is valid as the all the audios are pretty close to what the author claimed.

#### Claim 2: Tacotron 2 learns pronunciations based on phrase semantics.

In [16]:
from scipy.io.wavfile import write

from IPython.display import Audio, display
text = "Don't desert me here in the desert!"
#Audio as claimed by the paper
Audio("/home/cc/Tacotron2/desertdesert.wav")

In [17]:
tts_tacotron2_melgan(text) 

In [18]:
tts_tacotron2_melgan_stft(text) 

In [19]:
tts_tacotron2_mb_melgan(text)

After hearing all the audio we can say that the claim is valid as the all the audios are pretty close to what the author claimed and the pronounciation based on phrase sementics are pretty close.

#### Claim 3: Tacotron 2 is somewhat robust to spelling errors.

In [20]:
text = "Thisss isrealy awhsome."
#Audio as claimed by the paper
Audio("/home/cc/Tacotron2/awhsome.wav")

In [21]:
tts_tacotron2_melgan(text)

In [22]:
tts_tacotron2_melgan_stft(text)

In [23]:
tts_tacotron2_mb_melgan(text)

#### Claim 4: Tacotron 2 is sensitive to punctuation.

In [24]:
text = "This is your personal assistant, Google Home."
#Audio as claimed by the paper
Audio("/home/cc/Tacotron2/ghome_comma.wav")

In [25]:
tts_tacotron2_melgan(text)

In [26]:
tts_tacotron2_melgan_stft(text)

In [27]:
tts_tacotron2_mb_melgan(text)

In [28]:
text = "This is your personal assistant Google Home."
#Audio as claimed by the paper
Audio("/home/cc/Tacotron2/ghome_nocomma.wav")

In [29]:
tts_tacotron2_melgan(text)

In [30]:
tts_tacotron2_melgan_stft(text)

In [31]:
tts_tacotron2_mb_melgan(text)

Here we can observe that all the tacotron2 models are working pretty well accept the tacotron 1 model where it is not taking a pause on punctuation,

####  Claim 5: Tacotron 2 learns stress and intonation.

In [32]:
text = "The buses aren't the problem, they actually provide a solution."
#Audio as claimed by the paper
Audio("/home/cc/Tacotron2/bus_nostress.wav")

In [33]:
tts_tacotron2_melgan(text)

In [34]:
tts_tacotron2_melgan_stft(text)

In [35]:
tts_tacotron2_mb_melgan(text)

In [36]:
text = "The buses aren't the PROBLEM, they actually provide a SOLUTION."
#Audio as claimed by the paper
Audio("/home/cc/Tacotron2/bus_stress.wav")

In [37]:
tts_tacotron2_melgan(text)

In [38]:
tts_tacotron2_melgan_stft(text)

In [39]:
tts_tacotron2_mb_melgan(text)

Here we can see that none of the models are close to what it has been asserted by the author on the stress and intonation part and the audio is same as the normal text audio where there is no word to be stressed.

#### Claim 6: Tacotron 2's prosody changes when turning a statement into a question.

In [40]:
text = "The quick brown fox jumps over the lazy dog."
#Audio as claimed by the paper
Audio("/home/cc/Tacotron2/fox_period.wav")

In [41]:
tts_tacotron2_melgan(text)

In [42]:
tts_tacotron2_melgan_stft(text)

In [43]:
tts_tacotron2_mb_melgan(text)

In [44]:
text = "Does the quick brown fox jump over the lazy dog?"
#Audio as claimed by the paper
Audio("/home/cc/Tacotron2/fox_question.wav")

In [45]:
tts_tacotron2_melgan(text)

In [46]:
tts_tacotron2_melgan_stft(text)

In [47]:
tts_tacotron2_mb_melgan(text)

Here we can see that none of the models are able to change prosody when changing a statement to a question.

#### Claim 7: Tacotron 2 is good at tongue twisters.

In [48]:
text = "Peter Piper picked a peck of pickled peppers. How many pickled peppers did Peter Piper pick?"
#Audio as claimed by the paper
Audio("/home/cc/Tacotron2/peterpiper.wav")

In [49]:
tts_tacotron2_melgan(text)

In [50]:
tts_tacotron2_melgan_stft(text)

In [51]:
tts_tacotron2_mb_melgan(text)

Here we can see that all the models are performing pretty close to what was claimed by the author. 

## Overall conclusions

For some of these test cases, the claim is validated. For some others, it is not. The results suggest that the choice of vocoder used with Tacotron2 may affect the validity of these claims.