# **DNN Model 6 - Text to Speech Conversion**

**This notebook will experiment the TTS Generation section.**

**The package Tenorspeech Contains SOTA models readily available for inference to be used as TTS Generators. This notebook will experiment with 2 of the most promising models (Tacotron2 & FastSpeech2). Each of these models outputs mel spectograms and we will use mel-gan model to synthesize speech. Also it should be noted that these models are trained on the LJ dataset.**

## **Setup**

In [1]:
!pip install TensorFlowTTS
!pip install tensorflow-addons
!pip install git+https://github.com/repodiac/german_transliterate

Collecting TensorFlowTTS
  Downloading TensorFlowTTS-1.8-py3-none-any.whl (128 kB)
[K     |████████████████████████████████| 128 kB 5.2 MB/s 
[?25hCollecting pyworld>=0.2.10
  Downloading pyworld-0.3.0.tar.gz (212 kB)
[K     |████████████████████████████████| 212 kB 25.5 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting unidecode>=1.1.1
  Downloading Unidecode-1.3.2-py3-none-any.whl (235 kB)
[K     |████████████████████████████████| 235 kB 46.3 MB/s 
[?25hCollecting inflect>=4.1.0
  Downloading inflect-5.3.0-py3-none-any.whl (32 kB)
Collecting g2pM
  Downloading g2pM-0.1.2.5-py3-none-any.whl (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 13.9 MB/s 
[?25hCollecting tensorflow-gpu==2.6.0
  Downloading tensorflow_gpu-2.6.0-cp37-cp37m-manylinux2010_x86_64.whl (458.3 MB)
[K     |████████████████████████████████| 458.3 MB 10 kB/s 
Collectin

## **Imports**

In [2]:
import numpy as np
import soundfile as sf
import yaml

import tensorflow as tf

from tensorflow_tts.inference import AutoProcessor
from tensorflow_tts.inference import TFAutoModel
from tensorflow_tts.inference import AutoConfig

from IPython.display import Audio

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package cmudict to /root/nltk_data...
[nltk_data]   Unzipping corpora/cmudict.zip.


## **Load Pretrained Models**

In [3]:
# melgan - to convert the extracted mel spectrograms to audio
melgan = TFAutoModel.from_pretrained("tensorspeech/tts-melgan-ljspeech-en")

Downloading:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.68k [00:00<?, ?B/s]

In [4]:
# Tacotron
tacotron2_processor = AutoProcessor.from_pretrained("tensorspeech/tts-tacotron2-ljspeech-en")
tacotron2 = TFAutoModel.from_pretrained("tensorspeech/tts-tacotron2-ljspeech-en")

Downloading:   0%|          | 0.00/3.57k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/128M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.03k [00:00<?, ?B/s]

In [5]:
# Fastspeech2
fastspeech2_processor = AutoProcessor.from_pretrained("tensorspeech/tts-fastspeech2-ljspeech-en")
fastspeech2 = TFAutoModel.from_pretrained("tensorspeech/tts-fastspeech2-ljspeech-en")

Downloading:   0%|          | 0.00/3.57k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/125M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.71k [00:00<?, ?B/s]

### Prepare the Input pipeline

We need to use the respective preprocessor to recieve the word ids of the inputted text

In [6]:
# Utility function to recieve the text ids
def get_input_ids(text):
  tacotron2_ids = tacotron2_processor.text_to_sequence(text)
  fastspeech2_ids = fastspeech2_processor.text_to_sequence(text)

  return tacotron2_ids, fastspeech2_ids

In [8]:
# Sample text
text = "Recent research at Harvard has shown meditating\
for as little as 8 weeks, can actually increase the grey matter in the \
parts of the brain responsible for emotional regulation, and learning.\
World is will be a better place. We all need to make sure that it does."

In [9]:
tacotron2_ids, fastspeech2_ids = get_input_ids(text)

In [10]:
print(tacotron2_ids)
print(fastspeech2_ids)

[55, 42, 40, 42, 51, 57, 11, 55, 42, 56, 42, 38, 55, 40, 45, 11, 38, 57, 11, 45, 38, 55, 59, 38, 55, 41, 11, 45, 38, 56, 11, 56, 45, 52, 60, 51, 11, 50, 42, 41, 46, 57, 38, 57, 46, 51, 44, 43, 52, 55, 11, 38, 56, 11, 49, 46, 57, 57, 49, 42, 11, 38, 56, 11, 42, 46, 44, 45, 57, 11, 60, 42, 42, 48, 56, 6, 11, 40, 38, 51, 11, 38, 40, 57, 58, 38, 49, 49, 62, 11, 46, 51, 40, 55, 42, 38, 56, 42, 11, 57, 45, 42, 11, 44, 55, 42, 62, 11, 50, 38, 57, 57, 42, 55, 11, 46, 51, 11, 57, 45, 42, 11, 53, 38, 55, 57, 56, 11, 52, 43, 11, 57, 45, 42, 11, 39, 55, 38, 46, 51, 11, 55, 42, 56, 53, 52, 51, 56, 46, 39, 49, 42, 11, 43, 52, 55, 11, 42, 50, 52, 57, 46, 52, 51, 38, 49, 11, 55, 42, 44, 58, 49, 38, 57, 46, 52, 51, 6, 11, 38, 51, 41, 11, 49, 42, 38, 55, 51, 46, 51, 44, 7, 60, 52, 55, 49, 41, 11, 46, 56, 11, 60, 46, 49, 49, 11, 39, 42, 11, 38, 11, 39, 42, 57, 57, 42, 55, 11, 53, 49, 38, 40, 42, 7, 11, 60, 42, 11, 38, 49, 49, 11, 51, 42, 42, 41, 11, 57, 52, 11, 50, 38, 48, 42, 11, 56, 58, 55, 42, 11, 57,

## **Inference**

In [13]:
# Inference on Tacotron
_, tacotron2_mel_outputs, _, _ = tacotron2.inference(
        tf.expand_dims(tf.convert_to_tensor(tacotron2_ids, dtype=tf.int32), 0),
        tf.convert_to_tensor([len(tacotron2_ids)], tf.int32),
        tf.convert_to_tensor([0], dtype=tf.int32)
)

In [15]:
# Inference on Fastspeech2
fastspeech2_mel_before, fastspeech2_mel_after, duration_outputs, _, _ = fastspeech2.inference(
    input_ids=tf.expand_dims(tf.convert_to_tensor(fastspeech2_ids, dtype=tf.int32), 0),
    speaker_ids=tf.convert_to_tensor([0], dtype=tf.int32),
    speed_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
    f0_ratios =tf.convert_to_tensor([1.0], dtype=tf.float32),
    energy_ratios =tf.convert_to_tensor([1.0], dtype=tf.float32),
)

## **Audio Generation**

In [19]:
# Use mel gan
audio_outputs_tacotron2 = melgan(tacotron2_mel_outputs)[0, :, 0]
audio_outputs_fastspeech2 = melgan(fastspeech2_mel_before)[0, :, 0]

### Audio outputs

In [20]:
Audio(data=audio_outputs_tacotron2, rate=22050)

In [21]:
Audio(data=audio_outputs_fastspeech2, rate=22050)