[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github.com/tulasiram58827/TTS_TFLite/blob/main/End_to_End_TTS.ipynb)

This notebook provides an end-to-end Text to Speech with a choice to choose both TTS Model and Vocoder

## About TTS Architecture

Speech Synthesis generally also called as Text-To-Speech(TTS). Synthesizing a speech from text generally consists of two steps.
Spectrograms are one of the most commonly used features to represent speech.
- `TTS model` : TTS Model generates mel spectrograms(Don't worry about mel) from text.


- `Vocoder` : Vocoder generates an audio from spectrograms. There are different types of vocoders.
    - Algorithmic based
        - Griffin-Lim
    - Neural Network based
        - MelGAN
        - MB-MelGAN
        - Parallel WaveGAN

## SetUp

In [None]:
!sudo apt-get install espeak
!pip install phonemizer

In [None]:
!git clone https://github.com/mozilla/TTS

%cd TTS
!git checkout c7296b3
!pip install -r requirements.txt
!python setup.py install
!pip install tensorflow==2.3.0rc0
%cd ..

## Imports

In [17]:
import os
import torch
import time
import IPython

from TTS.tf.utils.tflite import load_tflite_model
from TTS.tf.utils.io import load_checkpoint
from TTS.utils.io import load_config
from TTS.utils.text.symbols import symbols, phonemes
from TTS.utils.audio import AudioProcessor
from TTS.utils.synthesis import synthesis

The below models are provided by this [Mozilla TTS Repository](https://github/mozilla/TTS/) and you can use this [Notebook](https://colab.research.google.com/github/mozilla/TTS/blob/master/notebooks/DDC_TTS_and_MultiBand_MelGAN_TFLite_Example.ipynb#scrollTo=4dnpE0-kvTsu) to generate the TFLite models.

- These Models are trained on `LJSpeech` dataset.

In [None]:
# Downloading Tacotron2 TFLite Model and its config
!gdown --id 17PYXCmTe0el_SLTwznrt3vOArNGMGo5v -O tts_model.tflite
!gdown --id 18CQ6G6tBEOfvCHlPqP8EBI4xWbrr9dBc -O config.json

# Downloading MB-MelGAN Vocoder TFLite Model and its config
!gdown --id 1aXveT-NjOM1mUr6tM4JfWjshq67GvVIO -O vocoder_model.tflite
!gdown --id 1Rd0R_nRCrbjEdpOwq6XwZAktvugiBvmu -O config_vocoder.json
!gdown --id 11oY3Tv0kQtxK_JPgxrfesa99maVXHNxU -O scale_stats.npy

## TFLite MB MelGAN Inference

In [1]:
def run_mb_melgan(mel_spec):
  VOCODER_MODEL = "vocoder_model.tflite"
  VOCODER_CONFIG = "config_vocoder.json"
  vocoder_model = load_tflite_model(VOCODER_MODEL)  
  VOCODER_CONFIG = load_config(VOCODER_CONFIG)
  vocoder_inputs = mel_spec[None, :, :]
  # get input and output details
  input_details = vocoder_model.get_input_details()
  # reshape input tensor for the new input shape
  vocoder_model.resize_tensor_input(input_details[0]['index'], vocoder_inputs.shape)
  vocoder_model.allocate_tensors()
  detail = input_details[0]
  vocoder_model.set_tensor(detail['index'], vocoder_inputs)
  # run the model
  vocoder_model.invoke()
  # collect outputs
  output_details = vocoder_model.get_output_details()
  waveform = vocoder_model.get_tensor(output_details[0]['index'])
  return waveform 

## Parallel WaveGAN Inference

In [2]:
def run_parallel_wavegan(melspec):
    feats = np.expand_dims(melspec, 0)
    interpreter = tf.lite.Interpreter(model_path='parallel_wavegan.tflite')

    input_details = interpreter.get_input_details()

    output_details = interpreter.get_output_details()

    interpreter.resize_tensor_input(input_details[0]['index'],  [1, feats.shape[1], feats.shape[2]], strict=True)
    interpreter.allocate_tensors()

    interpreter.set_tensor(input_details[0]['index'], feats)

    interpreter.invoke()

    output = interpreter.get_tensor(output_details[0]['index'])
    
    return output

## Tacotron2 Inference

In [1]:
def run_tacotron2(text):
    use_cuda = False
    TTS_MODEL = "tts_model.tflite"
    TTS_CONFIG = "config.json"
    TTS_CONFIG = load_config(TTS_CONFIG)
    ap = AudioProcessor(**TTS_CONFIG.audio)
    speaker_id = None
    speakers = []
    model = load_tflite_model(TTS_MODEL)
    waveform, alignment, mel_spec, mel_postnet_spec, stop_tokens, inputs = synthesis(model, text, TTS_CONFIG, use_cuda, ap, speaker_id, style_wav=None,
                                                                             truncated=False, enable_eos_bos_chars=TTS_CONFIG.enable_eos_bos_chars,
                                                                             backend='tflite')
    return mel_postnet_spec, TTS_CONFIG.audio['sample_rate']

## TTS Inference Helper

In [3]:
def run_tts_inference(text, model_name='Tacotron2', vocoder_name='MB-MelGAN'):
    if model_name == 'Tacotron2':
        tac_output, sample_rate = run_tacotron2(text)
    if vocoder_name == 'MB-MelGAN':
        waveform = run_mb_melgan(tac_output.T)
        waveform = waveform[0, 0]
    elif vocoder_name == 'PWGAN':
        waveform = run_parallel_wavegan(tac_output.T)
        waveform = waveform[0, :, 0]
      
    IPython.display.display(IPython.display.Audio(waveform, rate=sample_rate))
    

## Choose model

In [4]:
tts_model = 'Tacotron2' #@param ["Tacotron2", "FastSpeech", "FastSpeech2", "Glow-TTS"]

vocoder_model = 'MB-MelGAN' #@param ["Mel-GAN", "MB-MelGAN"]

## Inference

In [44]:
text =  "Bill got in the habit of asking himself “Is that thought true?” and if he wasn’t absolutely certain it was, he just let it go."

run_tts_inference(sentence, tts_model, vocoder_model)

 > Setting up Audio Processor...
 | > sample_rate:22050
 | > num_mels:80
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:0
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:True
 | > symmetric_norm:True
 | > mel_fmin:50.0
 | > mel_fmax:7600.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > stats_path:./scale_stats.npy
 | > hop_length:256
 | > win_length:1024


## Benchmarks

### MB-MELGAN

- Inference Time : 1.9ms
- Memory FootPrint : 15MB
- Model Size : 9.7MB

### TACOTRON2

- Inference Time

- Memory FootPrint

- Model Size :  28.67MB

### Parallel WaveGAN

#### Dynamic Range Quantization

- Inference Time : 

- Memory FootPrint :

- Model Size : 5.7MB 

#### Float16 Quantization

- Inference Time : 

- Memory FootPrint :

- Model Size : 3.2MB