<a href="https://colab.research.google.com/github/IsinghGitHub/CellStrat/blob/master/TTS_V2_using_Tacotron_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Tacotron-2:**

Tensorflow implementation of DeepMind's Tacotron-2. A deep neural network architecture described in this paper: [Natural TTS synthesis by conditioning Wavenet on MEL spectogram predictions](https://arxiv.org/pdf/1712.05884.pdf)

This Repository contains additional improvements and attempts over the paper, we thus propose paper_hparams.py file which holds the exact hyperparameters to reproduce the paper results without any additional extras.

Suggested hparams.py file which is default in use, contains the hyperparameters with extras that proved to provide better results in most cases. Feel free to toy with the parameters as needed.

DIFFERENCES WILL BE HIGHLIGHTED IN DOCUMENTATION SHORTLY.

**Steps to run the code:**

Step (0): Get your dataset, here I have set the examples of Ljspeech, en_US and en_UK (from M-AILABS).

Step (1): Preprocess your data. This will give you the training_data folder.

Step (2): Train your Tacotron model. Yields the logs-Tacotron folder.

Step (3): Synthesize/Evaluate the Tacotron model. Gives the tacotron_output folder.

Step (4): Train your Wavenet model. Yield the logs-Wavenet folder.

Step (5): Synthesize audio using the Wavenet model. Gives the wavenet_output folder.

Note: Steps 2, 3, and 4 can be made with a simple run for both Tacotron and WaveNet (Tacotron-2, step ( * )).

In [0]:
!sudo apt-get update -y

In [0]:
!apt-get install -y libasound-dev portaudio19-dev libportaudio2 libportaudiocpp0 ffmpeg

In [0]:
!sudo apt-get update

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Changing the directory to Tacotron directory.

In [0]:
import os
from os.path import exists, join, expanduser

os.chdir('/content/drive/My Drive/CellStrat/TTS/Tacotron-2') 

In [0]:
!ls

## **Ignore the above error for any requirements not matched for now**


In [0]:
!pip install -q -r requirements.txt

Download and extract the LJSpeech Dataset

In [0]:
!tar -xvf LJSpeech-1.1.tar

`**Preprocessing**`

In [0]:
!python preprocess.py

## Training:

### To train both models sequentially (one after the other):

In [0]:
!python train.py --model='Tacotron-2'

### Feature prediction model can separately be trained using:

In [0]:
!python train.py --model='Tacotron'

### checkpoints will be made each 5000 steps and stored under logs-Tacotron folder.

Naturally, training the wavenet separately is done by:

In [0]:
!python train.py --model='WaveNet'

### logs will be stored inside logs-Wavenet.

Note:


1.   ### If model argument is not provided, training will default to Tacotron-2 model training. (both models)


2.  ### Please refer to train arguments under train.py for a set of options you can use.

3.   ### It is now possible to make wavenet preprocessing alone using wavenet_proprocess.py.

## Synthesis

To synthesize audio in an End-to-End (text to audio) manner (both models at work):

In [0]:
!python synthesize.py --model='Tacotron-2'

For the spectrogram prediction network (separately), there are three types of mel spectrograms synthesis:



Evaluation (synthesis on custom sentences). This is what we'll usually use after having a full end to end model.


In [0]:
!python synthesize.py --model='Tacotron'

Natural synthesis (let the model make predictions alone by feeding last decoder output to the next time step).


In [0]:
!python synthesize.py --model='Tacotron' --mode='synthesis' --GTA=False

Ground Truth Aligned synthesis (DEFAULT: the model is assisted by true labels in a teacher forcing manner). This synthesis method is used when predicting mel spectrograms used to train the wavenet vocoder. (yields better results as stated in the paper)

In [0]:
!python synthesize.py --model='Tacotron' --mode='synthesis' --GTA=True


Synthesizing the waveforms conditionned on previously synthesized Mel-spectrograms (separately) can be done with:



In [0]:
!python synthesize.py --model='WaveNet'

Note:

If model argument is not provided, synthesis will default to Tacotron-2 model synthesis. (End-to-End TTS)
Please refer to synthesis arguments under synthesize.py for a set of options you can use.

## References and Resources:

Natural TTS synthesis by conditioning Wavenet on MEL 

spectogram predictions

Original tacotron paper

Attention-Based Models for Speech Recognition

Wavenet: A generative model for raw audio

Fast Wavenet

r9y9/wavenet_vocoder

keithito/tacotron