# Tacotron: Towards End-to-End Speech Synthesis

Link: https://arxiv.org/abs/1703.10135

Authors: Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, Rif A. Saurous

Institution: Google, Inc.

Publication: arXiv

Date: 6 Apr 2017




## Background Materials

tacos & sushi


## What is this paper about?


Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters.

## What is the motivation of this research?

Traditional TTS pipelines are complex. For example, it includes a text frontend extracting various linguistic features, a duration model, an acoustic feature prediction, and signal-processing based vocoder. They require domain expertise and labor.

## What makes this paper different from previous research?


Tacotron uses seq2seq with attention but does not require phoneme-level alignment, so there is no need to use pre-trained aligner.

Tacotron directly predict raw spectrogram so does not use vocoder.

Tacotron is complete end-to-end model, which can learn directly from `<text, audio>` pair to predict spectrogram, so can be trained completely from scratch.

Tacotron predicts frame level spectrograms, so it is significantly faster than sample level models but relatively low quality in terms of naturalness.

## How this paper achieve it?

Tacotron consists of:

- character embedding layer
- pre-net Encoder
- CBHG Encoder
- attention RNN
- pre-net Decoder
- Decoder RNN
- CBHG post-processng net

<img src="img/Tacotron-Towards_End-to-End_Speech_Synthesis_Figure1.png" width="600">

### CBHG module

CBHG module is a model inspired from machine translation (Lee et al., 2016).

To model local and contextual information, the input sequence is first convolved with K sets of 1-D convolutional filters, where k-th set contains $C_k$ filters of width, which correspond to model K-gram.

To increase local invariances, the convolution outputs are stacked together and max pooled along time.

The processed sequence is passed to a few fixed-width 1-D convolutions. Its outputs are added with original sequence via residual connections.

To extract high-level features, the convolution outputs are fed into a multi-layer highway network.

To extract sequential features from both frward and backward context, the high-level feature outputs are fed into bidirectional GRU RNN.

<img src="img/Tacotron-Towards_End-to-End_Speech_Synthesis_Figure2.png" width="300">



### Encoder

The goal of the encoder is to extract robust sequential representations of text. CBHG encoder reduces overfitting and makes fewer mispronunciations compared to a standard multi-layer RNN encoder.

The input to the encoder is character sequence represented as one-hot vector, and they are embedded into continuous vector.

"pre net" applies non-linear transformations to each embedding. A bottleneck layer with dropout is used in this study.



### Decoder

The purpose of the decoder is learning alignment between speech signal and text.

A content-based tanh attention decoder (Vinyals et al., 2015) is used, where a stateful recurrent layer produces the attention query at each decoder timestep.

The input to the decoder is a concatenation of the context vector and the attention RNN cell output.

For decoder a stack of GRUs with vertical residual connections (Wu et al., 2016) is used.

They did not choose a raw spectrogram as the decoder target. They chose 80-band mel-scale spectrogram as the seq2seq target. This redundancy makes it possible to use a different target for seq2seq decoding and waveform synthesis, that lead to a highly general model agnostic to waveform synthesis method.

To predict the decoder target a simple fully-connected layer is used.



### Post-processing net and waveform synthesis

The task of the post-processing net is to convert the seq2seq target to target that can be synthesized into waveforms.

A motivation of the post-processing net is that it can see the full decoded sequence both forward and backward to correct the prediction error, while seq2seq always runs from left-to-right.

A CBHG module is used for post-processing net.

In this work Griffin-Lim algorithm is used to synthesize waveform from predicted spectrogram. However the concept of a post-processing net is highly general, so can be used to predict alternative targets (e.g. vocoder parameter, WaveNet-like neural vocoder).

### Experimental results

Generated samples can be found at https://google.github.io/tacotron/ .

## Dataset used in this study

- internal North American English dataset (~ 24.6 hours)


## Implementations

- https://github.com/Kyubyong/tacotron


## Further Readings

- Sequence to sequence learning with neural networks https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf
- Highway Networks https://arxiv.org/abs/1505.00387
- Fully Character-Level Neural Machine Translation without Explicit Segmentation https://arxiv.org/abs/1610.03017