# Deep Voice 2: Multi-Speaker Neural Text-to-Speech

Link: http://research.baidu.com/wp-content/uploads/2017/05/Deep-Voice-2-Complete-Arxiv.pdf

Authors: Sercan O. Arık, Andrew Gibiansky, Wei Ping, Gregory Diamos, John Miller, Jonathan Raiman, Kainan Peng, Yanqi Zhou

Institution: Baidu Silicon Valley Artificial Intelligence Lab

Publication: 31st Conference on Neural Information Processing Systems (NIPS 2017)

Date: 2017



## Background Materials

- http://research.baidu.com/deep-voice-2-multi-speaker-neural-text-speech/


## What is this paper about?

Deep Voice version 2 that is an improved architecture based on Deep Voice 1.

Multi-speaker speech synthesis with trainable speaker embeddings applied to Deep Voice 2 and Tacotron.


## What is the motivation of this research?

Most TTS system are built with a single speaker voice, and multiple speaker voices are provided by having distinct databases or model parameters.
This requires much more data and development effort than a system with single voice.

Deep Voice 2 is designed to share majority of parameters between different speakers and require significantly less data per speaker.


## What makes this paper different from previous research?

Existing methods for multi-speaker TTS include an average voice, i-vectors using per-speaker output layers, speaker adaption using GANs.

In this paper trainable speaker embeddings are used instead, which is a novel approach. The speaker embeddings can be trained jointly with the rest of the model and thus can directly learn features relevant to the speech synthesis task.


## How this paper achieve it?


### Single-Speaker Deep Voice 2

Deep Voice 2 keeps the general structure of Deep Voice 1.

One major difference is the separation of the phoneme duration and frequency models.

<img src="img/Deep_Voice_2_Multi-Speaker_Neural_Text-to-Speech_Figure1.png" width="600">

<img src="img/Deep_Voice_2_Multi-Speaker_Neural_Text-to-Speech_Figure5.png" width="600">

### Multi-Speaker Models with Trainable Speaker Embeddings

Speaker-dependent parameters are stored in a very low-dimensional vector and thus there is near-complete weight sharing between speakers.

Speaker embeddings are used to produce RNN initial states, nonlinearity biases, and multiplicative gating factors.

High performance settings are following:

- Site-Specific Speaker Embeddings: For every use site in the model architecture, transform the shared speaker embedding to the appropriate dimension and form through an affine projection and a nonlinearity.
- Recurrent Initialization: Initialize recurrent layer hidden states with site-specific speaker embeddings.
- Input Augmentation: Concatenate a site-specific speaker embedding to the input at every timestep of a recurrent layer.
- Feature Gating: Multiply layer activations elementwise with a site-specific speaker embedding to render adaptable information flow.

Figure 2: Architecture for the multi-speaker (a) segmentation, (b) duration, and (c) frequency model.
<img src="img/Deep_Voice_2_Multi-Speaker_Neural_Text-to-Speech_Figure2.png" width="600">


### Segmentation model

Estimation of phoneme location is trained as an unsupervised learning problem, similar to Deep Voice 1.

The segmentation model is convolutional-recurrent architecture with connectionist temporal classification (CTC) loss applied to classify phoneme pairs. The phoneme pairs are then used to extract the boundary between them.

The major changes in Deep Voice 2 are the addition of 
- batch normalization 
- residual connections

in the convolutional layers.

The segmentation model layers compute

$h^{(l)} = \mathrm{relu}(h^{l-1} + \mathrm{BN}(W^{(l)} \ast h^{(l-1)}))$

where $\mathrm{BN}$ is batch normalization (Ioffe and Szegedy, 2015), $h^{(l)}$ is the output of the $l$-th layer, $W^{(l)}$ is the convolutional filterbank, $b^{(l)}$ is the bias vector, and $\ast$ is the convolution operator.


In multi-speaker model, batch-normalized activation is multiplied by site-specific speaker embedding,

$h^{(l)} = \mathrm{relu}(h^{l-1} + \mathrm{BN}(W^{(l)} \ast h^{(l-1)}) \cdot g_s)$

where $g_s$ is a site-specific speaker embedding.

### Duration Model

Duration prediction is formulated as a sequence labeling problem. The phoneme durations are discretized into log-scaled buckets, and each input phoneme is assigned the bucket label. The sequence are modeled with a conditional random field (CRF) with pairwise potentials (Lample et al., 2016).

In multi-speaker model speaker-dependent recurrent initialization and input augumentation is used. A site-specific embedding is used to initialize RNN hidden states and to provide input to the first RNN layer.

### Frequency Model

The frequency model consists of multiple layers:

Bidirectional recurrent unit (GRU) layers (Cho et al., 2014) generates hidden states from the input features. From the hidden sates the probability that each frame is voiced. The hidden states are also used to make two separate normalized $F_0$ predictions, $f_{\mathrm{GRU}}$ and $f_{\mathrm{conv}}$.

$f_{\mathrm{GRU}}$ is made with bidirectional GRU followed by an affine projection.

$f_{\mathrm{conv}}$ is made with multiple convolutions with varing convolution withs.

The hidden state is also used to predict a mixture ratio $\omega$. $\omega$ is used to weigh and conbine two predictions.

$f = \omega f_{GRU} + (1 - \omega)f_{conv}$

The normalized prediction $f$ is then converted to the true frequency $F_0$ via

$F_0 = \mu_{F_0} + \sigma_{F_0} \cdot f$

where $\mu_{F_0}$ is the mean and $\sigma_{F_0}$ is standard deviation for the speaker model.



Since $F_0$ value vary greatly between speakers, in multi-speaker model introduces trainable parameters for the mean and standard deviation and multipy them by speaker-embeddings-dependent scaling terms.

$F_0 = \mu_{F_0} \cdot (1 + \mathrm{softsign}(V_\mu^T g_f)) + \sigma_{F_0} \cdot (1 + \mathrm{softsign}(V_\sigma^Tg_f)) \cdot f$

where $g_f$ is a site-specific speaker embedding, and $V_\mu$ and $V_\sigma$ are trainable parameter vectors.

By the way,
$\mathrm{softsign}(x) = \frac{x}{1 + |x|}$

### Vocal Model

Similar to Deep Voice 1, the Deep Voice 2 use WaveNet-based architecture with a two-layer bidirectional QRNN (Bradbury et al., 2016) conditioning network. 1×1 convolution is removed.

The multi-speaker vocal model uses only input augmentation. This differs from global conditioning as in Oord et al. (2016).

Without speaker-embeddings the vocal model can stil produce distinct voice but speaker-embeddings improve quality.


### Multi-Speaker Tacotron

They extended Tacotron to Multi-speaker model.

The model performance was highly dependent  on model hyperparameters, and some model often failed to learn attention mechanism for  a small subset of sepakers. All initial and final silence in audio clip was trimmed, because if the speech does not start at the same timestep, the model are less likely to converge to a meaningful attention curve and recognizable speech.

<img src="img/Deep_Voice_2_Multi-Speaker_Neural_Text-to-Speech_Figure3.png" width="600">


#### Character-to-Spectrogram Model

Incorpolating speaker embeddings into CBHG post-processing network degrades output quality.

Incorpolating speaker embeddings into the character encoder is necessary.

A speaker-dependent CBHG encoder is necessary to learn its attention mechanism and generate meaningful output.


The encoder is conditioned on speaker:
- each highway layer: once site-specific embedding as an extra input
-  CBHG RNN state: a second site-specific embedding as the initial state

The decoder with speaker embeddings is helpful:
- decoder pre-net: one site-specific embedding as an extra input
- attention RNN: one site-specific embedding as the initial attention context vector
- decoder GRU: one site-specific embedding as the initial hidden state, and one site-specific embedding as a bias to the tanh in the content-based attention mechanism


#### Spectrogram-to-Waveform Model

With Griffin-Lim algorithm, minor noise in input spectrogram causes noticeable estimation errors.

They use WaveNet to convert linear-scaled log-magnitude spectrogram to audio waveform for higher quality audio.




## Dataset used in this study

- VCTK dataset with 44 hours of speech contains 108 speakers with approximately 400 utterances each
- internal dataset of audiobooks contains 477 speakers with 30 minutes of audio each, about 238 hours in total

## Implementations




## Further Readings

- Tacotron: Towards End-to-End Speech Synthesis https://arxiv.org/abs/1703.10135
- Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks http://www.cs.toronto.edu/~graves/icml_2006.pdf
- Voice Conversion from Unaligned Corpora using Variational Autoencoding Wasserstein Generative Adversarial Networks https://arxiv.org/abs/1704.00849
- Deep Speaker: an End-to-End Neural Speaker Embedding System https://arxiv.org/abs/1705.02304
- Quasi-Recurrent Neural Networks https://arxiv.org/abs/1611.01576