# Multi-Language Multi-Speaker Acoustic Modeling for LSTM-RNN based Statistical Parametric Speech Synthesis

Link: https://research.google.com/pubs/pub45400.html

Authors: Bo Li, Heiga Zen

Institution: Google

Publication: google research

Date: 2016




## Background Materials




## Papers citing this paper



## What is this paper about?


LSTM-RNN based statistic parametric speech synthesis system that uses data from multiple languages and speakers.


## What is the motivation of this research?


The ability to utilize inhomogeneous data is important because it allows to use more large data for training.

## What makes this paper different from previous research?

- trained with inhomogeneous data
- Adaption to a new language with limiting data

## How this paper achieve it?

### Design
For modeling language variation, cluster adaptive training (CAT) (Tan et al., 2015) is used.

For modeling speaker variation, speaker dependent output layers are used.

### Preprocessing
A input sequence of words from language $u$ is first processed to extract a sequence of universal linguistic feature vectors. The union of the linguistic feature set of all the languages is used instead of the international phonetic alphabet (IPA) for simplicity. For feature dimensions that are not available in the current language zero padding is used.

The universal linguistic feature vectors are converted to frame-level linguistic feature vectors $\{x_1, ..., x_t, ..., x_T\}$ by duration model.

### Goal
The goal of this system is

to output a vocoder parameter feature vector $y_t^{(u, s)}$ for the desired language $u$ and speaker $s$,

given for each frame at time $t$ the feature vector $x_t$ together with a language ID $u$ and a speaker ID $s$ as an input.


### Components

1. **mean tower $\mathcal{M}^{\mathrm{mean}}$**: a sub-network which captures the shared knowledge across different training languages.
1. **language basis towers $\mathcal{M}_l^{\mathrm{lang}}$** for $l \in \{1,...,L\}$: a set of sub-networks to capture different variation of all training languages. $L$ is the dimention of the language.
1. **language code vector $\lambda^{(u)}$** for each of the training languages $u \in \{1,...,U\}$: Normally $L < U$ is met, i.e. the total number of languages $U$ is usually larger than the dimension of the language space $L$. This bottleneck structure forces the culustering of the given languages and enables information sharing between language.
1. **speaker dependent RNN output layer $\mathcal{M}_s^{\mathrm{spkr}}$** onc for each speaker $s \in \{1,...,S\}$

#### speaker dependent RNN output layer

The vocoder parameters can be derived as follows.

$\boldsymbol{h}_t^{(u)} = \mathcal{M}^{\mathrm{mean}}(\{x_1, ..., x_t\}) + \sum_{l=1}^L\lambda_l^{(u)}\mathcal{M}_l^{\mathrm{lang}}(\{x_1, ..., x_t\})$

$\boldsymbol{y}_t^{(u, s)} = \mathcal{M}_s^{\mathrm{spkr}}(\boldsymbol{h}_t^{(u)}, \boldsymbol{y}_{t-1}^{(u, s)})$

The language mean tower $\mathcal{M}^{\mathrm{mean}}$ sets up the origin of the language space and each language basis tower $\mathcal{M}_l^{\mathrm{lang}}$ learns one potential direction of variation away from the origin.

The language code vector $\boldsymbol{\lambda}^{(u)}$ locates the language $u$ in the $L$ dimensional space and converts the universal input linguistic feature vector $\boldsymbol{x}_t$ to language dependent hidden activation $\boldsymbol{h}_t^{(u)}$.

$\mathcal{M}^{\mathrm{mean}}$, $\mathcal{M}_l^{\mathrm{lang}}$, $\lambda^{(u)}$, $\mathcal{M}_s^{\mathrm{spkr}}$ will be updated during training for $(u, s)$ pair.

<img src="img/Multi-Language_Multi-Speaker_Acoustic_Modeling_for_LSTM-RNN_based_Statistical_Parametric_Speech_Synthesis_Figure1.png" width="400">


### Results

#### AdaptationtoNewLanguages

Six training language is selected and six basis towers and one mean tower were used to model the language space ($U = 6, L = 6$). 

Additional two language are selected for adaption with limiting training data ($U = 8, L = 6$).

The experimented adaptation methods are following:

- update only language code $\lambda^{(u)}$;
- v2 : update language code $\lambda^{(u)}$ and mean tower $\mathcal{M}^{\mathrm{mean}}$ jointly;
- v3 : start from v1, update mean tower $\mathcal{M}^{\mathrm{mean}}$ alone;
- v4 : start from v3, joint update language $\lambda^{(u)}$ and mean tower $\mathcal{M}^{\mathrm{mean}}$.

Other than v1, the other methods yield lower mean square errors than directly building a system from the limited data.
v2 and v3 are statistically better than baseline in preference tests.


## Dataset used in this study

internal data

### training language

- North American (US) English
- British (UK) English
- French
- Italian
- German
- Spanish

### testing language for adaption

- Polish
- BR Portuguese

## Implementations




## Further Readings


