Papers

(Feel free to suggest changes)

Papers

Merging Phoneme and Char representations: https://arxiv.org/pdf/1811.07240.pdf
Tacotron transfer learning : https://arxiv.org/pdf/1904.06508.pdf
phoneme timing from attention: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8683827
SEMI-SUPERVISED TRAINING FOR IMPROVING DATA EFFICIENCY IN END-TO-ENDSPEECH SYNTHESI - https://arxiv.org/pdf/1808.10128.pdf
Listening while Speaking: Speech Chain by Deep Learning - https://arxiv.org/pdf/1707.04879.pdf
GENERALIZED END-TO-END LOSS FOR SPEAKER VERIFICATION: https://arxiv.org/pdf/1710.10467.pdf
Es-Tacotron2: Multi-Task Tacotron 2 with Pre-Trained Estimated Network for Reducing the Over-Smoothness Problem: https://www.mdpi.com/2078-2489/10/4/131/pdf
- Against Over-Smoothness
FastSpeech: https://arxiv.org/pdf/1905.09263.pdf
Learning singing from speech: https://arxiv.org/pdf/1912.10128.pdf
TTS-GAN: https://arxiv.org/pdf/1909.11646.pdf
- they use duration and linguistic features for en2en TTS.
- Close to WaveNet performance.
DurIAN: https://arxiv.org/pdf/1909.01700.pdf
- Duration aware Tacotron
MelNet: https://arxiv.org/abs/1906.01083
AlignTTS: https://arxiv.org/pdf/2003.01950.pdf
Unsupervised Speech Decomposition via Triple Information Bottleneck
- https://arxiv.org/pdf/2004.11284.pdf
- https://anonymous0818.github.io/
FlowTron: https://arxiv.org/pdf/2005.05957.pdf
- Inverse Autoregresive Flow on Tacotron like architecture
- WaveGlow as vocoder.
- Speech style embedding with Mixture of Gaussian model.
- Model is large and havier than vanilla Tacotron
- MOS values are slighly better than public Tacotron implementation.
Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention : https://arxiv.org/pdf/1710.08969.pdf

End-to-End Adversarial Text-to-Speech: http://arxiv.org/abs/2006.03575 (Click to Expand)

end2end feed-forward TTS learning.
Character alignment has been done with a separate aligner module.
The aligner predicts length of each character. - The center location of a char is found wrt the total length of the previous characters. - Char positions are interpolated with a Gaussian window wrt the real audio length.
- audio output is computed in mu-law domain. (I don't have a reasoning for this)
- use only 2 secs audio windows for traning.
- GAN-TTS generator is used to produce audio signal.
- RWD is used as a audio level discriminator.
- MelD: They use BigGAN-deep architecture as spectrogram level discriminator regading the problem as image reconstruction.
- Spectrogram loss
  - Using only adversarial feed-back is not enough to learn the char alignments. They use a spectrogram loss b/w predicted spectrograms and ground-truth specs.
  - Note that model predicts audio signals. Spectrograms above are computed from the generated audio.
  - Dynamic Time Wraping is used to compute a minimal-cost alignment b/w generated spectrograms and ground-truth.
  - It involves a dynamic programming approach to find a minimal-cost alignment.
- Aligner length loss is used to penalize the aligner for predicting different than the real audio length.
- They train the model with multi speaker dataset but report results on the best performing speaker.
- Ablation Study importance of each component: (LengthLoss and SpectrogramLoss) > RWD > MelD > Phonemes > MultiSpeakerDataset.
- My 2 cents: It is a feed forward model which provides end-2-end speech synthesis with no need to train a separate vocoder model. However, it is very complicated model with a lot of hyperparameters and implementation details. Also the final result is not close to the state of the art. I think we need to find specific algorithms for learning character alignments which would reduce the need of tunning a combination of different algorithms.

Fast Speech2: http://arxiv.org/abs/2006.04558 (Click to Expand)

Use phoneme durations generated by MFA as labels to train a length regulator.
Thay use frame level F0 and L2 spectrogram norms (Variance Information) as additional features.
Variance predictor module predicts the variance information at inference time.
Ablation study result improvements: model < model + L2_norm < model + L2_norm + F0

Glow-TTS: https://arxiv.org/pdf/2005.11129.pdf (Click to Expand)

Use Monotonic Alignment Search to learn the alignment b/w text and spectrogram
This alignment is used to train a Duration Predictor to be used at inference.
Encoder maps each character to a Gaussian Distribution.
Decoder maps each spectrogram frame to a latent vector using Normalizing Flow (Glow Layers)
Encoder and Decoder outputs are aligned with MAS.
At each iteration first the most probable alignment is found by MAS and this alignment is used to update mode parameters.
A duration predictor is trained to predict the number of spectrogram frames for each character.
At inference only the duration predictor is used instead of MAS
Encoder has the architecture of the TTS transformer with 2 updates
Instead of absolute positional encoding, they use realtive positional encoding.
They also use a residual connection for the Encoder Prenet.
Decoder has the same architecture as the Glow model.
They train both single and multi-speaker model.
It is showed experimentally, Glow-TTS is more robust against long sentences compared to original Tacotron2
15x faster than Tacotron2 at inference
My 2 cents: Their samples sound not as natural as Tacotron. I believe normal attention models still generate more natural speech since the attention learns to map characters to model outputs directly. However, using Glow-TTS might be a good alternative for hard datasets.
Samples: https://github.com/jaywalnut310/glow-tts
Repository: https://github.com/jaywalnut310/glow-tts

Non-Autoregressive Neural Text-to-Speech: http://arxiv.org/abs/1905.08459 (Click to Expand)

A derivation of Deep Voice 3 model using non-causal convolutional layers.
Teacher-Student paradigm to train annon-autoregressive student with multiple attention blocks from an autoregressive teacher model.
The teacher is used to generate text-to-spectrogram alignments to be used by the student model.
The model is trained with two loss functions for attention alignment and spectrogram generation.
Multi attention blocks refine the attention alignment layer by layer.
The student uses dot-product attention with query, key and value vectors. The query is only positinal encoding vectors. The key and the value are the encoder outputs.
Proposed model is heavily tied to the positional encoding which also relies on different constant values.

Double Decoder Consistency: https://erogol.com/solving-attention-problems-of-tts-models-with-double-decoder-consistency (Click to Expand)

The model uses a Tacotron like architecture but with 2 decoders and a postnet.
DDC uses two synchronous decoders using different reduction rates.
The decoders use different reduction rates thus they compute outputs in different granularities and learn different aspects of the input data.
The model uses the consistency between these two decoders to increase robustness of learned text-to-spectrogram alignment.
The model also applies a refinement to the final decoder output by applying the postnet iteratively multiple times.
DDC uses Batch Normalization in the prenet module and drops Dropout layers.
DDC uses gradual training to reduce the total training time.
We use a Multi-Band Melgan Generator as a vocoder trained with Multiple Random Window Discriminators differently than the original work.
We are able to train a DDC model only in 2 days with a single GPU and the final model is able to generate faster than real-time speech on a CPU. Demo page: https://erogol.github.io/ddc-samples/ Code: https://github.com/mozilla/TTS

Multi-Speaker Papers

Training Multi-Speaker Neural Text-to-Speech Systems using Speaker-Imbalanced Speech Corpora - https://arxiv.org/abs/1904.00771
Deep Voice 2 - https://papers.nips.cc/paper/6889-deep-voice-2-multi-speaker-neural-text-to-speech.pdf
Sample Efficient Adaptive TTS - https://openreview.net/pdf?id=rkzjUoAcFX
- WaveNet + Speaker Embedding approach
Voice Loop - https://arxiv.org/abs/1707.06588
MODELING MULTI-SPEAKER LATENT SPACE TO IMPROVE NEURAL TTS QUICK ENROLLING NEW SPEAKER AND ENHANCING PREMIUM VOICE - https://arxiv.org/pdf/1812.05253.pdf
Transfer learning from speaker verification to multispeaker text-to-speech synthesis - https://arxiv.org/pdf/1806.04558.pdf
Fitting new speakers based on a short untranscribed sample - https://arxiv.org/pdf/1802.06984.pdf
Generalized end-to-end loss for speaker verification

Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation: http://arxiv.org/abs/2005.08024

Train a multi-speaker TTS model with only an hour long paired data (text-to-voice alignment) and more unpaired (only voide) data.
It learns a code book with each code word corresponds to a single phoneme.
The code-book is aligned to phonemes using the paired data and CTC algorithm.
This code book functions like a proxy to implicitly estimate the phoneme sequence of the unpaired data.
They stack Tacotron2 model on top to perform TTS using the code word embeddings generated by the initial part of the model.
They beat the benchmark methods in 1hr long paired data setting.
They don't report full paired data results.
They don't have a good ablation study which could be interesting to see how different parts of the model contribute to the performance.
They use Griffin-Lim as a vocoder thus there is space for improvement.

Demo page: https://ttaoretw.github.io/multispkr-semi-tts/demo.html
Code: https://github.com/ttaoREtw/semi-tts

Attentron: Few-shot Text-to-Speech Exploiting Attention-based Variable Length Embedding: https://arxiv.org/abs/2005.08484

Use two encoders to learn speaker depended features.
Coarse encoder learns a global speaker embedding vector based on provided reference spectrograms.
Fine encoder learns a variable length embedding keeping the temporal dimention in cooperation with a attention module.
The attention selects important reference spectrogram frames to synthesize target speech.
Pre-train the model with a single speaker dataset first (LJSpeech for 30k iters.)
Fine-tune the model with a multi-speaker dataset. (VCTK for 70k iters.)
It achieves slightly better metrics in comparison to using x-vectors from speaker classification model and VAE based reference audio encoder.

Demo page: https://hyperconnect.github.io/Attentron/

Attention

LOCATION-RELATIVE ATTENTION MECHANISMS FOR ROBUST LONG-FORMSPEECH SYNTHESIS : https://arxiv.org/pdf/1910.10288.pdf

Vocoders

MelGAN: https://arxiv.org/pdf/1910.06711.pdf
ParallelWaveGAN: https://arxiv.org/pdf/1910.11480.pdf
- Multi scale STFT loss
- ~1M model parameters (very small)
- Slightly worse than WaveRNN
Improving FFTNEt
- https://www.okamotocamera.com/slt_2018.pdfF
- https://www.okamotocamera.com/slt_2018.pdf
FFTnet
- https://gfx.cs.princeton.edu/pubs/Jin_2018_FAR/clips/clips.php
- https://gfx.cs.princeton.edu/pubs/Jin_2018_FAR/fftnet-jin2018.pdf
SPEECH WAVEFORM RECONSTRUCTION USING CONVOLUTIONAL NEURALNETWORKS WITH NOISE AND PERIODIC INPUTS
- 150.162.46.34:8080/icassp2019/ICASSP2019/pdfs/0007045.pdf
Towards Achieveing Robust Universal Vocoding
- https://arxiv.org/pdf/1811.06292.pdf
LPCNet
- https://arxiv.org/pdf/1810.11846.pdf
- https://arxiv.org/pdf/2001.11686.pdf
ExciteNet
- https://arxiv.org/pdf/1811.04769v3.pdf
GELP: GAN-Excited Linear Prediction for Speech Synthesis fromMel-spectrogram
- https://arxiv.org/pdf/1904.03976v3.pdf
High Fidelity Speech Synthesis with Adversarial Networks: https://arxiv.org/abs/1909.11646
- GAN-TTS, end-to-end speech synthesis
- Uses duration and linguistic features
- Duration and acoustic features are predicted by additional models.
- Random Window Discriminator: Ingest not the whole Voice sample but random windows.
- Multiple RWDs. Some conditional and some unconditional. (conditioned on input features)
- Punchline: Use randomly sampled windows with different window sizes for D.
- Shared results sounds mechanical that shows the limits of non-neural acoustic features.
Multi-Band MelGAN: https://arxiv.org/abs/2005.05106
- Use PWGAN losses instead of feature-matching loss.
- Using a larger receptive field boosts model performance significantly.
- Generator pretraining for 200k iters.
- Multi-Band voice signal prediction. The output is summation of 4 different band predictions with PQMF synthesis filters.
- Multi-band model has 1.9m parameters (quite small).
- Claimed to be 7x faster than MelGAN
- On a Chinese dataset: MOS 4.22
WaveGLow: https://arxiv.org/abs/1811.00002
- Very large model (268M parameters)
- Hard to train since on 12GB GPU it can only takes batch size 1.
- Real-time inference due to the use of convolutions.
- Based on Invertable Normalizing Flow. (Great tutorial https://blog.evjang.com/2018/01/nf1.html )
- Model learns and invetible mapping of audio samples to mel-spectrograms with Max Likelihood loss.
- In inference network runs in reverse direction and give mel-specs are converted to audio samples.
- Training has been done using 8 Nvidia V100 with 32GB ram, batch size 24. (Expensive)
SqueezeWave: https://arxiv.org/pdf/2001.05685.pdf , code: https://github.com/tianrengao/SqueezeWave
- ~5-13x faster than real-time
- WaveGlow redanduncies: Long audio samples, upsamples mel-specs, large channel dimensions in WN function.
- Fixes: More but shorter audio samples as input, (L=2000, C=8 vs L=64, C=256)
- L=64 matchs the mel-spec resolution so no upsampling necessary.
- Use depth-wise separable convolutions in WN modules.
- Use regular convolution instead of dilated since audio samples are shorter.
- Do not split module outputs to residual and network output, assuming these vectors are almost identical.
- Training has been done using Titan RTX 24GB batch size 96 for 600k iterations.
- MOS on LJSpeech: WaveGLow - 4.57, SqueezeWave (L=128 C=256) - 4.07 and SqueezeWave (L=64 C=256) - 3.77
- Smallest model has 21K samples per second on Raspi3.

WaveGrad: https://arxiv.org/pdf/2009.00713.pdf

It is based on Probability Diffusion and Lagenvin Dynamics
The base idea is to learn a function that maps a known distribution to target data distribution iteratively.
They report 0.2 real-time factor on a GPU but CPU performance is not shared.
In the example code below, the author reports that the model converges after 2 days of training on a single GPU.
MOS scores on the paper are not compherensive enough but shows comparable performance to known models like WaveRNN and WaveNet.

Code: https://github.com/ivanvovk/WaveGrad

From the Internet (Blogs, Videos etc)

Videos

Paper Discussion

Tacotron 2 : https://www.youtube.com/watch?v=2iarxxm-v9w

Talks

End-to-End Text-to-Speech Synthesis, Part 1 : https://www.youtube.com/watch?v=RNKrq26Z0ZQ
Speech synthesis from neural decoding of spoken sentences | AISC : https://www.youtube.com/watch?v=MNDtMDPmnMo
Generative Text-to-Speech Synthesis : https://www.youtube.com/watch?v=j4mVEAnKiNg
SPEECH SYNTHESIS FOR THE GAMING INDUSTRY : https://www.youtube.com/watch?v=aOHAYe4A-2Q

General

Modern Text-to-Speech Systems Review : https://www.youtube.com/watch?v=8rXLSc-ZcRY

Blogs

Text to Speech Deep Learning Architectures : http://www.erogol.com/text-speech-deep-learning-architectures/

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Papers

Multi-Speaker Papers

Attention

Vocoders

From the Internet (Blogs, Videos etc)

Videos

Paper Discussion

Talks

General

Blogs

About

Releases

Packages

AnksIndustries/TTS-papers

Folders and files

Latest commit

History

Repository files navigation

Papers

Multi-Speaker Papers

Attention

Vocoders

From the Internet (Blogs, Videos etc)

Videos

Paper Discussion

Talks

General

Blogs

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages