Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The choice of vocoder (WaveRNN vs WaveGlow) #82

Closed
snakers4 opened this issue Aug 8, 2019 · 12 comments
Closed

The choice of vocoder (WaveRNN vs WaveGlow) #82

snakers4 opened this issue Aug 8, 2019 · 12 comments

Comments

@snakers4
Copy link

snakers4 commented Aug 8, 2019

Hi!

For a change it is always great to see a repo where a real human did something as opposed to an endless stream of corporate / academic research that cannot essentially be reproduced.

We have published a huge STT dataset and are also planning to extend our TTS dataset with 30-40 voices (at least). Our datasets are in Russian. So if you would like to extend your language support - please stay tuned.

Anyway - I wanted to ask - why did you choose WaveRNN? It seems that WaveGlow / FloWaveNet are the go-to option now? I tested WaveGlow - it trains mostly as promised and code is really easy to use.

@oytunturk
Copy link

oytunturk commented Aug 8, 2019 via email

@orbisAI
Copy link

orbisAI commented Aug 8, 2019

Well actually WaveGlow's quality doesn't seem too bad compared to WaveRNN.
Important thing to note is that MOS is subjective and relative measure of audio quality. In the original paper, WaveNet's MOS of training dataset is much higher than that of WaveGlow, which might indicate that for the same quality WaveGlow's MOS may be lower.

But also, WaveGlow may have had a worse synthesizer than google did. But as far as vocoder's performance goes (spec to wav), WaveGlow does not seem too bad, but much faster.

@CorentinJ
Copy link
Owner

WaveGlow is both slower and of worse quality than WaveRNN. Do keep in mind that we're talking about the available public implementations, not the papers. But even in the papers it's the case.

@snakers4
Copy link
Author

snakers4 commented Aug 8, 2019

WaveGlow is both slower and of worse quality than WaveRNN

I (and some other people) tested that the official public implementation of WaveGlow was 4-8x RTS on one 1080Ti (inference)
Have not tested WaveRNN yet, but heard reports of it being <1 RTS

What are your benchmarks?

@CorentinJ
Copy link
Owner

Right, I guess I was wrong about the speed of WaveGlow then. WaveRNN uses batched inference, so its speed is proportional to the length of the spectrogram to synthesize. I've gone up to 20x real-time for WaveRNN, but on short sentences it's going to be around 1x.

Anyway, the reason I picked WaveRNN over WaveGlow was due to the quality of the samples each open source implementation presented.

@orbisAI
Copy link

orbisAI commented Aug 8, 2019

WaveGlow is both slower and of worse quality than WaveRNN. Do keep in mind that we're talking about the available public implementations, not the papers. But even in the papers it's the case.

Nvidia's implementation of WaveGlow is giving me nearly 2000kHz on my v100. I tried this repo's vocoder out of box, and maybe I'm doing sth wrong but it's sub real-time (0.6~0.8 RTS).

@orbisAI
Copy link

orbisAI commented Aug 8, 2019

Right, I guess I was wrong about the speed of WaveGlow then. WaveRNN uses batched inference, so its speed is proportional to the length of the spectrogram to synthesize. I've gone up to 20x real-time for WaveRNN, but on short sentences it's going to be around 1x.

Anyway, the reason I picked WaveRNN over WaveGlow was due to the quality of the samples each open source implementation presented.

ah yes, I should test on longer sentences. I ran tests on a single sentence <100 characters.

@oytunturk
Copy link

oytunturk commented Aug 8, 2019 via email

@snakers4
Copy link
Author

snakers4 commented Aug 8, 2019

I see. Many thanks to all of the participants of the chat.
Given all of the above, I guess that for business production-like setting (i.e. short sentences) WaveGlow and FloWaveNet are the most balanced options for now.

@qo4on
Copy link

qo4on commented Apr 2, 2020

Is WaveNet still the best vokoder for today?

@bryant0918
Copy link

I've read that WaveGlow is more robust in handling several languages, but WaveRNN is language dependent and quickly degrades when you train on an additional language. If I were to create a multilingual system would it still be better to use WaveRNN and train several different models? Or use a single WaveGlow model that could essentially handle any language? What would my cost be in quality and Speed?

@RuntimeRacer
Copy link

I've read that WaveGlow is more robust in handling several languages, but WaveRNN is language dependent and quickly degrades when you train on an additional language. If I were to create a multilingual system would it still be better to use WaveRNN and train several different models? Or use a single WaveGlow model that could essentially handle any language? What would my cost be in quality and Speed?

I did not play around with it yet, but some time ago I came across this repo on multilingual TTS in a single synthesizer: https://github.com/Tomiinek/Multilingual_Text_to_Speech

They're using WaveRNN, so I assume the quality really just depends on whether the Vocoder has been trained with good multilingual samples. In the end the vocoder is used to render a generated AI voice more natural; so it 'should' not matter which language the voice is speaking in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants