Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can single GPU get good result? #12

Closed
Cheneng opened this issue Nov 11, 2018 · 18 comments
Closed

Can single GPU get good result? #12

Cheneng opened this issue Nov 11, 2018 · 18 comments

Comments

@Cheneng
Copy link

Cheneng commented Nov 11, 2018

Does anyone train this model with single GPU(1080ti) and get good result? In this situation i can only run the model with the batch size 1. Cause I don't have enough GPU...

@will-rice
Copy link

I've been training with batch size 1 and it is doing pretty well. Definitely takes longer, but it seems to still work.

@belevtsoff
Copy link

@will-rice how long does it take in your case, how many iterations? I tried running it with bsz=3, but shortening the segment length to 8000. After 80k iterations the speech is barely intelligible (I'm using about 12 hours of male voice, 16kHz).

@will-rice
Copy link

I'm at 140k on ljspeech. It doesn't sound great, but it continues to improve. A smaller dataset I'm using at 165k the speech is noisy, but definitely better than the ljspeech. According to the paper https://arxiv.org/pdf/1811.00002.pdf their model was 24 batch size and 580k iterations. So by extremely rough math you are looking at well over 1mil iterations for results equivalent to the paper.

@belevtsoff
Copy link

Oh, I see, I overlooked that they used batch size of 24! This explains a lot... Wow, the amount of training this thing requires is insane, compared to the wavenet. Thanks

@RPrenger
Copy link

RPrenger commented Nov 12, 2018

Can I ask what value you're using for sigma during training and sample generation? And can you post a sample? We hear "decent speech" at ~160k iterations (though it definitely improves with more). I haven't seen a huge effect from the larger batch size, but we haven't done a lot of ablative analysis yet.

@will-rice
Copy link

will-rice commented Nov 12, 2018

https://soundcloud.com/user-667131267/waveglow-tedlium-150k
training sigma sqrt(0.5). sample sigma is 1.0.
correction: training sigma is also 1.0

@rafaelvalle
Copy link
Contributor

@will-rice try sampling with a smaller sigma, 0.8 or 0.6 for example

@belevtsoff
Copy link

belevtsoff commented Nov 12, 2018

@RPrenger I just realized that I had a bug in the code that made the tensorboardX output audio worse than it is in reality. Anyway, here's an example at 120k steps (16kHz, bsz=3, segment=8000; 1080Ti): https://soundcloud.com/belevtsoff/waveglow_120k. Both sigmas are 1.0

@will-rice
Copy link

@rafaelvalle Thanks! this is what I'm getting from LJSpeech at 250k now. https://soundcloud.com/user-667131267/in-domain-ljspeech-250k

@rafaelvalle
Copy link
Contributor

rafaelvalle commented Nov 12, 2018

@will-rice sounds like it's training properly! for generating this ljs sample, what sigma value did you use?

@will-rice
Copy link

@rafaelvalle sigma 0.85 for that one.

@dchaws
Copy link

dchaws commented Nov 13, 2018

@will-rice is that sample (https://soundcloud.com/user-667131267/in-domain-ljspeech-250k) from a model trained on a smaller dataset?

@will-rice
Copy link

@dchaws That model was trained on the full ljspeech dataset with the default parameters.

@RPrenger
Copy link

@belevtsoff That sounds reasonable for 120k iterations. It should keep improving with more iterations. Also try doing inference with sigma=0.8 or so.

@G-Wang
Copy link

G-Wang commented Nov 14, 2018

@will-rice what is the synthesis speed on your set up? Faster than real time?

@will-rice
Copy link

will-rice commented Nov 14, 2018

@G-Wang Using a single 1080ti that 9 second clip took about 2 seconds to generate.
Edit: I wanted to add that the model inference is not the only thing running on this card.

@rafaelvalle
Copy link
Contributor

Closing issue. Please re-open if needed.

@yxt132
Copy link

yxt132 commented Jan 30, 2019

Can I ask what value you're using for sigma during training and sample generation? And can you post a sample? We hear "decent speech" at ~160k iterations (though it definitely improves with more). I haven't seen a huge effect from the larger batch size, but we haven't done a lot of ablative analysis yet.

How many and what gpu(s) did you use? Did you train on LJSpeech or other dataset? Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants