New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can single GPU get good result? #12
Comments
I've been training with batch size 1 and it is doing pretty well. Definitely takes longer, but it seems to still work. |
@will-rice how long does it take in your case, how many iterations? I tried running it with bsz=3, but shortening the segment length to 8000. After 80k iterations the speech is barely intelligible (I'm using about 12 hours of male voice, 16kHz). |
I'm at 140k on ljspeech. It doesn't sound great, but it continues to improve. A smaller dataset I'm using at 165k the speech is noisy, but definitely better than the ljspeech. According to the paper https://arxiv.org/pdf/1811.00002.pdf their model was 24 batch size and 580k iterations. So by extremely rough math you are looking at well over 1mil iterations for results equivalent to the paper. |
Oh, I see, I overlooked that they used batch size of 24! This explains a lot... Wow, the amount of training this thing requires is insane, compared to the wavenet. Thanks |
Can I ask what value you're using for sigma during training and sample generation? And can you post a sample? We hear "decent speech" at ~160k iterations (though it definitely improves with more). I haven't seen a huge effect from the larger batch size, but we haven't done a lot of ablative analysis yet. |
https://soundcloud.com/user-667131267/waveglow-tedlium-150k |
@will-rice try sampling with a smaller sigma, 0.8 or 0.6 for example |
@RPrenger I just realized that I had a bug in the code that made the tensorboardX output audio worse than it is in reality. Anyway, here's an example at 120k steps (16kHz, bsz=3, segment=8000; 1080Ti): https://soundcloud.com/belevtsoff/waveglow_120k. Both sigmas are 1.0 |
@rafaelvalle Thanks! this is what I'm getting from LJSpeech at 250k now. https://soundcloud.com/user-667131267/in-domain-ljspeech-250k |
@will-rice sounds like it's training properly! for generating this ljs sample, what sigma value did you use? |
@rafaelvalle sigma 0.85 for that one. |
@will-rice is that sample (https://soundcloud.com/user-667131267/in-domain-ljspeech-250k) from a model trained on a smaller dataset? |
@dchaws That model was trained on the full ljspeech dataset with the default parameters. |
@belevtsoff That sounds reasonable for 120k iterations. It should keep improving with more iterations. Also try doing inference with sigma=0.8 or so. |
@will-rice what is the synthesis speed on your set up? Faster than real time? |
@G-Wang Using a single 1080ti that 9 second clip took about 2 seconds to generate. |
Closing issue. Please re-open if needed. |
How many and what gpu(s) did you use? Did you train on LJSpeech or other dataset? Thanks |
Does anyone train this model with single GPU(1080ti) and get good result? In this situation i can only run the model with the batch size 1. Cause I don't have enough GPU...
The text was updated successfully, but these errors were encountered: