Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA out of memory on 8 V100-s while finetuning on different dataset. Batch size 6, seg length 12k. #104

Closed
deepconsc opened this issue Mar 14, 2019 · 21 comments

Comments

@deepconsc
Copy link

deepconsc commented Mar 14, 2019

After days of different approaches, we've decided to finetune the NVIDIA pre-trained model on different language dataset. We're running model on 8 V100-s with 16GB of VRAM.
Our dataset was recorded at 48kHz, and then downsampled to 22050.
Checkpoint is converted.
Additionally, while getting out-of-memory errors on 8 GPUs, we've decreased the batch size to 6, and segment length to 12000.
Even now, 3 of 8 GPUs die, and batches are distributed on 5 of them. Training time has been increased drastically.

Any thoughts from collaborators/contributors?

@batikim09
Copy link

Hi, I'm using a single Tesla V100 (32GB). For this size of VRAM, the maximum batch size is 8. I haven't faced out-of-memory yet. I also did not change any architecture.
Do you mean the batch of 6 for each V100? then, total batch is 48?

As a reference, the original paper used 3 x 8, total batch size is 24.
Since your each VRAM is 16GB, a batch of 6 for each card sounds too big.

I aslo found it is a sort of difficult to reproduce a good quality when I use a different data set. Using pre-trained model seems reasonable than learning from scratch.

@deepconsc
Copy link
Author

@batikim09 Hey there!

I mean batch size of 6 for total cluster, 8 GPUs.
Using pre-trained model seems the most convenient way, for sure. But, I'm struggling with the parameters.
I'm trying to get an increased RAM on gcp image rn, and will test in couple of hours.
In case it throws errors again, I'm planning to keep the segment length the same, but decrease the channels from 512 to 256, as mentioned by one of the collaborators, it increases the training time almost 4x, without giving up too much quality.

@deepconsc
Copy link
Author

deepconsc commented Mar 15, 2019

Here it goes - I've been debugging on different images with different batch sizes, including Tesla V100's & Tesla K80s. I've converted the model, and tried to fine-tune it on different language dataset.
After executing distributed.py, I've been analyzing the nvidia-smi function with watch, and wrote down how the VRAM loads are being changed over time.
Every time it threw CUDA out of memory errors.

Rn, model is running on 8 Tesla K80's, with batch size 1, channels = 512, segment length 16k.
Teslas on GCP have 12GB of VRAM (11441MiB), and the model with above-mentioned config is taking 10312 MiB from each of them. - Model is running.

Initial thought is that distributed.py multiplies the batch size by the amount of GPUs, and executes it in parallel, so that every GPU is taking the load of the batch size written in config.json.

P.S. the load of each GPU increased to 11336 MiB while writing this comment.

@rafaelvalle please, have a look at this, maybe it's helpful, but I'm not expert in pytorch anyway.

Screen Shot 2019-03-15 at 23 57 16

@rafaelvalle
Copy link
Contributor

rafaelvalle commented Mar 16, 2019

Hard to say. With this repo we're able to train on 8 V100.
Have you tried a different version of pytorch?
Are you able to run with batch size 3 and segment length 16000?

@deepconsc
Copy link
Author

deepconsc commented Mar 16, 2019

hm, not a chance. It only worked for 1 batch size, and segment length was the same (16k).
I've not tried different version of pytorch, would you suggest any of them specifically?

@rafaelvalle
Copy link
Contributor

Are you running on V100s or K80s?

@deepconsc
Copy link
Author

deepconsc commented Mar 16, 2019

Right now on K80s, retraining it on new dataset with the same parameters, but batch size 1, and training time is 14.22 steps/min. (8 K80s, 12VRAM)

@rafaelvalle
Copy link
Contributor

rafaelvalle commented Mar 16, 2019

We recently shared a model trained with 255 channels. This should help you increase your batch size.

@deepconsc
Copy link
Author

deepconsc commented Mar 16, 2019

I'll have a look at it.
And, I'll leave the issue opened, maybe there's a solution for this specific situation.
Thank you!

@rafaelvalle
Copy link
Contributor

It's on the readme. we just replaced the link
https://drive.google.com/file/d/1WsibBTsuRg_SF2Z6L6NFRTT-NjEy1oTx/view

@deepconsc
Copy link
Author

Yes, I looked up for readme changes and found it, then changed the comment haha. Thank you again!

@sravand93
Copy link

Hi @batikim09 ,
Is your training completed. May i know how much time it took for you to train. As we are on a single v100(16GB)gpu, we want to estimate how much time it will take to train the model.

Also a single epoch with batch size of 5 is taking around 50 minutes. Do you think it is expected time?
Your reply will help us a lot.

Thanks

@batikim09
Copy link

batikim09 commented Apr 3, 2019

@deepconsc , Hi, your samples sound good. Were they generated using predicted features from your trained Tacotron2 or based on ground-truth mel-spectrogram?

@sravand93, Hi, In my case (1x Tesla V100, 32GB VRAM),
It roughly took 18 days to learn from scratch (~1000 epoch, I do not like the term, iteration).

Since 32GB VRAM is too small to accomdate the original batch size (24), I used a gradient accumulation technique. So, I used the size of 64, each epoch roughly takes 24 min. This is important since if your batch size is too small, your gradient update can be unstable at some point (if you are unlucky). For gradient accumulation, please see:
https://discuss.pytorch.org/t/how-to-implement-accumulated-gradient/3822/18

I hope it helps you.

@deepconsc
Copy link
Author

deepconsc commented Apr 3, 2019

@batikim09 Hey there, thanks.
It's generated from the inference we've created, by inputing text, generating mels from tacotron2, and using waveglow as vocoder. In this case, we've created an inference, and cutted it into two parts, and deployed on Google Colab - so, we could run the checkpoints loader as one code-line session, and then execute the second part for text input, text2mel, and mel2wav processes.
As an additional info, it took 19 mins to create 7 mins of audio file on Google Colab.

@sharathadavanne
Copy link

@deepconsc @batikim09 I understood from the comments that you guys trained on custom data starting from the pretrained weights. I have been trying to do the same, in my case the training works perfectly on single GPU, but freezes without any debug messages on multi-GPU's . Any pointers?

@deepconsc
Copy link
Author

@sharathadavanne get torch==1.0 version, other versions creeped us out while fine tuning. And, you can decrease the batch size for the beginning, and debug the model while increasing it.

@Aradhya0510
Copy link

Hey @deepconsc your sythesis sounds great! May I know what was the size of the dataset you used to fine tune the model? Also have you tried training from scratch on your data?

@deepconsc
Copy link
Author

@Aradhya0510 Hey, thanks mate.
It's 11 hours of really really rough data, with cars and other noises in the background. We filtered the background noise before feeding it, and it basically took away some of the frequencies that consisted human speech, so I'm really impressed of this result as well.
On the training part - I don't actually see the point of training from the scratch. I think fine-tuning is best way to go, in terms of computing power, dataset, and time. When tacotron2 knows generation of english text based mels, it won't struggle to re-study the features of other language. So is waveglow. If you check the checkpoints time by time while training, you'll see that it yields really good results even below 200k iters from scratch (they're not realistic, but you can actually detect the letters in generated audio) , and then tries to generalize and perfect them. Anyway, fine-tuning is the way to go to.

@rafaelvalle
Copy link
Contributor

You can fine-tune, train from scratch and compare the results.
Given that the shared models already sound ok on other speaker, I assuming fine-tuning to different speakers should be better.

@Aradhya0510
Copy link

Aradhya0510 commented Jul 4, 2019

@deepconsc considering the quality of data your results are even more impressive. I understand that having a baseline is always a good idea but training from scratch gives you a better picture of how the model is learning, what feature hierarchies are and how you can further exploit the architecture to reach the desired state faster. I am also curious to know if the model carries any features from the original speaker after fine tuning, as I have seen the case with some other voice conditioning approaches. So a comparison(tuned vs trained) might come handy.

@rafaelvalle
Copy link
Contributor

Closing due to inactivity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants