New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA out of memory on 8 V100-s while finetuning on different dataset. Batch size 6, seg length 12k. #104
Comments
Hi, I'm using a single Tesla V100 (32GB). For this size of VRAM, the maximum batch size is 8. I haven't faced out-of-memory yet. I also did not change any architecture. As a reference, the original paper used 3 x 8, total batch size is 24. I aslo found it is a sort of difficult to reproduce a good quality when I use a different data set. Using pre-trained model seems reasonable than learning from scratch. |
@batikim09 Hey there! I mean batch size of 6 for total cluster, 8 GPUs. |
Here it goes - I've been debugging on different images with different batch sizes, including Tesla V100's & Tesla K80s. I've converted the model, and tried to fine-tune it on different language dataset. Rn, model is running on 8 Tesla K80's, with batch size 1, channels = 512, segment length 16k. Initial thought is that distributed.py multiplies the batch size by the amount of GPUs, and executes it in parallel, so that every GPU is taking the load of the batch size written in config.json. P.S. the load of each GPU increased to 11336 MiB while writing this comment. @rafaelvalle please, have a look at this, maybe it's helpful, but I'm not expert in pytorch anyway. |
Hard to say. With this repo we're able to train on 8 V100. |
hm, not a chance. It only worked for 1 batch size, and segment length was the same (16k). |
Are you running on V100s or K80s? |
Right now on K80s, retraining it on new dataset with the same parameters, but batch size 1, and training time is 14.22 steps/min. (8 K80s, 12VRAM) |
We recently shared a model trained with 255 channels. This should help you increase your batch size. |
I'll have a look at it. |
It's on the readme. we just replaced the link |
Yes, I looked up for readme changes and found it, then changed the comment haha. Thank you again! |
Hi @batikim09 , Also a single epoch with batch size of 5 is taking around 50 minutes. Do you think it is expected time? Thanks |
@deepconsc , Hi, your samples sound good. Were they generated using predicted features from your trained Tacotron2 or based on ground-truth mel-spectrogram? @sravand93, Hi, In my case (1x Tesla V100, 32GB VRAM), Since 32GB VRAM is too small to accomdate the original batch size (24), I used a gradient accumulation technique. So, I used the size of 64, each epoch roughly takes 24 min. This is important since if your batch size is too small, your gradient update can be unstable at some point (if you are unlucky). For gradient accumulation, please see: I hope it helps you. |
@batikim09 Hey there, thanks. |
@deepconsc @batikim09 I understood from the comments that you guys trained on custom data starting from the pretrained weights. I have been trying to do the same, in my case the training works perfectly on single GPU, but freezes without any debug messages on multi-GPU's . Any pointers? |
@sharathadavanne get torch==1.0 version, other versions creeped us out while fine tuning. And, you can decrease the batch size for the beginning, and debug the model while increasing it. |
Hey @deepconsc your sythesis sounds great! May I know what was the size of the dataset you used to fine tune the model? Also have you tried training from scratch on your data? |
@Aradhya0510 Hey, thanks mate. |
You can fine-tune, train from scratch and compare the results. |
@deepconsc considering the quality of data your results are even more impressive. I understand that having a baseline is always a good idea but training from scratch gives you a better picture of how the model is learning, what feature hierarchies are and how you can further exploit the architecture to reach the desired state faster. I am also curious to know if the model carries any features from the original speaker after fine tuning, as I have seen the case with some other voice conditioning approaches. So a comparison(tuned vs trained) might come handy. |
Closing due to inactivity. |
After days of different approaches, we've decided to finetune the NVIDIA pre-trained model on different language dataset. We're running model on 8 V100-s with 16GB of VRAM.
Our dataset was recorded at 48kHz, and then downsampled to 22050.
Checkpoint is converted.
Additionally, while getting out-of-memory errors on 8 GPUs, we've decreased the batch size to 6, and segment length to 12000.
Even now, 3 of 8 GPUs die, and batches are distributed on 5 of them. Training time has been increased drastically.
Any thoughts from collaborators/contributors?
The text was updated successfully, but these errors were encountered: