CUDA out of memory on 8 V100-s while finetuning on different dataset. Batch size 6, seg length 12k. #104

deepconsc · 2019-03-14T22:53:41Z

After days of different approaches, we've decided to finetune the NVIDIA pre-trained model on different language dataset. We're running model on 8 V100-s with 16GB of VRAM.
Our dataset was recorded at 48kHz, and then downsampled to 22050.
Checkpoint is converted.
Additionally, while getting out-of-memory errors on 8 GPUs, we've decreased the batch size to 6, and segment length to 12000.
Even now, 3 of 8 GPUs die, and batches are distributed on 5 of them. Training time has been increased drastically.

Any thoughts from collaborators/contributors?

batikim09 · 2019-03-15T07:15:07Z

Hi, I'm using a single Tesla V100 (32GB). For this size of VRAM, the maximum batch size is 8. I haven't faced out-of-memory yet. I also did not change any architecture.
Do you mean the batch of 6 for each V100? then, total batch is 48?

As a reference, the original paper used 3 x 8, total batch size is 24.
Since your each VRAM is 16GB, a batch of 6 for each card sounds too big.

I aslo found it is a sort of difficult to reproduce a good quality when I use a different data set. Using pre-trained model seems reasonable than learning from scratch.

deepconsc · 2019-03-15T14:31:07Z

@batikim09 Hey there!

I mean batch size of 6 for total cluster, 8 GPUs.
Using pre-trained model seems the most convenient way, for sure. But, I'm struggling with the parameters.
I'm trying to get an increased RAM on gcp image rn, and will test in couple of hours.
In case it throws errors again, I'm planning to keep the segment length the same, but decrease the channels from 512 to 256, as mentioned by one of the collaborators, it increases the training time almost 4x, without giving up too much quality.

deepconsc · 2019-03-15T19:58:59Z

Here it goes - I've been debugging on different images with different batch sizes, including Tesla V100's & Tesla K80s. I've converted the model, and tried to fine-tune it on different language dataset.
After executing distributed.py, I've been analyzing the nvidia-smi function with watch, and wrote down how the VRAM loads are being changed over time.
Every time it threw CUDA out of memory errors.

Rn, model is running on 8 Tesla K80's, with batch size 1, channels = 512, segment length 16k.
Teslas on GCP have 12GB of VRAM (11441MiB), and the model with above-mentioned config is taking 10312 MiB from each of them. - Model is running.

Initial thought is that distributed.py multiplies the batch size by the amount of GPUs, and executes it in parallel, so that every GPU is taking the load of the batch size written in config.json.

P.S. the load of each GPU increased to 11336 MiB while writing this comment.

@rafaelvalle please, have a look at this, maybe it's helpful, but I'm not expert in pytorch anyway.

rafaelvalle · 2019-03-16T01:07:05Z

Hard to say. With this repo we're able to train on 8 V100.
Have you tried a different version of pytorch?
Are you able to run with batch size 3 and segment length 16000?

deepconsc · 2019-03-16T01:09:22Z

hm, not a chance. It only worked for 1 batch size, and segment length was the same (16k).
I've not tried different version of pytorch, would you suggest any of them specifically?

rafaelvalle · 2019-03-16T01:11:21Z

Are you running on V100s or K80s?

deepconsc · 2019-03-16T01:40:06Z

Right now on K80s, retraining it on new dataset with the same parameters, but batch size 1, and training time is 14.22 steps/min. (8 K80s, 12VRAM)

rafaelvalle · 2019-03-16T01:44:11Z

We recently shared a model trained with 255 channels. This should help you increase your batch size.

deepconsc · 2019-03-16T01:49:55Z

I'll have a look at it.
And, I'll leave the issue opened, maybe there's a solution for this specific situation.
Thank you!

rafaelvalle · 2019-03-16T01:51:47Z

It's on the readme. we just replaced the link
https://drive.google.com/file/d/1WsibBTsuRg_SF2Z6L6NFRTT-NjEy1oTx/view

deepconsc · 2019-03-16T01:52:59Z

Yes, I looked up for readme changes and found it, then changed the comment haha. Thank you again!

sravand93 · 2019-04-03T13:26:58Z

Hi @batikim09 ,
Is your training completed. May i know how much time it took for you to train. As we are on a single v100(16GB)gpu, we want to estimate how much time it will take to train the model.

Also a single epoch with batch size of 5 is taking around 50 minutes. Do you think it is expected time?
Your reply will help us a lot.

Thanks

batikim09 · 2019-04-03T14:05:25Z

@deepconsc , Hi, your samples sound good. Were they generated using predicted features from your trained Tacotron2 or based on ground-truth mel-spectrogram?

@sravand93, Hi, In my case (1x Tesla V100, 32GB VRAM),
It roughly took 18 days to learn from scratch (~1000 epoch, I do not like the term, iteration).

Since 32GB VRAM is too small to accomdate the original batch size (24), I used a gradient accumulation technique. So, I used the size of 64, each epoch roughly takes 24 min. This is important since if your batch size is too small, your gradient update can be unstable at some point (if you are unlucky). For gradient accumulation, please see:
https://discuss.pytorch.org/t/how-to-implement-accumulated-gradient/3822/18

I hope it helps you.

deepconsc · 2019-04-03T15:17:11Z

@batikim09 Hey there, thanks.
It's generated from the inference we've created, by inputing text, generating mels from tacotron2, and using waveglow as vocoder. In this case, we've created an inference, and cutted it into two parts, and deployed on Google Colab - so, we could run the checkpoints loader as one code-line session, and then execute the second part for text input, text2mel, and mel2wav processes.
As an additional info, it took 19 mins to create 7 mins of audio file on Google Colab.

sharathadavanne · 2019-06-27T05:29:10Z

@deepconsc @batikim09 I understood from the comments that you guys trained on custom data starting from the pretrained weights. I have been trying to do the same, in my case the training works perfectly on single GPU, but freezes without any debug messages on multi-GPU's . Any pointers?

deepconsc · 2019-06-27T15:43:54Z

@sharathadavanne get torch==1.0 version, other versions creeped us out while fine tuning. And, you can decrease the batch size for the beginning, and debug the model while increasing it.

Aradhya0510 · 2019-07-03T21:42:12Z

Hey @deepconsc your sythesis sounds great! May I know what was the size of the dataset you used to fine tune the model? Also have you tried training from scratch on your data?

deepconsc · 2019-07-03T22:26:24Z

@Aradhya0510 Hey, thanks mate.
It's 11 hours of really really rough data, with cars and other noises in the background. We filtered the background noise before feeding it, and it basically took away some of the frequencies that consisted human speech, so I'm really impressed of this result as well.
On the training part - I don't actually see the point of training from the scratch. I think fine-tuning is best way to go, in terms of computing power, dataset, and time. When tacotron2 knows generation of english text based mels, it won't struggle to re-study the features of other language. So is waveglow. If you check the checkpoints time by time while training, you'll see that it yields really good results even below 200k iters from scratch (they're not realistic, but you can actually detect the letters in generated audio) , and then tries to generalize and perfect them. Anyway, fine-tuning is the way to go to.

rafaelvalle · 2019-07-03T23:11:45Z

You can fine-tune, train from scratch and compare the results.
Given that the shared models already sound ok on other speaker, I assuming fine-tuning to different speakers should be better.

Aradhya0510 · 2019-07-04T15:03:15Z

@deepconsc considering the quality of data your results are even more impressive. I understand that having a baseline is always a good idea but training from scratch gives you a better picture of how the model is learning, what feature hierarchies are and how you can further exploit the architecture to reach the desired state faster. I am also curious to know if the model carries any features from the original speaker after fine tuning, as I have seen the case with some other voice conditioning approaches. So a comparison(tuned vs trained) might come handy.

rafaelvalle · 2019-10-26T04:58:55Z

Closing due to inactivity.

rafaelvalle closed this as completed Oct 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA out of memory on 8 V100-s while finetuning on different dataset. Batch size 6, seg length 12k. #104

CUDA out of memory on 8 V100-s while finetuning on different dataset. Batch size 6, seg length 12k. #104

deepconsc commented Mar 14, 2019 •

edited

batikim09 commented Mar 15, 2019

deepconsc commented Mar 15, 2019

deepconsc commented Mar 15, 2019 •

edited

rafaelvalle commented Mar 16, 2019 •

edited

deepconsc commented Mar 16, 2019 •

edited

rafaelvalle commented Mar 16, 2019

deepconsc commented Mar 16, 2019 •

edited

rafaelvalle commented Mar 16, 2019 •

edited

deepconsc commented Mar 16, 2019 •

edited

rafaelvalle commented Mar 16, 2019

deepconsc commented Mar 16, 2019

sravand93 commented Apr 3, 2019

batikim09 commented Apr 3, 2019 •

edited

deepconsc commented Apr 3, 2019 •

edited

sharathadavanne commented Jun 27, 2019

deepconsc commented Jun 27, 2019

Aradhya0510 commented Jul 3, 2019

deepconsc commented Jul 3, 2019

rafaelvalle commented Jul 3, 2019

Aradhya0510 commented Jul 4, 2019 •

edited

rafaelvalle commented Oct 26, 2019

CUDA out of memory on 8 V100-s while finetuning on different dataset. Batch size 6, seg length 12k. #104

CUDA out of memory on 8 V100-s while finetuning on different dataset. Batch size 6, seg length 12k. #104

Comments

deepconsc commented Mar 14, 2019 • edited

batikim09 commented Mar 15, 2019

deepconsc commented Mar 15, 2019

deepconsc commented Mar 15, 2019 • edited

rafaelvalle commented Mar 16, 2019 • edited

deepconsc commented Mar 16, 2019 • edited

rafaelvalle commented Mar 16, 2019

deepconsc commented Mar 16, 2019 • edited

rafaelvalle commented Mar 16, 2019 • edited

deepconsc commented Mar 16, 2019 • edited

rafaelvalle commented Mar 16, 2019

deepconsc commented Mar 16, 2019

sravand93 commented Apr 3, 2019

batikim09 commented Apr 3, 2019 • edited

deepconsc commented Apr 3, 2019 • edited

sharathadavanne commented Jun 27, 2019

deepconsc commented Jun 27, 2019

Aradhya0510 commented Jul 3, 2019

deepconsc commented Jul 3, 2019

rafaelvalle commented Jul 3, 2019

Aradhya0510 commented Jul 4, 2019 • edited

rafaelvalle commented Oct 26, 2019

deepconsc commented Mar 14, 2019 •

edited

deepconsc commented Mar 15, 2019 •

edited

rafaelvalle commented Mar 16, 2019 •

edited

deepconsc commented Mar 16, 2019 •

edited

deepconsc commented Mar 16, 2019 •

edited

rafaelvalle commented Mar 16, 2019 •

edited

deepconsc commented Mar 16, 2019 •

edited

batikim09 commented Apr 3, 2019 •

edited

deepconsc commented Apr 3, 2019 •

edited

Aradhya0510 commented Jul 4, 2019 •

edited