Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed to initialize batched cufft plan with customized allocator #711

Closed
abaddon-moriarty opened this issue Nov 29, 2021 · 2 comments
Closed
Assignees
Labels
bug 🐛 Something isn't working

Comments

@abaddon-moriarty
Copy link

Hello everyone,
I am currently training a phoneme-based HiFi-GAN model and I recently ran into the following issue. It started when I tried using multiple GPUs, but now I can't even train on a single GPU.

It is written that it is reducing the batch size, but these are the settings in my hifigan.v1.yaml file

image

I saw this issue having thr same failed to initialize batched cufft plan with customized allocator, but for them the GPU runs out of memory, which is not the case for me.
I also saw in this issue that there was a problem in the batch_max_steps_valid, but I've used the same file to train other vocoders and it is the first time this error arises, what should be the correct value ?

INFO:tensorflow:batch_all_reduce: 156 all-reduces with algorithm = nccl, num_packs = 1
2021-11-29 16:45:53,870 (cross_device_ops:702) INFO: batch_all_reduce: 156 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 102 all-reduces with algorithm = nccl, num_packs = 1
2021-11-29 16:46:15,996 (cross_device_ops:702) INFO: batch_all_reduce: 102 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 156 all-reduces with algorithm = nccl, num_packs = 1
2021-11-29 16:46:53,329 (cross_device_ops:702) INFO: batch_all_reduce: 156 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 102 all-reduces with algorithm = nccl, num_packs = 1
2021-11-29 16:47:14,118 (cross_device_ops:702) INFO: batch_all_reduce: 102 all-reduces with algorithm = nccl, num_packs = 1
2021-11-29 16:48:33.400178: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2021-11-29 16:48:38.176008: E tensorflow/stream_executor/cuda/cuda_fft.cc:223] failed to make cuFFT batched plan:5
2021-11-29 16:48:38.176052: E tensorflow/stream_executor/cuda/cuda_fft.cc:426] Initialize Params: rank: 1 elem_count: 2048 input_embed: 2048 input_stride: 1 input_distance: 2048 output_embed: 1025 output_stride: 1 output_distance: 1025 batch_count: 480
2021-11-29 16:48:38.176062: F tensorflow/stream_executor/cuda/cuda_fft.cc:435] failed to initialize batched cufft plan with customized allocator: Failed to make cuFFT batched plan.

Any ideas on how to correct this ?
Thank you

@dathudeptrai dathudeptrai self-assigned this Dec 5, 2021
@dathudeptrai dathudeptrai added the bug 🐛 Something isn't working label Dec 5, 2021
@dathudeptrai
Copy link
Collaborator

@ZDisket do you know what is a problem here ?

@abaddon-moriarty
Copy link
Author

I have re-initialised everything and started from zero again, I no longer have this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants