Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA out of memory #52

Closed
GabrielleTse opened this issue Nov 22, 2021 · 5 comments
Closed

CUDA out of memory #52

GabrielleTse opened this issue Nov 22, 2021 · 5 comments

Comments

@GabrielleTse
Copy link

We implement the training process with pbr rendered data on eight GPU parallel computing (NIVDIA 2080 Ti with graphic memory of 12 G) , it barely starts training in batchsize 8 (original is 24). But when we resume the training process, CUDA will be out of memory.

We'd like to know the author's training configuration...

@wangg12
Copy link
Member

wangg12 commented Nov 22, 2021

Hello, I usually train the model using a single 2080Ti GPU (11G).

@GabrielleTse
Copy link
Author

Thank you very much for your fast and kind reply! However, we're still facing the same issue with multiple 2080Ti GPUs. My concerns are now:

  1. The rendered pbr data for LM-O is around 200G, the "CUDA out of memory" warning arises only when we train real+pbr. Is the size of pbr data correct? Is it possible that something wrong in pbr data rendering?

b3c184289b1baa94bc7cbb35ec88945

  1. We find that here are 8 sub-functions running on each GPU, is there any .py file controlling the function parallel computing, so that we can reduce the configuration?

3f2d185650f9c8d0ddc14a895195e8b

@wangg12 wangg12 added the help wanted Extra attention is needed label Nov 23, 2021
@wangg12
Copy link
Member

wangg12 commented Nov 23, 2021

For training on multiple gpus with ddp, it seems that the workers in the dataloader will create many subprocesses which occupy a lot of gpu memory. I don't know how to solve this problem and I almost always train the models using a single GPU. Maybe there are some hidden bug in the current implementation for training on multiple gpus with ddp.
If anyone can solve the problem, please do let me know.

@GabrielleTse
Copy link
Author

We've fixed the problem by changing the training configuration file:
https://github.com/THU-DA-6D-Pose-Group/GDR-Net/blob/main/configs/base/common_base.py

-----------------------------------------------------------------------------

DataLoader

-----------------------------------------------------------------------------

DATALOADER = dict(
# Number of data loading threads
NUM_WORKERS=4,
……
)

NUM_WORKERS controls the number of processes distributed on GPUs, when we reduce
NUM_WORKERS=4
into
NUM_WORKERS=3

CUDA won't be out of memory on multiple GPUs.

@wangg12
Copy link
Member

wangg12 commented Nov 25, 2021

Try the updated code, the memory issue caused by ddp spawn should be resolved.

@wangg12 wangg12 closed this as completed Nov 25, 2021
@wangg12 wangg12 removed the help wanted Extra attention is needed label Nov 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants