CUDA out of memory #52

GabrielleTse · 2021-11-22T04:00:55Z

We implement the training process with pbr rendered data on eight GPU parallel computing (NIVDIA 2080 Ti with graphic memory of 12 G) , it barely starts training in batchsize 8 (original is 24). But when we resume the training process, CUDA will be out of memory.

We'd like to know the author's training configuration...

wangg12 · 2021-11-22T05:37:52Z

Hello, I usually train the model using a single 2080Ti GPU (11G).

GabrielleTse · 2021-11-23T07:33:09Z

Thank you very much for your fast and kind reply! However, we're still facing the same issue with multiple 2080Ti GPUs. My concerns are now:

The rendered pbr data for LM-O is around 200G, the "CUDA out of memory" warning arises only when we train real+pbr. Is the size of pbr data correct? Is it possible that something wrong in pbr data rendering?

We find that here are 8 sub-functions running on each GPU, is there any .py file controlling the function parallel computing, so that we can reduce the configuration?

wangg12 · 2021-11-23T08:17:23Z

For training on multiple gpus with ddp, it seems that the workers in the dataloader will create many subprocesses which occupy a lot of gpu memory. I don't know how to solve this problem and I almost always train the models using a single GPU. Maybe there are some hidden bug in the current implementation for training on multiple gpus with ddp.
If anyone can solve the problem, please do let me know.

GabrielleTse · 2021-11-24T02:37:03Z

We've fixed the problem by changing the training configuration file:
https://github.com/THU-DA-6D-Pose-Group/GDR-Net/blob/main/configs/base/common_base.py

-----------------------------------------------------------------------------

DataLoader

-----------------------------------------------------------------------------

DATALOADER = dict(
# Number of data loading threads
NUM_WORKERS=4,
……
)

NUM_WORKERS controls the number of processes distributed on GPUs, when we reduce
NUM_WORKERS=4
into
NUM_WORKERS=3

CUDA won't be out of memory on multiple GPUs.

wangg12 · 2021-11-25T07:44:01Z

Try the updated code, the memory issue caused by ddp spawn should be resolved.

wangg12 added the help wanted Extra attention is needed label Nov 23, 2021

wangg12 closed this as completed Nov 25, 2021

wangg12 removed the help wanted Extra attention is needed label Nov 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA out of memory #52

CUDA out of memory #52

GabrielleTse commented Nov 22, 2021

wangg12 commented Nov 22, 2021

GabrielleTse commented Nov 23, 2021

wangg12 commented Nov 23, 2021

GabrielleTse commented Nov 24, 2021

wangg12 commented Nov 25, 2021

CUDA out of memory #52

CUDA out of memory #52

Comments

GabrielleTse commented Nov 22, 2021

wangg12 commented Nov 22, 2021

GabrielleTse commented Nov 23, 2021

wangg12 commented Nov 23, 2021

GabrielleTse commented Nov 24, 2021

-----------------------------------------------------------------------------

DataLoader

-----------------------------------------------------------------------------

wangg12 commented Nov 25, 2021