New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA out of memory #52
Comments
Hello, I usually train the model using a single 2080Ti GPU (11G). |
Thank you very much for your fast and kind reply! However, we're still facing the same issue with multiple 2080Ti GPUs. My concerns are now:
|
For training on multiple gpus with ddp, it seems that the workers in the dataloader will create many subprocesses which occupy a lot of gpu memory. I don't know how to solve this problem and I almost always train the models using a single GPU. Maybe there are some hidden bug in the current implementation for training on multiple gpus with ddp. |
We've fixed the problem by changing the training configuration file: -----------------------------------------------------------------------------DataLoader-----------------------------------------------------------------------------DATALOADER = dict( NUM_WORKERS controls the number of processes distributed on GPUs, when we reduce CUDA won't be out of memory on multiple GPUs. |
Try the updated code, the memory issue caused by ddp spawn should be resolved. |
We implement the training process with pbr rendered data on eight GPU parallel computing (NIVDIA 2080 Ti with graphic memory of 12 G) , it barely starts training in batchsize 8 (original is 24). But when we resume the training process, CUDA will be out of memory.
We'd like to know the author's training configuration...
The text was updated successfully, but these errors were encountered: