Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle GPU memory management #20

Closed
lukeyeager opened this issue Mar 23, 2015 · 7 comments
Closed

Handle GPU memory management #20

lukeyeager opened this issue Mar 23, 2015 · 7 comments

Comments

@lukeyeager
Copy link
Member

See #18.

DIGITS should handle GPU memory allocation for the user automatically. This could be done in a few ways:

  1. Calculate how much memory will be required before the training starts and adjust the batch size automatically. This may not be possible - I'd have to dig into the caffe code to figure out whether they even know before running. There is a Memory required for data line in caffe's output, but it seems totally unrelated to the amount of memory used on the GPU.
  2. Detect out-of-memory failures and automatically scale down the batch size and re-run the job until it fits in memory. This could take a while - even on a fast machine, the VGG network takes a few minutes before taking up its maximum amount of memory on the GPU. So, if we have to wait for caffe to fail 2 or 3 times before getting it right, this could be a major time suck.
  3. Let cuDNN handle the memory management for us automatically. In section 3.11 of the cuDNN user doc, there are options for specifying how to choose the convolution algorithm. The CUDNN_CONVOLUTION_FWD_SPECIFY_WORKSPACE_LIMIT option could be used to say "use the fastest algorithm that fits within the specified memory budget." This would be a change to caffe, not to DIGITS, and it's not a complete solution since many people will be using caffe without cuDNN and maybe even without CUDA.
@lukeyeager
Copy link
Member Author

This was handled in NVcaffe with v0.13.1 with the CNMeM memory manager. Now Caffe automatically uses ~90% of the available GPU memory.

NVIDIA/caffe#12
https://github.com/NVIDIA/caffe/releases/tag/v0.13.1
https://github.com/NVIDIA/cnmem

It's not a complete solution to the issue reported above - sometimes you still need to lower your batch size manually - but it solves most of the problem.

@shaibagon
Copy link

@lukeyeager thank you for handling this annoying issue. I came across this cuda mem issue recently and I noticed that in fact caffe allocates memory for BOTH train and test nets. Is there a way to "swap" these nets in and out of the gpu memory? That is, during training only allocate and work on the training net. Then when starting a test phase, swap the training net out of GPU and allocate for the test net?

@lukeyeager
Copy link
Member Author

@shaibagon that seems like a good idea to me, but I may be unaware of some limitation that makes it hard/impossible. Either way, the DIGITS team wouldn't be involved in making that sort of a change, and we probably wouldn't do it in NVcaffe either. This would probably need to be changed in BVLC/caffe and then we'd pull it into NVcaffe for our next release at some point.

@beniz
Copy link

beniz commented Feb 16, 2016

FTR, one way to slightly diminish the test net memory load is to set its batch_size to 1, and to set the testing iteration above or equal to the total number of iterations. Very inconenient and not solving the issue, but this is a trick I use in my own dealing with my own version of Caffe.

Swapping would indeed be great and could also be done without too much difficulty by directly dealing with the prototxt and loading portions on purpose.

@shaibagon
Copy link

@lukeyeager - thank you for your reply.
@beniz - setting test batch size to one does help but at a very high cost: it slows down the test phase. Moreover, for large models even setting the test batch size to one takes a lot of space when allocating room in GPU memory for BOTH train and test nets.

@homah
Copy link

homah commented Dec 2, 2016

hi :)
I change batch size from optional in creating the model. before it couldnt create the model and after changing batch size to 1 it could without memory error. but when İ want test the model and select an image to test for classification in all classifying instead of percent it shows 'NAN%'
what is that? I couldnt test my model. in both lenet and alexnet i
screenshot from 2016-12-02 16 53 52
see that.

please help me!!!
how can I create model without any problem??

@lukeyeager
Copy link
Member Author

@homah what does your question have to do with this thread? Also, we would prefer for you to use our user group for questions, and to use GitHub only to report bugs and feature requests. This is clearly explained in our README:
https://github.com/NVIDIA/DIGITS/blob/digits-5.0/README.md#get-help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants