Handle GPU memory management #20

lukeyeager · 2015-03-23T16:42:04Z

See #18.

DIGITS should handle GPU memory allocation for the user automatically. This could be done in a few ways:

Calculate how much memory will be required before the training starts and adjust the batch size automatically. This may not be possible - I'd have to dig into the caffe code to figure out whether they even know before running. There is a Memory required for data line in caffe's output, but it seems totally unrelated to the amount of memory used on the GPU.
Detect out-of-memory failures and automatically scale down the batch size and re-run the job until it fits in memory. This could take a while - even on a fast machine, the VGG network takes a few minutes before taking up its maximum amount of memory on the GPU. So, if we have to wait for caffe to fail 2 or 3 times before getting it right, this could be a major time suck.
Let cuDNN handle the memory management for us automatically. In section 3.11 of the cuDNN user doc, there are options for specifying how to choose the convolution algorithm. The CUDNN_CONVOLUTION_FWD_SPECIFY_WORKSPACE_LIMIT option could be used to say "use the fastest algorithm that fits within the specified memory budget." This would be a change to caffe, not to DIGITS, and it's not a complete solution since many people will be using caffe without cuDNN and maybe even without CUDA.

The text was updated successfully, but these errors were encountered:

lukeyeager · 2015-10-28T16:43:51Z

This was handled in NVcaffe with v0.13.1 with the CNMeM memory manager. Now Caffe automatically uses ~90% of the available GPU memory.

NVIDIA/caffe#12
https://github.com/NVIDIA/caffe/releases/tag/v0.13.1
https://github.com/NVIDIA/cnmem

It's not a complete solution to the issue reported above - sometimes you still need to lower your batch size manually - but it solves most of the problem.

shaibagon · 2016-02-15T11:18:48Z

@lukeyeager thank you for handling this annoying issue. I came across this cuda mem issue recently and I noticed that in fact caffe allocates memory for BOTH train and test nets. Is there a way to "swap" these nets in and out of the gpu memory? That is, during training only allocate and work on the training net. Then when starting a test phase, swap the training net out of GPU and allocate for the test net?

lukeyeager · 2016-02-16T18:10:07Z

@shaibagon that seems like a good idea to me, but I may be unaware of some limitation that makes it hard/impossible. Either way, the DIGITS team wouldn't be involved in making that sort of a change, and we probably wouldn't do it in NVcaffe either. This would probably need to be changed in BVLC/caffe and then we'd pull it into NVcaffe for our next release at some point.

beniz · 2016-02-16T18:26:22Z

FTR, one way to slightly diminish the test net memory load is to set its batch_size to 1, and to set the testing iteration above or equal to the total number of iterations. Very inconenient and not solving the issue, but this is a trick I use in my own dealing with my own version of Caffe.

Swapping would indeed be great and could also be done without too much difficulty by directly dealing with the prototxt and loading portions on purpose.

shaibagon · 2016-02-17T06:06:20Z

@lukeyeager - thank you for your reply.
@beniz - setting test batch size to one does help but at a very high cost: it slows down the test phase. Moreover, for large models even setting the test batch size to one takes a lot of space when allocating room in GPU memory for BOTH train and test nets.

homah · 2016-12-02T14:09:42Z

hi :)
I change batch size from optional in creating the model. before it couldnt create the model and after changing batch size to 1 it could without memory error. but when İ want test the model and select an image to test for classification in all classifying instead of percent it shows 'NAN%'
what is that? I couldnt test my model. in both lenet and alexnet i

see that.

please help me!!!
how can I create model without any problem??

lukeyeager · 2016-12-02T20:57:13Z

@homah what does your question have to do with this thread? Also, we would prefer for you to use our user group for questions, and to use GitHub only to report bugs and feature requests. This is clearly explained in our README:
https://github.com/NVIDIA/DIGITS/blob/digits-5.0/README.md#get-help

lukeyeager added the enhancement label Mar 23, 2015

lukeyeager closed this as completed Oct 28, 2015

Alqazzaz mentioned this issue Apr 30, 2017

segmentation PNG labels #1605

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle GPU memory management #20

Handle GPU memory management #20

lukeyeager commented Mar 23, 2015

lukeyeager commented Oct 28, 2015

shaibagon commented Feb 15, 2016

lukeyeager commented Feb 16, 2016

beniz commented Feb 16, 2016

shaibagon commented Feb 17, 2016

homah commented Dec 2, 2016 •

edited

Loading

lukeyeager commented Dec 2, 2016

Handle GPU memory management #20

Handle GPU memory management #20

Comments

lukeyeager commented Mar 23, 2015

lukeyeager commented Oct 28, 2015

shaibagon commented Feb 15, 2016

lukeyeager commented Feb 16, 2016

beniz commented Feb 16, 2016

shaibagon commented Feb 17, 2016

homah commented Dec 2, 2016 • edited Loading

lukeyeager commented Dec 2, 2016

homah commented Dec 2, 2016 •

edited

Loading