make CaffeFreeHost thread safe by anfeng · Pull Request #3640 · BVLC/caffe

anfeng · 2016-02-05T18:02:56Z

Current implementation of CaffeFreeHost() assumes that it's invoked by the same thread as CaffeMallocHost() caller. This PR remove that constraint. This is useful, for example, when Caffe blob is released by JVM GC.

longjon · 2016-03-02T00:02:50Z

Is there a reason why we don't just call cudaHostAlloc(..., cudaHostAllocPortable)? This current solution is a bit messy, both because the caller has to store the device, and because the caller may be in CPU mode for which the device ID does not make sense...

(By the way, this doesn't really have anything to do with thread safety; it's purely a CUDA multi-device API issue.)

anfeng · 2016-03-04T21:23:40Z

@longjon I did try with cudaHostAlloc w/ cudaHostAllocPortable, and get the following error:
F0304 20:53:56.164652 22247 syncedmem.hpp:31] Check failed: error == cudaSuccess (10 vs. 0) invalid device ordinal

The thread invoking ~SyncedMemory() is JVM GC thread. It has no clue which GPU device to use if we don't remember the GPU device that allocate the pinged memory. We should not set it to a random device (say 0) either since GPU node may have set the device to be thread exclusive. Therefore, such a thread simply could not set up cuda context, and cudaFreeHost() will fail.

longjon · 2016-03-05T00:59:22Z

Thanks for the explanation, that makes sense. Given that CaffeMallocHost already has a use_cuda output argument, and given that it's internal to SyncedMemory, I guess we can go with this interface. (It feels a little off to me to have a malloc wrapper returning more things than an allocated pointer, but that's the interface we already have. Maybe that logic should be pulled into SyncedMemory, but we don't need to do that here.)

However the allocation logic here, though it may work as used, does not seem to make sense. Note that gpu_device_ is used both for host allocations and device allocations. It's perfectly possible (and not unreasonable) for the caller to ask for host memory, then switch devices and ask for device memory (or, less reasonably, vice versa). If my reading is correct, this code breaks in that case.

So it seems to me the device should be captured exactly once, and we should decide whether that should be at construction time or first allocation time, and we should make sure all future CUDA operations use that fixed device/context.

Re: the second point, I'm not sure without checking things in more detail whether it suffices to fix the device at construction time, although if so that might be the simplest semantics. Note that currently the device is only captured at (lazy) device allocation time. If we also allow it to be captured at host allocation time, this could cause unexpected behavior where we allocate on the host, then switch devices, then perform an initial device allocation on a non-current device!

So I'd suggest thinking about which semantics makes the most sense, and updating this with a patch that won't break even though the caller is allowed to change the device at any time. (Or let me know if I've misread and that should be currently be the case.) You can force push to this PR (and please squash/clean up history before merge).

anfeng · 2016-03-06T03:54:27Z

Let's introduce a separated alloc_device_ to record the GPU device used for allocating cpu_ptr_, and have gpu_device_ only for the device used for allocating gpu_ptr_. This introduces minimum memory overhead, but is much reliable for dealing with various use cases.

anfeng · 2016-03-06T04:58:48Z

@longjon please review the revised implementation.

yzk0281 · 2016-04-06T08:42:52Z

Hi,
I use libcaffe in another nginx server to extract feature of images.
When I run libcaffe in CPU-ONLY mode, it works fine; but when I turn to GPU mode, error accurs at this CaffeFreeHost function:
Check failed: error == cudaSuccess (3 vs. 0) initialization error.
But, When I run caffe alone with GPU mode, it works fine.
I think it is some kind of memory error, but I don't know how to solve it.
I tried to modify src/caffe/syncedmem.cpp and include/caffe/syncedmem.hpp as anfeng says, but still the same question.
Anybody has suggestions?

seanbell · 2016-04-24T18:49:58Z

@yzk0281 I don't think this issue is related to your problem; you probably want to find another issue or create a new one. I believe that you cannot change the CPU/GPU mode once you have created a net, but someone else should confirm that.

Andy Feng added 2 commits February 5, 2016 09:59

make CaffeFreeHost thread safe

901b05f

kickoff build again

e22428c

use alloc_device_ to track the GPU device used for allocating cpu_ptr_

414bdc5

space added b/w // and comment

6b9b8fb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make CaffeFreeHost thread safe#3640

make CaffeFreeHost thread safe#3640
anfeng wants to merge 4 commits intoBVLC:masterfrom
anfeng:master

anfeng commented Feb 5, 2016

Uh oh!

longjon commented Mar 2, 2016

Uh oh!

anfeng commented Mar 4, 2016

Uh oh!

longjon commented Mar 5, 2016

Uh oh!

anfeng commented Mar 6, 2016

Uh oh!

anfeng commented Mar 6, 2016

Uh oh!

yzk0281 commented Apr 6, 2016

Uh oh!

seanbell commented Apr 24, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

anfeng commented Feb 5, 2016

Uh oh!

longjon commented Mar 2, 2016

Uh oh!

anfeng commented Mar 4, 2016

Uh oh!

longjon commented Mar 5, 2016

Uh oh!

anfeng commented Mar 6, 2016

Uh oh!

anfeng commented Mar 6, 2016

Uh oh!

yzk0281 commented Apr 6, 2016

Uh oh!

seanbell commented Apr 24, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants