Skip to content

make CaffeFreeHost thread safe#3640

Open
anfeng wants to merge 4 commits intoBVLC:masterfrom
anfeng:master
Open

make CaffeFreeHost thread safe#3640
anfeng wants to merge 4 commits intoBVLC:masterfrom
anfeng:master

Conversation

@anfeng
Copy link

@anfeng anfeng commented Feb 5, 2016

Current implementation of CaffeFreeHost() assumes that it's invoked by the same thread as CaffeMallocHost() caller. This PR remove that constraint. This is useful, for example, when Caffe blob is released by JVM GC.

@longjon
Copy link
Contributor

longjon commented Mar 2, 2016

Is there a reason why we don't just call cudaHostAlloc(..., cudaHostAllocPortable)? This current solution is a bit messy, both because the caller has to store the device, and because the caller may be in CPU mode for which the device ID does not make sense...

(By the way, this doesn't really have anything to do with thread safety; it's purely a CUDA multi-device API issue.)

@anfeng
Copy link
Author

anfeng commented Mar 4, 2016

@longjon I did try with cudaHostAlloc w/ cudaHostAllocPortable, and get the following error:
F0304 20:53:56.164652 22247 syncedmem.hpp:31] Check failed: error == cudaSuccess (10 vs. 0) invalid device ordinal

The thread invoking ~SyncedMemory() is JVM GC thread. It has no clue which GPU device to use if we don't remember the GPU device that allocate the pinged memory. We should not set it to a random device (say 0) either since GPU node may have set the device to be thread exclusive. Therefore, such a thread simply could not set up cuda context, and cudaFreeHost() will fail.

@longjon
Copy link
Contributor

longjon commented Mar 5, 2016

Thanks for the explanation, that makes sense. Given that CaffeMallocHost already has a use_cuda output argument, and given that it's internal to SyncedMemory, I guess we can go with this interface. (It feels a little off to me to have a malloc wrapper returning more things than an allocated pointer, but that's the interface we already have. Maybe that logic should be pulled into SyncedMemory, but we don't need to do that here.)

However the allocation logic here, though it may work as used, does not seem to make sense. Note that gpu_device_ is used both for host allocations and device allocations. It's perfectly possible (and not unreasonable) for the caller to ask for host memory, then switch devices and ask for device memory (or, less reasonably, vice versa). If my reading is correct, this code breaks in that case.

So it seems to me the device should be captured exactly once, and we should decide whether that should be at construction time or first allocation time, and we should make sure all future CUDA operations use that fixed device/context.

Re: the second point, I'm not sure without checking things in more detail whether it suffices to fix the device at construction time, although if so that might be the simplest semantics. Note that currently the device is only captured at (lazy) device allocation time. If we also allow it to be captured at host allocation time, this could cause unexpected behavior where we allocate on the host, then switch devices, then perform an initial device allocation on a non-current device!

So I'd suggest thinking about which semantics makes the most sense, and updating this with a patch that won't break even though the caller is allowed to change the device at any time. (Or let me know if I've misread and that should be currently be the case.) You can force push to this PR (and please squash/clean up history before merge).

@anfeng
Copy link
Author

anfeng commented Mar 6, 2016

Let's introduce a separated alloc_device_ to record the GPU device used for allocating cpu_ptr_, and have gpu_device_ only for the device used for allocating gpu_ptr_. This introduces minimum memory overhead, but is much reliable for dealing with various use cases.

@anfeng
Copy link
Author

anfeng commented Mar 6, 2016

@longjon please review the revised implementation.

@yzk0281
Copy link

yzk0281 commented Apr 6, 2016

Hi,
I use libcaffe in another nginx server to extract feature of images.
When I run libcaffe in CPU-ONLY mode, it works fine; but when I turn to GPU mode, error accurs at this CaffeFreeHost function:
Check failed: error == cudaSuccess (3 vs. 0) initialization error.
But, When I run caffe alone with GPU mode, it works fine.
I think it is some kind of memory error, but I don't know how to solve it.
I tried to modify src/caffe/syncedmem.cpp and include/caffe/syncedmem.hpp as anfeng says, but still the same question.
Anybody has suggestions?

@seanbell
Copy link

@yzk0281 I don't think this issue is related to your problem; you probably want to find another issue or create a new one. I believe that you cannot change the CPU/GPU mode once you have created a net, but someone else should confirm that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants