Conversation
|
Is there a reason why we don't just call (By the way, this doesn't really have anything to do with thread safety; it's purely a CUDA multi-device API issue.) |
|
@longjon I did try with cudaHostAlloc w/ cudaHostAllocPortable, and get the following error: The thread invoking ~SyncedMemory() is JVM GC thread. It has no clue which GPU device to use if we don't remember the GPU device that allocate the pinged memory. We should not set it to a random device (say 0) either since GPU node may have set the device to be thread exclusive. Therefore, such a thread simply could not set up cuda context, and cudaFreeHost() will fail. |
|
Thanks for the explanation, that makes sense. Given that However the allocation logic here, though it may work as used, does not seem to make sense. Note that So it seems to me the device should be captured exactly once, and we should decide whether that should be at construction time or first allocation time, and we should make sure all future CUDA operations use that fixed device/context. Re: the second point, I'm not sure without checking things in more detail whether it suffices to fix the device at construction time, although if so that might be the simplest semantics. Note that currently the device is only captured at (lazy) device allocation time. If we also allow it to be captured at host allocation time, this could cause unexpected behavior where we allocate on the host, then switch devices, then perform an initial device allocation on a non-current device! So I'd suggest thinking about which semantics makes the most sense, and updating this with a patch that won't break even though the caller is allowed to change the device at any time. (Or let me know if I've misread and that should be currently be the case.) You can force push to this PR (and please squash/clean up history before merge). |
|
Let's introduce a separated alloc_device_ to record the GPU device used for allocating cpu_ptr_, and have gpu_device_ only for the device used for allocating gpu_ptr_. This introduces minimum memory overhead, but is much reliable for dealing with various use cases. |
|
@longjon please review the revised implementation. |
|
Hi, |
|
@yzk0281 I don't think this issue is related to your problem; you probably want to find another issue or create a new one. I believe that you cannot change the CPU/GPU mode once you have created a net, but someone else should confirm that. |
Current implementation of CaffeFreeHost() assumes that it's invoked by the same thread as CaffeMallocHost() caller. This PR remove that constraint. This is useful, for example, when Caffe blob is released by JVM GC.