-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add pinned memory #9216
Add pinned memory #9216
Conversation
c4d0515
to
e0156c1
Compare
999ea6c
to
ef027b3
Compare
paddle/fluid/framework/tensor.h
Outdated
@@ -45,10 +45,11 @@ class Tensor { | |||
friend struct EigenVector; | |||
|
|||
public: | |||
Tensor() : offset_(0) {} | |||
Tensor() : offset_(0), use_pinned_(false) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why need to add use_pinned_
to tensor, if we just want to use pinned memory for copying, just put this in the CopyTensor
implement should OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just put this in the CopyTensor implement
Do you mean, for CPU->GPU, to copy the data from CPU to pinned memory first and then copy from pinned memory to GPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No of course. I was thinking about GPU->CPU copy. For CPU->GPU case, it seems this is the only way now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we use the pinned memory just as a staging area but not use it directly? The computations of CPU can access the data of the pinned memory directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, can we set use_pinned_
as a global GFLAG, so that all allocations on host would use pinned memory, then we can test the overall performance boost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the memory copying between GPU and CPU is async, for GPU->CPU case, before reading data from the pinned memory we should ensure the copy has completed, so we should add sync operation. So using pinned memory is a little complex.
I plan to only put the input data into pinned memory and test the overall performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still WIP, can I review and merge this?
ef027b3
to
eaa90d3
Compare
paddle/fluid/memory/memory.cc
Outdated
void* Alloc<platform::CUDAPlace>(platform::CUDAPlace place, size_t size, | ||
bool use_pinned) { | ||
void* ptr; | ||
if (use_pinned) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alloc on CUDAPlace
with use_pinned=false
will return a pointer on device, but when calling with use_pinned=true
will return a pointer on host, this is a little bit confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, maybe we should add a new place (CUDAPinnedPlace).
paddle/fluid/framework/tensor_impl.h
Outdated
} else if (platform::is_gpu_place(place)) { | ||
#ifndef PADDLE_WITH_CUDA | ||
PADDLE_THROW("'CUDAPlace' is not supported in CPU only device."); | ||
} | ||
#else | ||
holder_.reset(new PlaceholderImpl<platform::CUDAPlace>( | ||
boost::get<platform::CUDAPlace>(place), size, type)); | ||
boost::get<platform::CUDAPlace>(place), size, type, use_pinned)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
holder_ pointer will be on host here may cause error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The host memory is page-locked and accessible to the device.
Refer: http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1gab84100ae1fa1b12eaca660207ef585b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this information!
… feature/add_pinned_memory
ba1178e
to
9e99446
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM++, we can add the new place type in next PR.
WIP
The current CUDA Runtime Documentation states:
Asynchronous(Memcpy):
- For transfers from device memory to pageable host memory, the function will return only once the copy has completed.