Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CUDAPinnedPlace #9380

Merged

Conversation

chengduoZH
Copy link
Contributor

@chengduoZH chengduoZH commented Mar 26, 2018

fix #8728
related PR: #9216

CUDA pinned memory is different with CPU memory and GPU memory, physically, it's at CPU side, but it can be accessed by CPU and GPU. In the last PR, I add an argument, is_pinned_, to Tensor, but if a user is careless, it is dangerous that pinned memory's tensor is involved in the model computation, which will lead to slow training speed and very difficult to find. So After talking with @typhoonzero, I add the CUDAPinnedPlace.

Note: Currently, pinned memory is only used for memory copying.

This PR's work:

  1. add CUDAPinnedPlace
  2. add CUDAPinnedContext. In order to be compatible with CPUPlace and CUDAPlace.
  3. add Copy case (CUDAPinnedPlace-> CUDAPlace, CUDAPlace-> CUDAPinnedPlace, CUDAPinnedPlace-> CPUPlace, CPUPlace-> CUDAPinnedPlace, CUDAPinnedPlace-> CUDAPinnedPlace)

@chengduoZH chengduoZH force-pushed the feature/add_CUDAPinnedPlace branch 2 times, most recently from 702ada3 to b302feb Compare March 26, 2018 13:01
@chengduoZH chengduoZH changed the title [WIP] Add CUDAPinnedPlace Add CUDAPinnedPlace Mar 27, 2018

size_t usable = paddle::platform::GpuMaxAllocSize() - fallback_alloc_size_;
size_t usable =
paddle::platform::CUDAPinnedMaxAllocSize() - cuda_pinnd_alloc_size_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems default pinned memory max size is determined by system settings, can use ulimit -l to check out current system locked memory max size. I took a look at our current machines, the default value seems very small (64KB), to increase this setting, need to run ulimit -l [new size].probably it's better to add one document to describe this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review!
Searching around the internet, I found that

  • ulimit -l does affect the amount of memory we can memlock(), but it does not affect cudaMallocHost() because the CUDA pinning allocator doesn't use memlock.
  • the pinned allocator on CUDA under the hood is using mmap() with MAP_FIXED. The experiment is here.
  • So theoretically, the max size of pinned memory can be the max size of physical memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed information, quiet useful!

auto* buddy_allocator = GetCUDAPinnedBuddyAllocator();
void* ptr = buddy_allocator->Alloc(size);

// if (ptr == nullptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can remove these comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -27,6 +27,10 @@ DEFINE_double(fraction_of_cpu_memory_to_use, 1,
"Default use 100% of CPU memory for PaddlePaddle,"
"reserve the rest for page tables, etc");

DEFINE_double(fraction_of_cuda_pinned_memory_to_use, 0.5,
"Default use 100% of CPU memory for PaddlePaddle,"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flag description need to be updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

size_t CUDAPinnedMaxChunkSize() {
// Allow to allocate the maximum chunk size is roughly 0.39% of CUDA_PINNED
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just say it's 1/256 total size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@gongweibao gongweibao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

// of host pinned allocation. Allocates too much would reduce
// the amount of memory available to the underlying system for paging.
size_t usable =
Copy link
Contributor

@gongweibao gongweibao Apr 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is line 58 FLAGS_use_pinned_memory useful now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

}

TEST(CPUANDCUDAPinned, CPUAllocator) {
test_pinned_memory<paddle::platform::CPUPlace>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to assert that pinned memory is faster than common memory K series?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

size_t gpu_alloc_size_ =
0; // TODO(zcd): how to define the upper limit of CUDAPinnedMemory?
size_t fallback_alloc_size_ = 0;
size_t cuda_pinnd_alloc_size_ = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments at line 24 should be modified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

@chengduoZH chengduoZH force-pushed the feature/add_CUDAPinnedPlace branch 3 times, most recently from fdd5dfc to 0652737 Compare April 3, 2018 11:49
if (WITH_GPU)
nv_test(pinned_memory_test SRCS pinned_memory_test.cu DEPS place paddle_memory)
endif()
# if (WITH_GPU)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

gongweibao
gongweibao previously approved these changes Apr 3, 2018
Copy link
Contributor

@gongweibao gongweibao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question.

@chengduoZH chengduoZH force-pushed the feature/add_CUDAPinnedPlace branch 2 times, most recently from f7178f0 to 638a8b4 Compare April 4, 2018 02:25
typhoonzero
typhoonzero previously approved these changes Apr 4, 2018
Copy link
Contributor

@typhoonzero typhoonzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

#include "paddle/fluid/platform/gpu_info.h"
#include "paddle/fluid/platform/place.h"

// This unit test is an example comparing the performance between using pinned
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to move benchmark tests using https://github.com/google/benchmark later and save some CI time.

@chengduoZH chengduoZH merged commit c14305f into PaddlePaddle:develop Apr 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add pinned memory Allocator for Fluid
3 participants