-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use checked allocators in CUB catch2 tests #1271
Conversation
35c1c73
to
a6ba0ba
Compare
2036a23
to
393a950
Compare
These replace the device vector allocator with a custom version that checks the amount of free device memory before calling cudaMalloc. Ref issue NVIDIA#1212.
Several ADL functions for `thrust::detail::vector_base` were defined in the `thrust::` namespace, but should be in `thrust::detail`, otherwise custom aliases / subclasses of `vector_base` outside of the `thrust::` namespace will not find them. The `thrust::host_vector` and `thrust::device_vector` classes would find them by happenstance from pulling `thrust::` namespace functions into the ADL overload set. This commit moves these `vector_base` ADL functions (operator==, operator!=, swap) into the appropriate `thrust::detail::` namespace so they can be found reliably.
Also removed benchmarking code rather than porting, since benchmarks are now handled separately from tests.
393a950
to
3bbd937
Compare
3bbd937
to
bc0c868
Compare
The function-scope static approach resulted in cudaErrorSymbolNotFound.
This WARs another batch of cudaErrorSymbolNotFound.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried running batched memcpy test on Jetson Orin Nano Developer Kit (6.3 GB) and some test cases were skipped, which indicates that the new logic works. But then, the test went 800MB into swap, took another 20 minutes and then the application was killed by OS. This is a huge progress, compared to the initial hang, and the code looks much better than what I initially suggested. I'd like to try and get the test passing till the end, though. My intuition is that it'd take checking available device memory when allocating host vectors (at least on Tegra). Since it's a small change in code, I think it might make sense to try it as part of this PR, but if you'd like to experiment in a follow-up PR, please, file an issue.
Interesting, I was unable to repro hangs from host allocs on orin. I'll add that check for integrated systems and we can retry. |
@gevtushenko I added a device mem check for host allocations on integrated systems and cleaned up those API examples. Can you test this on your small Orin board again? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The batched memcpy and copy executed till completion now
This adds new header-only utilities to the
c2h
testing library:c2h::checked_cuda_allocator<T>
: New allocator that checks free device memory before callingcudaMalloc
and throwsbad_alloc
if(alloc_bytes + 16MiB) > free_bytes
. This avoids issues on Tegra and Windows where over-allocating device memory causes slowdowns or even system hangs.c2h::checked_host_allocator<T>
: New host allocator using new/delete, but checks available device memory prior to allocating memory on systems with integrated host/device memory.c2h::device_policy
: Thrust execution policy that usesc2h::checked_cuda_allocator<char>
for temporary storage allocations.c2h::device_vector<T>
: Device vector that usesc2h::checked_cuda_allocator<T>
.c2h::host_vector<T>
: Host vector that usesc2h::checked_host_allocator<T>
.Description
closes #1212
Checklist