Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use checked allocators in CUB catch2 tests #1271

Merged
merged 30 commits into from
Feb 15, 2024

Conversation

alliepiper
Copy link
Collaborator

@alliepiper alliepiper commented Jan 10, 2024

This adds new header-only utilities to the c2h testing library:

  • c2h::checked_cuda_allocator<T>: New allocator that checks free device memory before calling cudaMalloc and throws bad_alloc if (alloc_bytes + 16MiB) > free_bytes. This avoids issues on Tegra and Windows where over-allocating device memory causes slowdowns or even system hangs.
  • c2h::checked_host_allocator<T>: New host allocator using new/delete, but checks available device memory prior to allocating memory on systems with integrated host/device memory.
  • c2h::device_policy: Thrust execution policy that uses c2h::checked_cuda_allocator<char> for temporary storage allocations.
  • c2h::device_vector<T>: Device vector that uses c2h::checked_cuda_allocator<T>.
  • c2h::host_vector<T>: Host vector that uses c2h::checked_host_allocator<T>.

Description

closes #1212

Checklist

  • New or existing tests cover these changes.

@alliepiper alliepiper requested review from a team as code owners January 10, 2024 20:34
@alliepiper alliepiper marked this pull request as draft January 10, 2024 20:34
@alliepiper alliepiper force-pushed the c2h_checked_allocator branch 3 times, most recently from 35c1c73 to a6ba0ba Compare January 12, 2024 19:29
These replace the device vector allocator with a custom version that
checks the amount of free device memory before calling cudaMalloc.

Ref issue NVIDIA#1212.
Several ADL functions for `thrust::detail::vector_base` were
defined in the `thrust::` namespace, but should be in
`thrust::detail`, otherwise custom aliases / subclasses of
`vector_base` outside of the `thrust::` namespace will not
find them.

The `thrust::host_vector` and `thrust::device_vector` classes
would find them by happenstance from pulling `thrust::` namespace
functions into the ADL overload set.

This commit moves these `vector_base` ADL functions (operator==,
operator!=, swap) into the appropriate `thrust::detail::` namespace so
they can be found reliably.
Also removed benchmarking code rather than porting,
since benchmarks are now handled separately from tests.
The function-scope static approach resulted in cudaErrorSymbolNotFound.
This WARs another batch of cudaErrorSymbolNotFound.
@alliepiper alliepiper changed the title EXPERIMENTAL: Replace thrust vectors with c2h::*vector wrappers in new CUB tests. Replace thrust vectors with c2h::*vector wrappers in new CUB tests. Jan 29, 2024
@alliepiper alliepiper marked this pull request as ready for review January 29, 2024 17:52
@alliepiper alliepiper requested a review from a team as a code owner January 29, 2024 17:52
@alliepiper alliepiper changed the title Replace thrust vectors with c2h::*vector wrappers in new CUB tests. Use checked allocators in CUB catch2 tests Jan 29, 2024
cub/test/test_device_batch_memcpy.cu Outdated Show resolved Hide resolved
cub/test/test_device_batch_copy.cu Outdated Show resolved Hide resolved
cub/test/catch2_test_warp_exchange_smem.cu Show resolved Hide resolved
cub/test/catch2_test_device_radix_sort_custom.cu Outdated Show resolved Hide resolved
Copy link
Collaborator

@gevtushenko gevtushenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried running batched memcpy test on Jetson Orin Nano Developer Kit (6.3 GB) and some test cases were skipped, which indicates that the new logic works. But then, the test went 800MB into swap, took another 20 minutes and then the application was killed by OS. This is a huge progress, compared to the initial hang, and the code looks much better than what I initially suggested. I'd like to try and get the test passing till the end, though. My intuition is that it'd take checking available device memory when allocating host vectors (at least on Tegra). Since it's a small change in code, I think it might make sense to try it as part of this PR, but if you'd like to experiment in a follow-up PR, please, file an issue.

cub/test/catch2_test_device_for_api.cu Outdated Show resolved Hide resolved
@alliepiper
Copy link
Collaborator Author

Interesting, I was unable to repro hangs from host allocs on orin. I'll add that check for integrated systems and we can retry.

@alliepiper
Copy link
Collaborator Author

@gevtushenko I added a device mem check for host allocations on integrated systems and cleaned up those API examples. Can you test this on your small Orin board again?

Copy link
Collaborator

@gevtushenko gevtushenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The batched memcpy and copy executed till completion now

@alliepiper alliepiper merged commit 2fd3b8c into NVIDIA:main Feb 15, 2024
538 checks passed
@alliepiper alliepiper deleted the c2h_checked_allocator branch February 15, 2024 00:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Define unified strategy for tests with large memory allocations
2 participants