Remove recursion from __internal_is_address_from#7561
Merged
davebayer merged 5 commits intoNVIDIA:mainfrom Feb 9, 2026
Merged
Conversation
When running on pre-Hopper GPUs, a call to `__internal_is_address_from(ptr, cluster_shared)` would simply make a recursive call to `__internal_is_address_from(ptr, shared)`. The recursion would stop there; there was no infinite recursion or large stack sizes. But when compiling the GPU code with debug information and no optimization (`nvcc -G`), the recursive call would remain in the PTX and that would cause either ptxas or nvlink to be unable to calculate the correct stack size for the kernel. That could result in a failed kernel if the default stack size is too small. Avoid this problem by removing the recursive call in `__internal_is_address_from`. Instead move the `case address_space::shared:` code to just after `case address_space::cluster_shared:` and have `case address_space::cluster_shared:` `[[fallthrough]]` to `case address_space::shared:` on pre-Hopper GPUs. This fixes some stdpar tests when compiled with `nvc++ -g -stdpar` on pre-Hopper GPUs. It fixes some CUDA applications compiled with `nvcc -G`, though I don't have any real-world examples.
Contributor
Contributor
Author
|
Here is an example of the problem: #include <stdio.h>
#include <cuda/__memory/address_space.h>
__global__ void kernel(void *p) {
if (cuda::device::__internal_is_address_from(p, cuda::device::address_space::cluster_shared)) {
printf("Wrong answer!\n");
asm("trap;");
}
}
int main(int argc, char** argv) {
kernel<<<32,32>>>(argv);
if (cudaDeviceSynchronize() != cudaSuccess) {
printf("Kernel failed\n");
}
}The warning indicates that the kernel stack size might be wrong. This test program is small enough that the default stack size isn't an actual problem. But we have seen this warning followed by kernel failures due to stack overflow in larger programs. I know that this change avoids the warning from ptxas or nvlink. But I can't easily do rigorous regression testing, so I would appreciate someone on the CCCL team doing whatever testing is appropriate for this. |
bernhardmgruber
approved these changes
Feb 9, 2026
Contributor
|
pre-commit.ci autofix |
Contributor
|
/ok to test 1285379 |
davebayer
approved these changes
Feb 9, 2026
davebayer
reviewed
Feb 9, 2026
Contributor
|
/ok to test 2e10217 |
Contributor
🥳 CI Workflow Results🟩 Finished in 3h 31m: Pass: 100%/95 | Total: 4d 02h | Max: 3h 30m | Hits: 39%/249347See results here. |
github-actions bot
pushed a commit
that referenced
this pull request
Feb 9, 2026
* Remove recursion from __internal_is_address_from When running on pre-Hopper GPUs, a call to `__internal_is_address_from(ptr, cluster_shared)` would simply make a recursive call to `__internal_is_address_from(ptr, shared)`. The recursion would stop there; there was no infinite recursion or large stack sizes. But when compiling the GPU code with debug information and no optimization (`nvcc -G`), the recursive call would remain in the PTX and that would cause either ptxas or nvlink to be unable to calculate the correct stack size for the kernel. That could result in a failed kernel if the default stack size is too small. Avoid this problem by removing the recursive call in `__internal_is_address_from`. Instead move the `case address_space::shared:` code to just after `case address_space::cluster_shared:` and have `case address_space::cluster_shared:` `[[fallthrough]]` to `case address_space::shared:` on pre-Hopper GPUs. This fixes some stdpar tests when compiled with `nvc++ -g -stdpar` on pre-Hopper GPUs. It fixes some CUDA applications compiled with `nvcc -G`, though I don't have any real-world examples. * [pre-commit.ci] auto code formatting * Update libcudacxx/include/cuda/__memory/address_space.h * Update libcudacxx/include/cuda/__memory/address_space.h * Update libcudacxx/include/cuda/__memory/address_space.h --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: David Bayer <48736217+davebayer@users.noreply.github.com> (cherry picked from commit afd6222)
Contributor
|
Successfully created backport PR for |
wmaxey
pushed a commit
that referenced
this pull request
Feb 11, 2026
* Remove recursion from __internal_is_address_from When running on pre-Hopper GPUs, a call to `__internal_is_address_from(ptr, cluster_shared)` would simply make a recursive call to `__internal_is_address_from(ptr, shared)`. The recursion would stop there; there was no infinite recursion or large stack sizes. But when compiling the GPU code with debug information and no optimization (`nvcc -G`), the recursive call would remain in the PTX and that would cause either ptxas or nvlink to be unable to calculate the correct stack size for the kernel. That could result in a failed kernel if the default stack size is too small. Avoid this problem by removing the recursive call in `__internal_is_address_from`. Instead move the `case address_space::shared:` code to just after `case address_space::cluster_shared:` and have `case address_space::cluster_shared:` `[[fallthrough]]` to `case address_space::shared:` on pre-Hopper GPUs. This fixes some stdpar tests when compiled with `nvc++ -g -stdpar` on pre-Hopper GPUs. It fixes some CUDA applications compiled with `nvcc -G`, though I don't have any real-world examples. * [pre-commit.ci] auto code formatting * Update libcudacxx/include/cuda/__memory/address_space.h * Update libcudacxx/include/cuda/__memory/address_space.h * Update libcudacxx/include/cuda/__memory/address_space.h --------- (cherry picked from commit afd6222) Co-authored-by: David Olsen <dolsen@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: David Bayer <48736217+davebayer@users.noreply.github.com>
fbusato
pushed a commit
to fbusato/cccl
that referenced
this pull request
Feb 19, 2026
* Remove recursion from __internal_is_address_from When running on pre-Hopper GPUs, a call to `__internal_is_address_from(ptr, cluster_shared)` would simply make a recursive call to `__internal_is_address_from(ptr, shared)`. The recursion would stop there; there was no infinite recursion or large stack sizes. But when compiling the GPU code with debug information and no optimization (`nvcc -G`), the recursive call would remain in the PTX and that would cause either ptxas or nvlink to be unable to calculate the correct stack size for the kernel. That could result in a failed kernel if the default stack size is too small. Avoid this problem by removing the recursive call in `__internal_is_address_from`. Instead move the `case address_space::shared:` code to just after `case address_space::cluster_shared:` and have `case address_space::cluster_shared:` `[[fallthrough]]` to `case address_space::shared:` on pre-Hopper GPUs. This fixes some stdpar tests when compiled with `nvc++ -g -stdpar` on pre-Hopper GPUs. It fixes some CUDA applications compiled with `nvcc -G`, though I don't have any real-world examples. * [pre-commit.ci] auto code formatting * Update libcudacxx/include/cuda/__memory/address_space.h * Update libcudacxx/include/cuda/__memory/address_space.h * Update libcudacxx/include/cuda/__memory/address_space.h --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: David Bayer <48736217+davebayer@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
When running on pre-Hopper GPUs, a call to
__internal_is_address_from(ptr, cluster_shared)would simply make a recursive call to__internal_is_address_from(ptr, shared). The recursion would stop there; there was no infinite recursion or large stack sizes. But when compiling the GPU code with debug information and no optimization (nvcc -G), the recursive call would remain in the PTX and that would cause either ptxas or nvlink to be unable to calculate the correct stack size for the kernel. That could result in a failed kernel if the default stack size is too small.Avoid this problem by removing the recursive call in
__internal_is_address_from. Instead move thecase address_space::shared:code to just aftercase address_space::cluster_shared:and havecase address_space::cluster_shared:[[fallthrough]]tocase address_space::shared:on pre-Hopper GPUs.This fixes some stdpar tests when compiled with
nvc++ -g -stdparon pre-Hopper GPUs. It fixes some CUDA applications compiled withnvcc -G, though I don't have any real-world examples.Also fixes NVBug: 5880331
Checklist