Remove recursion from __internal_is_address_from by dkolsen-pgi · Pull Request #7561 · NVIDIA/cccl

dkolsen-pgi · 2026-02-07T06:01:01Z

Description

When running on pre-Hopper GPUs, a call to __internal_is_address_from(ptr, cluster_shared) would simply make a recursive call to __internal_is_address_from(ptr, shared). The recursion would stop there; there was no infinite recursion or large stack sizes. But when compiling the GPU code with debug information and no optimization (nvcc -G), the recursive call would remain in the PTX and that would cause either ptxas or nvlink to be unable to calculate the correct stack size for the kernel. That could result in a failed kernel if the default stack size is too small.

Avoid this problem by removing the recursive call in __internal_is_address_from. Instead move the case address_space::shared: code to just after case address_space::cluster_shared: and have case address_space::cluster_shared: [[fallthrough]] to case address_space::shared: on pre-Hopper GPUs.

This fixes some stdpar tests when compiled with nvc++ -g -stdpar on pre-Hopper GPUs. It fixes some CUDA applications compiled with nvcc -G, though I don't have any real-world examples.

Also fixes NVBug: 5880331

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

When running on pre-Hopper GPUs, a call to `__internal_is_address_from(ptr, cluster_shared)` would simply make a recursive call to `__internal_is_address_from(ptr, shared)`. The recursion would stop there; there was no infinite recursion or large stack sizes. But when compiling the GPU code with debug information and no optimization (`nvcc -G`), the recursive call would remain in the PTX and that would cause either ptxas or nvlink to be unable to calculate the correct stack size for the kernel. That could result in a failed kernel if the default stack size is too small. Avoid this problem by removing the recursive call in `__internal_is_address_from`. Instead move the `case address_space::shared:` code to just after `case address_space::cluster_shared:` and have `case address_space::cluster_shared:` `[[fallthrough]]` to `case address_space::shared:` on pre-Hopper GPUs. This fixes some stdpar tests when compiled with `nvc++ -g -stdpar` on pre-Hopper GPUs. It fixes some CUDA applications compiled with `nvcc -G`, though I don't have any real-world examples.

copy-pr-bot · 2026-02-07T06:01:04Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

dkolsen-pgi · 2026-02-07T06:02:07Z

Here is an example of the problem:

#include <stdio.h>
#include <cuda/__memory/address_space.h>

__global__ void kernel(void *p) {
  if (cuda::device::__internal_is_address_from(p, cuda::device::address_space::cluster_shared)) {
    printf("Wrong answer!\n");
    asm("trap;");
  }
}

int main(int argc, char** argv) {
  kernel<<<32,32>>>(argv);
  if (cudaDeviceSynchronize() != cudaSuccess) {
    printf("Kernel failed\n");
  }
}

$ nvcc -arch=native -G -I/proj/cuda/cccl/main/libcudacxx/include test.cu
ptxas warning : Stack size for entry function '_Z6kernelPv' cannot be statically determined

The warning indicates that the kernel stack size might be wrong. This test program is small enough that the default stack size isn't an actual problem. But we have seen this warning followed by kernel failures due to stack overflow in larger programs.

I know that this change avoids the warning from ptxas or nvlink. But I can't easily do rigorous regression testing, so I would appreciate someone on the CCCL team doing whatever testing is appropriate for this.

bernhardmgruber · 2026-02-09T08:04:56Z

pre-commit.ci autofix

bernhardmgruber · 2026-02-09T08:08:16Z

/ok to test 1285379

libcudacxx/include/cuda/__memory/address_space.h

davebayer · 2026-02-09T09:56:51Z

/ok to test 2e10217

github-actions · 2026-02-09T13:30:46Z

🥳 CI Workflow Results

🟩 Finished in 3h 31m: Pass: 100%/95 | Total: 4d 02h | Max: 3h 30m | Hits: 39%/249347

See results here.

* Remove recursion from __internal_is_address_from When running on pre-Hopper GPUs, a call to `__internal_is_address_from(ptr, cluster_shared)` would simply make a recursive call to `__internal_is_address_from(ptr, shared)`. The recursion would stop there; there was no infinite recursion or large stack sizes. But when compiling the GPU code with debug information and no optimization (`nvcc -G`), the recursive call would remain in the PTX and that would cause either ptxas or nvlink to be unable to calculate the correct stack size for the kernel. That could result in a failed kernel if the default stack size is too small. Avoid this problem by removing the recursive call in `__internal_is_address_from`. Instead move the `case address_space::shared:` code to just after `case address_space::cluster_shared:` and have `case address_space::cluster_shared:` `[[fallthrough]]` to `case address_space::shared:` on pre-Hopper GPUs. This fixes some stdpar tests when compiled with `nvc++ -g -stdpar` on pre-Hopper GPUs. It fixes some CUDA applications compiled with `nvcc -G`, though I don't have any real-world examples. * [pre-commit.ci] auto code formatting * Update libcudacxx/include/cuda/__memory/address_space.h * Update libcudacxx/include/cuda/__memory/address_space.h * Update libcudacxx/include/cuda/__memory/address_space.h --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: David Bayer <48736217+davebayer@users.noreply.github.com> (cherry picked from commit afd6222)

github-actions · 2026-02-09T13:31:19Z

Successfully created backport PR for branch/3.2.x:

[Backport branch/3.2.x] Remove recursion from __internal_is_address_from #7573

* Remove recursion from __internal_is_address_from When running on pre-Hopper GPUs, a call to `__internal_is_address_from(ptr, cluster_shared)` would simply make a recursive call to `__internal_is_address_from(ptr, shared)`. The recursion would stop there; there was no infinite recursion or large stack sizes. But when compiling the GPU code with debug information and no optimization (`nvcc -G`), the recursive call would remain in the PTX and that would cause either ptxas or nvlink to be unable to calculate the correct stack size for the kernel. That could result in a failed kernel if the default stack size is too small. Avoid this problem by removing the recursive call in `__internal_is_address_from`. Instead move the `case address_space::shared:` code to just after `case address_space::cluster_shared:` and have `case address_space::cluster_shared:` `[[fallthrough]]` to `case address_space::shared:` on pre-Hopper GPUs. This fixes some stdpar tests when compiled with `nvc++ -g -stdpar` on pre-Hopper GPUs. It fixes some CUDA applications compiled with `nvcc -G`, though I don't have any real-world examples. * [pre-commit.ci] auto code formatting * Update libcudacxx/include/cuda/__memory/address_space.h * Update libcudacxx/include/cuda/__memory/address_space.h * Update libcudacxx/include/cuda/__memory/address_space.h --------- (cherry picked from commit afd6222) Co-authored-by: David Olsen <dolsen@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: David Bayer <48736217+davebayer@users.noreply.github.com>

* Remove recursion from __internal_is_address_from When running on pre-Hopper GPUs, a call to `__internal_is_address_from(ptr, cluster_shared)` would simply make a recursive call to `__internal_is_address_from(ptr, shared)`. The recursion would stop there; there was no infinite recursion or large stack sizes. But when compiling the GPU code with debug information and no optimization (`nvcc -G`), the recursive call would remain in the PTX and that would cause either ptxas or nvlink to be unable to calculate the correct stack size for the kernel. That could result in a failed kernel if the default stack size is too small. Avoid this problem by removing the recursive call in `__internal_is_address_from`. Instead move the `case address_space::shared:` code to just after `case address_space::cluster_shared:` and have `case address_space::cluster_shared:` `[[fallthrough]]` to `case address_space::shared:` on pre-Hopper GPUs. This fixes some stdpar tests when compiled with `nvc++ -g -stdpar` on pre-Hopper GPUs. It fixes some CUDA applications compiled with `nvcc -G`, though I don't have any real-world examples. * [pre-commit.ci] auto code formatting * Update libcudacxx/include/cuda/__memory/address_space.h * Update libcudacxx/include/cuda/__memory/address_space.h * Update libcudacxx/include/cuda/__memory/address_space.h --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: David Bayer <48736217+davebayer@users.noreply.github.com>

dkolsen-pgi requested a review from a team as a code owner February 7, 2026 06:01

dkolsen-pgi requested a review from pciolkosz February 7, 2026 06:01

github-project-automation bot added this to CCCL Feb 7, 2026

github-project-automation bot moved this to Todo in CCCL Feb 7, 2026

cccl-authenticator-app bot moved this from Todo to In Review in CCCL Feb 7, 2026

bernhardmgruber approved these changes Feb 9, 2026

View reviewed changes

[pre-commit.ci] auto code formatting

1285379

davebayer approved these changes Feb 9, 2026

View reviewed changes

davebayer reviewed Feb 9, 2026

View reviewed changes

libcudacxx/include/cuda/__memory/address_space.h Outdated Show resolved Hide resolved

libcudacxx/include/cuda/__memory/address_space.h Outdated Show resolved Hide resolved

libcudacxx/include/cuda/__memory/address_space.h Outdated Show resolved Hide resolved

davebayer added 3 commits February 9, 2026 10:53

Update libcudacxx/include/cuda/__memory/address_space.h

9214f92

Update libcudacxx/include/cuda/__memory/address_space.h

0d596b1

Update libcudacxx/include/cuda/__memory/address_space.h

2e10217

bernhardmgruber added the backport branch/3.2.x label Feb 9, 2026

davebayer enabled auto-merge (squash) February 9, 2026 11:16

davebayer merged commit afd6222 into NVIDIA:main Feb 9, 2026
113 checks passed

github-project-automation bot moved this from In Review to Done in CCCL Feb 9, 2026

github-actions bot mentioned this pull request Feb 9, 2026

[Backport branch/3.2.x] Remove recursion from __internal_is_address_from #7573

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove recursion from __internal_is_address_from#7561

Remove recursion from __internal_is_address_from#7561
davebayer merged 5 commits intoNVIDIA:mainfrom
dkolsen-pgi:bug/no-recursion

dkolsen-pgi commented Feb 7, 2026 •

edited by bernhardmgruber

Loading

Uh oh!

copy-pr-bot bot commented Feb 7, 2026

Uh oh!

dkolsen-pgi commented Feb 7, 2026

Uh oh!

bernhardmgruber commented Feb 9, 2026

Uh oh!

bernhardmgruber commented Feb 9, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davebayer commented Feb 9, 2026

Uh oh!

github-actions bot commented Feb 9, 2026

Uh oh!

Uh oh!

github-actions bot commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dkolsen-pgi commented Feb 7, 2026 • edited by bernhardmgruber Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

copy-pr-bot bot commented Feb 7, 2026

Uh oh!

dkolsen-pgi commented Feb 7, 2026

Uh oh!

bernhardmgruber commented Feb 9, 2026

Uh oh!

bernhardmgruber commented Feb 9, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davebayer commented Feb 9, 2026

Uh oh!

github-actions bot commented Feb 9, 2026

🥳 CI Workflow Results

🟩 Finished in 3h 31m: Pass: 100%/95 | Total: 4d 02h | Max: 3h 30m | Hits: 39%/249347

Uh oh!

Uh oh!

github-actions bot commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dkolsen-pgi commented Feb 7, 2026 •

edited by bernhardmgruber

Loading