Skip to content

Remove recursion from __internal_is_address_from#7561

Merged
davebayer merged 5 commits intoNVIDIA:mainfrom
dkolsen-pgi:bug/no-recursion
Feb 9, 2026
Merged

Remove recursion from __internal_is_address_from#7561
davebayer merged 5 commits intoNVIDIA:mainfrom
dkolsen-pgi:bug/no-recursion

Conversation

@dkolsen-pgi
Copy link
Contributor

@dkolsen-pgi dkolsen-pgi commented Feb 7, 2026

Description

When running on pre-Hopper GPUs, a call to __internal_is_address_from(ptr, cluster_shared) would simply make a recursive call to __internal_is_address_from(ptr, shared). The recursion would stop there; there was no infinite recursion or large stack sizes. But when compiling the GPU code with debug information and no optimization (nvcc -G), the recursive call would remain in the PTX and that would cause either ptxas or nvlink to be unable to calculate the correct stack size for the kernel. That could result in a failed kernel if the default stack size is too small.

Avoid this problem by removing the recursive call in __internal_is_address_from. Instead move the case address_space::shared: code to just after case address_space::cluster_shared: and have case address_space::cluster_shared: [[fallthrough]] to case address_space::shared: on pre-Hopper GPUs.

This fixes some stdpar tests when compiled with nvc++ -g -stdpar on pre-Hopper GPUs. It fixes some CUDA applications compiled with nvcc -G, though I don't have any real-world examples.

Also fixes NVBug: 5880331

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

When running on pre-Hopper GPUs, a call to
`__internal_is_address_from(ptr, cluster_shared)` would simply make a
recursive call to `__internal_is_address_from(ptr, shared)`.  The
recursion would stop there; there was no infinite recursion or large
stack sizes.  But when compiling the GPU code with debug information and
no optimization (`nvcc -G`), the recursive call would remain in the PTX
and that would cause either ptxas or nvlink to be unable to calculate
the correct stack size for the kernel.  That could result in a failed
kernel if the default stack size is too small.

Avoid this problem by removing the recursive call in
`__internal_is_address_from`.  Instead move the `case
address_space::shared:` code to just after `case
address_space::cluster_shared:` and have `case
address_space::cluster_shared:` `[[fallthrough]]` to `case
address_space::shared:` on pre-Hopper GPUs.

This fixes some stdpar tests when compiled with `nvc++ -g -stdpar` on
pre-Hopper GPUs.  It fixes some CUDA applications compiled with `nvcc
-G`, though I don't have any real-world examples.
@dkolsen-pgi dkolsen-pgi requested a review from a team as a code owner February 7, 2026 06:01
@dkolsen-pgi dkolsen-pgi requested a review from pciolkosz February 7, 2026 06:01
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Feb 7, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-project-automation github-project-automation bot moved this to Todo in CCCL Feb 7, 2026
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Feb 7, 2026
@dkolsen-pgi
Copy link
Contributor Author

Here is an example of the problem:

#include <stdio.h>
#include <cuda/__memory/address_space.h>

__global__ void kernel(void *p) {
  if (cuda::device::__internal_is_address_from(p, cuda::device::address_space::cluster_shared)) {
    printf("Wrong answer!\n");
    asm("trap;");
  }
}

int main(int argc, char** argv) {
  kernel<<<32,32>>>(argv);
  if (cudaDeviceSynchronize() != cudaSuccess) {
    printf("Kernel failed\n");
  }
}
$ nvcc -arch=native -G -I/proj/cuda/cccl/main/libcudacxx/include test.cu
ptxas warning : Stack size for entry function '_Z6kernelPv' cannot be statically determined

The warning indicates that the kernel stack size might be wrong. This test program is small enough that the default stack size isn't an actual problem. But we have seen this warning followed by kernel failures due to stack overflow in larger programs.

I know that this change avoids the warning from ptxas or nvlink. But I can't easily do rigorous regression testing, so I would appreciate someone on the CCCL team doing whatever testing is appropriate for this.

@bernhardmgruber
Copy link
Contributor

pre-commit.ci autofix

@bernhardmgruber
Copy link
Contributor

/ok to test 1285379

@davebayer
Copy link
Contributor

/ok to test 2e10217

@github-actions
Copy link
Contributor

github-actions bot commented Feb 9, 2026

🥳 CI Workflow Results

🟩 Finished in 3h 31m: Pass: 100%/95 | Total: 4d 02h | Max: 3h 30m | Hits: 39%/249347

See results here.

@davebayer davebayer merged commit afd6222 into NVIDIA:main Feb 9, 2026
113 checks passed
@github-project-automation github-project-automation bot moved this from In Review to Done in CCCL Feb 9, 2026
github-actions bot pushed a commit that referenced this pull request Feb 9, 2026
* Remove recursion from __internal_is_address_from

When running on pre-Hopper GPUs, a call to
`__internal_is_address_from(ptr, cluster_shared)` would simply make a
recursive call to `__internal_is_address_from(ptr, shared)`.  The
recursion would stop there; there was no infinite recursion or large
stack sizes.  But when compiling the GPU code with debug information and
no optimization (`nvcc -G`), the recursive call would remain in the PTX
and that would cause either ptxas or nvlink to be unable to calculate
the correct stack size for the kernel.  That could result in a failed
kernel if the default stack size is too small.

Avoid this problem by removing the recursive call in
`__internal_is_address_from`.  Instead move the `case
address_space::shared:` code to just after `case
address_space::cluster_shared:` and have `case
address_space::cluster_shared:` `[[fallthrough]]` to `case
address_space::shared:` on pre-Hopper GPUs.

This fixes some stdpar tests when compiled with `nvc++ -g -stdpar` on
pre-Hopper GPUs.  It fixes some CUDA applications compiled with `nvcc
-G`, though I don't have any real-world examples.

* [pre-commit.ci] auto code formatting

* Update libcudacxx/include/cuda/__memory/address_space.h

* Update libcudacxx/include/cuda/__memory/address_space.h

* Update libcudacxx/include/cuda/__memory/address_space.h

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: David Bayer <48736217+davebayer@users.noreply.github.com>
(cherry picked from commit afd6222)
@github-actions
Copy link
Contributor

github-actions bot commented Feb 9, 2026

Successfully created backport PR for branch/3.2.x:

wmaxey pushed a commit that referenced this pull request Feb 11, 2026
* Remove recursion from __internal_is_address_from

When running on pre-Hopper GPUs, a call to
`__internal_is_address_from(ptr, cluster_shared)` would simply make a
recursive call to `__internal_is_address_from(ptr, shared)`.  The
recursion would stop there; there was no infinite recursion or large
stack sizes.  But when compiling the GPU code with debug information and
no optimization (`nvcc -G`), the recursive call would remain in the PTX
and that would cause either ptxas or nvlink to be unable to calculate
the correct stack size for the kernel.  That could result in a failed
kernel if the default stack size is too small.

Avoid this problem by removing the recursive call in
`__internal_is_address_from`.  Instead move the `case
address_space::shared:` code to just after `case
address_space::cluster_shared:` and have `case
address_space::cluster_shared:` `[[fallthrough]]` to `case
address_space::shared:` on pre-Hopper GPUs.

This fixes some stdpar tests when compiled with `nvc++ -g -stdpar` on
pre-Hopper GPUs.  It fixes some CUDA applications compiled with `nvcc
-G`, though I don't have any real-world examples.

* [pre-commit.ci] auto code formatting

* Update libcudacxx/include/cuda/__memory/address_space.h

* Update libcudacxx/include/cuda/__memory/address_space.h

* Update libcudacxx/include/cuda/__memory/address_space.h

---------



(cherry picked from commit afd6222)

Co-authored-by: David Olsen <dolsen@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: David Bayer <48736217+davebayer@users.noreply.github.com>
fbusato pushed a commit to fbusato/cccl that referenced this pull request Feb 19, 2026
* Remove recursion from __internal_is_address_from

When running on pre-Hopper GPUs, a call to
`__internal_is_address_from(ptr, cluster_shared)` would simply make a
recursive call to `__internal_is_address_from(ptr, shared)`.  The
recursion would stop there; there was no infinite recursion or large
stack sizes.  But when compiling the GPU code with debug information and
no optimization (`nvcc -G`), the recursive call would remain in the PTX
and that would cause either ptxas or nvlink to be unable to calculate
the correct stack size for the kernel.  That could result in a failed
kernel if the default stack size is too small.

Avoid this problem by removing the recursive call in
`__internal_is_address_from`.  Instead move the `case
address_space::shared:` code to just after `case
address_space::cluster_shared:` and have `case
address_space::cluster_shared:` `[[fallthrough]]` to `case
address_space::shared:` on pre-Hopper GPUs.

This fixes some stdpar tests when compiled with `nvc++ -g -stdpar` on
pre-Hopper GPUs.  It fixes some CUDA applications compiled with `nvcc
-G`, though I don't have any real-world examples.

* [pre-commit.ci] auto code formatting

* Update libcudacxx/include/cuda/__memory/address_space.h

* Update libcudacxx/include/cuda/__memory/address_space.h

* Update libcudacxx/include/cuda/__memory/address_space.h

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: David Bayer <48736217+davebayer@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants