[STF] Migrate __stf/utility/ from cuda_safe_call to cuda_try by andralex · Pull Request #9150 · NVIDIA/cccl

andralex · 2026-05-28T00:10:47Z

Summary

Second PR in the series migrating production STF headers off cuda_safe_call (abort-on-failure) and onto cuda_try (throw-on-failure), so callers (Python wrappers, exception-aware control flow) can recover from CUDA errors instead of having the process aborted.

Twelve call sites across three headers in cudax/include/cuda/experimental/__stf/utility/. After this PR, zero cuda_safe_call calls remain in that subtree outside of cuda_safe_call.cuh itself (the implementation).

Every pattern here was already demonstrated by PR #9147, so reviewer overhead per site should be small.

Changes

`occupancy.cuh`

compute_occupancy — hoisted the cache write to after the CUDA call. Before, auto& result = occupancy_cache[key] inserted a default-initialized entry into the cache before invoking the CUDA query, so a thrown cuda_try would have left zero-init data behind for any subsequent lookup. Now the result is computed into a stack-local first and assigned into the cache only on success.
compute_kernel_limits — cudaFuncGetAttributes switched to the templated cuda_try<F>(args...) form so the result can be const-initialized.

`memory.cuh`

allocateHostMemory / allocateManagedMemory pool wipes — rewrote the "wipe pool when too full" for-loops as pop-then-free while-loops. With the naive substitution, a cudaFreeHost / cudaFree mid-loop failure would have left the pool full of pointers that were already freed, and the next call would have re-entered the wipe and double-freed them. The new pattern leaks at most the in-flight pointer (already removed from the pool), never causes a double-free.
allocateHostMemory / allocateManagedMemory — cudaMallocHost / cudaMallocManaged switched to templated form, fused with the return.
Three callback launchers (cudaLaunchHostFunc × 2 and cudaGraphAddHostNode) — each previously did new std::pair<size_t, void*>(...) and passed the raw pointer to the launch. If the launch returned an error, cuda_safe_call aborted; if we'd done a naive cuda_try conversion, the launch failure would have leaked the heap pair forever. Rewrote with std::make_unique + args.release() after the cuda_try succeeds, so a failed launch unwinds cleanly.
address_is_pinned — cudaPointerGetAttributes → templated form.
unpin_memory — cudaGetLastError → direct cuda_try.

`constant_logical_data.cuh`

cached_value::get() — between the cached.emplace(...) and the cudaStreamSynchronize, added SCOPE(fail) { cached.erase(it); };. If the sync throws, the partially-fetched cache entry would otherwise persist and corrupt any subsequent get() call (which would find the cache entry and return the unsynchronized value). unordered_map::erase by iterator is noexcept, so the SCOPE body is safe.

Includes added

memory.cuh: <memory> (for std::make_unique) and utility/scope_guard.cuh.
constant_logical_data.cuh: utility/scope_guard.cuh.

(Both scope_guard.cuh includes are explicit rather than transitive, matching the style established in PR #9147.)

Test plan

CI green (/ok to test in comment below)
No behavioral change on the success path

Second PR in the series. Twelve call sites across three headers in ``__stf/utility/``, grouped by pattern: occupancy.cuh: - compute_occupancy: hoist the cache write so that the cuOccupancy / cudaOccupancy CUDA call runs against a stack-local first, then assigns into the cache on success. Previously a thrown cuda_try would have left a default-initialized entry in the cache that any subsequent lookup with the same key would silently return. - compute_kernel_limits: cudaFuncGetAttributes -> templated cuda_try<cudaFuncGetAttributes>(f) for const-initialized attrs. memory.cuh: - allocateHostMemory / allocateManagedMemory pool wipes: rewrite the "wipe pool when too full" loops to pop-then-free, so a thrown cuda_try leaks at most the in-flight pointer (already removed from the pool) instead of leaving the pool full of already-freed pointers and producing a double-free on the next call. - allocateHostMemory / allocateManagedMemory: cudaMallocHost and cudaMallocManaged -> templated cuda_try<F>(sz) (returns void*, pairs naturally with ``return``). - deallocateHostMemory(stream), deallocateManagedMemory(stream), and the graph-overload of deallocateHostMemory: switch the raw ``new std::pair<...>`` to std::make_unique and call args.release() only after the launch / graph-node add succeeds. A failed launch previously leaked the heap pair forever. - address_is_pinned: cudaPointerGetAttributes -> templated form. - unpin_memory: cudaGetLastError -> cuda_try direct. constant_logical_data.cuh: - cached_value::get(): cudaStreamSynchronize -> cuda_try, guarded by SCOPE(fail) { cached.erase(it); } so that a sync failure does not leave a partially-fetched entry that subsequent get() calls would return unsynchronized. unordered_map::erase(it) is noexcept, so the SCOPE body is safe. Added include of utility/scope_guard.cuh in memory.cuh and constant_logical_data.cuh, and <memory> in memory.cuh for unique_ptr. After this PR, zero cuda_safe_call calls remain in ``__stf/utility/`` outside of cuda_safe_call.cuh itself (the implementation).

andralex · 2026-05-28T00:10:49Z

/ok to test 7ba36e4

copy-pr-bot · 2026-05-28T00:10:51Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-05-28T00:22:50Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 226443f9-4a99-4e5a-956c-d91b50bed29f

📥 Commits

Reviewing files that changed from the base of the PR and between 7ba36e4 and fc6e23d.

📒 Files selected for processing (1)

cudax/include/cuda/experimental/__stf/utility/memory.cuh

🚧 Files skipped from review as they are similar to previous changes (1)

cudax/include/cuda/experimental/__stf/utility/memory.cuh

📝 Walkthrough

Summary by CodeRabbit

Bug Fixes
- Improved error handling so failed CUDA operations clean up partial state and avoid reuse of invalid entries.
- Prevented resource leaks when stream- or graph-ordered deallocation fails by ensuring callbacks or heap allocations are only released after successful scheduling.
Refactor
- Strengthened exception-safety across memory allocation, deallocation, and occupancy computation paths for more robust error propagation.

suggestion:

Walkthrough

Refactor three STF utility headers to use cuda_try for CUDA calls, add scope-guarded cache cleanup and compute-then-insert caching to avoid leaving partial entries, and gate callback-data ownership with std::unique_ptr until CUDA launch/node-add succeeds.

Changes

STF Utility Exception-Safety Hardening

Layer / File(s)	Summary
Constant logical data cache failure cleanup `cudax/include/cuda/experimental/__stf/utility/constant_logical_data.cuh`	Add `scope_guard.cuh` include and, in `impl::get`, replace direct `cudaStreamSynchronize` with `cuda_try` and use `SCOPE(fail){ cached.erase(it); }` to erase partially-populated cache entries on sync failure.
Occupancy cache compute-then-insert and attributes retrieval `cudax/include/cuda/experimental/__stf/utility/occupancy.cuh`	On cache miss compute occupancy into a local `compute_occupancy_result` using `cuda_try` (CUfunction and non-`CUfunction` paths) and insert into `occupancy_cache` only on success; retrieve `cudaFuncAttributes` via `cuda_try<cudaFuncGetAttributes>(f)`.
Memory allocation and pool wipe with cuda_try `cudax/include/cuda/experimental/__stf/utility/memory.cuh`	Add `scope_guard.cuh` and `<memory>` includes. `allocateHostMemory` and `allocateManagedMemory` free excess pooled entries individually with `cuda_try` in an erase loop and allocate via runtime-status `cuda_try(cudaMallocHost/ cudaMallocManaged)` before returning.
Stream/graph deallocation callback ownership gating `cudax/include/cuda/experimental/__stf/utility/memory.cuh`	Stream-ordered and graph-ordered deallocation allocate a heap `(size, ptr)` pair under `std::unique_ptr`, call `cuda_try(cudaLaunchHostFunc(...))` or `cuda_try<cudaGraphAddHostNode>(...)`, and release the pointer to callback only on successful operations to avoid leaks on failure.
Memory query helpers cuda_try migration `cudax/include/cuda/experimental/__stf/utility/memory.cuh`	`address_is_pinned` now uses `cuda_try<cudaPointerGetAttributes>(p)`; `unpin_memory` uses `cuda_try(cudaGetLastError())`.

Possibly related PRs

NVIDIA/cccl#8891: Introduced cuda_try return-value inference patterns used in this PR.

Suggested labels

stf

Suggested reviewers

caugonnet
alliepiper
bernhardmgruber

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

cudax/include/cuda/experimental/__stf/utility/constant_logical_data.cuh (1)

85-85: ⚡ Quick win

suggestion: Fully qualify the free-function calls on Line 85 to match repository rules; e.g. ::cuda::experimental::stf::cuda_try(::cudaStreamSynchronize(stream));.

As per coding guidelines, “All calls to free functions must be fully qualified starting from the global namespace.”

cudax/include/cuda/experimental/__stf/utility/occupancy.cuh (1)

55-61: ⚡ Quick win

suggestion: Qualify these changed free-function calls from global scope for consistency with the cudax style contract, e.g. ::cuda::experimental::stf::cuda_try(::cuOccupancyMaxPotentialBlockSize(...)), ::cuda::experimental::stf::cuda_try(::cudaOccupancyMaxPotentialBlockSize(...)), and ::cuda::experimental::stf::cuda_try<::cudaFuncGetAttributes>(f).

As per coding guidelines, “All calls to free functions must be fully qualified starting from the global namespace.”

Also applies to: 109-109

cudax/include/cuda/experimental/__stf/utility/memory.cuh (1)

31-35: ⚡ Quick win

suggestion: replace the unused scope_guard.cuh include with a direct <utility> include. The new ::std::pair<size_t, void*> instantiations depend on <utility>, and nothing in this header uses a scope-guard symbol.

As per coding guidelines, "Files must include all headers related to symbols they use" and "Remove unused code, variables, functions, types, template parameters, and headers."

Also applies to: 192-193, 215-216, 243-244

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2bf25116-9962-4180-b491-fd7fd58182df

📥 Commits

Reviewing files that changed from the base of the PR and between 740b3c0 and 7ba36e4.

📒 Files selected for processing (3)

cudax/include/cuda/experimental/__stf/utility/constant_logical_data.cuh
cudax/include/cuda/experimental/__stf/utility/memory.cuh
cudax/include/cuda/experimental/__stf/utility/occupancy.cuh

coderabbitai · 2026-05-28T00:22:53Z

  }
-  cuda_safe_call(cudaMallocManaged(&result, sz));
-  return result;
+  return cuda_try<cudaMallocManaged>(sz);


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail fd -a 'cuda_safe_call\.cuh$|cuda_try\.cuh$' . | while read -r file; do rg -n -C4 'cuda_try|cudaMallocManaged' "$file" done

Repository: NVIDIA/cccl

Length of output: 5861

🏁 Script executed:

#!/bin/bash set -euo pipefail # 1) Show the call site in memory.cuh FILE="cudax/include/cuda/experimental/__stf/utility/memory.cuh" if [ -f "$FILE" ]; then echo "---- $FILE ----" nl -ba "$FILE" | sed -n '90,160p' else echo "Missing: $FILE" >&2 fi # 2) Locate cuda_try implementation and show surrounding logic + reserved helpers rg -n "template <auto fun, typename\.\.\. Ps>.*cuda_try" -S cudax/include/cuda/experimental/__stf/utility || true fd -a 'cuda_try\.cuh$|cuda_safe_call\.cuh$' cudax/include/cuda/experimental/__stf/utility . | while read -r f; do echo "---- $f ----" nl -ba "$f" | sed -n '300,620p' done # 3) Find reserved::first_param / last_param definitions fd -a 'cuda_safe_call\.cuh$|cuda_try\.cuh$' . | while read -r f; do rg -n "reserved::(first_param|last_param)" "$f" || true rg -n "struct (first_param|last_param)|using (first_param|last_param)" "$f" || true done

Repository: NVIDIA/cccl

Length of output: 183

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="cudax/include/cuda/experimental/__stf/utility/memory.cuh" echo "== File info ==" ls -l "$FILE" 2>/dev/null || true wc -l "$FILE" 2>/dev/null || true echo "== Snippet around line 124 ==" sed -n '105,145p' "$FILE" 2>/dev/null || true echo "== cuda_mallocmanaged/cuda_try usage ==" rg -n "cuda_try<\s*cudaMallocManaged\s*>\s*\(" "$FILE" || true rg -n "cudaMallocManaged" "$FILE" || true echo "== Find cuda_try implementation files ==" fd -a 'cuda_try\.cuh$|cuda_safe_call\.cuh$' cudax/include/cuda/experimental/__stf/utility . | sort | cat echo "== Show cuda_try definition + reserved first/last param ==" fd -a 'cuda_try\.cuh$|cuda_safe_call\.cuh$' cudax/include/cuda/experimental/__stf/utility . | while read -r f; do echo "---- $f ----" rg -n "template <auto fun, typename\.\.\. Ps>.*cuda_try|cuda_try\(" "$f" | head -n 50 || true rg -n "reserved::(first_param|last_param)|struct (first_param|last_param)|using (first_param|last_param)" "$f" || true done

Repository: NVIDIA/cccl

Length of output: 7683

🏁 Script executed:

#!/bin/bash set -euo pipefail # repo root assumed echo "== Other uses of cuda_try<cudaMallocManaged> ==" rg -n "cuda_try<\s*cudaMallocManaged\s*>\s*\(" -S . || true echo "== Any uses of cudaMemAttachGlobal ==" rg -n "cudaMemAttachGlobal" -S cudax/include . || true echo "== Show cuda_try invocation selection logic (cuda_safe_call.cuh) ==" FILE="cudax/include/cuda/experimental/__stf/utility/cuda_safe_call.cuh" sed -n '420,520p' "$FILE" | cat -n echo "== Show first_param/last_param machinery (cuda_safe_call.cuh) ==" sed -n '220,320p' "$FILE" | cat -n

Repository: NVIDIA/cccl

Length of output: 9466

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== Search for cudaMallocManaged declaration/signature in repo ==" rg -n "cudaMallocManaged\s*\(" -S . | head -n 50 || true echo "== Search for default flags for cudaMallocManaged ==" rg -n "cudaMemAttachGlobal" -S . | head -n 50 || true echo "== Search for cudaMallocManaged flags parameter usage ==" rg -n "cudaMallocManaged\s*\([^,]+,[^,]+,[^)]+" -S libcudacxx libcudacxx/include cudax 2>/dev/null || true echo "== Search for cuda_runtime.h wrapper declaration references ==" rg -n "cuda_runtime\.h" -S cudax/include | head -n 50 || true

Repository: NVIDIA/cccl

Length of output: 6632

important: In cudax/include/cuda/experimental/__stf/utility/memory.cuh:124, cuda_try<cudaMallocManaged>(sz) selects first_output_form and synthesizes cudaMallocManaged(&result, sz), omitting the 3rd flags argument. Pass the attachment flags explicitly to avoid relying on defaults: return cuda_try<cudaMallocManaged>(sz, cudaMemAttachGlobal);

The previous commit used ``cuda_try<cudaMallocHost>(sz)`` and ``cuda_try<cudaMallocManaged>(sz)`` to return the allocated pointer in a single expression. nvcc rejected this across the CI matrix: both CUDA names are overload sets (the C++ headers in cuda_runtime.h wrap the C API with templates), so ``decltype(fun)`` cannot resolve them. The templated form is documented as unsupported for overload sets in cuda_safe_call.cuh (~lines 432-435). Restore the runtime-status form for both: void* p = nullptr; cuda_try(cudaMallocHost(&p, sz)); // or cudaMallocManaged return p; Added a short comment at each site pointing at the overload-set limitation so future migrators do not repeat the trap. No behavioral change other than fixing the build.

andralex · 2026-05-28T14:19:59Z

/ok to test fc6e23d

coderabbitai · 2026-05-28T14:21:59Z

Actionable comments posted: 0

github-actions · 2026-05-28T14:54:27Z

😬 CI Workflow Results

🟥 Finished in 32m 41s: Pass: 12%/55 | Total: 4h 14m | Max: 32m 35s | Hits: 100%/1673

See results here.

andralex requested a review from a team as a code owner May 28, 2026 00:10

andralex requested a review from caugonnet May 28, 2026 00:10

github-project-automation Bot added this to CCCL May 28, 2026

github-project-automation Bot moved this to Todo in CCCL May 28, 2026

cccl-authenticator-app Bot moved this from Todo to In Review in CCCL May 28, 2026

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

andralex mentioned this pull request May 28, 2026

cudax/stf: migrate stackable/ from cuda_safe_call to cuda_try #9165

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[STF] Migrate __stf/utility/ from cuda_safe_call to cuda_try#9150

[STF] Migrate __stf/utility/ from cuda_safe_call to cuda_try#9150
andralex wants to merge 2 commits into
NVIDIA:mainfrom
andralex:andralex/stf-cuda-try-utility

andralex commented May 28, 2026

Uh oh!

andralex commented May 28, 2026

Uh oh!

copy-pr-bot Bot commented May 28, 2026

Uh oh!

coderabbitai Bot commented May 28, 2026 •

edited

Loading

Summary by CodeRabbit

Walkthrough

Changes

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 28, 2026

Uh oh!

This comment has been minimized.

andralex commented May 28, 2026

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

github-actions Bot commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andralex commented May 28, 2026

Summary

Changes

occupancy.cuh

memory.cuh

constant_logical_data.cuh

Includes added

Test plan

Uh oh!

andralex commented May 28, 2026

Uh oh!

copy-pr-bot Bot commented May 28, 2026

Uh oh!

coderabbitai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

andralex commented May 28, 2026

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

github-actions Bot commented May 28, 2026

😬 CI Workflow Results

🟥 Finished in 32m 41s: Pass: 12%/55 | Total: 4h 14m | Max: 32m 35s | Hits: 100%/1673

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`occupancy.cuh`

`memory.cuh`

`constant_logical_data.cuh`

coderabbitai Bot commented May 28, 2026 •

edited

Loading