Skip to content

[STF] Migrate __stf/utility/ from cuda_safe_call to cuda_try#9150

Open
andralex wants to merge 2 commits into
NVIDIA:mainfrom
andralex:andralex/stf-cuda-try-utility
Open

[STF] Migrate __stf/utility/ from cuda_safe_call to cuda_try#9150
andralex wants to merge 2 commits into
NVIDIA:mainfrom
andralex:andralex/stf-cuda-try-utility

Conversation

@andralex
Copy link
Copy Markdown
Contributor

Summary

Second PR in the series migrating production STF headers off cuda_safe_call (abort-on-failure) and onto cuda_try (throw-on-failure), so callers (Python wrappers, exception-aware control flow) can recover from CUDA errors instead of having the process aborted.

Twelve call sites across three headers in cudax/include/cuda/experimental/__stf/utility/. After this PR, zero cuda_safe_call calls remain in that subtree outside of cuda_safe_call.cuh itself (the implementation).

Every pattern here was already demonstrated by PR #9147, so reviewer overhead per site should be small.

Changes

occupancy.cuh

  • compute_occupancy — hoisted the cache write to after the CUDA call. Before, auto& result = occupancy_cache[key] inserted a default-initialized entry into the cache before invoking the CUDA query, so a thrown cuda_try would have left zero-init data behind for any subsequent lookup. Now the result is computed into a stack-local first and assigned into the cache only on success.
  • compute_kernel_limitscudaFuncGetAttributes switched to the templated cuda_try<F>(args...) form so the result can be const-initialized.

memory.cuh

  • allocateHostMemory / allocateManagedMemory pool wipes — rewrote the "wipe pool when too full" for-loops as pop-then-free while-loops. With the naive substitution, a cudaFreeHost / cudaFree mid-loop failure would have left the pool full of pointers that were already freed, and the next call would have re-entered the wipe and double-freed them. The new pattern leaks at most the in-flight pointer (already removed from the pool), never causes a double-free.
  • allocateHostMemory / allocateManagedMemorycudaMallocHost / cudaMallocManaged switched to templated form, fused with the return.
  • Three callback launchers (cudaLaunchHostFunc × 2 and cudaGraphAddHostNode) — each previously did new std::pair<size_t, void*>(...) and passed the raw pointer to the launch. If the launch returned an error, cuda_safe_call aborted; if we'd done a naive cuda_try conversion, the launch failure would have leaked the heap pair forever. Rewrote with std::make_unique + args.release() after the cuda_try succeeds, so a failed launch unwinds cleanly.
  • address_is_pinnedcudaPointerGetAttributes → templated form.
  • unpin_memorycudaGetLastError → direct cuda_try.

constant_logical_data.cuh

  • cached_value::get() — between the cached.emplace(...) and the cudaStreamSynchronize, added SCOPE(fail) { cached.erase(it); };. If the sync throws, the partially-fetched cache entry would otherwise persist and corrupt any subsequent get() call (which would find the cache entry and return the unsynchronized value). unordered_map::erase by iterator is noexcept, so the SCOPE body is safe.

Includes added

  • memory.cuh: <memory> (for std::make_unique) and utility/scope_guard.cuh.
  • constant_logical_data.cuh: utility/scope_guard.cuh.

(Both scope_guard.cuh includes are explicit rather than transitive, matching the style established in PR #9147.)

Test plan

  • CI green (/ok to test in comment below)
  • No behavioral change on the success path

Second PR in the series. Twelve call sites across three headers in
``__stf/utility/``, grouped by pattern:

occupancy.cuh:
  - compute_occupancy: hoist the cache write so that the cuOccupancy /
    cudaOccupancy CUDA call runs against a stack-local first, then
    assigns into the cache on success. Previously a thrown cuda_try
    would have left a default-initialized entry in the cache that any
    subsequent lookup with the same key would silently return.
  - compute_kernel_limits: cudaFuncGetAttributes -> templated
    cuda_try<cudaFuncGetAttributes>(f) for const-initialized attrs.

memory.cuh:
  - allocateHostMemory / allocateManagedMemory pool wipes: rewrite
    the "wipe pool when too full" loops to pop-then-free, so a thrown
    cuda_try leaks at most the in-flight pointer (already removed from
    the pool) instead of leaving the pool full of already-freed
    pointers and producing a double-free on the next call.
  - allocateHostMemory / allocateManagedMemory: cudaMallocHost and
    cudaMallocManaged -> templated cuda_try<F>(sz) (returns void*,
    pairs naturally with ``return``).
  - deallocateHostMemory(stream), deallocateManagedMemory(stream),
    and the graph-overload of deallocateHostMemory: switch the
    raw ``new std::pair<...>`` to std::make_unique and call
    args.release() only after the launch / graph-node add succeeds.
    A failed launch previously leaked the heap pair forever.
  - address_is_pinned: cudaPointerGetAttributes -> templated form.
  - unpin_memory: cudaGetLastError -> cuda_try direct.

constant_logical_data.cuh:
  - cached_value::get(): cudaStreamSynchronize -> cuda_try, guarded by
    SCOPE(fail) { cached.erase(it); } so that a sync failure does not
    leave a partially-fetched entry that subsequent get() calls would
    return unsynchronized. unordered_map::erase(it) is noexcept, so the
    SCOPE body is safe.

Added include of utility/scope_guard.cuh in memory.cuh and
constant_logical_data.cuh, and <memory> in memory.cuh for unique_ptr.

After this PR, zero cuda_safe_call calls remain in ``__stf/utility/``
outside of cuda_safe_call.cuh itself (the implementation).
@andralex andralex requested a review from a team as a code owner May 28, 2026 00:10
@andralex andralex requested a review from caugonnet May 28, 2026 00:10
@andralex
Copy link
Copy Markdown
Contributor Author

/ok to test 7ba36e4

@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented May 28, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-project-automation github-project-automation Bot moved this to Todo in CCCL May 28, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL May 28, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 28, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 226443f9-4a99-4e5a-956c-d91b50bed29f

📥 Commits

Reviewing files that changed from the base of the PR and between 7ba36e4 and fc6e23d.

📒 Files selected for processing (1)
  • cudax/include/cuda/experimental/__stf/utility/memory.cuh
🚧 Files skipped from review as they are similar to previous changes (1)
  • cudax/include/cuda/experimental/__stf/utility/memory.cuh

📝 Walkthrough

Summary by CodeRabbit

  • Bug Fixes

    • Improved error handling so failed CUDA operations clean up partial state and avoid reuse of invalid entries.
    • Prevented resource leaks when stream- or graph-ordered deallocation fails by ensuring callbacks or heap allocations are only released after successful scheduling.
  • Refactor

    • Strengthened exception-safety across memory allocation, deallocation, and occupancy computation paths for more robust error propagation.

suggestion:

Walkthrough

Refactor three STF utility headers to use cuda_try for CUDA calls, add scope-guarded cache cleanup and compute-then-insert caching to avoid leaving partial entries, and gate callback-data ownership with std::unique_ptr until CUDA launch/node-add succeeds.

Changes

STF Utility Exception-Safety Hardening

Layer / File(s) Summary
Constant logical data cache failure cleanup
cudax/include/cuda/experimental/__stf/utility/constant_logical_data.cuh
Add scope_guard.cuh include and, in impl::get, replace direct cudaStreamSynchronize with cuda_try and use SCOPE(fail){ cached.erase(it); } to erase partially-populated cache entries on sync failure.
Occupancy cache compute-then-insert and attributes retrieval
cudax/include/cuda/experimental/__stf/utility/occupancy.cuh
On cache miss compute occupancy into a local compute_occupancy_result using cuda_try (CUfunction and non-CUfunction paths) and insert into occupancy_cache only on success; retrieve cudaFuncAttributes via cuda_try<cudaFuncGetAttributes>(f).
Memory allocation and pool wipe with cuda_try
cudax/include/cuda/experimental/__stf/utility/memory.cuh
Add scope_guard.cuh and <memory> includes. allocateHostMemory and allocateManagedMemory free excess pooled entries individually with cuda_try in an erase loop and allocate via runtime-status cuda_try(cudaMallocHost/ cudaMallocManaged) before returning.
Stream/graph deallocation callback ownership gating
cudax/include/cuda/experimental/__stf/utility/memory.cuh
Stream-ordered and graph-ordered deallocation allocate a heap (size, ptr) pair under std::unique_ptr, call cuda_try(cudaLaunchHostFunc(...)) or cuda_try<cudaGraphAddHostNode>(...), and release the pointer to callback only on successful operations to avoid leaks on failure.
Memory query helpers cuda_try migration
cudax/include/cuda/experimental/__stf/utility/memory.cuh
address_is_pinned now uses cuda_try<cudaPointerGetAttributes>(p); unpin_memory uses cuda_try(cudaGetLastError()).

Possibly related PRs

  • NVIDIA/cccl#8891: Introduced cuda_try return-value inference patterns used in this PR.

Suggested labels

stf

Suggested reviewers

  • caugonnet
  • alliepiper
  • bernhardmgruber

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
cudax/include/cuda/experimental/__stf/utility/constant_logical_data.cuh (1)

85-85: ⚡ Quick win

suggestion: Fully qualify the free-function calls on Line 85 to match repository rules; e.g. ::cuda::experimental::stf::cuda_try(::cudaStreamSynchronize(stream));.

As per coding guidelines, “All calls to free functions must be fully qualified starting from the global namespace.”

cudax/include/cuda/experimental/__stf/utility/occupancy.cuh (1)

55-61: ⚡ Quick win

suggestion: Qualify these changed free-function calls from global scope for consistency with the cudax style contract, e.g. ::cuda::experimental::stf::cuda_try(::cuOccupancyMaxPotentialBlockSize(...)), ::cuda::experimental::stf::cuda_try(::cudaOccupancyMaxPotentialBlockSize(...)), and ::cuda::experimental::stf::cuda_try<::cudaFuncGetAttributes>(f).

As per coding guidelines, “All calls to free functions must be fully qualified starting from the global namespace.”

Also applies to: 109-109

cudax/include/cuda/experimental/__stf/utility/memory.cuh (1)

31-35: ⚡ Quick win

suggestion: replace the unused scope_guard.cuh include with a direct <utility> include. The new ::std::pair<size_t, void*> instantiations depend on <utility>, and nothing in this header uses a scope-guard symbol.

As per coding guidelines, "Files must include all headers related to symbols they use" and "Remove unused code, variables, functions, types, template parameters, and headers."

Also applies to: 192-193, 215-216, 243-244


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2bf25116-9962-4180-b491-fd7fd58182df

📥 Commits

Reviewing files that changed from the base of the PR and between 740b3c0 and 7ba36e4.

📒 Files selected for processing (3)
  • cudax/include/cuda/experimental/__stf/utility/constant_logical_data.cuh
  • cudax/include/cuda/experimental/__stf/utility/memory.cuh
  • cudax/include/cuda/experimental/__stf/utility/occupancy.cuh

}
cuda_safe_call(cudaMallocManaged(&result, sz));
return result;
return cuda_try<cudaMallocManaged>(sz);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

fd -a 'cuda_safe_call\.cuh$|cuda_try\.cuh$' . | while read -r file; do
  rg -n -C4 'cuda_try|cudaMallocManaged' "$file"
done

Repository: NVIDIA/cccl

Length of output: 5861


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Show the call site in memory.cuh
FILE="cudax/include/cuda/experimental/__stf/utility/memory.cuh"
if [ -f "$FILE" ]; then
  echo "---- $FILE ----"
  nl -ba "$FILE" | sed -n '90,160p'
else
  echo "Missing: $FILE" >&2
fi

# 2) Locate cuda_try implementation and show surrounding logic + reserved helpers
rg -n "template <auto fun, typename\.\.\. Ps>.*cuda_try" -S cudax/include/cuda/experimental/__stf/utility || true
fd -a 'cuda_try\.cuh$|cuda_safe_call\.cuh$' cudax/include/cuda/experimental/__stf/utility . | while read -r f; do
  echo "---- $f ----"
  nl -ba "$f" | sed -n '300,620p'
done

# 3) Find reserved::first_param / last_param definitions
fd -a 'cuda_safe_call\.cuh$|cuda_try\.cuh$' . | while read -r f; do
  rg -n "reserved::(first_param|last_param)" "$f" || true
  rg -n "struct (first_param|last_param)|using (first_param|last_param)" "$f" || true
done

Repository: NVIDIA/cccl

Length of output: 183


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="cudax/include/cuda/experimental/__stf/utility/memory.cuh"

echo "== File info =="
ls -l "$FILE" 2>/dev/null || true
wc -l "$FILE" 2>/dev/null || true

echo "== Snippet around line 124 =="
sed -n '105,145p' "$FILE" 2>/dev/null || true

echo "== cuda_mallocmanaged/cuda_try usage =="
rg -n "cuda_try<\s*cudaMallocManaged\s*>\s*\(" "$FILE" || true
rg -n "cudaMallocManaged" "$FILE" || true

echo "== Find cuda_try implementation files =="
fd -a 'cuda_try\.cuh$|cuda_safe_call\.cuh$' cudax/include/cuda/experimental/__stf/utility . | sort | cat

echo "== Show cuda_try definition + reserved first/last param =="
fd -a 'cuda_try\.cuh$|cuda_safe_call\.cuh$' cudax/include/cuda/experimental/__stf/utility . | while read -r f; do
  echo "---- $f ----"
  rg -n "template <auto fun, typename\.\.\. Ps>.*cuda_try|cuda_try\(" "$f" | head -n 50 || true
  rg -n "reserved::(first_param|last_param)|struct (first_param|last_param)|using (first_param|last_param)" "$f" || true
done

Repository: NVIDIA/cccl

Length of output: 7683


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# repo root assumed

echo "== Other uses of cuda_try<cudaMallocManaged> =="
rg -n "cuda_try<\s*cudaMallocManaged\s*>\s*\(" -S . || true

echo "== Any uses of cudaMemAttachGlobal =="
rg -n "cudaMemAttachGlobal" -S cudax/include . || true

echo "== Show cuda_try invocation selection logic (cuda_safe_call.cuh) =="
FILE="cudax/include/cuda/experimental/__stf/utility/cuda_safe_call.cuh"
sed -n '420,520p' "$FILE" | cat -n

echo "== Show first_param/last_param machinery (cuda_safe_call.cuh) =="
sed -n '220,320p' "$FILE" | cat -n

Repository: NVIDIA/cccl

Length of output: 9466


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Search for cudaMallocManaged declaration/signature in repo =="
rg -n "cudaMallocManaged\s*\(" -S . | head -n 50 || true

echo "== Search for default flags for cudaMallocManaged =="
rg -n "cudaMemAttachGlobal" -S . | head -n 50 || true

echo "== Search for cudaMallocManaged flags parameter usage =="
rg -n "cudaMallocManaged\s*\([^,]+,[^,]+,[^)]+" -S libcudacxx libcudacxx/include cudax 2>/dev/null || true

echo "== Search for cuda_runtime.h wrapper declaration references =="
rg -n "cuda_runtime\.h" -S cudax/include | head -n 50 || true

Repository: NVIDIA/cccl

Length of output: 6632


important: In cudax/include/cuda/experimental/__stf/utility/memory.cuh:124, cuda_try<cudaMallocManaged>(sz) selects first_output_form and synthesizes cudaMallocManaged(&result, sz), omitting the 3rd flags argument. Pass the attachment flags explicitly to avoid relying on defaults: return cuda_try<cudaMallocManaged>(sz, cudaMemAttachGlobal);

@github-actions

This comment has been minimized.

The previous commit used ``cuda_try<cudaMallocHost>(sz)`` and
``cuda_try<cudaMallocManaged>(sz)`` to return the allocated pointer in a
single expression. nvcc rejected this across the CI matrix: both CUDA
names are overload sets (the C++ headers in cuda_runtime.h wrap the C
API with templates), so ``decltype(fun)`` cannot resolve them. The
templated form is documented as unsupported for overload sets in
cuda_safe_call.cuh (~lines 432-435).

Restore the runtime-status form for both:

  void* p = nullptr;
  cuda_try(cudaMallocHost(&p, sz));    // or cudaMallocManaged
  return p;

Added a short comment at each site pointing at the overload-set
limitation so future migrators do not repeat the trap.

No behavioral change other than fixing the build.
@andralex
Copy link
Copy Markdown
Contributor Author

/ok to test fc6e23d

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

@github-actions
Copy link
Copy Markdown
Contributor

😬 CI Workflow Results

🟥 Finished in 32m 41s: Pass: 12%/55 | Total: 4h 14m | Max: 32m 35s | Hits: 100%/1673

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

1 participant