[STF] Migrate __stf/utility/ from cuda_safe_call to cuda_try#9150
[STF] Migrate __stf/utility/ from cuda_safe_call to cuda_try#9150andralex wants to merge 2 commits into
Conversation
Second PR in the series. Twelve call sites across three headers in
``__stf/utility/``, grouped by pattern:
occupancy.cuh:
- compute_occupancy: hoist the cache write so that the cuOccupancy /
cudaOccupancy CUDA call runs against a stack-local first, then
assigns into the cache on success. Previously a thrown cuda_try
would have left a default-initialized entry in the cache that any
subsequent lookup with the same key would silently return.
- compute_kernel_limits: cudaFuncGetAttributes -> templated
cuda_try<cudaFuncGetAttributes>(f) for const-initialized attrs.
memory.cuh:
- allocateHostMemory / allocateManagedMemory pool wipes: rewrite
the "wipe pool when too full" loops to pop-then-free, so a thrown
cuda_try leaks at most the in-flight pointer (already removed from
the pool) instead of leaving the pool full of already-freed
pointers and producing a double-free on the next call.
- allocateHostMemory / allocateManagedMemory: cudaMallocHost and
cudaMallocManaged -> templated cuda_try<F>(sz) (returns void*,
pairs naturally with ``return``).
- deallocateHostMemory(stream), deallocateManagedMemory(stream),
and the graph-overload of deallocateHostMemory: switch the
raw ``new std::pair<...>`` to std::make_unique and call
args.release() only after the launch / graph-node add succeeds.
A failed launch previously leaked the heap pair forever.
- address_is_pinned: cudaPointerGetAttributes -> templated form.
- unpin_memory: cudaGetLastError -> cuda_try direct.
constant_logical_data.cuh:
- cached_value::get(): cudaStreamSynchronize -> cuda_try, guarded by
SCOPE(fail) { cached.erase(it); } so that a sync failure does not
leave a partially-fetched entry that subsequent get() calls would
return unsynchronized. unordered_map::erase(it) is noexcept, so the
SCOPE body is safe.
Added include of utility/scope_guard.cuh in memory.cuh and
constant_logical_data.cuh, and <memory> in memory.cuh for unique_ptr.
After this PR, zero cuda_safe_call calls remain in ``__stf/utility/``
outside of cuda_safe_call.cuh itself (the implementation).
|
/ok to test 7ba36e4 |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughSummary by CodeRabbit
suggestion: WalkthroughRefactor three STF utility headers to use ChangesSTF Utility Exception-Safety Hardening
Possibly related PRs
Suggested labels
Suggested reviewers
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
cudax/include/cuda/experimental/__stf/utility/constant_logical_data.cuh (1)
85-85: ⚡ Quick winsuggestion: Fully qualify the free-function calls on Line 85 to match repository rules; e.g.
::cuda::experimental::stf::cuda_try(::cudaStreamSynchronize(stream));.As per coding guidelines, “All calls to free functions must be fully qualified starting from the global namespace.”
cudax/include/cuda/experimental/__stf/utility/occupancy.cuh (1)
55-61: ⚡ Quick winsuggestion: Qualify these changed free-function calls from global scope for consistency with the cudax style contract, e.g.
::cuda::experimental::stf::cuda_try(::cuOccupancyMaxPotentialBlockSize(...)),::cuda::experimental::stf::cuda_try(::cudaOccupancyMaxPotentialBlockSize(...)), and::cuda::experimental::stf::cuda_try<::cudaFuncGetAttributes>(f).As per coding guidelines, “All calls to free functions must be fully qualified starting from the global namespace.”
Also applies to: 109-109
cudax/include/cuda/experimental/__stf/utility/memory.cuh (1)
31-35: ⚡ Quick winsuggestion: replace the unused
scope_guard.cuhinclude with a direct<utility>include. The new::std::pair<size_t, void*>instantiations depend on<utility>, and nothing in this header uses a scope-guard symbol.As per coding guidelines, "Files must include all headers related to symbols they use" and "Remove unused code, variables, functions, types, template parameters, and headers."
Also applies to: 192-193, 215-216, 243-244
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 2bf25116-9962-4180-b491-fd7fd58182df
📒 Files selected for processing (3)
cudax/include/cuda/experimental/__stf/utility/constant_logical_data.cuhcudax/include/cuda/experimental/__stf/utility/memory.cuhcudax/include/cuda/experimental/__stf/utility/occupancy.cuh
| } | ||
| cuda_safe_call(cudaMallocManaged(&result, sz)); | ||
| return result; | ||
| return cuda_try<cudaMallocManaged>(sz); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
fd -a 'cuda_safe_call\.cuh$|cuda_try\.cuh$' . | while read -r file; do
rg -n -C4 'cuda_try|cudaMallocManaged' "$file"
doneRepository: NVIDIA/cccl
Length of output: 5861
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# 1) Show the call site in memory.cuh
FILE="cudax/include/cuda/experimental/__stf/utility/memory.cuh"
if [ -f "$FILE" ]; then
echo "---- $FILE ----"
nl -ba "$FILE" | sed -n '90,160p'
else
echo "Missing: $FILE" >&2
fi
# 2) Locate cuda_try implementation and show surrounding logic + reserved helpers
rg -n "template <auto fun, typename\.\.\. Ps>.*cuda_try" -S cudax/include/cuda/experimental/__stf/utility || true
fd -a 'cuda_try\.cuh$|cuda_safe_call\.cuh$' cudax/include/cuda/experimental/__stf/utility . | while read -r f; do
echo "---- $f ----"
nl -ba "$f" | sed -n '300,620p'
done
# 3) Find reserved::first_param / last_param definitions
fd -a 'cuda_safe_call\.cuh$|cuda_try\.cuh$' . | while read -r f; do
rg -n "reserved::(first_param|last_param)" "$f" || true
rg -n "struct (first_param|last_param)|using (first_param|last_param)" "$f" || true
doneRepository: NVIDIA/cccl
Length of output: 183
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="cudax/include/cuda/experimental/__stf/utility/memory.cuh"
echo "== File info =="
ls -l "$FILE" 2>/dev/null || true
wc -l "$FILE" 2>/dev/null || true
echo "== Snippet around line 124 =="
sed -n '105,145p' "$FILE" 2>/dev/null || true
echo "== cuda_mallocmanaged/cuda_try usage =="
rg -n "cuda_try<\s*cudaMallocManaged\s*>\s*\(" "$FILE" || true
rg -n "cudaMallocManaged" "$FILE" || true
echo "== Find cuda_try implementation files =="
fd -a 'cuda_try\.cuh$|cuda_safe_call\.cuh$' cudax/include/cuda/experimental/__stf/utility . | sort | cat
echo "== Show cuda_try definition + reserved first/last param =="
fd -a 'cuda_try\.cuh$|cuda_safe_call\.cuh$' cudax/include/cuda/experimental/__stf/utility . | while read -r f; do
echo "---- $f ----"
rg -n "template <auto fun, typename\.\.\. Ps>.*cuda_try|cuda_try\(" "$f" | head -n 50 || true
rg -n "reserved::(first_param|last_param)|struct (first_param|last_param)|using (first_param|last_param)" "$f" || true
doneRepository: NVIDIA/cccl
Length of output: 7683
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# repo root assumed
echo "== Other uses of cuda_try<cudaMallocManaged> =="
rg -n "cuda_try<\s*cudaMallocManaged\s*>\s*\(" -S . || true
echo "== Any uses of cudaMemAttachGlobal =="
rg -n "cudaMemAttachGlobal" -S cudax/include . || true
echo "== Show cuda_try invocation selection logic (cuda_safe_call.cuh) =="
FILE="cudax/include/cuda/experimental/__stf/utility/cuda_safe_call.cuh"
sed -n '420,520p' "$FILE" | cat -n
echo "== Show first_param/last_param machinery (cuda_safe_call.cuh) =="
sed -n '220,320p' "$FILE" | cat -nRepository: NVIDIA/cccl
Length of output: 9466
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Search for cudaMallocManaged declaration/signature in repo =="
rg -n "cudaMallocManaged\s*\(" -S . | head -n 50 || true
echo "== Search for default flags for cudaMallocManaged =="
rg -n "cudaMemAttachGlobal" -S . | head -n 50 || true
echo "== Search for cudaMallocManaged flags parameter usage =="
rg -n "cudaMallocManaged\s*\([^,]+,[^,]+,[^)]+" -S libcudacxx libcudacxx/include cudax 2>/dev/null || true
echo "== Search for cuda_runtime.h wrapper declaration references =="
rg -n "cuda_runtime\.h" -S cudax/include | head -n 50 || trueRepository: NVIDIA/cccl
Length of output: 6632
important: In cudax/include/cuda/experimental/__stf/utility/memory.cuh:124, cuda_try<cudaMallocManaged>(sz) selects first_output_form and synthesizes cudaMallocManaged(&result, sz), omitting the 3rd flags argument. Pass the attachment flags explicitly to avoid relying on defaults: return cuda_try<cudaMallocManaged>(sz, cudaMemAttachGlobal);
This comment has been minimized.
This comment has been minimized.
The previous commit used ``cuda_try<cudaMallocHost>(sz)`` and ``cuda_try<cudaMallocManaged>(sz)`` to return the allocated pointer in a single expression. nvcc rejected this across the CI matrix: both CUDA names are overload sets (the C++ headers in cuda_runtime.h wrap the C API with templates), so ``decltype(fun)`` cannot resolve them. The templated form is documented as unsupported for overload sets in cuda_safe_call.cuh (~lines 432-435). Restore the runtime-status form for both: void* p = nullptr; cuda_try(cudaMallocHost(&p, sz)); // or cudaMallocManaged return p; Added a short comment at each site pointing at the overload-set limitation so future migrators do not repeat the trap. No behavioral change other than fixing the build.
|
/ok to test fc6e23d |
|
Actionable comments posted: 0 |
😬 CI Workflow Results🟥 Finished in 32m 41s: Pass: 12%/55 | Total: 4h 14m | Max: 32m 35s | Hits: 100%/1673See results here. |
Summary
Second PR in the series migrating production STF headers off
cuda_safe_call(abort-on-failure) and ontocuda_try(throw-on-failure), so callers (Python wrappers, exception-aware control flow) can recover from CUDA errors instead of having the process aborted.Twelve call sites across three headers in
cudax/include/cuda/experimental/__stf/utility/. After this PR, zerocuda_safe_callcalls remain in that subtree outside ofcuda_safe_call.cuhitself (the implementation).Every pattern here was already demonstrated by PR #9147, so reviewer overhead per site should be small.
Changes
occupancy.cuhcompute_occupancy— hoisted the cache write to after the CUDA call. Before,auto& result = occupancy_cache[key]inserted a default-initialized entry into the cache before invoking the CUDA query, so a throwncuda_trywould have left zero-init data behind for any subsequent lookup. Now the result is computed into a stack-local first and assigned into the cache only on success.compute_kernel_limits—cudaFuncGetAttributesswitched to the templatedcuda_try<F>(args...)form so the result can beconst-initialized.memory.cuhallocateHostMemory/allocateManagedMemorypool wipes — rewrote the "wipe pool when too full"for-loops as pop-then-freewhile-loops. With the naive substitution, acudaFreeHost/cudaFreemid-loop failure would have left the pool full of pointers that were already freed, and the next call would have re-entered the wipe and double-freed them. The new pattern leaks at most the in-flight pointer (already removed from the pool), never causes a double-free.allocateHostMemory/allocateManagedMemory—cudaMallocHost/cudaMallocManagedswitched to templated form, fused with the return.cudaLaunchHostFunc× 2 andcudaGraphAddHostNode) — each previously didnew std::pair<size_t, void*>(...)and passed the raw pointer to the launch. If the launch returned an error,cuda_safe_callaborted; if we'd done a naivecuda_tryconversion, the launch failure would have leaked the heap pair forever. Rewrote withstd::make_unique+args.release()after thecuda_trysucceeds, so a failed launch unwinds cleanly.address_is_pinned—cudaPointerGetAttributes→ templated form.unpin_memory—cudaGetLastError→ directcuda_try.constant_logical_data.cuhcached_value::get()— between thecached.emplace(...)and thecudaStreamSynchronize, addedSCOPE(fail) { cached.erase(it); };. If the sync throws, the partially-fetched cache entry would otherwise persist and corrupt any subsequentget()call (which would find the cache entry and return the unsynchronized value).unordered_map::eraseby iterator isnoexcept, so the SCOPE body is safe.Includes added
memory.cuh:<memory>(forstd::make_unique) andutility/scope_guard.cuh.constant_logical_data.cuh:utility/scope_guard.cuh.(Both
scope_guard.cuhincludes are explicit rather than transitive, matching the style established in PR #9147.)Test plan
/ok to testin comment below)