[STF] Migrate __stf/allocators/ from cuda_safe_call to cuda_try by andralex · Pull Request #9146 · NVIDIA/cccl

andralex · 2026-05-27T22:03:50Z

Summary

First PR in a series migrating production STF headers off the abort-on-failure cuda_safe_call and onto the throw-on-failure cuda_try, so callers (including Python wrappers and any wider exception-aware control flow) have the option to recover from CUDA errors instead of having the process aborted underneath them.

Scope here is intentionally tiny -- the entire cudax/include/cuda/experimental/__stf/allocators/ subtree, two call sites -- so the migration pattern can be reviewed before it scales up.

Changes

`pooled_allocator.cuh`

Plain cudaGetDeviceProperties query inside the constructor. No CUDA state in flight, no rollback needed -- straight cuda_safe_call -> cuda_try substitution.

`adapters.cuh`

stream_adapter::clear() synchronizes the stream before any blocking deallocations. With cuda_try, that path can now throw. The existing ~stream_adapter() sanity assertion (cleared_or_moved must be true at destruction) would then fire spuriously and report "clear() was not called" -- which is misleading: clear() was called, it just failed mid-way.

Guarded the bool flip with SCOPE(exit) at the top of clear() so the destructor's contract holds whether or not the synchronization throws. Explicit scope_guard.cuh include added rather than relying on transitive inclusion.

void clear()
{
  _CCCL_ASSERT(adapter_state, "Invalid state");
  _CCCL_ASSERT(!cleared_or_moved, "clear() was already called, or the object was moved.");

  SCOPE(exit) { cleared_or_moved = true; };

  // ... cuda_try(cudaStreamSynchronize(stream)); ...
}

The SCOPE(exit) body only writes a bool, which is noexcept -- no risk of std::terminate during unwinding.

Migration pattern notes (for the follow-ups)

SAFE: pure queries with no in-flight CUDA state -> direct cuda_try substitution.
GUARDED: operations that need an undo step on failure -> cuda_try + SCOPE(fail) for the rollback. Inside the SCOPE(fail) body, use cuda_safe_call, not cuda_try -- guard destructors are noexcept, so a thrown exception during unwinding would std::terminate.
KEEP: destructors and CUDA host callbacks remain cuda_safe_call. Same rationale -- those contexts are noexcept.

Test plan

CI green (/ok to test once branch is pushed -- see comment below)
No behavioral change for the success path
Confirmed no @code doxygen blocks in either file reference cuda_safe_call, so docs do not drift

First PR in a series migrating production STF headers away from the abort-on-failure ``cuda_safe_call`` toward the throw-on-failure ``cuda_try``, so callers (including Python wrappers and any wider exception-aware control flow) have the option to recover from CUDA errors instead of having the process aborted underneath them. Two sites in ``__stf/allocators/``: - ``pooled_allocator.cuh``: a plain ``cudaGetDeviceProperties`` query inside the constructor. No CUDA state in flight, no rollback needed -- straight substitution. - ``adapters.cuh``: ``stream_adapter::clear()`` synchronizes the stream before performing blocking deallocations. Substituting ``cuda_try`` means that path can now throw; the existing ``~stream_adapter()`` sanity assertion (``cleared_or_moved`` must be true) would then fire spuriously and report "clear() was not called" -- which is misleading, because clear() *was* called, it just failed mid-way. Guard the bool flip with ``SCOPE(exit)`` at the top of ``clear()`` so the destructor's contract holds whether or not the synchronization throws, and pull in ``scope_guard.cuh`` explicitly rather than relying on transitive include. Pilot PR for the broader migration -- intentionally small so the conversion pattern (including SCOPE-body usage of ``cuda_safe_call`` for the rollback step in future PRs) can be reviewed before scaling up.

copy-pr-bot · 2026-05-27T22:03:53Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-05-27T22:07:20Z

📝 Walkthrough

Summary by CodeRabbit

Bug Fixes
- Improved exception safety in memory allocation operations to prevent resource leaks during CUDA operations.
- Enhanced error handling approach for device configuration queries to increase overall system robustness.

Walkthrough

This PR improves exception-safety and error-handling consistency in STF allocators. stream_adapter::clear() now uses scope guards to guarantee the destructor precondition is met even when CUDA operations throw, and switches stream synchronization from cuda_safe_call to cuda_try. Similarly, block_data_pool switches device-properties queries to cuda_try.

Changes

STF Allocator Robustness

Layer / File(s)	Summary
stream_adapter::clear() exception-safety and error-handling `cudax/include/cuda/experimental/__stf/allocators/adapters.cuh`	Includes scope_guard header and adds `SCOPE(exit)` guard in `clear()` to set `cleared_or_moved = true` on all exit paths; replaces `cuda_safe_call` with `cuda_try` for stream synchronization before blocking deallocations.
block_data_pool error-handling consistency `cudax/include/cuda/experimental/__stf/allocators/pooled_allocator.cuh`	Switches `cudaGetDeviceProperties` from `cuda_safe_call` to `cuda_try` wrapper in pool constructor.

Possibly related PRs

NVIDIA/cccl#8891: Implements enhanced cuda_try output-parameter inference that this PR depends on for cudaGetDeviceProperties calls.

Suggested labels

stf

Suggested reviewers

caugonnet
alliepiper
bernhardmgruber

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 963e43f7-56ce-479c-8aca-f1922db75a36

📥 Commits

Reviewing files that changed from the base of the PR and between d2d87ed and 0f22512.

📒 Files selected for processing (2)

cudax/include/cuda/experimental/__stf/allocators/adapters.cuh
cudax/include/cuda/experimental/__stf/allocators/pooled_allocator.cuh

coderabbitai · 2026-05-27T22:07:23Z

+    // Flip the bool up-front so the destructor's "clear() was not called"
+    // sanity assertion still holds even if a CUDA call below throws and the
+    // caller catches the exception. From the user's contract perspective,
+    // ``clear()`` *was* called; whether every individual deallocation
+    // succeeded is communicated via the thrown ``cuda_exception``.
+    SCOPE(exit)
+    {
+      cleared_or_moved = true;
+    };


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

critical: Do not mark clear() as completed when it exits via exception.

If cuda_try throws on Line 195, adapter_state->to_free is still populated, but the SCOPE(exit) on Lines 179-182 sets cleared_or_moved = true anyway. That makes a retry hit the precondition on Line 172 and lets the destructor stop asserting even though buffers were never released, so this exception path becomes a silent leak. Only flip the state after all frees succeed, or split this into separate "clear attempted" and "clear completed" flags.

Also applies to: 195-195

@andralex Would not that go through SCOPE(fail) instead of SCOPE(exit) ?

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

andralex · 2026-05-27T22:14:27Z

/ok to test 0f22512

andralex · 2026-05-27T22:17:47Z

Closing to let the author review the diff locally before pushing. Will reopen (or a successor PR) once approved.

github-actions · 2026-05-27T23:15:40Z

🥳 CI Workflow Results

🟩 Finished in 59m 28s: Pass: 100%/55 | Total: 23h 04m | Max: 59m 27s | Hits: 15%/150797

See results here.

andralex requested a review from a team as a code owner May 27, 2026 22:03

andralex requested a review from caugonnet May 27, 2026 22:03

github-project-automation Bot added this to CCCL May 27, 2026

github-project-automation Bot moved this to Todo in CCCL May 27, 2026

cccl-authenticator-app Bot moved this from Todo to In Review in CCCL May 27, 2026

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

andralex closed this May 27, 2026

github-project-automation Bot moved this from In Review to Done in CCCL May 27, 2026

andralex deleted the andralex/stf-cuda-try-allocators branch May 27, 2026 22:17

andralex mentioned this pull request May 28, 2026

cudax/stf: migrate stackable/ from cuda_safe_call to cuda_try #9165

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[STF] Migrate __stf/allocators/ from cuda_safe_call to cuda_try#9146

[STF] Migrate __stf/allocators/ from cuda_safe_call to cuda_try#9146
andralex wants to merge 1 commit into
NVIDIA:mainfrom
andralex:andralex/stf-cuda-try-allocators

andralex commented May 27, 2026

Uh oh!

copy-pr-bot Bot commented May 27, 2026

Uh oh!

coderabbitai Bot commented May 27, 2026

Summary by CodeRabbit

Walkthrough

Changes

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 27, 2026 •

edited

Loading

Uh oh!

caugonnet May 28, 2026

Uh oh!

coderabbitai Bot May 28, 2026

Uh oh!

andralex commented May 27, 2026

Uh oh!

andralex commented May 27, 2026

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andralex commented May 27, 2026

Summary

Changes

pooled_allocator.cuh

adapters.cuh

Migration pattern notes (for the follow-ups)

Test plan

Uh oh!

copy-pr-bot Bot commented May 27, 2026

Uh oh!

coderabbitai Bot commented May 27, 2026

Summary by CodeRabbit

Walkthrough

Changes

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

caugonnet May 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

andralex commented May 27, 2026

Uh oh!

andralex commented May 27, 2026

Uh oh!

github-actions Bot commented May 27, 2026

🥳 CI Workflow Results

🟩 Finished in 59m 28s: Pass: 100%/55 | Total: 23h 04m | Max: 59m 27s | Hits: 15%/150797

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`pooled_allocator.cuh`

`adapters.cuh`

coderabbitai Bot May 27, 2026 •

edited

Loading