[STF] Migrate __stf/allocators/ from cuda_safe_call to cuda_try#9146
[STF] Migrate __stf/allocators/ from cuda_safe_call to cuda_try#9146andralex wants to merge 1 commit into
Conversation
First PR in a series migrating production STF headers away from the abort-on-failure ``cuda_safe_call`` toward the throw-on-failure ``cuda_try``, so callers (including Python wrappers and any wider exception-aware control flow) have the option to recover from CUDA errors instead of having the process aborted underneath them. Two sites in ``__stf/allocators/``: - ``pooled_allocator.cuh``: a plain ``cudaGetDeviceProperties`` query inside the constructor. No CUDA state in flight, no rollback needed -- straight substitution. - ``adapters.cuh``: ``stream_adapter::clear()`` synchronizes the stream before performing blocking deallocations. Substituting ``cuda_try`` means that path can now throw; the existing ``~stream_adapter()`` sanity assertion (``cleared_or_moved`` must be true) would then fire spuriously and report "clear() was not called" -- which is misleading, because clear() *was* called, it just failed mid-way. Guard the bool flip with ``SCOPE(exit)`` at the top of ``clear()`` so the destructor's contract holds whether or not the synchronization throws, and pull in ``scope_guard.cuh`` explicitly rather than relying on transitive include. Pilot PR for the broader migration -- intentionally small so the conversion pattern (including SCOPE-body usage of ``cuda_safe_call`` for the rollback step in future PRs) can be reviewed before scaling up.
📝 WalkthroughSummary by CodeRabbit
WalkthroughThis PR improves exception-safety and error-handling consistency in STF allocators. ChangesSTF Allocator Robustness
Possibly related PRs
Suggested labels
Suggested reviewers
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 963e43f7-56ce-479c-8aca-f1922db75a36
📒 Files selected for processing (2)
cudax/include/cuda/experimental/__stf/allocators/adapters.cuhcudax/include/cuda/experimental/__stf/allocators/pooled_allocator.cuh
| // Flip the bool up-front so the destructor's "clear() was not called" | ||
| // sanity assertion still holds even if a CUDA call below throws and the | ||
| // caller catches the exception. From the user's contract perspective, | ||
| // ``clear()`` *was* called; whether every individual deallocation | ||
| // succeeded is communicated via the thrown ``cuda_exception``. | ||
| SCOPE(exit) | ||
| { | ||
| cleared_or_moved = true; | ||
| }; |
There was a problem hiding this comment.
critical: Do not mark clear() as completed when it exits via exception.
If cuda_try throws on Line 195, adapter_state->to_free is still populated, but the SCOPE(exit) on Lines 179-182 sets cleared_or_moved = true anyway. That makes a retry hit the precondition on Line 172 and lets the destructor stop asserting even though buffers were never released, so this exception path becomes a silent leak. Only flip the state after all frees succeed, or split this into separate "clear attempted" and "clear completed" flags.
Also applies to: 195-195
There was a problem hiding this comment.
@andralex Would not that go through SCOPE(fail) instead of SCOPE(exit) ?
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
|
/ok to test 0f22512 |
|
Closing to let the author review the diff locally before pushing. Will reopen (or a successor PR) once approved. |
🥳 CI Workflow Results🟩 Finished in 59m 28s: Pass: 100%/55 | Total: 23h 04m | Max: 59m 27s | Hits: 15%/150797See results here. |
Summary
First PR in a series migrating production STF headers off the abort-on-failure
cuda_safe_calland onto the throw-on-failurecuda_try, so callers (including Python wrappers and any wider exception-aware control flow) have the option to recover from CUDA errors instead of having the process aborted underneath them.Scope here is intentionally tiny -- the entire
cudax/include/cuda/experimental/__stf/allocators/subtree, two call sites -- so the migration pattern can be reviewed before it scales up.Changes
pooled_allocator.cuhPlain
cudaGetDevicePropertiesquery inside the constructor. No CUDA state in flight, no rollback needed -- straightcuda_safe_call->cuda_trysubstitution.adapters.cuhstream_adapter::clear()synchronizes the stream before any blocking deallocations. Withcuda_try, that path can now throw. The existing~stream_adapter()sanity assertion (cleared_or_movedmust be true at destruction) would then fire spuriously and report "clear() was not called" -- which is misleading:clear()was called, it just failed mid-way.Guarded the bool flip with
SCOPE(exit)at the top ofclear()so the destructor's contract holds whether or not the synchronization throws. Explicitscope_guard.cuhinclude added rather than relying on transitive inclusion.The
SCOPE(exit)body only writes abool, which isnoexcept-- no risk ofstd::terminateduring unwinding.Migration pattern notes (for the follow-ups)
cuda_trysubstitution.cuda_try+SCOPE(fail)for the rollback. Inside theSCOPE(fail)body, usecuda_safe_call, notcuda_try-- guard destructors arenoexcept, so a thrown exception during unwinding wouldstd::terminate.cuda_safe_call. Same rationale -- those contexts arenoexcept.Test plan
/ok to testonce branch is pushed -- see comment below)@codedoxygen blocks in either file referencecuda_safe_call, so docs do not drift