Refactor warpspeed scan 1/2#9168
Conversation
no SASS changes
kernel gets one instruction shorter: ISETP.GE.U64.AND P3, PT, R14, 0xf80, PT ; plus a few registers have different names now
|
Actionable comments posted: 0 |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughSummary by CodeRabbit
WalkthroughRefactors the warpspeed scan kernel by extracting inline ChangesWarpspeed scan kernel refactoring
Possibly related PRs
Suggested reviewers
Comment |
|
Actionable comments posted: 0 |
This reverts commit befb22d.
| [[nodiscard]] _CCCL_HOST_DEVICE_API friend constexpr bool | ||
| operator==(const SquadDesc& lhs, const SquadDesc& rhs) noexcept |
There was a problem hiding this comment.
This is needed to avoid ODR use of squad descriptions which are static constexpr members of a class, since otherwise we get the error that those are not accessible in __device__ code.
This comment has been minimized.
This comment has been minimized.
| int barrierIdx = (int) this->mSquadIdx + 1; | ||
| const int barrierIdx = this->mSquadIdx + 1; | ||
|
|
||
| __barrier_sync_count(barrierIdx, this->threadCount()); |
There was a problem hiding this comment.
Unrelated to the PR: Why do we use unaligned version of the __barrier_sync? From what I see in the main loop, we always synchronize the warps on the same line in the code, so we should be able to use the aligned version, which produces a bit less code.
See the comparison: https://godbolt.org/z/vPGbT9W7c
There was a problem hiding this comment.
Ah, I'm probably wrong, it's not that all threads that are part of the barrier must call the instruction uniformly, but rather the whole CTA :(
Co-authored-by: David Bayer <48736217+davebayer@users.noreply.github.com>
|
/ok to test 700f422 |
| bool is_first_tile, | ||
| bool is_last_tile, // TODO(bgruber): should we dispatch on is_last_tile outside this function and compile it twice? |
There was a problem hiding this comment.
I believe in the long run we should have an enumeration like
enum class good_name{
__full_tile,
__first_tile,
__last_tile,
__only_tile,
};| PhaseInOutT& phaseInOutRW, | ||
| PhaseSumT& phaseSumThreadAndWarpW, | ||
| int valid_items, | ||
| bool is_first_tile, |
There was a problem hiding this comment.
Should those rather be template arguments?
There was a problem hiding this comment.
This is exactly what the next line implies:
// TODO(bgruber): should we dispatch on is_last_tile outside this function and compile it twice?
I believe we may want to test this, but not in this PR.
🥳 CI Workflow Results🟩 Finished in 1h 51m: Pass: 100%/285 | Total: 3d 07h | Max: 41m 02s | Hits: 100%/196423See results here. |
This is the first part of a few cleanup commits that should improve the readability of the warpspeed scan implementation.
AI generated summary:
I tried pedantically to not cause any SASS changes, but extracting the
reduce_tilefunction somehow let the compiler elide anISETP.GE.U64.AND P3, PT, R14, 0xf80, PT ;instruction, so the kernel is now exactly one instruction shorter. A few register names changed as well, but the instructions are otherwise identical.