Skip to content

Fix tiny problem sizes for warpspeed scan#7921

Merged
bernhardmgruber merged 3 commits intoNVIDIA:mainfrom
bernhardmgruber:scan_fixxes
Mar 9, 2026
Merged

Fix tiny problem sizes for warpspeed scan#7921
bernhardmgruber merged 3 commits intoNVIDIA:mainfrom
bernhardmgruber:scan_fixxes

Conversation

@bernhardmgruber
Copy link
Copy Markdown
Contributor

Fixes: #7821

@bernhardmgruber bernhardmgruber requested a review from a team as a code owner March 6, 2026 19:19
@github-project-automation github-project-automation bot moved this to Todo in CCCL Mar 6, 2026
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Mar 6, 2026
constexpr ::cuda::std::uint16_t byteMask = 0xFFFF;
const ::cuda::std::uint16_t byteMaskStart = byteMask << cpAsyncOobInfo.smemStartSkipBytes;
const ::cuda::std::uint16_t byteMaskEnd = byteMask >> (16 - cpAsyncOobInfo.smemEndBytesAfter16BBoundary);
const ::cuda::std::uint16_t byteMaskEnd = byteMask >> (16 - cpAsyncOobInfo.smemEndBytesAfter16BBoundary) % 16;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahendriksen is there any smarter way to shift byteMask by smemEndBytesAfter16BBoundary, but leave it when it's 16? Would a predicated shift be faster?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't think of a quicker way off the top of my head. I always hoped that the compiler would figure out the best way to compute all these values.

A modulo 16 operation is just and AND by ~0xF, which should be quicker than computing the predicate.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, thx!

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 6, 2026

🥳 CI Workflow Results

🟩 Finished in 3h 52m: Pass: 100%/249 | Total: 9d 03h | Max: 3h 51m | Hits: 71%/155156

See results here.

@bernhardmgruber bernhardmgruber merged commit 578d64b into NVIDIA:main Mar 9, 2026
271 checks passed
@github-project-automation github-project-automation bot moved this from In Review to Done in CCCL Mar 9, 2026
@bernhardmgruber bernhardmgruber deleted the scan_fixxes branch March 9, 2026 08:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

[BUG]: Spurious failures in cub.test.device.scan_alignment.lid_0

3 participants