Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion cub/cub/detail/warpspeed/squad/load_store.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -236,7 +236,7 @@ squadStoreBulkSync(Squad squad, CpAsyncOobInfo<OutputT> cpAsyncOobInfo, const ::

constexpr ::cuda::std::uint16_t byteMask = 0xFFFF;
const ::cuda::std::uint16_t byteMaskStart = byteMask << cpAsyncOobInfo.smemStartSkipBytes;
const ::cuda::std::uint16_t byteMaskEnd = byteMask >> (16 - cpAsyncOobInfo.smemEndBytesAfter16BBoundary);
const ::cuda::std::uint16_t byteMaskEnd = byteMask >> (16 - cpAsyncOobInfo.smemEndBytesAfter16BBoundary) % 16;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahendriksen is there any smarter way to shift byteMask by smemEndBytesAfter16BBoundary, but leave it when it's 16? Would a predicated shift be faster?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't think of a quicker way off the top of my head. I always hoped that the compiler would figure out the best way to compute all these values.

A modulo 16 operation is just and AND by ~0xF, which should be quicker than computing the predicate.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, thx!

// byteMaskStart contains zeroes at the left
# if _CCCL_CUDA_COMPILER(NVCC, >=, 13, 2)
const ::cuda::std::uint16_t byteMaskSmall = byteMaskStart & byteMaskEnd;
Expand Down
6 changes: 6 additions & 0 deletions cub/cub/device/dispatch/dispatch_scan.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -410,6 +410,12 @@ struct DispatchScan
template <typename ActivePolicyT>
CUB_RUNTIME_FUNCTION _CCCL_HOST _CCCL_FORCEINLINE cudaError_t __invoke_lookahead_algorithm(ActivePolicyT)
{
if (num_items == 0)
{
temp_storage_bytes = 1; // just fulfill the contract that CUB always requires some temporary storage
return cudaSuccess;
}

using InputT = ::cuda::std::iter_value_t<InputIteratorT>;
using OutputT = ::cuda::std::iter_value_t<OutputIteratorT>;
using WarpspeedPolicy = typename ActivePolicyT::WarpspeedPolicy;
Expand Down
2 changes: 1 addition & 1 deletion cub/test/catch2_test_device_scan_alignment.cu
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ C2H_TEST("Device scan works with all device interfaces", "[scan][device]", value
constexpr offset_t max_num_items = 8192;

const auto offset = GENERATE_COPY(values({0, 1, 3, 4, 7, 8, 11, 12, 16}), take(3, random(0, max_offset)));
const auto num_items = GENERATE_COPY(values({1, max_num_items}), take(64, random(0, max_num_items)));
const auto num_items = GENERATE_COPY(values({0, 1, max_num_items}), take(64, random(2, max_num_items - 1)));

CAPTURE(num_items, offset);

Expand Down
Loading