-
Notifications
You must be signed in to change notification settings - Fork 367
Description
Is this a duplicate?
- I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct
Area
CUB
Is your feature request related to a problem? Please describe.
Atomic-based specialization of block histogram is using device-wide atomics instead of block-wide ones:
| atomicAdd(histogram + items[i], 1); |
When histogram is in shared memory (and compiler can see that), this inefficiency is optimized away. Nevertheless, block histogram allows histogram to be in global memory, which leads to suboptimal codegen (using gpu instead of cta scope on atom).
Describe the solution you'd like
Scoped atomics are Pascal+ feature, so we can consider something along the lines of:
NV_IF_TARGET(NV_PROVIDES_SM_60,
(atomicAdd_block(histogram + items[i], 1);),
(atomicAdd(histogram + items[i], 1);));Potential benchmark for this change:
template <int BlockThreads, int ItemsPerThread, int Bins>
__global__ void kernel(int *data, int *histogram)
{
using histogram_t = cub::BlockHistogram<int,
BlockThreads,
ItemsPerThread,
Bins,
cub::BlockHistogramAlgorithm::BLOCK_HISTO_ATOMIC>;
__shared__ typename histogram_t::TempStorage temp_storage;
int thread_data[ItemsPerThread];
cub::LoadDirectStriped<BlockThreads>(threadIdx.x, data, thread_data);
histogram_t(temp_storage).Histogram(thread_data, histogram + Bins * blockIdx.x);
}
template <class BlockThreads, class ItemsPerThread, class Bins>
void bench(nvbench::state &state, nvbench::type_list<BlockThreads, ItemsPerThread, Bins>)
{
constexpr int block_threads = BlockThreads::value;
constexpr int items_per_thread = ItemsPerThread::value;
constexpr int bins = Bins::value;
int grid_size = 800;
int input_size = block_threads * items_per_thread;
thrust::device_vector<int> data(input_size);
thrust::device_vector<int> histogram(bins * grid_size);
thrust::tabulate(data.begin(), data.end(), [] __host__ __device__(int i) { return i % bins; });
state.exec([&](nvbench::launch &launch) {
kernel<block_threads, items_per_thread, bins>
<<<grid_size, block_threads, 0, launch.get_stream()>>>(thrust::raw_pointer_cast(data.data()),
thrust::raw_pointer_cast(
histogram.data()));
});
}
using block_threads = nvbench::enum_type_list<128, 256, 512>;
using items = nvbench::enum_type_list<1, 3, 7>;
using bins = nvbench::enum_type_list<10, 50, 100>;
NVBENCH_BENCH_TYPES(bench, NVBENCH_TYPE_AXES(block_threads, items, bins));Describe alternatives you've considered
No response
Additional context
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
In Review