Skip to content

[FEA]: Reduce scope of histogram atomics #3357

@gevtushenko

Description

@gevtushenko

Is this a duplicate?

Area

CUB

Is your feature request related to a problem? Please describe.

Atomic-based specialization of block histogram is using device-wide atomics instead of block-wide ones:

atomicAdd(histogram + items[i], 1);

When histogram is in shared memory (and compiler can see that), this inefficiency is optimized away. Nevertheless, block histogram allows histogram to be in global memory, which leads to suboptimal codegen (using gpu instead of cta scope on atom).

Describe the solution you'd like

Scoped atomics are Pascal+ feature, so we can consider something along the lines of:

NV_IF_TARGET(NV_PROVIDES_SM_60, 
             (atomicAdd_block(histogram + items[i], 1);), 
             (atomicAdd(histogram + items[i], 1);));

Potential benchmark for this change:

template <int BlockThreads, int ItemsPerThread, int Bins>
__global__ void kernel(int *data, int *histogram)
{
  using histogram_t = cub::BlockHistogram<int,
                                          BlockThreads,
                                          ItemsPerThread,
                                          Bins,
                                          cub::BlockHistogramAlgorithm::BLOCK_HISTO_ATOMIC>;
  __shared__ typename histogram_t::TempStorage temp_storage;

  int thread_data[ItemsPerThread];
  cub::LoadDirectStriped<BlockThreads>(threadIdx.x, data, thread_data);
  histogram_t(temp_storage).Histogram(thread_data, histogram + Bins * blockIdx.x);
}

template <class BlockThreads, class ItemsPerThread, class Bins>
void bench(nvbench::state &state, nvbench::type_list<BlockThreads, ItemsPerThread, Bins>)
{
  constexpr int block_threads    = BlockThreads::value;
  constexpr int items_per_thread = ItemsPerThread::value;
  constexpr int bins             = Bins::value;

  int grid_size  = 800;
  int input_size = block_threads * items_per_thread;
  thrust::device_vector<int> data(input_size);
  thrust::device_vector<int> histogram(bins * grid_size);
  thrust::tabulate(data.begin(), data.end(), [] __host__ __device__(int i) { return i % bins; });

  state.exec([&](nvbench::launch &launch) {
    kernel<block_threads, items_per_thread, bins>
      <<<grid_size, block_threads, 0, launch.get_stream()>>>(thrust::raw_pointer_cast(data.data()),
                                                             thrust::raw_pointer_cast(
                                                               histogram.data()));
  });
}

using block_threads = nvbench::enum_type_list<128, 256, 512>;
using items         = nvbench::enum_type_list<1, 3, 7>;
using bins          = nvbench::enum_type_list<10, 50, 100>;

NVBENCH_BENCH_TYPES(bench, NVBENCH_TYPE_AXES(block_threads, items, bins));

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    Status

    In Review

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions