[FEA]: Reduce scope of histogram atomics

### Is this a duplicate?

- [x] I confirmed there appear to be no [duplicate issues](https://github.com/NVIDIA/cccl/issues) for this request and that I agree to the [Code of Conduct](CODE_OF_CONDUCT.md)

### Area

CUB

### Is your feature request related to a problem? Please describe.

Atomic-based specialization of block histogram is using device-wide atomics instead of block-wide ones:

https://github.com/NVIDIA/cccl/blob/cc7c1bb7e888dcfc8665ca4936d8e99c7476a847/cub/cub/block/specializations/block_histogram_atomic.cuh#L79

When histogram is in shared memory (and compiler can see that), this inefficiency is optimized away. Nevertheless, block histogram allows histogram to be in global memory, which leads to suboptimal codegen (using `gpu` instead of `cta` scope on `atom`).  

### Describe the solution you'd like

Scoped atomics are Pascal+ feature, so we can consider something along the lines of:

```c++
NV_IF_TARGET(NV_PROVIDES_SM_60, 
             (atomicAdd_block(histogram + items[i], 1);), 
             (atomicAdd(histogram + items[i], 1);));
```

Potential benchmark for this change:

```c++
template <int BlockThreads, int ItemsPerThread, int Bins>
__global__ void kernel(int *data, int *histogram)
{
  using histogram_t = cub::BlockHistogram<int,
                                          BlockThreads,
                                          ItemsPerThread,
                                          Bins,
                                          cub::BlockHistogramAlgorithm::BLOCK_HISTO_ATOMIC>;
  __shared__ typename histogram_t::TempStorage temp_storage;

  int thread_data[ItemsPerThread];
  cub::LoadDirectStriped<BlockThreads>(threadIdx.x, data, thread_data);
  histogram_t(temp_storage).Histogram(thread_data, histogram + Bins * blockIdx.x);
}

template <class BlockThreads, class ItemsPerThread, class Bins>
void bench(nvbench::state &state, nvbench::type_list<BlockThreads, ItemsPerThread, Bins>)
{
  constexpr int block_threads    = BlockThreads::value;
  constexpr int items_per_thread = ItemsPerThread::value;
  constexpr int bins             = Bins::value;

  int grid_size  = 800;
  int input_size = block_threads * items_per_thread;
  thrust::device_vector<int> data(input_size);
  thrust::device_vector<int> histogram(bins * grid_size);
  thrust::tabulate(data.begin(), data.end(), [] __host__ __device__(int i) { return i % bins; });

  state.exec([&](nvbench::launch &launch) {
    kernel<block_threads, items_per_thread, bins>
      <<<grid_size, block_threads, 0, launch.get_stream()>>>(thrust::raw_pointer_cast(data.data()),
                                                             thrust::raw_pointer_cast(
                                                               histogram.data()));
  });
}

using block_threads = nvbench::enum_type_list<128, 256, 512>;
using items         = nvbench::enum_type_list<1, 3, 7>;
using bins          = nvbench::enum_type_list<10, 50, 100>;

NVBENCH_BENCH_TYPES(bench, NVBENCH_TYPE_AXES(block_threads, items, bins));
```

### Describe alternatives you've considered

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA]: Reduce scope of histogram atomics #3357

Is this a duplicate?

Area

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA]: Reduce scope of histogram atomics #3357

Description

Is this a duplicate?

Area

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions