<img src="Images/nvidia_header.png" style="margin-left: -30px; width: 300px; float: left;">

## Exercise: Coarsening

To compute the [block histogram using CUB](https://nvidia.github.io/cccl/cub/api/classcub_1_1BlockHistogram.html) you will need to four following things:

1. Declare the histogram parameters as compile time variable: the block size, number of bins, and elements per thread:

```c++
// block size has to be known at compile time
constexpr int block_size = 256;
constexpr int items_per_thread = 1;
constexpr int num_bins = 10;
```

2. Declare block histogram type using the previously declared values:
```c++
using histogram_t =
  cub::BlockHistogram<
    int,              // type of histogram counters
    block_size,       // number of threads in a block
    items_per_thread, // number of bin indices that each thread contributes
    num_bins>         // number of bins in the histogram
```

3. Allocate temporary storage in shared memory:

```c++
__shared__ typename histogram_t::TempStorage temp_storage;
```

4. Call the histogram member function as such:

```c++
int thread_bins[items_per_thread] = {...};
histogram_t{temp_storage}.Histogram(thread_bins, block_histogram);
```

Modify the kernel below to compute the average temperature in the tile and write it to the output array:

In [None]:
#@title Google Colab Setup
!mkdir -p Sources
!wget https://raw.githubusercontent.com/NVIDIA/accelerated-computing-hub/refs/heads/main/gpu-cpp-tutorial/notebooks/03.06-Cooperative-Algorithms/Sources/ach.cuh -nv -O Sources/ach.cuh
!wget https://raw.githubusercontent.com/NVIDIA/accelerated-computing-hub/refs/heads/main/gpu-cpp-tutorial/notebooks/03.06-Cooperative-Algorithms/Sources/__init__.py -nv -O Sources/__init__.py
!wget https://raw.githubusercontent.com/NVIDIA/accelerated-computing-hub/refs/heads/main/gpu-cpp-tutorial/notebooks/03.06-Cooperative-Algorithms/Sources/ach.py -nv -O Sources/ach.py

<details>
<summary>Original code in case you need it</summary>

```c++
%%writefile Sources/cooperative.cpp
#include "ach.cuh"

constexpr int block_size = 256;
constexpr int items_per_thread = 1;
constexpr int num_bins = 10;
constexpr float bin_width = 10;

__global__ void histogram_kernel(cuda::std::span<float> temperatures,
                                 cuda::std::span<int> histogram) 
{
  __shared__ int block_histogram[num_bins];

  int cell = blockIdx.x * blockDim.x + threadIdx.x;
  int bins[items_per_thread] = {static_cast<int>(temperatures[cell] / bin_width)};

  ??? // 1. Use `cub::BlockHistogram` to compute the block histogram

  __syncthreads();
  if (threadIdx.x < num_bins) {
    cuda::atomic_ref<int, cuda::thread_scope_device> ref(histogram[threadIdx.x]);
    ref.fetch_add(block_histogram[threadIdx.x]);
  }
}

void histogram(cuda::std::span<float> temperatures, 
               cuda::std::span<int> histogram, 
               cudaStream_t stream) 
{
  int grid_size = cuda::ceil_div(temperatures.size(), block_size);
  histogram_kernel<<<grid_size, block_size, 0, stream>>>(
    temperatures, histogram);
}

```
    
</details>

In [None]:
%%writefile Sources/cooperative.cpp
#include "ach.cuh"

constexpr int block_size = 256;
constexpr int items_per_thread = 1;
constexpr int num_bins = 10;
constexpr float bin_width = 10;

__global__ void histogram_kernel(cuda::std::span<float> temperatures,
                                 cuda::std::span<int> histogram) 
{
  __shared__ int block_histogram[num_bins];

  int cell = blockIdx.x * blockDim.x + threadIdx.x;
  int bins[items_per_thread] = {static_cast<int>(temperatures[cell] / bin_width)};

  // TODO: Use `cub::BlockHistogram` to compute the block histogram

  __syncthreads();
  if (threadIdx.x < num_bins) {
    cuda::atomic_ref<int, cuda::thread_scope_device> ref(histogram[threadIdx.x]);
    ref.fetch_add(block_histogram[threadIdx.x]);
  }
}

void histogram(cuda::std::span<float> temperatures, 
               cuda::std::span<int> histogram, 
               cudaStream_t stream) 
{
  int grid_size = cuda::ceil_div(temperatures.size(), block_size);
  histogram_kernel<<<grid_size, block_size, 0, stream>>>(
    temperatures, histogram);
}


In [None]:
import Sources.ach
Sources.ach.run("Sources/cooperative.cpp")

If you’re unsure how to proceed, consider expanding this section for guidance. Use the hint only after giving the problem a genuine attempt.

<details>

  <summary>Hints</summary>

  - `cub::BlockHistogram` needs temporary storage in shared memory

</details>

Open this section only after you’ve made a serious attempt at solving the problem. Once you’ve completed your solution, compare it with the reference provided here to evaluate your approach and identify any potential improvements.

<details>
  <summary>Solution</summary>

  Key points:

  - Use `cub::BlockHistogram` to compute the block histogram
  - Allocate temporary storage in shared memory
  - Make sure to synchronize before reading the block histogram

  Solution:
  ```c++
  using histogram_t = cub::BlockHistogram<int, block_size, 1, 10>;
  __shared__ typename histogram_t::TempStorage temp_storage;
  histogram_t(temp_storage).Histogram(bins, block_histogram);
  __syncthreads();
  ```

  You can find full solution [here](Solutions/cooperative.cu).
</details>

<img src="Images/nvidia_header.png" style="margin-left: -30px; width: 300px; float: left;">