# Atomics

## Content

* [Histogram of Temperature Grid](#Histogram-of-Temperature-Grid)
* [Data Race](#Data-Race)
* [Exercise: Fix Histogram](03.03.02-Exercise-Fix-Histogram.ipynb)

---
We recently fixed a bug caused by our thread hierarchy, which might prompt the question: why did we need that hierarchy in the first place? 
To illustrate its value, let’s look at a related problem: computing a histogram of our temperature grid.

## Histogram of Temperature Grid
A histogram helps visualize the distribution of temperatures by grouping values into "bins".
In this example, each bin covers a 10-degree range, so the first bin represents temperatures in `[0, 10)`, the second in `[10, 20)`, and so on.

![Histogram](Images/histogram.png "Histogram")

Given a cell’s temperature, how do we determine the bin it belongs to? We can simply use integer division:

```c++
int bin = static_cast<int>(temperatures[cell] / bin_width);
```

So, a temperature of 14 falls into bin 1, while 4 maps to bin 0. 
Next, we’ll implement this logic in a CUDA kernel, assigning one thread per cell to calculate its bin.

In [None]:
import os

if os.getenv("COLAB_RELEASE_TAG"): # If running in Google Colab:
  !mkdir -p Sources
  !wget https://raw.githubusercontent.com/NVIDIA/accelerated-computing-hub/refs/heads/main/gpu-cpp-tutorial/notebooks/03.03-Atomics/Sources/ach.cuh -nv -O Sources/ach.cuh
  !wget https://raw.githubusercontent.com/NVIDIA/accelerated-computing-hub/refs/heads/main/gpu-cpp-tutorial/notebooks/03.03-Atomics/Sources/__init__.py -nv -O Sources/__init__.py
  !wget https://raw.githubusercontent.com/NVIDIA/accelerated-computing-hub/refs/heads/main/gpu-cpp-tutorial/notebooks/03.03-Atomics/Sources/ach.py -nv -O Sources/ach.py

In [None]:
%%writefile Sources/histogram-bug.cpp
#include "ach.cuh"

constexpr float bin_width = 10;

__global__ void histogram_kernel(cuda::std::span<float> temperatures,
                                 cuda::std::span<int> histogram)
{
  int cell = blockIdx.x * blockDim.x + threadIdx.x;
  int bin = static_cast<int>(temperatures[cell] / bin_width);
  int old_count = histogram[bin];
  int new_count = old_count + 1;
  histogram[bin] = new_count;
}

void histogram(cuda::std::span<float> temperatures,
               cuda::std::span<int> histogram,
               cudaStream_t stream)
{
  int block_size = 256;
  int grid_size = cuda::ceil_div(temperatures.size(), block_size);
  histogram_kernel<<<grid_size, block_size, 0, stream>>>(
    temperatures, histogram);
}

In [None]:
import Sources.ach
Sources.ach.run("Sources/histogram-bug.cpp")

## Data Race
Something went wrong. 
Despite having four million cells, our histogram comes out nearly empty. 
The culprit is in this kernel code:

```c++
int old_count = histogram[bin];
int new_count = old_count + 1;
histogram[bin] = new_count;
```

Because this code runs simultaneously on millions of threads while attempting to read/write a single copy of the `histogram` span, it introduces a data race.  
For example, if two threads increment the same bin at the same time, 
both read the same initial value and overwrite one another’s updates, 
causing the bin to increment only once instead of twice. 
Multiplied by millions of cells, this leads to a nearly empty histogram.

![Data Race](Images/race.png "Data Race")

To fix this, we need to make the read, modify, and write steps a single, indivisible operation. 
CUDA provides atomic operations that handle concurrency safely, ensuring we don’t lose any increments in our histogram.

In [None]:
%%writefile Sources/atomic.cpp
#include <cuda/std/span>
#include <cuda/std/atomic>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>

__global__ void kernel(cuda::std::span<int> count)
{
    // Wrap data in atomic_ref
    cuda::std::atomic_ref<int> ref(count[0]);

    // Atomically increment the underlying value
    ref.fetch_add(1);
}

int main()
{
    thrust::device_vector<int> count(1);

    int threads_in_block = 256;
    int blocks_in_grid = 42;

    kernel<<<blocks_in_grid, threads_in_block>>>(
        cuda::std::span<int>{thrust::raw_pointer_cast(count.data()), 1});

    cudaDeviceSynchronize();

    thrust::host_vector<int> count_host = count;
    std::cout << "expected: " << threads_in_block * blocks_in_grid << std::endl;
    std::cout << "observed: " << count_host[0] << std::endl;
}

In [None]:
!nvcc -arch=native  Sources/atomic.cpp -x cu -arch=native -o /tmp/a.out -run

In the example above, we reproduce our histogram kernel’s structure, where multiple threads attempt to increment the same memory location. 
This time, however, we wrap the memory reference in a `cuda::std::atomic_ref<int>`:

```c++
cuda::std::atomic_ref<int> ref(count[0]);
```

Here, `int` indicates the type of the underlying value, and the constructor accepts a reference to the memory we want to modify. 
The resulting atomic_ref object offers atomic operations, such as:

```c++
ref.fetch_add(1);
```

This call performs an indivisible read-modify-write operation: it reads the current value of `count[0]`, adds one, and writes the result back atomically.
You can think of atomics as writing an instruction rather than a direct value. 

![Atomics](Images/atomic.png "Atomics")


The "?" is replaced by the current value of `count[0]`, incremented by one, and stored in a single step. 
It doesn’t matter how many threads do this concurrently - the result remains correct.

---

In the next exercise, you will fix the histogram kernel using atomics.
Move on to the [next exercise](03.03.02-Exercise-Fix-Histogram.ipynb)