# Exercise - Kernel Authoring - Book Histogram

Let's learn to use some advanced CUDA features like shared memory, atomics, and [cuda.cooperative](https://nvidia.github.io/cccl/python/cooperative.html) to write an efficient histogram kernel to determine the most frequent characters in a collection of books.

First, let's download our dataset.

In [None]:
import os

if os.getenv("COLAB_RELEASE_TAG") and not os.path.exists("/ach-installed"): # If running in Google Colab:
  !curl -s -L -O https://developer.nvidia.com/downloads/assets/tools/secure/nsight-systems/2025_3/NsightSystems-linux-cli-public-2025.3.1.90-3582212.deb
  !sudo dpkg -i NsightSystems-linux-cli-public-2025.3.1.90-3582212.deb > /dev/null
  !pip uninstall "cuda-python" --yes > /dev/null
  !pip install "numba-cuda" "cuda-cccl[test-cu12]" "nsightful[notebook] @ git+https://github.com/brycelelbach/nsightful.git" > /dev/null 2>&1
  open("/ach-installed", "a").close()

import numpy as np
import matplotlib.pyplot as plt
import nsightful
import urllib.request

In [None]:
urllib.request.urlretrieve(
  "https://drive.usercontent.google.com/download?id=1MW1lPgkTq3YG9ikuq6u3d9sfpt-wKQZ0&export=download",
  "books__15m.txt")

A histogram kernel counts the number of times a value occurs in a dataset. To implement this, we create an array that is large enough to store all possible values (in the case of counting 1-byte ASCII characters, 256 elements). Then for the value of each element in the dataset, we increment its location in the array.

Let's try a simple way to implement this:

In [None]:
%%writefile histogram_global.py

from numba import cuda
import cupy as cp
import cupyx as cpx
import sys
import os

bins = 256

values = cp.fromfile("books__15m.txt", dtype=cp.uint8)
histogram = cp.zeros((bins), dtype=cp.int32)

threads_per_block = 512 if len(sys.argv) < 3 else int(sys.argv[2])
items_per_thread = 8 if len(sys.argv) < 4 else int(sys.argv[3])
items_per_block = threads_per_block * items_per_thread
blocks = int(len(values) / (threads_per_block * items_per_thread))
assert values.size % items_per_block == 0

@cuda.jit
def histogram_global(values, histogram):
  for i in range(items_per_thread):
    value = values[cuda.grid(1) * items_per_thread + i]
    old_count = histogram[value]
    new_count = old_count + 1
    histogram[value] = new_count

def launch(output):
  histogram[:] = 0 # Reset histogram to 0 each trial.
  histogram_global[blocks, threads_per_block](values, histogram)

  if (output):
    cp.savetxt(sys.stdout, histogram, delimiter=",", fmt="%i")

if len(sys.argv) >= 2 and sys.argv[1] == "output":
  launch(output=True)
elif os.getenv("NV_COMPUTE_PROFILER_PERFWORKS_DIR"): # Running under `ncu`.
  launch(output=False) # `ncu` slows things down; so just launch once when running under it.
else:
  launch(output=False)
  D = cpx.profiler.benchmark(launch, (False), n_repeat=15, n_warmup=4).gpu_times[0]
  print(f"{D.mean():.3g} s ± {(D.std() / D.mean()):.2%} (mean ± relative stdev of {D.size} runs)")

Now let's make sure it runs and check the output.

In [None]:
!python histogram_global.py

In [None]:
histogram_output = !python histogram_global.py output
histogram = np.loadtxt(histogram_output, delimiter=",")

# Print most frequently occuring characters.
pairs = sorted(((i,c) for i,c in enumerate(histogram) if c), key=lambda x: x[1], reverse=True)[:20]
labels = [('SPACE' if i==32 else chr(i)) if 32<=i<=126 else f'0x{i:02X}' for i,_ in pairs]
plt.barh(labels[::-1], [c for _,c in pairs][::-1]); plt.xlabel('count'); plt.tight_layout(); plt.show()

It looks like something is wrong - our counts are very low, and the most common characters don't make a lot of sense. Many of our increments seem to get lost!

What's happening here is called a data race. Many different threads are trying to access the bins of the histogram at the same time.

Imagine that two threads are trying to update the same bin.

- Thread 0 reads the count of the bin, which is 0, and stores it in its local variable `old_count`.
- Thread 0 adds 1 to its `old_count`, producing a `new_count` of 1.
- Thread 1 reads the count of the bin, which is still 0, and stores it in its local variable `old_count`.
- Thread 1 adds 1 to its `old_count`, producing a `new_count` of 1.
- Thread 0 stores `new_count` to the bin, setting it to 1.
- Thread 1 stores `new_count` to the bin, setting it to 1, and losing the increment from thread 0!

To fix this, we need to use atomic operations. `cuda.atomic.add(array, index, value)` will perform `array[index] += value` as a single indivisible operation. This will ensure that no increments get lost.

**TODO: Fix the code above by modifying it to use `cuda.atomic.add`.**

Now let's profile our code.

In [None]:
!ncu -f --kernel-name regex:histogram_global --set full -o histogram_global python histogram_global.py
histogram_global_csv = !ncu --import histogram_global.ncu-rep --csv

In [None]:
nsightful.display_ncu_csv_in_notebook(histogram_global_csv)

Looking at the profile trace, it seems like our code is quite slow - look at the memory workload tab and see how low the throughput is!

One improvement we should make is to separate loading from values from the histogram update and to perform striped loads. We'll use [cuda.cooperative](https://nvidia.github.io/cccl/python/cooperative.html)'s block load instead of writing this by hand.

**TODO: Rewrite the code below to use `cuda.cooperative` to load from `values` into local memory.**
- **Create a `coop.block.load(dtype, threads_per_block, items_per_thread, algorithm)` object outside of the kernel.**
- **Make sure to link the algorithm object to the kernel by adding a `link` parameter to the decorator.**
- **Create storage for the items we'll load with `cuda.local.array`.**

**TODO: Look at the profile trace and code and think about how we could improve performance further.**

**HINT:**
- **What sorts of operations are we performing? Are they expensive? Can we make the code more efficient by reducing the number of expensive operations we perform?**
- **You can allocate memory accessible by the entire block with `cuda.shared.array`.**
- **You can synchronize all threads within a block with `cuda.syncthreads`.**

In [None]:
%%writefile histogram_localized.py

from numba import cuda
import cupy as cp
import cupyx as cpx
import sys
import os

bins = 256

values = cp.fromfile("books__15m.txt", dtype=cp.uint8)
histogram = cp.zeros((bins), dtype=cp.int32)

threads_per_block = 512 if len(sys.argv) < 3 else int(sys.argv[2])
items_per_thread = 8 if len(sys.argv) < 4 else int(sys.argv[3])
items_per_block = threads_per_block * items_per_thread
blocks = int(len(values) / (threads_per_block * items_per_thread))
assert values.size % items_per_block == 0

@cuda.jit
def histogram_localized(values, histogram):
  for i in range(items_per_thread):
    value = values[cuda.grid(1) * items_per_thread + i]
    old_count = histogram[value]
    new_count = old_count + 1
    histogram[value] = new_count

def launch(check, output):
  histogram[:] = 0 # Reset histogram to 0 each trial.
  histogram_localized[blocks, threads_per_block](values, histogram)

  if (check):
    assert cp.sum(histogram) == len(values)

  if (output):
    cp.savetxt(sys.stdout, histogram, delimiter=",", fmt="%i")

if len(sys.argv) >= 2 and sys.argv[1] == "output":
  launch(check=True, output=True)
elif os.getenv("NV_COMPUTE_PROFILER_PERFWORKS_DIR"): # Running under `ncu`.
  launch(check=False, output=False) # `ncu` slows things down; so just launch once when running under it.
else:
  launch(check=True, output=False)
  D = cpx.profiler.benchmark(launch, (False, False), n_repeat=15, n_warmup=4).gpu_times[0]
  print(f"{D.mean():.3g} s ± {(D.std() / D.mean()):.2%} (mean ± relative stdev of {D.size} runs)")

Now let's profile our code.

In [None]:
!python histogram_localized.py

In [None]:
histogram_output = !python histogram_localized.py output
histogram = np.loadtxt(histogram_output, delimiter=",")

# Print most frequently occuring characters.
pairs = sorted(((i,c) for i,c in enumerate(histogram) if c), key=lambda x: x[1], reverse=True)[:20]
labels = [('SPACE' if i==32 else chr(i)) if 32<=i<=126 else f'0x{i:02X}' for i,_ in pairs]
plt.barh(labels[::-1], [c for _,c in pairs][::-1]); plt.xlabel('count'); plt.tight_layout(); plt.show()

In [None]:
!ncu -f --kernel-name regex:histogram_localized --set full -o histogram_localized python histogram_localized.py
histogram_localized_csv = !ncu --import histogram_localized.ncu-rep --csv

In [None]:
nsightful.display_ncu_csv_in_notebook(histogram_localized_csv)

In [None]:
histogram_global_duration    = !python histogram_global.py
histogram_localized_duration = !python histogram_localized.py
speedup = float(histogram_global_duration[0].split()[0]) / float(histogram_localized_duration[0].split()[0])

print(f"histogram_global:    {histogram_global_duration[0]}")
print(f"histogram_localized: {histogram_localized_duration[0]}")
print(f"histogram_localized speedup over histogram_global: {speedup:.2f}")