# Summing a Vector
Numba exposes many CUDA features, including shared memory. To demonstrate shared memory, let’s reimplement a famous CUDA solution for summing a vector which works by “folding” the data up using a successively smaller number of threads.

Note that this is a fairly naive implementation, and there are more efficient ways of implementing reductions using Numba - see Monte Carlo Integration for an example.

In [2]:
import numpy as np
from numba import cuda
from numba.types import int32

Let’s create some one dimensional data that we’ll use to demonstrate the kernel itself:

In [3]:
# generate data
a = cuda.to_device(np.arange(1024))
nelem = len(a)

Here is a version of the kernel implemented using Numba:

In [4]:
@cuda.jit
def array_sum(data):
    tid = cuda.threadIdx.x
    size = len(data)
    if tid < size:
        i = cuda.grid(1)

        # Declare an array in shared memory
        shr = cuda.shared.array(nelem, int32)
        shr[tid] = data[i]

        # Ensure writes to shared memory are visible
        # to all threads before reducing
        cuda.syncthreads()

        s = 1
        while s < cuda.blockDim.x:
            if tid % (2 * s) == 0:
                # Stride by `s` and add
                shr[tid] += shr[tid + s]
            s *= 2
            cuda.syncthreads()

        # After the loop, the zeroth  element contains the sum
        if tid == 0:
            data[tid] = shr[tid]

We can run kernel and verify that the same result is obtained through summing data on the host as follows:

In [5]:
array_sum[1, nelem](a)
print(a[0])                  # 523776
print(sum(np.arange(1024)))  # 523776



523776
523776


## Challenge
This algorithm can be greatly improved upon by redesigning the inner loop to use sequential memory accesses, and even further by using strategies that keep more threads active and working, since in this example most threads quickly become idle.