# GTC 2020 Numba Tutorial Notebook 3: Memory Management

## Managing GPU Memory

During the benchmarking in the previous notebook, we used NumPy arrays on the CPU as inputs and outputs.  If you want to reduce the impact of host-to-device/device-to-host bandwidth, it is best to copy data to the GPU explicitly and leave it there to amortize the cost over multiple function calls.  In addition, allocating device memory can be relatively slow, so allocating GPU arrays once and refilling them with data from the host can also be a performance improvement.

Let's create our example addition ufunc again:

In [None]:
from numba import vectorize
import numpy as np

@vectorize(['float32(float32, float32)'], target='cuda')
def add_ufunc(x, y):
    return x + y

In [None]:
n = 5000000
x = np.arange(n).astype(np.float32)
y = 2 * x

In [None]:
%timeit add_ufunc(x, y)  # Baseline performance with host arrays

There are two ways that we can create GPU arrays to pass to Numba.  Numba defines its own GPU array object (not as fully-featured as CuPy, but may be useful if you don't need the rest of CuPy for your application).  The `numba.cuda` module includes a function that will copy host data to the GPU and return a CUDA device array:

In [None]:
from numba import cuda

x_device = cuda.to_device(x)
y_device = cuda.to_device(y)

print(x_device)
print(x_device.shape)
print(x_device.dtype)

Device arrays can be passed to Numba's compiled CUDA functions just like NumPy arrays, but without the copy overhead:

In [None]:
%timeit add_ufunc(x_device, y_device)

That shows some decent performance improvement already, but we are still allocating a device array for the output of the ufunc and copying it back to the host.  We can create the output buffer with the `numba.cuda.device_array()` function:

In [None]:
out_device = cuda.device_array(shape=(n,), dtype=np.float32)  # does not initialize the contents, like np.empty()

And then we can use a special `out` keyword argument to the ufunc to specify the output buffer:

In [None]:
%timeit add_ufunc(x_device, y_device, out=out_device); cuda.synchronize()

(Don't worry about what `cuda.synchronize()` does for now.  We'll talk about it in the next section.  It is present here to ensure that benchmark times are accurate.)

Now that we have removed the device allocation and copy steps, the computation runs *much* faster than before.  When we want to bring the device array back to the host memory, we can use the `copy_to_host()` method:

In [None]:
out_host = out_device.copy_to_host()
print(out_host[:10])

## CuPy Interoperability

Recent versions of CuPy (>= 4.5) support (Numba's generic CUDA array interface)[https://numba.pydata.org/numba-doc/latest/cuda/cuda_array_interface.html].  We can see this on a CuPy array, by looking for the `__cuda_array_interface__` attribute:

In [None]:
import cupy as cp

x_cp = cp.asarray(x)
y_cp = cp.asarray(y)
out_cp = cp.empty_like(y_cp)

x_cp.__cuda_array_interface__

This describes the CuPy array in a portable way so that other packages, like Numba, can use it:

In [None]:
add_ufunc(x_cp, y_cp, out=out_cp)

print(out_cp[:10])

This offers the same speed benefits as allocating the CUDA device array in Numba:

In [None]:
%timeit add_ufunc(x_cp, y_cp, out=out_cp); cuda.synchronize()

(In fact, depending on your system, you may notice that CuPy arrays have even less overhead than Numba arrays!  This is due to a [performance regression](https://github.com/numba/numba/issues/5191) in Numba's ufuncs that will be fixed in a future release.)

Note that Numba won't automatically create a CuPy array for the ufunc output, so if you want to ensure the ufunc result is saved in a CuPy array, be sure to pass an explicit `out` argument to the ufunc, as shown above.

# Exercise

Given these ufuncs:

In [None]:
import math

@vectorize(['float32(float32, float32, float32)'], target='cuda')
def make_pulses(i, period, amplitude):
    return max(math.sin(i / period) - 0.3, 0.0) * amplitude

n = 100000
noise = (np.random.normal(size=n) * 3).astype(np.float32)
t = np.arange(n, dtype=np.float32)
period = n / 23

Convert this code to use device allocations so that there are only host<->device copies at the beginning and end and benchmark performance change.  Use either CuPy arrays or Numba device allocations for `t`, `pulses`, `waveform` and `noise`.

In [None]:
# copy the input data to the GPU first, and pass out= arguments to the functions
pulses = make_pulses(t, period, 100.0)
waveform = add_ufunc(pulses, noise)

In [None]:
%matplotlib inline
from matplotlib import pyplot as plt
plt.plot(waveform)