# GTC 2017 Numba Tutorial Notebook 3: Memory Management

## Managing GPU Memory

During the benchmarking in the previous notebook, we used NumPy arrays on the CPU as inputs and outputs.  If you want to reduce the impact of host-to-device/device-to-host bandwidth, it is best to copy data to the GPU explicitly and leave it there to amortize the cost over multiple function calls.  In addition, allocating device memory can be relatively slow, so allocating GPU arrays once and refilling them with data from the host can also be a performance improvement.

Let's create our example addition ufunc again:

In [3]:
from numba import vectorize
import numpy as np

@vectorize(['float32(float32, float32)'], target='cuda')
def add_ufunc(x, y):
    return x + y

In [5]:
n = 100000
x = np.arange(n).astype(np.float32)
y = 2 * x

In [6]:
%timeit add_ufunc(x, y)  # Baseline performance with host arrays

The slowest run took 137.56 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.25 ms per loop


The `numba.cuda` module includes a function that will copy host data to the GPU and return a CUDA device array:

In [12]:
from numba import cuda

x_device = cuda.to_device(x)
y_device = cuda.to_device(y)

print(x_device)
print(x_device.shape)
print(x_device.dtype)

<numba.cuda.cudadrv.devicearray.DeviceNDArray object at 0x112327f28>
(100000,)
float32


Device arrays can be passed to CUDA functions just like NumPy arrays, but without the copy overhead:

In [13]:
%timeit add_ufunc(x_device, y_device)

1000 loops, best of 3: 429 µs per loop


That's a big performance improvement already, but we are still allocating a device array for the output of the ufunc and copying it back to the host.  We can create the output buffer with the `numba.cuda.device_array()` function:

In [15]:
out_device = cuda.device_array(shape=(n,), dtype=np.float32)  # does not initialize the contents, like np.empty()

In [None]:
And then we can use a special `out` keyword argument to the ufunc to specify the output buffer:

In [16]:
%timeit add_ufunc(x_device, y_device, out=out_device)

1000 loops, best of 3: 235 µs per loop


Now that we have removed the device allocation and copy steps, the computation runs *much* faster than before.  When we want to bring the device array back to the host memory, we can use the `copy_to_host()` method:

In [17]:
out_host = out_device.copy_to_host()
print(out_host[:10])

[  0.   3.   6.   9.  12.  15.  18.  21.  24.  27.]


# Exercise

(Convert example functions to ufuncs and measure impact of computing with host arrays vs device arrays)