In [6]:
import numpy as np
import cupy as cp

In [8]:
x_cpu = np.array([1, 2, 3])

##### In a normal CUDA workflow we have to allocate the memory on the GPU and the move the data to the GPU memory. In CuPy this is not required, the memory allocation and data movement can be done in a single operation.

In [14]:
%%time
x_gpu_0 = cp.asarray(x_cpu)  # move the ndarray from host mem to GPU0 memeory.

CPU times: user 441 µs, sys: 335 µs, total: 776 µs
Wall time: 532 µs


#### In the past any communication between two GPUs had to go throgh the PCIe card. But now NVIDIA offeres a technology called NVLink. NVLink is a direct GPU-to-GPU interconnect that scales multi-GPU input/output (IO) within a node. This makes GPU-to-GPU transfer (D2D tranfer) much faster than GPU-to-Host (D2H transfer) or Host-to-GPU transfer (H2D transfer). 

In [15]:
%%time
with cp.cuda.Device(1):
    x_gpu_1 = cp.asarray(x_gpu_0)  # move the ndarray to GPU0 to GPU1.

CPU times: user 224 µs, sys: 169 µs, total: 393 µs
Wall time: 321 µs


##### There are two ways to fetch the data from the GPU to host ``cupy.ndarray.get()`` or ``cupy.asnumpy``

In [17]:
with cp.cuda.Device(0):
    x_cpu = cp.asnumpy(x_gpu_0)  # move the array back to the host.

In [18]:
with cp.cuda.Device(1):
    x_cpu = x_gpu_1.get()