# CUDA/HIP

## Level 0: Hello World

We start with a serial CPU code printing a range of numbers:

```cpp
for (size_t i0 = 0; i0 < nx; ++i0) {
    printf("%ld\n", i0);
}
```

The full example is available in [print-numbers-base.cpp](../src/print-numbers/print-numbers-base.cpp), and can be compiled and executed using the following cells.

In [None]:
!g++ -O3 -march=native -std=c++17 -o ../build/print-numbers/print-numbers-base ../src/print-numbers/print-numbers-base.cpp

In [None]:
!../build/print-numbers/print-numbers-base

CUDA/HIP utilize separate *kernel* functions which are *launched* from the host code. By convention, they must return `void` and are marked with the `__global__` keyword.

```cpp
__global__ void increase(size_t nx) {
    for (size_t i0 = 0; i0 < nx; ++i0) {
        printf("%ld\n", i0);
    }
}
```

The kernel can be configured by providing an *execution configuration* in triple-chevron syntax.

```cpp
increase<<<1, 1>>>(nx);
```

The above example runs on the GPU, but all work is done in a single thread.
To achieve parallelism, you must manually assign each loop iteration to a separate thread and spawn as many threads as there are iterations.
Each thread computes a unique global or data index using built-in thread variables.

```cpp
__global__ void increase(size_t nx) {
    const size_t i0 = blockIdx.x * blockDim.x + threadIdx.x;
    printf("%ld\n", i0);
}
```

```cpp
auto numThreadsPerBlock = 256;
auto numBlocks = nx / numThreadsPerBlock;
increase<<<numBlocks, numThreadsPerBlock>>>(nx);
```

While the above example works if `nx` is evenly divisible by the block size, it will not in all other cases.
The common solution is to spawn an extra block and ensure that only valid threads perform computations.

```cpp
__global__ void increase(size_t nx) {
    const size_t i0 = blockIdx.x * blockDim.x + threadIdx.x;

    if (i0 < nx)
        printf("%ld\n", i0);
}
```

```cpp
auto numThreadsPerBlock = 256;
auto numBlocks = ceilingDivide(nx, numThreadsPerBlock);
increase<<<numBlocks, numThreadsPerBlock>>>(nx);
```

The complete example code is available in [print-numbers-cuda.cpp](../src/print-numbers/print-numbers-cuda.cpp), and can be built and executed using the following cells.

In [None]:
!nvcc -O3 -std=c++17 -arch=sm_86 -o ../build/print-numbers/print-numbers-cuda ../src/print-numbers/print-numbers-cuda.cu

In [None]:
!../build/print-numbers/print-numbers-cuda

## Level 1: Adding Managed Memory

Our next application is increasing all elements of an array by one.

[increase-base.cpp](../src/increase/increase-base.cpp) shows a serial CPU-only implementation.
Its key part and our entry point is the increase function.

```cpp
void increase(double* data, size_t nx) {
    for (size_t i0 = 0; i0 < nx; ++i0) {
        data[i0] += 1;
    }
}
```

Data allocation and deallocation is done explicitly in CUDA.
For the case of managed memory this works as follows.

```cpp
double *data;                  // unified allocation
cudaMallocManaged((void **)&data, sizeof(double) * nx);

/* ... */

cudaFree(data);                // unified de-allocation
```

Migration of data structures is implicit, but can be triggered explicitly to optimize performance.

```cpp
cudaMemPrefetchAsync(data, sizeof(double) * nx, 0 /* deviceId */);   // host to device
cudaMemPrefetchAsync(data, sizeof(double) * nx, cudaCpuDeviceId);    // device to host
```

Pointers to allocated memory are passed to the kernel as argument.

```cpp
__global__ void increase(double* data, size_t nx) {
    const size_t i0 = blockIdx.x * blockDim.x + threadIdx.x;

    if (i0 < nx)
        data[i0] += 1;
}
```

```cpp
auto numThreadsPerBlock = 256;
auto numBlocks = ceilingDivide(nx, numThreadsPerBlock);
increase<<<numBlocks, numThreadsPerBlock>>>(data, nx);
```

The complete example code is available in [increase-cuda-mm.cpp](../src/increase/increase-cuda-mm.cpp).
Build and execute it using the following cells.

In [None]:
!nvcc -O3 -std=c++17 -arch=sm_86 -o ../build/increase/increase-cuda-mm ../src/increase/increase-cuda-mm.cu

In [None]:
!../build/increase/increase-cuda-mm

## Level 2: Switching to Explicit Memory Management

Switching from managed memory to explicit memory management requires the following changes and additions:

**1.** Separate device and host allocations

A frequently used pattern is having two separate pointers for host and device allocations, which share the same *data layout*.
To make this intend clear to code readers, often the same variable name is used but prefixed with a `d_` for the device version.
An additional prefixing with `h_` for host versions can be even more verbose, but is less often used in practice.

```cpp
double *data;                  // host allocation
cudaMallocHost((void **)&data, sizeof(double) * nx);      // HIP uses hipHostMalloc

double *d_data;                // device allocation    
cudaMalloc((void **)&d_data, sizeof(double) * nx);

/* ... */

cudaFree(d_data);              // device de-allocation

cudaFreeHost(data);            // host de-allocation   // HIP uses hipHostFree
```

**2.** Explicit copies between host and device

Data transfer is performed with `cudaMemcpy(target, source, bytesToTransfer, direction)`, for example:

```cpp
cudaMemcpy(d_data, data, sizeof(double) * nx, cudaMemcpyHostToDevice);
```

As before, pointers to (device) memory can be passed to kernels as argument.
Note, that the `d_` prefix is commonly dropped inside the kernel.

```cpp
__global__ void increase(double* data, size_t nx) {
    const size_t i0 = blockIdx.x * blockDim.x + threadIdx.x;

    if (i0 < nx)
        data[i0] += 1;
}
```

```cpp
auto numThreadsPerBlock = 256;
auto numBlocks = ceilingDivide(nx, numThreadsPerBlock);
increase<<<numBlocks, numThreadsPerBlock>>>(d_data, nx);
```

The complete example code is available in [increase-cuda-expl.cpp](../src/increase/increase-cuda-expl.cpp).
Build and execute it using the following cells.

In [None]:
!nvcc -O3 -std=c++17 -arch=sm_86 -o ../build/increase/increase-cuda-expl ../src/increase/increase-cuda-expl.cu

In [None]:
!../build/increase/increase-cuda-expl