# Parallel Computation

## Recap

As discussed in the introduction, parallel computation on a GPU involves several key steps:

**1. Trigger execution on GPUs**

**2. Spawn threads**

GPUs, like other hardware components, are designed with a hierarchical structure.
To efficiently utilize the hardware, threads and their organization are also typically hierarchical.

* CUDA/HIP: *thread* > *block* > *grid*
* SYCL: *work item* > *workgroup* > *nd-range*
* OpenMP: *thread* > *team* > *league*
* OpenACC: *thread* > *vector* > *worker* > *gang*
* Kokkos: *thread* > *team* > *league*

On the hardware level, threads are further grouped as follows:
* *Warps* of 32 on NVIDIA GPUs
* *Wavefronts* of 64 on AMD GPUs
* *Sub-groups* or *sub-workgroups* on Intel GPUs

**3. Map threads**

Each thread executes the same set of operations.
To differentiate them, each thread is assigned one or more IDs or indices, which are used to calculate *globally unique thread indices*.
These indices are then used to map threads to specific portions of the work.

CUDA/HIP make this explicit by providing *built-in thread variables* that yield different values depending on the evaluating thread.
A global thread index is commonly computed from the block index, the block-local thread index, and the block size (number of threads per block) as follows:
```cpp
blockIdx.x * blockDim.x + threadIdx.x
```

SYCL and Kokkos provide a global index as a single lambda parameter.

OpenMP and OpenACC internally map existing loop indices onto threads.

Many standard algorithms do not expose indices directly, but instead operate on references to elements of the input/output data structures.

**4. Synchronization**

Waiting for the GPU to finish outstanding work can be done either:
* implicitly at the end of GPU code sections (OpenMP, OpenACC), or
* via specific API function calls.

## Implementation

Generally, there are three main approaches to implementing parallel computations on GPUs:
* Writing a dedicated GPU kernel (function) as a separate code section, launched from the host code
* Defining an inline kernel for better language integration, while still exposing a GPU-specific implementation
* Relying on automatic conversion of code originally written for CPU execution

## Example

To demonstrate the different offloading and parallelization approaches, we consider a simple test case: increasing all elements of an array by one. \
[increase-base.cpp](../src/increase/increase-base.cpp) shows a serial CPU-only implementation.
Its key part is the increase function.

```cpp
void increase(double* data, size_t nx) {
    for (size_t i0 = 0; i0 < nx; ++i0) {
        data[i0] += 1;
    }
}
```

Other tasks performed by our application include:
* Parsing command line arguments:
    * `nx`: the number of elements in the vector to be processed
    * `nItWarmUp`: the number of warm-up iterations
    * `nIt`: the number of timed iterations
* Allocating an array with `nx` elements
* Initializing the array so that each element holds a value equal to its index
* Calling `increase` for `nItWarmUp` iterations
* Calling `increase` for `nIt` iterations and measuring the time taken
* Printing statistics and estimated performance metrics
* Verifying that all array elements have the expected value
* Deallocating the array

You can compile and execute the code using the following cells:

In [None]:
!g++ -O3 -march=native -std=c++17 -o ../build/increase/increase-base ../src/increase/increase-base.cpp

In [None]:
!../build/increase/increase-base

## OpenMP

**1.** OpenMP enables code execution on GPUs by introducing *target regions*.

```cpp
#pragma omp target
for (size_t i0 = 0; i0 < nx; ++i0) {
    data[i0] += 1;
}
```

**2.** This code executes the loop on the GPU *serially*. To introduce parallelism, use `teams` and `parallel`.

**3.** Loop iterations can be mapped to spawned threads using `distribute` and `for`.
If not specified, the compiler chooses the number of teams and threads per team.

```cpp
#pragma omp target teams distribute parallel for
for (size_t i0 = 0; i0 < nx; ++i0) {
    data[i0] += 1;
}
```

**4.** Synchronization occurs implicitly at the end of the target region.

The complete example code can be found in [increase-omp-target-expl.cpp](../src/increase/increase-omp-target-expl.cpp) and [increase-omp-target-mm.cpp](../src/increase/increase-omp-target-mm.cpp).
Build and execute them using the following cells.

In [None]:
!nvc++ -O3 -std=c++17 -mp=gpu -target=gpu -o ../build/increase/increase-omp-target-expl ../src/increase/increase-omp-target-expl.cpp

In [None]:
!../build/increase/increase-omp-target-expl

In [None]:
!nvc++ -O3 -std=c++17 -mp=gpu -target=gpu -gpu=mem:managed -o ../build/increase/increase-omp-target-mm ../src/increase/increase-omp-target-mm.cpp

In [None]:
!../build/increase/increase-omp-target-mm

## OpenACC

OpenACC offers a similar approach to OpenMP target offloading, using `parallel` (to spawn threads) and `loop` (to distribute work).

Whether execution occurs on the CPU or GPU is determined by compiler arguments.

If not specified, the compiler chooses the number of gangs, workers, and vector size.

```cpp
#pragma acc parallel loop
for (size_t i0 = 0; i0 < nx; ++i0) {
    data[i0] += 1;
}   // implicit synchronization
```

As with OpenMP, this code instructs the compiler to parallelize the loop - regardless of whether it is safe (e.g., in the presence of race conditions due to inter-iteration dependencies).
Alternatively, `kernels` can be used to give the compiler more control.
In this case, the compiler will:
* Analyze dependencies and only parallelize loops without dependencies
* Apply loop and kernel transformations, including fusion

```cpp
#pragma acc kernels
{
    for (size_t i0 = 0; i0 < nx; ++i0) {
        data[i0] += 1;
    }
    /* potentially more work */
}
```

The complete example code is available in [increase-openacc-expl.cpp](../src/increase/increase-openacc-expl.cpp) and [increase-openacc-mm.cpp](../src/increase/increase-openacc-mm.cpp).
Build and execute them using the following cells.

In [None]:
!nvc++ -O3 -std=c++17 -acc=gpu -target=gpu -o ../build/increase/increase-openacc-expl ../src/increase/increase-openacc-expl.cpp

In [None]:
!../build/increase/increase-openacc-expl

In [None]:
!nvc++ -O3 -std=c++17 -acc=gpu -target=gpu -gpu=mem:managed -o ../build/increase/increase-openacc-mm ../src/increase/increase-openacc-mm.cpp

In [None]:
!../build/increase/increase-openacc-mm

## Modern C++

An alternative to loop-based operations is to use STL algorithms.
These can be parallelized by providing an *execution policy*, and GPU offloading can be enabled via compiler arguments.

```cpp
std::transform(std::execution::par_unseq, data, data + nx, data,
               [=](auto data_item) {
                   return data_item + 1;
               }); // implicit synchronization
```

All algorithm executions are synchronous with respect to the CPU, e.g. no explicit GPU synchronization is necessary.

If indices are needed (for example, to access neighboring elements), there are two main approaches.

Reconstruct the index using pointer arithmetic...
```cpp
std::for_each(std::execution::par_unseq, data, data + nx,
              [=](const auto& data_item) {
                  const size_t i0 = &data_item - data;
                  data[i0] += 1;
              });
```

...or use a thrust `counting_iterator` (also available on AMD via *rocThrust*):
```cpp
std::for_each(std::execution::par_unseq, thrust::make_counting_iterator<size_t>(0), thrust::make_counting_iterator<size_t>(nx),
              [=](const auto &i0) {
                  data[i0] += 1;
              });
```

The complete example code is available in [increase-std-par.cpp](../src/increase/increase-std-par.cpp).
Build and execute it using the following cells.

In [None]:
!nvc++ -O3 -std=c++17 -stdpar=gpu -target=gpu -gpu=cc86 -o ../build/increase/increase-std-par ../src/increase/increase-std-par.cpp

In [None]:
!../build/increase/increase-std-par

## Thrust

For more control and support for additional computational patterns, Thrust is a strong alternative over 'standard' modern C++.

It provides GPU-accelerated versions of many STL algorithms, as well as additional ones.
Thrust algorithms also accept an *execution policy* argument, which specifies *where* the computation should be performed (note the difference to `std::execution`).

```cpp
thrust::transform(thrust::device, data.begin(), data.end(), data.begin(),
                  [=] __host__ __device__ (double data_elem) {
                      return data_elem + 1;
                  }); // implicit synchronization
```

As before, you can also use a counting iterator.

```cpp
double *data_ptr = thrust::raw_pointer_cast(data.data());
thrust::for_each(thrust::device, thrust::make_counting_iterator<size_t>(0), thrust::make_counting_iterator<size_t>(nx),
                 [=] __host__ __device__ (size_t i0) {
                     data_ptr[i0] += 1;
                 });
```

Alternatively, Thrust provides the `tabulate` pattern.
It applies a transformation to the *index* of each element and stores the result in-place.

```cpp
double *data_ptr = thrust::raw_pointer_cast(data.data());
thrust::tabulate(thrust::device, data.begin(), data.end(),
                 [=] __host__ __device__ (size_t i0) {
                     return data_ptr[i0] + 1;
                 });
```

As with STL algorithms, thrust counterparts are synchronous with respect to the CPU, e.g. no explicit GPU synchronization is necessary.

The complete example code is available in [increase-thrust-expl.cu](../src/increase/increase-thrust-expl.cu) and [increase-thrust-mm.cu](../src/increase/increase-thrust-mm.cu).
Build and execute them using the following cells.

In [None]:
!nvcc -O3 -std=c++17 --extended-lambda -arch=sm_86 -o ../build/increase/increase-thrust-expl ../src/increase/increase-thrust-expl.cu

In [None]:
!../build/increase/increase-thrust-expl 

In [None]:
!nvcc -O3 -std=c++17 --extended-lambda -arch=sm_86 -o ../build/increase/increase-thrust-mm ../src/increase/increase-thrust-mm.cu

In [None]:
!../build/increase/increase-thrust-mm 

## Kokkos

Kokkos provides its own abstraction for parallel loops.
Depending on how Kokkos is compiled, this will map to either CPU or GPU execution spaces.

```cpp
Kokkos::parallel_for(
    Kokkos::RangePolicy<>(0, nx),
        KOKKOS_LAMBDA(const size_t i0) {
            data(i0) += 1;
        });
```

Tuning the thread hierarchy can be done by specifying an additional *team policy*.

Synchronization with the GPU is *not* implicit.
It can be triggered by calling:
```cpp
Kokkos::fence();
```

The complete example code is available in [increase-kokkos.cpp](../src/increase/increase-kokkos.cpp).
Build and execute it using the following cells.

In [None]:
!g++ -O3 -march=native -std=c++20 -I/root/kokkos/install-serial/include -L/root/kokkos/install-serial/lib -o ../build/increase/increase-kokkos-serial ../src/increase/increase-kokkos.cpp -lkokkoscore -ldl

In [None]:
!../build/increase/increase-kokkos-serial

In [None]:
!/root/kokkos/install-cuda/bin/nvcc_wrapper -O3 -march=native -std=c++20 -arch=sm_86 --expt-extended-lambda --expt-relaxed-constexpr -I/root/kokkos/install-cuda/include -L/root/kokkos/install-cuda/lib -o ../build/increase/increase-kokkos-cuda ../src/increase/increase-kokkos.cpp -lkokkoscore -ldl -lcuda

In [None]:
!../build/increase/increase-kokkos-cuda

## SYCL

SYCL also provides an abstraction for parallel loops.
Using it requires a *handler*, which in turn requires a work *queue*.

The latter can be initialized with the `in_order` property that ensures that kernels and other operations are executed in order.

```cpp
sycl::queue q(sycl::property::queue::in_order{});
```

Work can then be submitted to the queue which provides a handler

```cpp
q.submit([&](sycl::handler &h) {
    h.parallel_for(nx, [=](auto i0) {
        data[i0] += 1;
    });
});
```

You can tune the workgroup size by specifying global and local sizes (the total number of threads and the number of threads per workgroup).
Note that these must be evenly divisible, and any extra threads may need to be masked.

```cpp
auto local_size = 256;
auto global_size = ceilingDivide(nx, local_size) * local_size;
q.submit([&](sycl::handler &h) {
    h.parallel_for( sycl::nd_range<1>{ global_size, local_size }, [=](auto item) {
        auto i0 = item.get_global_id(0);
        if (i0 < nx) {
            data[i0] += 1;
        }
    });
});
```

In all cases, explicit synchronization with the GPU is performed by calling:

```cpp
q.wait();
```

When using buffers, you must create additional accessors to access data.

```cpp
q.submit([&](sycl::handler &h) {
    auto data = b_data.get_access(h, sycl::read_write);
    h.parallel_for(nx, [=](auto i0) {
        data[i0] += 1;
    });
});
```

The complete example code is available in [increase-sycl-expl.cpp](../src/increase/increase-sycl-expl.cpp), [increase-sycl-mm.cpp](../src/increase/increase-sycl-mm.cpp), and [increase-sycl-buffer.cpp](../src/increase/increase-sycl-buffer.cpp).
Build and execute them using the following cells.

In [None]:
!icpx -O3 -march=native -std=c++17 -fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_86 -o ../build/increase/increase-sycl-expl ../src/increase/increase-sycl-expl.cpp

In [None]:
!../build/increase/increase-sycl-expl

In [None]:
!icpx -O3 -march=native -std=c++17 -fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_86 -o ../build/increase/increase-sycl-buffer ../src/increase/increase-sycl-buffer.cpp

In [None]:
!../build/increase/increase-sycl-buffer

In [None]:
!icpx -O3 -march=native -std=c++17 -fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_86 -o ../build/increase/increase-sycl-expl ../src/increase/increase-sycl-expl.cpp

In [None]:
!../build/increase/increase-sycl-expl

In [None]:
!icpx -O3 -march=native -std=c++17 -fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_86 -o ../build/increase/increase-sycl-mm ../src/increase/increase-sycl-mm.cpp

In [None]:
!../build/increase/increase-sycl-mm

## CUDA/ HIP

CUDA/HIP utilize separate *kernel* functions which are *launched* from the host code. By convention, they must return `void` and are marked with the `__global__` keyword.

```cpp
__global__ void increase(double* data, size_t nx) {
    for (size_t i0 = 0; i0 < nx; ++i0) {
        data[i0] += 1;
    }
}
```

The kernel can be configured by providing an *execution configuration* in triple-chevron syntax.

```cpp
increase<<<1, 1>>>(d_data, nx);
```

The above example runs on the GPU, but all work is done in a single thread.
To achieve parallelism, you must manually assign each loop iteration to a separate thread and spawn as many threads as there are iterations.
Each thread computes a unique global or data index using built-in thread variables.

```cpp
__global__ void increase(double* data, size_t nx) {
    const size_t i0 = blockIdx.x * blockDim.x + threadIdx.x;
    data[i0] += 1;
}
```

```cpp
auto numThreadsPerBlock = 256;
auto numBlocks = nx / numThreadsPerBlock;
increase<<<numBlocks, numThreadsPerBlock>>>(d_data, nx);
```

While the above example works if `nx` is evenly divisible by the block size, it will not in all other cases.
The common solution is to spawn an extra block and ensure that only valid threads perform computations.

```cpp
__global__ void increase(double* data, size_t nx) {
    const size_t i0 = blockIdx.x * blockDim.x + threadIdx.x;

    if (i0 < nx)
        data[i0] += 1;
}
```

```cpp
auto numThreadsPerBlock = 256;
auto numBlocks = ceilingDivide(nx, numThreadsPerBlock);
increase<<<numBlocks, numThreadsPerBlock>>>(d_data, nx);
```

The complete example code is available in [increase-cuda-expl.cpp](../src/increase/increase-cuda-expl.cpp) and [increase-cuda-mm.cpp](../src/increase/increase-cuda-mm.cpp).
Build and execute them using the following cells.

In [None]:
!nvcc -O3 -std=c++17 -arch=sm_86 -o ../build/increase/increase-cuda-expl ../src/increase/increase-cuda-expl.cu

In [None]:
!../build/increase/increase-cuda-expl

In [None]:
!nvcc -O3 -std=c++17 -arch=sm_86 -o ../build/increase/increase-cuda-mm ../src/increase/increase-cuda-mm.cu

In [None]:
!../build/increase/increase-cuda-mm

## Next Step

Proceed to the [next steps](./next-steps.ipynb) notebook.