# Next Steps

## Profiling and Optimization

Nvidia provides mature profiling tools that can be used for all approaches.

A popular pattern is following a *top-down approach* to performance analysis, starting with a *whole application* performance overview and then narrowing down to specific *hot spots*.
The initial overview can be obtained using Nsight Systems, using either solely the command line interface, or by complementing the analysis with the provided GUI.
Nsight Compute can then be used to evaluate the performance of single kernels of interest.

## Nsight Systems CLI

First, we compile and execute out benchmark application to make sure that results are as expected.

In [None]:
!nvcc -O3 -std=c++17 -arch=sm_86 -o ../build/increase/increase-cuda-expl ../src/increase/increase-cuda-expl.cu

Next, we profile our binary with `nsys profile`.
Further command line arguments are:
* `--stats=true`: prints a summary of performance statistics on the command line
* `-o ...`: sets the target output profile file
* `--force-overwrite=true`: replaces the profile file if it already exists (instead of aborting)

In [None]:
!nsys profile --stats=true -o ../profiles/increase-cuda-expl --force-overwrite=true ../build/increase/increase-cuda-expl

The most relevant sections are

**CUDA API Statistics** (`cuda_api_sum`)

This section provides timing details for calls to the CUDA API such as `cudaMalloc`, `cudaMemcpy`, `cudaDeviceSynchronize`, etc.

For each API function called, nsight system reports
* the relative and absolute total (aggregated) time spent in its execution
* the number of calls, and
* statistics (min, max, average and median) about each calls duration.

**CUDA Kernel Statistics** (`cuda_gpu_kern_sum`)

This section reports for each kernel
* the relative and absolute total (aggregated) time spent in its execution,
* the number of launches, and
* statistics (min, max, average and median) of the kernel execution times.

**Memory Transfers** (`cuda_gpu_mem_time_sum` and `cuda_gpu_mem_size_sum`)

These two sections focus on data transfers between host and device and report
* the total time spent in transfers per direction, and their statistics, and
* the total amount of data transferred, and statistics about the individual transfers.

## Nsight Systems GUI

We can further investigate the application behavior by opening up the generated `increase-cuda-expl.nsys-rep` report file which has been saved to the `../profiles` folder.

To do so, download the file using the file browser on the left or **shift** + **right-click** on this [link](../profiles/increase-cuda-expl.nsys-rep) and choose *save link as*.
Next, open the downloaded file with your local installation of Nsight Systems (requires version 2025.1.1 or newer).

## Nsight Compute CLI

After fixing system level performance issues, and isolating hot-spots kernels, next is profiling the application using the CLI of Nsight Compute: `ncu`.

Its command line arguments are:
* `-o ...`: sets the target output profile file (equivalent to `nsys`)
* `--force-overwrite`: replaces the profile file if it already exists (in contrast to `nsys` no `=true`)

We can further limit the scope of profiled kernels with
* `--launch-skip n` or `-s n`: skips the first `n` kernels encountered
* `--launch-count n` or `-c n`: limits profiling to the first `n` applicable kernels
* `--kernel name` or `-k name`: limits profiling to kernels with the name `name`
  * can also be used with regex
  * Nsight Compute also supports kernel renaming

All command line arguments are listed in the [documentation](https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html#profile).

Without an output file, results are printed to the command line.

Note that we also decrease the number of iterations since by default *every kernel* is profiled.

In [None]:
!ncu -s 2 -c 1  ../build/increase/increase-cuda-expl

When profiling on a remote machine while the GUI to evaluate the results is run locally, it often is advantageous to obtain more data than might be necessary.
One easy way to realize this behavior is adding the `--set=full` argument.

In [None]:
!ncu -s 2 -c 1 --set=full -o ../profiles/increase-cuda-expl --force-overwrite ../build/increase/increase-cuda-expl

As before, the resulting file (`increase-cuda-expl.ncu-rep`) can opened locally after being downloaded using the file browser on the left or with **shift** + **right-click** on this [link](../profiles/increase-cuda-expl.ncu-rep) and *save link as*.

A more in-depth look into GPU performance engineering is beyond the scope of this tutorial, but NHR@FAU offers a comprehensive course on this topic.
Details can be found at [https://hpc.fau.de/teaching/tutorials-and-courses](https://hpc.fau.de/teaching/tutorials-and-courses).

### Optimization Considerations

Using the previously (shallowly) introduced performance evaluation techniques different patterns can be identified.
Generally, these patterns guide optimization strategies, but, more relevant to the scope of this workshop, also the choice of GPU programming approaches.

**Majority of Time Spent in Device Synchronization**

While this is generally a good thing (the GPU is doing meaningful work during this time), it can also be a hint that overlapping additional, independent work on the CPU is possible.
In order to achieve this, fully asynchronous kernel execution must be supported.

**Low Degree of Parallelism**

A single kernel is not able to fully saturate the capabilities of the GPU it is executed on.
This frequently shows in a low *occupancy*, and can in many cases also be approximated from the hardware characteristics.
In particular, a number of CUDA blocks lower than the number of SMs available, and/ or a very low block size are good indicators.

Apart from attempting to restructure the computation to expose a higher degree of parallelism, one alternative remedy is running multiple kernels in parallel.
This requires the ability to model dependencies between kernels and thereby the the expression of potential overlap.
More fine-grained synchronization mechanics can also be useful.

**Data Transfers between Host and Device Dominate**

If the application relies on managed memory, the first step is looking into the potential of applying prefetching.
For cases where this concept is not supported by the chosen GPU programming approach, switching to explicitly handled memory can be worthwhile.

Another potential optimization lies in overlapping data transfers and kernel execution.
As before, this requires fully asynchronous kernel execution, but in addition to that also fully asynchronous memory transfers as well as synchronization or dependency primitives between them.

**Data Sharing and Aggregation across Groups of Threads**

In many applications, the threads of a given group (block, team, gang, work-group, ...) access the same data elements multiple times.
While caching is usually able to handle this quite well, in some cases manually buffering data is the better choice.

Using the SM-local *shared memory* allows exactly that and, additionally, also synchronization between all threads of a group.

## Computational Patterns

When choosing a GPU programming approach for an application at hand it can be vital to first identify *computational patterns*, and check for their support.
* Can they be implemented at all?
* Can they be implemented concisely?
* Can they be implemented efficiently?

One frequently occurring pattern are *reductions*, e.g. summing all elements of a vector, computing a dot product, or finding the minimum value in an array.
The key issue when parallelizing reductions lies in the naturally occurring race condition.
Consider the following CPU function and corresponding kernel:

```cpp
void reduce(double *data, size_t nx) {
    double sum = 0;

    for (size_t i0 = 0; i0 < nx; ++i0)
        sum += data[i0];

    return sum;
}
```

```cpp
__global__ void reduce(double *data, double* sum, size_t nx) {
    const size_t i0 = blockIdx.x * blockDim.x + threadIdx.x;

    if (i0 < nx)
        *sum += data[i0];
}
```

The compound assignment looks like a single operation, but in practice it requires multiple steps:
* loading the old value of `sum` from memory and storing it in an intermediate variable (e.g. `tmp`),
* adding `data[i0]` to `tmp`, and
* writing back the value of `tmp` to `sum`.

When these operations are executed in parallel, some updates may be lost:
* Multiple threads read `sum` concurrently, then
* they modify their local versions concurrently, and
* lastly, they write back their computational result, thereby overwriting the contributions of other concurrent threads.

One option to fix this is making sure that the compound assignment is realized as a single *atomic* operation.

```cpp
__global__ void reduce(double *data, double* sum, size_t nx) {
    const size_t i0 = blockIdx.x * blockDim.x + threadIdx.x;

    if (i0 < nx)
        atomicAdd(sum, data[i0]);
}
```

While this version works correctly, performance can be sub-optimal due to *atomic congestion*.
Further optimization is possible by performing a *hierarchical reduction* via one or more of the following adaptations:
* Each thread computes the sum across multiple input values.
* Each warp performs a reduction across its threads.
* Each block performs a reduction across its threads/ warps.

Discussing implementation details for all variants is beyond the scope of this tutorial.
One accessible option, however, is implementing a block reduction using `cub`, a header-only library shipped as part of the Nvidia HPC Toolkit (or using `hipCUB` on AMD).

```cpp
#include <cub/cub.cuh>

template <unsigned int blockSize>
__global__ void reduce(double *data, double* sum, size_t nx) {
    const size_t i0 = blockIdx.x * blockDim.x + threadIdx.x;

    // define BlockReduce type with as many threads per block as we use
    typedef cub::BlockReduce <DATA_TYPE, blockSize, cub::BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY> BlockReduce;

    // allocate shared memory for block reduction
    __shared__ typename BlockReduce::TempStorage tempStorage;

    double elem = 0;

    if (i0 < nx)
        elem = data[i0];

    // reduce over block (all threads must participate)
    double blockSum = BlockReduce(tempStorage).Sum(elem);

    // atomically add the result to the global sum
    if (0 == threadIdx.x && i0 < numElements)
        atomicAdd(sum, blockSum);
}
```

Other approaches offer reductions as well - with varying programming effort, flexibility, and performance.

### OpenMP

```cpp
double sum = 0;

#pragma omp target teams distribute parallel for \
            reduction(+ : sum)
for (size_t i0 = 0; i0 < nx; ++i0)
    sum += data[i0];
```

### OpenACC

```cpp
double sum = 0;

#pragma acc parallel loop present(data[:nx]) \
            reduction(+ : sum)
for (size_t i0 = 0; i0 < nx; ++i0)
    sum += data[i0];
```

### Modern C++

```cpp
double sum = std::reduce(std::execution::par_unseq, data, data + nx, 0., std::plus{});
```

### Thrust

```cpp
double sum = thrust::reduce(data, data + nx, 0.);
```

### Kokkos

```cpp
double sum = 0;

Kokkos::parallel_reduce(
    Kokkos::RangePolicy<>(0, nx),
    KOKKOS_LAMBDA(const size_t i0, double &acc) {
        acc += data(i0);
    }, sum);
```

### SYCL

```cpp
q.submit([&](sycl::handler &h) {
    h.parallel_for(nx, [=](auto i0) {
        auto v = sycl::atomic_ref<double, sycl::memory_order::relaxed,
                                  sycl::memory_scope::device,
                                  sycl::access::address_space::global_space>(
            sum[0]);
        v.fetch_add(data[i0]);
    });
});
```

Additional optimization, similar to CUDA, can be implemented as well.
For this, [Intel's OneAPI optimization guide](https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2025-0/reduction.html) is a good starting point.

### Additional Consideration

In many cases, reducing over a set of values can be fused with their production.
Consider, e.g., computing a dot product.
A naive implementation is done in two steps:
* The result of a point-wise multiplication is stored in a temporary vector
* A sum reduction is applied to the temporary vector

Fusing both steps not only reduces the memory footprint, but is also expected to improve performance since less data needs to be read/ written.

For most approaches discussed previously, this fusion is not difficult to implement.
Two slight exceptions are modern C++ and Thrust - here different algorithms need to be used:
* `std::transform_reduce`
* `thrust::transform_reduce` or `thrust::transform_iterator`

## Beyond 1D

Many algorithms don't build on 1D iteration spaces and data structures, but instead on multidimensional ones.
Key considerations when choosing a GPU programming approach are
* Can nD iteration spaces be parallelized intuitively?
* Can the different parts of the thread hierarchy be multidimensional? This can boost data reuse for data access patterns following a concept of neighborhood.
* Is there support for multidimensional data structures?

## Interoperability

Another topic to consider when comparing different GPU programming approaches is their interoperability.
Many of them expose an interface from and to CUDA/HIP on the corresponding platforms.

## Multi-GPU

Last but not least, scaling workloads to multiple GPUs and even multiple GPU-equipped compute nodes is a relevant topic in many HPC application areas.

This requires concepts for targeting different GPUs available on the current node, as well as interoperability with distributed memory parallelization solutions such as MPI.

## Next Step

Head to the [programming challenge](./programming-challenge.ipynb) notebook.