# Next Steps


In this section, we outline how to analyze and optimize GPU applications using Nvidia's profiling tools, and discuss patterns and considerations for further performance improvements.

## Profiling and Optimization

Profiling is essential for understanding application performance and identifying optimization opportunities.

Nvidia provides robust profiling tools suitable for all GPU programming approaches.

A common and effective strategy is the *top-down approach* to performance analysis: start with a *whole application* overview, then focus on specific *hot spots*.

Begin with Nsight Systems to get a broad performance overview, using either the command line or the GUI.
Then, use Nsight Compute to analyze the performance of individual kernels in detail.

## Nsight Systems Command Line Interface (CLI)

First, compile and run the benchmark application to ensure it produces correct results.

In [None]:
!nvcc -O3 -std=c++17 -arch=sm_86 -o ../build/increase/increase-cuda-expl ../src/increase/increase-cuda-expl.cu

Next, profile the binary using `nsys profile`.

Key command line arguments:
* `--stats=true`: Prints a summary of performance statistics to the command line
* `-o ...`: Specifies the output profile file
* `--force-overwrite=true`: Overwrites the profile file if it already exists

In [None]:
!nsys profile --stats=true -o ../profiles/increase-cuda-expl --force-overwrite=true ../build/increase/increase-cuda-expl

The most relevant sections in the profiling output are:

**CUDA API Statistics** (`cuda_api_sum`)

This section provides timing details for CUDA API calls such as `cudaMalloc`, `cudaMemcpy`, and `cudaDeviceSynchronize`.

For each API function, Nsight Systems reports:
* The relative and absolute total time spent
* The number of calls
* Statistics (min, max, average, median) for call durations

**CUDA Kernel Statistics** (`cuda_gpu_kern_sum`)

This section reports, for each kernel:
* The relative and absolute total execution time
* The number of launches
* Statistics (min, max, average, median) for kernel execution times

**Memory Transfers** (`cuda_gpu_mem_time_sum` and `cuda_gpu_mem_size_sum`)

These two sections focus on data transfers between host and device, reporting:
* The total time spent on transfers per direction, with statistics
* The total amount of data transferred, with statistics for individual transfers

## Nsight Systems Graphical User Interface (GUI)

To further investigate application behavior, open the generated `increase-cuda-expl.nsys-rep` report file in the `../profiles` folder *on you notebook*.

Download the file using the file browser on the left, or **shift** + **right-click** on this [link](../profiles/increase-cuda-expl.nsys-rep) and choose *Save Link As*.
Then, open the file with your local installation of Nsight Systems (version 2025.1.1 or newer required).

## Nsight Compute Command Line Interface (CLI)

After addressing system-level performance issues and identifying hot-spot kernels, the next step is to profile the application in detail using the Nsight Compute CLI (`ncu`).

Key command line arguments:
* `-o ...`: Specifies the output profile file (similar to `nsys`)
* `--force-overwrite`: Overwrites the profile file if it already exists (without `=true`, unlike `nsys`)

You can further limit which kernels are profiled with:
* `--launch-skip n` or `-s n`: Skips the first `n` kernels
* `--launch-count n` or `-c n`: Profiles only the first `n` applicable kernels
* `--kernel name` or `-k name`: Profiles only kernels with the specified name (supports regex and kernel renaming)

See the [documentation](https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html#profile) for a full list of arguments.

If no output file is specified, results are printed to the command line.

Note: If you do not restrict which kernels are profiled (e.g., by setting the launch count), *every kernel* will be profiled.

In [None]:
!ncu -s 2 -c 1  ../build/increase/increase-cuda-expl

When profiling on a remote machine and analyzing results locally, it is often beneficial to collect more data than strictly necessary.
You can do this by adding the `--set=full` argument.

In [None]:
!ncu -s 2 -c 1 --set=full -o ../profiles/increase-cuda-expl --force-overwrite ../build/increase/increase-cuda-expl

As before, download the resulting file (`increase-cuda-expl.ncu-rep`) using the file browser on the left or **shift** + **right-click** on this [link](../profiles/increase-cuda-expl.ncu-rep) and choose *Save Link As* to open it locally.

A comprehensive discussion of this topic is beyond this tutorial's scope.
For more details, see for example the NHR@FAU course *GPU Performance Engineering*.
The next workshop data will be announced at [https://hpc.fau.de/teaching/tutorials-and-courses](https://hpc.fau.de/teaching/tutorials-and-courses).

### Optimization Considerations

Using the performance evaluation techniques introduced above, you can identify different performance patterns.
These patterns not only guide optimization strategies, but also inform the choice of GPU programming approaches.

**Majority of Time Spent in Device Synchronization**

While this often means the GPU is doing useful work, it may also indicate that additional, independent CPU work could be overlapped.
Achieving this requires fully asynchronous kernel execution.

**Low Degree of Parallelism**

A single kernel may not fully utilize the GPU's capabilities.
This is often reflected in low *occupancy*, which can sometimes be estimated from hardware characteristics.
For example, if the number of CUDA blocks is less than the number of available SMs, or if block sizes are very small, parallelism is likely insufficient.

To address this, consider restructuring the computation to expose more parallelism, or run multiple kernels in parallel.
This may require modeling dependencies between kernels and using finer-grained synchronization.

**Data Transfers between Host and Device Dominate**

If your application uses managed memory, first consider applying prefetching.
If prefetching is not supported by your programming approach, switching to explicit memory management may help.

Another optimization is to overlap data transfers and kernel execution.
This requires fully asynchronous kernel execution, asynchronous memory transfers, and appropriate synchronization or dependency primitives.

**Data Sharing and Aggregation across Groups of Threads**

In many applications, threads within a group (block, team, gang, work-group, etc.) repeatedly access the same data elements.
While hardware caching often helps, manual buffering can sometimes be more efficient.

Using SM-local *shared memory* enables both efficient data sharing and synchronization among threads in a group.

## Computational Patterns

When selecting a GPU programming approach, it is crucial to identify the application's *computational patterns* and check for their support:
* Can the pattern be implemented at all?
* Can it be implemented concisely?
* Can it be implemented efficiently?

A common pattern is the *reduction*, such as summing all elements of a vector, computing a dot product, or finding a minimum value in an array.

The main challenge in parallelizing reductions is the inherent race condition.
Consider the following CPU function and its corresponding kernel:

```cpp
void reduce(double *data, size_t nx) {
    double sum = 0;

    for (size_t i0 = 0; i0 < nx; ++i0)
        sum += data[i0];

    return sum;
}
```

```cpp
__global__ void reduce(double *data, double* sum, size_t nx) {
    const size_t i0 = blockIdx.x * blockDim.x + threadIdx.x;

    if (i0 < nx)
        *sum += data[i0];
}
```

The compound assignment appears like a single operation, but actually involves several steps:
* Loading the old value of `sum` from memory into a temporary variable (e.g., `tmp`)
* Adding `data[i0]` to `tmp`
* Writing the value of `tmp` back to `sum`

When these steps are performed in parallel, some updates may be lost:
* Multiple threads read `sum` concurrently
* Each modifies its local copy
* They write back their results, potentially overwriting each other's contributions

One solution is to ensure the compound assignment is performed as a single *atomic* operation:

```cpp
__global__ void reduce(double *data, double* sum, size_t nx) {
    const size_t i0 = blockIdx.x * blockDim.x + threadIdx.x;

    if (i0 < nx)
        atomicAdd(sum, data[i0]);
}
```

While this version is correct, performance may suffer due to *atomic congestion*.
Further optimization is possible through *hierarchical reduction*, which can include:
* Each thread summing multiple input values
* Each warp performing a reduction across its threads
* Each block performing a reduction across its threads or warps

A full discussion of all variants is beyond this tutorial, but a practical option is to use a block reduction with `cub`, a header-only library included in the Nvidia HPC Toolkit (or `hipCUB` on AMD):

```cpp
#include <cub/cub.cuh>

template <unsigned int blockSize>
__global__ void reduce(double *data, double* sum, size_t nx) {
    const size_t i0 = blockIdx.x * blockDim.x + threadIdx.x;

    // Define BlockReduce type for the block size
    typedef cub::BlockReduce <DATA_TYPE, blockSize, cub::BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY> BlockReduce;

    // Allocate shared memory for block reduction
    __shared__ typename BlockReduce::TempStorage tempStorage;

    double elem = 0;

    if (i0 < nx)
        elem = data[i0];

    // Reduce within the block (all threads *must* participate)
    double blockSum = BlockReduce(tempStorage).Sum(elem);

    // Atomically add the result to the global sum
    if (0 == threadIdx.x && i0 < numElements)
        atomicAdd(sum, blockSum);
}
```

Other programming models also support reductions, with varying levels of programming effort, flexibility, and performance.

### OpenMP

```cpp
double sum = 0;

#pragma omp target teams distribute parallel for \
            reduction(+ : sum)
for (size_t i0 = 0; i0 < nx; ++i0)
    sum += data[i0];
```

### OpenACC

```cpp
double sum = 0;

#pragma acc parallel loop present(data[:nx]) \
            reduction(+ : sum)
for (size_t i0 = 0; i0 < nx; ++i0)
    sum += data[i0];
```

### Modern C++

```cpp
double sum = std::reduce(std::execution::par_unseq, data, data + nx, 0., std::plus<>{});
```

### Thrust

```cpp
double sum = thrust::reduce(data, data + nx, 0.);
```

### Kokkos

```cpp
double sum = 0;

Kokkos::parallel_reduce(
    Kokkos::RangePolicy<>(0, nx),
    KOKKOS_LAMBDA(const size_t i0, double &acc) {
        acc += data(i0);
    }, sum);
```

### SYCL

```cpp
q.submit([&](sycl::handler &h) {
    h.parallel_for(nx, [=](auto i0) {
        auto v = sycl::atomic_ref<double, sycl::memory_order::relaxed,
                                  sycl::memory_scope::device,
                                  sycl::access::address_space::global_space>(
            sum[0]);
        v.fetch_add(data[i0]);
    });
});
```

Additional optimizations, similar to CUDA, are possible.
For more information, see [Intel's OneAPI optimization guide](https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2025-0/reduction.html).

### Additional Consideration

Often, reductions can be fused with the production of their input values.
For example, when computing a dot product, a naive implementation performs two steps:
* Compute the point-wise multiplication and store the result in a temporary vector
* Apply a sum reduction to the temporary vector

Fusing these steps reduces memory usage and typically improves performance by minimizing data movement.

Most of the approaches discussed above support this fusion easily.
Two exceptions are modern C++ and Thrust, which require different algorithms:
* `std::transform_reduce`
* `thrust::transform_reduce` or `thrust::transform_iterator`

## Beyond 1D

Many algorithms use multidimensional iteration spaces and data structures, not just 1D.
When choosing a GPU programming approach, consider:
* Can multidimensional iteration spaces be parallelized intuitively?
* Can the thread hierarchy be multidimensional? This can improve data reuse for neighborhood-based access patterns.
* Is there support for multidimensional data structures?

## Interoperability

Another important consideration is interoperability.
Many GPU programming approaches provide interfaces to and from CUDA/HIP on their respective platforms.

## Multi-GPU

Scaling workloads to multiple GPUs and even multiple GPU-equipped nodes is crucial in many HPC applications.

This requires mechanisms for targeting different GPUs on a node and interoperability with distributed memory solutions such as MPI.

## Next Step

Proceed to the [programming challenge](./programming-challenge.ipynb) notebook to apply what you've learned.