# Test Case: Conjugate Gradient

In this lab, we extend our 'numerical solver' implemented with matrix-free Jacobi iterations using the conjugate gradient method.
Being familiar with the algorithm on a deeper level is not necessary, but in case you are interested have a look at, e.g., this [wikipedia article](https://en.wikipedia.org/wiki/Conjugate_gradient_method#The_resulting_algorithm).
The linked page also shows an outline of the algorithm implemented which builds on the following building blocks:
* matrix-vector products (i.e. stencil applications)
* other vector operations such as scaling and addition (i.e. similar to the stream pattern)
* vector dot products (i.e. reductions)

Since this algorithm includes multiple steps, we first augment our baseline implementation with markers to make subsequent performance analysis easier.

## NVTX markers


We start by adding markers provided by the **NVidia Tools eXtensions library (NVTX)**.
The project is [open source](https://github.com/NVIDIA/NVTX) and well [documented](https://nvidia.github.io/NVTX/doxygen-cpp/).

The Nvidia HPC SDK (NVHPC) includes versions of NVTX.
Depending on the features required, this might already be sufficient.
For this course, however, we rely on using the latest version directly.
To obtain it, execute the following command once.
Since it is a **header-only library** no additional steps are necessary.

In [None]:
!cd ~ && \
    git clone https://github.com/NVIDIA/NVTX.git

Other paths are of course also possible in practice, but the remainder of this lab builds on this default choice.

After preparation, the next steps are adding the necessary header and modifying the code.

```cpp
#include <nvtx3/nvtx3.hpp>
```

The course material includes a CPU serial base version as well as GPU-accelerated versions based on CUDA, OpenMP and OpenACC.
The base and CUDA versions already include the discussed changes.
Review them to see examples of using NVTX in practice.

* [cg-base.cpp](../src/cg/cg-base.cpp),
* [cg-cuda-mm.cu](../src/cg/cg-cuda-mm.cu),
* [cg-omp-target-mm.cpp](../src/cg/cg-omp-target-mm.cpp), and
* [cg-openacc-mm.cpp](../src/cg/cg-openacc-mm.cpp).

As before, parameterization via command line arguments is possible:
- **Data type**: `float` or `double`
- **nx, ny**: Grid dimensions, scaling the total workload (`nx * ny`)
- **nWarmUp**: Number of non-timed warm-up iterations
- **nIt**: Number of timed iterations

Compilation, execution and profiling with Nsight Systems can be done with the below cells.

### Base

In [None]:
!nvc++ -O3 -march=native -std=c++17 -I$HOME/NVTX/c/include ../src/cg/cg-base.cpp -o ../build/cg-base

In [None]:
!../build/cg-base double 8192 8192 2 16

In [None]:
!nsys profile --stats=true -o ../profiles/cg-base --force-overwrite=true ../build/cg-base double 8192 8192 2 16

### CUDA

In [None]:
!nvc++ -O3 -fast -std=c++17 -I$HOME/NVTX/c/include -o ../build/cg-cuda-mm ../src/cg/cg-cuda-mm.cu

In [None]:
!../build/cg-cuda-mm double 8192 8192 2 16

In [None]:
!nsys profile --stats=true -o ../profiles/cg-cuda-mm --force-overwrite=true ../build/cg-cuda-mm double 8192 8192 2 16

### OpenMP

In [None]:
!nvc++ -O3 -std=c++17 -I$HOME/NVTX/c/include -mp=gpu -target=gpu -gpu=managed -o ../build/cg-omp-target-mm ../src/cg/cg-omp-target-mm.cpp

In [None]:
!../build/cg-omp-target-mm double 8192 8192 2 16

In [None]:
!nsys profile --stats=true -o ../profiles/cg-omp-target-mm --force-overwrite=true ../build/cg-omp-target-mm double 8192 8192 2 16

### OpenACC

In [None]:
!nvc++ -O3 -std=c++17 -I$HOME/NVTX/c/include -acc=gpu -target=gpu -gpu=managed -o ../build/cg-openacc-mm ../src/cg/cg-openacc-mm.cpp

In [None]:
!../build/cg-openacc-mm double 8192 8192 2 16

In [None]:
!nsys profile --stats=true -o ../profiles/cg-openacc-mm --force-overwrite=true ../build/cg-openacc-mm double 8192 8192 2 16

## Exercise

This exercise is designed to be longer and to give you more flexibility in which techniques you want to experiment with.
The baseline implementations are already partly GPU accelerated, but lack the desired performance.
Your tasks are as follows:
* Review the code(s) and choose one version (or create an independent one).
* Profile the application and check the Nsight GUI command line output and timeline visualization for NVTX data.
* Do some POD iterations.
  * Profile: use Nsight Systems and Compute to isolate hot-spots and performance issues in the application.
  * Optimize: implement performance optimizations to address bottlenecks.
  * Deploy: check whether the results are still correct.
* Add your performance result to the leaderboard.

Note that each GPU-accelerated version includes *performance bugs*.
Apart from fixing them, here are some additional optimization ideas to get you started:
* Optimize memory transfers
* Optimize occupancy/ execution configurations
* Perform reductions on GPU
  * \[CUDA\]: use optimized reductions e.g. using CUB or thrust
* Apply kernel fusion
  * \[CUDA\]: apply additional kernel fusion using cooperative grids
* Add alternating forwards-backwards kernels

## Next Step

Congratulations on finishing this course!

If you want to dive deeper, here are some topics this course did not cover:
* [NVIDIA CUDA Profiling Tools Interface (CUPTI)](https://developer.nvidia.com/cupti) provides means to profile applications programmatically.
* [AMD Tools](https://github.com/ROCm/rocprofiler-sdk)
  * Phase-out: `ROCTracer`, `ROCprofiler`, `rocprof`, and `rocprofv2`
  * Upcoming: `ROCprofiler-SDK` and `rocprofv3`
  * [ROCm Systems Profiler](https://github.com/ROCm/rocprofiler-compute) (formerly omnitrace), and [ROCm Compute Profiler](https://github.com/ROCm/rocprofiler-systems)

Here are some pointers if you want to further extend your GPU and HPC knowledge:
* [NHR@FAU](https://nhr.fau.de) offers a number of courses on different HPC related topics
  * [https://hpc.fau.de/teaching/tutorials-and-courses/](https://hpc.fau.de/teaching/tutorials-and-courses/)
* Likewise, most compute centers offer a variety of different courses, many of them online and free of charge
* Nvidia's [On-Demand Video Collection](https://www.nvidia.com/en-us/on-demand/) contains thousands of recordings of many insightful talks covering various GPU-related topics.
* [GTC](https://www.nvidia.com/gtc/) is one of the premier conferences around GPU computing and virtual attendance is usually free of charge.
