# Programming Challenge

This is the final part of the course.
Most examples discussed are also part of the of the [Accelerate Programming EXamples (APEX)](https://github.com/SebastianKuckuk/apex/tree/main/src/benchmark) repository.
Feel free to check it out in case you get stuck.

## Level 0: Code Review

At this point you might feel quite overwhelmed by all the different concepts discussed.
Reflect on the different approaches by revisiting the code examples of the *increase* example at TODO.
Next, choose one or more approaches you want to try out in this code challenge.

## Level 1: Stream Benchmark

We start by accelerating a straight-forward vector copy benchmark, similar to the previously discussed *increase* example.
The main difference is that we now have two arrays which we copy data between in a ping-pong fashion.
We also keep the increase by one in each element in each iteration to enable checking for correct results.

A serial baseline *CPU-only* implementation can be found in [stream.cpp](../src/stream/stream-base.cpp).
As usual, it can be compiled and executed with the following cells.

In [None]:
!g++ -O3 -march=native -std=c++17 -o ../build/stream/stream-base ../src/stream/stream-base.cpp

In [None]:
!../build/stream/stream-base

Start by copying it to a new file, e.g. using the following cell.

In [None]:
!cp ../src/stream/stream-base.cpp ../src/stream/stream-TODO.cpp

For CUDA or SYCL, optionally include the corresponding utility header.
Set the compiler and its arguments for your chosen approach in the next cell.

In [None]:
!TODO -O3 -std=c++17 -o ../build/stream/stream-TODO ../src/stream/stream-TODO.cpp

Apply GPU parallelization and make sure that after each change the result of your benchmark application is still correct.
Can you manage to get close to the bandwidth limit of the GPU you are running on?

## Level 2: 2D Stencil

The next application to be accelerated with GPUs is a simple yet widely used benchmark: a 2D stencil application.
It can be regarded as a proxy application for (matrix-free) matrix-vector multiplications, which are ubiquitous in HPC applications.

Our baseline application solves a 2D finite difference discretization of the Laplace equation using Jacobi iterations.
It can, among many other things, be used to simulate heat distribution.

<img src="https://upload.wikimedia.org/wikipedia/commons/0/01/Heat.gif" alt="heat equation" width="50%"/>

The details are not important for this tutorial.
In essence, an update for each point of a 2D grid is computed based on the values of neighboring points.
In this particular examples, only the neighbors in cardinal directions are used which, when visualized, looks like a stencil.

As before, it can be parameterized with command line arguments (c.f. `parseCLA_2d` in [stencil-2d-util.h](../src/stencil-2d/stencil-2d-util.h)):
* **nx, ny**: Grid dimensions, defining total workload (`nx * ny`)
* **nWarmUp**: Number of non-timed warm-up iterations
* **nIt**: Number of timed iterations
and basic diagnostic output of performance data is available via the `printStats` function in [util.h](../src/util.h).

Perform the same steps as for the last exercise:
* Compile and run the serial CPU base version
* Copy the code to a new file and optionally include additional headers
* Set up the compilation
* Port the application to GPU
* Make sure that the result stays the same (we use the absolute value of the residual as indicator)
* Compare the achieved bandwidth with the performance of the stream benchmark

In [None]:
!g++ -O3 -march=native -std=c++17 -Wall -o ../build/stencil-2d/stencil-2d-base ../src/stencil-2d/stencil-2d-base.cpp

In [None]:
!../build/stencil-2d/stencil-2d-base 8192 8192 2 16

In [None]:
!cp ../src/stencil-2d/stencil-2d-base.cpp ../src/stencil-2d/stencil-2d-TODO.cpp

For CUDA or SYCL, optionally include the corresponding utility header.
Set the compiler and its arguments for your chosen approach in the next cell.

In [None]:
!TODO -O3 -std=c++17 -o ../build/stencil-2d/stencil-2d-TODO ../src/stencil-2d/stencil-2d-TODO.cpp

Apply GPU parallelization and make sure that after each change the result of your benchmark application is still correct.
Can you manage to get close to the bandwidth limit of the GPU you are running on?

In [None]:
!../build/stencil-2d/stencil-2d-TODO 8192 8192 2 16

## Level 3: Conjugate Gradient

The final exercise and performance challenge is extending the previous 'numerical solver' implemented with matrix-free Jacobi iterations using the conjugate gradient method.
Being familiar with the algorithm on a deeper level is not necessary, but in case you are interested have a look at, e.g., this [wikipedia article](https://en.wikipedia.org/wiki/Conjugate_gradient_method#The_resulting_algorithm).
The linked page also shows an outline of the algorithm implemented which builds on the following building blocks:
* matrix-vector products (i.e. stencil applications)
* other vector operations such as scaling and addition (i.e. similar to the stream pattern)
* vector dot products (i.e. reductions)

The algorithm includes multiple steps - start by reviewing the serial base version [TODO]() carefully.


In [None]:
!g++ -O3 -march=native -std=c++17 -Wall -o ../build/cg/cg-base ../src/cg/cg-base.cpp

In [None]:
!../build/cg/cg-base 8192 8192 2 16

In [None]:
!cp ../src/cg/cg-base.cpp ../src/cg/cg-TODO.cpp

For CUDA or SYCL, optionally include the corresponding utility header.
Set the compiler and its arguments for your chosen approach in the next cell.

In [None]:
!TODO -O3 -std=c++17 -o ../build/cg/cg-TODO ../src/cg/cg-TODO.cpp

Apply GPU parallelization and make sure that after each change the result of your benchmark application is still correct.
Can you manage to get close to the bandwidth limit of the GPU you are running on?

In [None]:
!../build/cg/cg-TODO 8192 8192 2 16