# Programming Challenge

Welcome to the final part of the course!
This section will put your knowledge to the test with hands-on programming challenges designed to reinforce the concepts you've learned.

This is the final part of the course.
Many of the examples discussed are also available in the [Accelerate Programming EXamples (APEX)](https://github.com/SebastianKuckuk/apex/tree/main/src/benchmark) repository.
If you get stuck, feel free to explore the repository for inspiration and guidance.

## Level 0: Code Review

Begin by reflecting on the different approaches covered in the course.

At this point, you might feel overwhelmed by the variety of concepts discussed.
Take a moment to revisit the code examples for the *increase* example at `src/increase`.
Then, select one or more approaches you would like to try in this challenge.

## Level 1: Stream Benchmark

Accelerate a simple vector copy benchmark as your first task.

We will start by accelerating a straightforward vector copy benchmark, similar to the previously discussed *increase* example.
The main difference is that we now have two arrays, copying data between them in a ping-pong fashion.
Each element is incremented by one in every iteration, allowing you to verify correctness.

A serial, CPU-only baseline implementation is provided in [stream.cpp](../src/stream/stream-base.cpp).
As usual, you can compile and run it using the following cells.

In [None]:
!g++ -O3 -march=native -std=c++17 -o ../build/stream/stream-base ../src/stream/stream-base.cpp

In [None]:
!../build/stream/stream-base

Start by copying the baseline file to a new file, for example using the following command.

In [None]:
!cp ../src/stream/stream-base.cpp ../src/stream/stream-TODO.cpp

For CUDA or SYCL, you may optionally include the corresponding utility header.
Set the compiler and its arguments for your chosen approach in the next cell.

In [None]:
!TODO -O3 -std=c++17 -o ../build/stream/stream-TODO ../src/stream/stream-TODO.cpp

Apply GPU parallelization, ensuring that your benchmark application produces correct results after each change.
Can you approach the bandwidth limit of your GPU?

## Level 2: 2D Stencil

Tackle a widely used benchmark: the 2D stencil application.

The next application to accelerate with GPUs is a simple yet widely used benchmark: a 2D stencil application.
This serves as a proxy for (matrix-free) matrix-vector multiplications, which are common in HPC applications.

Our baseline application solves a 2D finite difference discretization of the Laplace equation using Jacobi iterations.
Among other things, it can be used to simulate heat distribution.

<img src="https://upload.wikimedia.org/wikipedia/commons/0/01/Heat.gif" alt="heat equation" width="50%"/>

The details are not crucial for this tutorial.
Essentially, each point in a 2D grid is updated based on the values of its neighboring points.
In this example, only the neighbors in the cardinal directions are used, forming a stencil pattern when visualized.

As before, you can parameterize the application with command line arguments (see `parseCLA_2d` in [stencil-2d-util.h](../src/stencil-2d/stencil-2d-util.h)):
* **nx, ny**: Grid dimensions, defining the total workload (`nx * ny`)
* **nWarmUp**: Number of non-timed warm-up iterations
* **nIt**: Number of timed iterations
Basic diagnostic output and performance data are available via the `printStats` function in [util.h](../src/util.h).

Follow these steps as in the previous exercise:
* Compile and run the serial CPU base version
* Copy the code to a new file and optionally include additional headers
* Set up the compilation
* Port the application to GPU
* Ensure the results remain correct (use the absolute value of the residual as an indicator)
* Compare the achieved bandwidth with the stream benchmark's performance

In [None]:
!g++ -O3 -march=native -std=c++17 -Wall -o ../build/stencil-2d/stencil-2d-base ../src/stencil-2d/stencil-2d-base.cpp

In [None]:
!../build/stencil-2d/stencil-2d-base 8192 8192 2 16

In [None]:
!cp ../src/stencil-2d/stencil-2d-base.cpp ../src/stencil-2d/stencil-2d-TODO.cpp

For CUDA or SYCL, you may optionally include the corresponding utility header.
Set the compiler and its arguments for your chosen approach in the next cell.

In [None]:
!TODO -O3 -std=c++17 -o ../build/stencil-2d/stencil-2d-TODO ../src/stencil-2d/stencil-2d-TODO.cpp

Apply GPU parallelization, ensuring that your benchmark application produces correct results after each change.
Can you approach the bandwidth limit of your GPU?

In [None]:
!../build/stencil-2d/stencil-2d-TODO 8192 8192 2 16

## Level 3: Conjugate Gradient

Extend the previous numerical solver using the conjugate gradient method.

The final exercise and performance challenge is to extend the previous 'numerical solver' (implemented with matrix-free Jacobi iterations) using the conjugate gradient method.
While a deep understanding of the algorithm is not required, you can learn more from this [Wikipedia article](https://en.wikipedia.org/wiki/Conjugate_gradient_method#The_resulting_algorithm).

The linked page outlines the algorithm, which builds on the following building blocks:
* Matrix-vector products (i.e., stencil applications)
* Other vector operations such as scaling and addition (similar to the stream pattern)
* Vector dot products (i.e., reductions)

The algorithm involves multiple steps.
Start by carefully reviewing the serial base version [cg-base.cpp](../src/cg/cg-base.cpp) and copying it to a new file.


In [None]:
!g++ -O3 -march=native -std=c++17 -Wall -o ../build/cg/cg-base ../src/cg/cg-base.cpp

In [None]:
!../build/cg/cg-base 8192 8192 2 16

In [None]:
!cp ../src/cg/cg-base.cpp ../src/cg/cg-TODO.cpp

For CUDA or SYCL, you may optionally include the corresponding utility header.
Set the compiler and its arguments for your chosen approach in the next cell.

In [None]:
!TODO -O3 -std=c++17 -o ../build/cg/cg-TODO ../src/cg/cg-TODO.cpp

Apply GPU parallelization, ensuring that your benchmark application produces correct results after each change.
If you progress quickly, also consider employing optimizations such as kernel fusion to further improve performance.

In [None]:
!../build/cg/cg-TODO 8192 8192 2 256

Once you achieve satisfactory performance, publish your results (execution time for **256 iterations** in ms) on the scoreboard.
The link to the scoreboard will be provided during the workshop.