# Use Case: 2D Stencil

## CPU Baseline

We begin with a simple yet widely used benchmark: a 2D stencil application.
It can be regarded as a proxy application for (matrix-free) matrix-vector multiplications, which are ubiquitous in HPC applications.

A serial baseline *CPU-only* implementation can be found in [stencil-2d-base.cpp](../src/stencil-2d/stencil-2d-base.cpp).
Reviewing the implementation reveals these key points:
* The main workload is encapsulated in the `stencil2d` function.
* The application can be parameterized with command line arguments (c.f. `parseCLA_2d` in [stencil-2d-util.h](../src/stencil-2d/stencil-2d-util.h)):
  - **Data type**: `float` or `double`
  - **nx, ny**: Grid dimensions, defining total workload (`nx * ny`)
  - **nWarmUp**: Number of non-timed warm-up iterations
  - **nIt**: Number of timed iterations
* Basic diagnostic output of performance data is available via the `printStats` function in [util.h](../src/util.h).

After reviewing the code, we can use the following commands to compile and execute the application:

In [None]:
!g++ -O3 -march=native -std=c++17 ../src/stencil-2d/stencil-2d-base.cpp -o ../build/stencil-2d-base

In [None]:
!../build/stencil-2d-base double 8192 8192 2 256

Next, we introduce OpenMP parallelization to enhance *CPU* performance.
The updated version is available in [stencil-2d-omp-host.cpp](../src/stencil-2d/stencil-2d-omp-host.cpp), and can be compiled and executed using:

In [None]:
!g++ -O3 -march=native -std=c++17 -fopenmp ../src/stencil-2d/stencil-2d-omp-host.cpp -o ../build/stencil-2d-omp-host

In [None]:
!../build/stencil-2d-omp-host double 8192 8192 2 256

Depending on parameters set and hardware used, performance gains may vary, not be present at all, or you might even observe a performance degradation.

## First Attempt at GPU Acceleration

To offload computations to the GPU, we extend the code with OpenMP target offloading.
In a first attempt, we limit code changes to the `stencil2d` function.
The updated version is in [stencil-2d-omp-target-v0.cpp](../src/stencil-2d/stencil-2d-omp-target-v0.cpp).

In [None]:
!nvc++ -O3 -march=native -std=c++17 -mp=gpu -target=gpu ../src/stencil-2d/stencil-2d-omp-target-v0.cpp -o ../build/stencil-2d-omp-target-v0

In [None]:
!../build/stencil-2d-omp-target-v0 double 8192 8192 2 256

Surprisingly, initial GPU performance is worse than the CPU baseline.
While this outcome might not be surprising to you, especially if you have a background in using OpenMP target offloading, we pretend not to know about the pertaining issues for now.

More importantly, we can now already *evaluate* performance, which can be useful to compare different variants of the same application.
What is missing, however, are the answers to the following questions:
* Could performance be better for this particular hard- and software combination?
* If so, how can we pinpoint what optimizations need to be applied where to raise our performance levels?

## Next Step

The rest of this course will cover ways of *modelling* performance, tools and techniques to verify these models, and optimizations arising from insights gained from them.
The core goal is to identify performance bottlenecks and explore ways to mitigate or shift them for better efficiency.

Before we can start modeling, however, having a closer look at the underlying GPU architecture is worthwhile.
Head over to the [GPU Architecture](./gpu-architecture.ipynb) notebook to get started.