# Micro Benchmarks

Micro benchmarks are essential for:
* Gaining insights into hardware behavior and performance characteristics.
* Identifying realistic performance boundaries.
* Serving as simplified proxies for analyzing complex applications.

All benchmark implementations discussed in this lab are available as a CPU serial base version as well as GPU-accelerated versions based on CUDA, OpenMP and OpenACC.
Implementations based on additional GPU programming approaches can be found online in the [Accelerated Programming EXamples (APEX)](https://github.com/SebastianKuckuk/apex) and [APEX Generator](https://github.com/SebastianKuckuk/apex-generator) repositories.
The latter also contains automatic benchmarking capabilities which where used to obtain the data presented in this lab.

All discussed performance results have been obtained using the CUDA version.
Their OpenMP and OpenACC counterparts *generally* show the same performance characteristics, but smaller differences may arise.
Using the APEX generator and included scripts, re-running the benchmarks should be easily possible for various GPU programming approaches and GPUs.

## Stream Benchmark

We begin with a simple vector copy benchmark to establish realistic performance limits caused by data transfers to and from main memory (as well as the L2 cache at lower problem sizes).


The course material includes a CPU serial base version as well as GPU-accelerated versions based on CUDA, OpenMP and OpenACC:
* [stream-base.cpp](../src/stream/stream-base.cpp),
* [stream-cuda-expl.cu](../src/stream/stream-cuda-expl.cu),
* [stream-omp-target-expl.cpp](../src/stream/stream-omp-target-expl.cpp), and
* [stream-openacc-expl.cpp](../src/stream/stream-openacc-expl.cpp).

Compilation and execution can be done with the below cells.

As before, parameterization via command line arguments is possible:
- **Data type**: `float` or `double`
- **nx**: number of elements to be copied
- **nWarmUp**: Number of non-timed warm-up iterations
- **nIt**: Number of timed iterations

### Base

In [None]:
!g++ -O3 -march=native -std=c++17 ../src/stream/stream-base.cpp -o ../build/stream-base

In [None]:
!../build/stream-base double $((64 * 1024 * 1024)) 2 8

### CUDA

In [None]:
!nvc++ -O3 -fast -std=c++17 -o ../build/stream-cuda-expl ../src/stream/stream-cuda-expl.cu

In [None]:
!../build/stream-cuda-expl double $((1024 * 1024 * 1024)) 2 8

### OpenMP

In [None]:
!nvc++ -O3 -std=c++17 -mp=gpu -target=gpu -o ../build/stream-omp-target-expl ../src/stream/stream-omp-target-expl.cpp

In [None]:
!../build/stream-omp-target-expl double $((1024 * 1024 * 1024)) 2 8

### OpenACC

In [None]:
!nvc++ -O3 -std=c++17 -acc=gpu -target=gpu -o ../build/stream-openacc-expl ../src/stream/stream-openacc-expl.cpp

In [None]:
!../build/stream-openacc-expl double $((1024 * 1024 * 1024)) 2 8

### Results

<img src="img/stream-A40.png" alt="A40 stream results" width="512px" style="background-color:white"/>
<img src="img/stream-A100-SXM4-80GB.png" alt="A100 stream results" width="512px" style="background-color:white"/>
<img src="img/stream-H100.png" alt="H100 stream results" width="512px" style="background-color:white"/>

The following table compares the theoretical bandwidth limits reported by the vendor (also compare the [GPU Architecture](./gpu-architecture.ipynb) notebook) with the asymptotic bandwidth observed (using the higher value in case of multiple ones) and the maximum observed bandwidth.

| GPU         | theoretical BW | asymptotic BW |   max BW   |
| ----------- | :------------: | :-----------: | :--------: |
| A40         |    696 GB/s    |   ~650 GB/s   | ~1500 GB/s |
| A100 (80GB) |   2039 GB/s    |  ~1750 GB/s   | ~2600 GB/s |
| H100        |   2400 GB/s    |  ~2200 GB/s   | ~4100 GB/s |

Based on the results, we can see different reoccurring patterns.

**Slow ramp up behavior**

As discussed previously, GPUs rely on massive parallelism.
One reason for this is the oversubscription of resources to achieve *latency hiding*.
Without a sufficient number of threads (bytes in flight), overall sustained bandwidth decreases.
Nevertheless, this benchmark can already reveal expected performance for strongly under-utilizing GPU workloads.
And it can be used to set up more tailored performance models, including roofline variants.

**Theoretical bandwidth limits are not reached**

The actual achievable bandwidth can heavily rely on the number of bytes read and written for each thread/ element.
Doing a copy benchmark is only one option (with 4 or 8 bytes read and written respectively).
Other common options are:

| Pattern | Formula                                         | Elements per LUP |
| ------- | ----------------------------------------------- | ---------------- |
| init    | A[i] = c                                        | 1 store          |
| read    | discard = A[i]                                  | 1 load           |
| scale   | A[i] = B[i] * c                                 | 1 load, 1 store  |
| triad   | A[i] = B[i] + D[i] * C[i]                       | 3 load, 1 store  |

as well as stencil-like patterns which we will pick up again later.

**Theoretical limits are exceeded**

Bandwidth tests frequently feature a distinct bump where the approximated bandwidth exceeds theoretical limits.

### Exercise

Check the raw performance data and isolate the locations of the bumps.
Can you relate these numbers to the hardware characteristics?

### Solution

Checking where the bumps are located, and relating the corresponding total data sizes to the L2 cache sizes reveals the reason for the observed bandwidth bumps.

| GPU         | theoretical BW |   max BW   |       occurred at       | L2 Cache |
| ----------- | :------------: | :--------: | :---------------------: | :------: |
| A40         |    696 GB/s    | ~1500 GB/s | ~350k elements (5.6 MB) |   6 MB   |
| A100 (80GB) |   2039 GB/s    | ~2600 GB/s | ~1.3m elements (21 MB)  |  40 MB   |
| H100        |   2400 GB/s    | ~4100 GB/s | ~2.0m elements (32 MB)  |  50 MB   |

Note: while this may suggest, that we implemented an L2 bandwidth benchmark, the obtained results have to be regarded with caution.
One reason is that they are usually very sensitive to varying *execution configurations*.
A more carefully designed benchmark that repeatedly and redundantly reads data that fits into the target cache size is usually a better choice.

## FMA Benchmark

Next, we complement the stream benchmark and its results by analyzing floating-point performance.
Most modern hardware provides built-in support for *fused multiply-add (FMA)* operations.
Check out on of the below benchmark variants - each repeats a fixed number of FMAs per cell to generate a sufficient numerical workload to saturate the hardware capabilities.
* [fma-base.cpp](../src/fma/fma-base.cpp),
* [fma-cuda-expl.cu](../src/fma/fma-cuda-expl.cu),
* [fma-omp-target-expl.cpp](../src/fma/fma-omp-target-expl.cpp), and
* [fma-openacc-expl.cpp](../src/fma/fma-openacc-expl.cpp).

Compilation and execution can be done with the below cells.

### Base

In [None]:
!g++ -O3 -march=native -std=c++17 ../src/fma/fma-base.cpp -o ../build/fma-base

In [None]:
!../build/fma-base float $((1024)) 2 8

### CUDA

In [None]:
!nvc++ -O3 -fast -std=c++17 -o ../build/fma-cuda-expl ../src/fma/fma-cuda-expl.cu

In [None]:
!../build/fma-cuda-expl float $((1024 * 1024)) 2 8

### OpenMP

In [None]:
!nvc++ -O3 -std=c++17 -mp=gpu -target=gpu -o ../build/fma-omp-target-expl ../src/fma/fma-omp-target-expl.cpp

In [None]:
!../build/fma-omp-target-expl float $((1024 * 1024)) 2 8

### OpenACC

In [None]:
!nvc++ -O3 -std=c++17 -acc=gpu -target=gpu -o ../build/fma-openacc-expl ../src/fma/fma-openacc-expl.cpp

In [None]:
!../build/fma-openacc-expl float $((1024 * 1024)) 2 8

### Results

<img src="img/fma-A40.png" alt="A40 fma results" width="512px" style="background-color:white"/>
<img src="img/fma-A100-SXM4-80GB.png" alt="A100 fma results" width="512px" style="background-color:white"/>
<img src="img/fma-H100.png" alt="H100 fma results" width="512px" style="background-color:white"/>

Comparing the measured performance to the theoretical maxima yields

| GPU         | asymptotic float | theor. limit | asymptotic double | theor. limit |
| ----------- | :--------------: | :----------: | :---------------: | :----------: |
| A40         |       ~19        |      37      |       ~0.58       |     0.59     |
| A100 (80GB) |       ~19        |      19      |       ~9.7        |     9.75     |
| H100        |       ~34        |      67      |        ~22        |      34      |

**Single vs. double precision performance**

The performance for different data types closely follows the hardware capabilities discussed in the [GPU Architecture](./gpu-architecture.ipynb) notebook.

**Sawtooth behavior**

### Exercise

Take another look at the raw performance data.
Can you identify the problem sizes related to the sawtooth spikes?
How can they be connected to hardware properties?

### Solution

Performance is best when the number of threads is just below a multiple of the *wave size*, i.e. the number threads required to fill the current GPU entirely.

The observed performance pattern can be attributed to load imbalances usually referred to as *partial waves*.
Additional threads go to a new wave which, if not filled, counts as partial.
In case of equal execution time per thread, as here, the overall run time scales with the number of (partial) waves.
Since the amount of FLOPs performed does not, at least not directly, the overall *throughput* decreases.

## Strided Stream Benchmark

Next, we extend our previous stream benchmark to investigate the effect of different memory access patterns.
For this, we add optional **strides** when **reading** from the input vector and/ or when **writing** to the output vector.
The overall amount of elements read and written is kept *independent of the stride*.

As with the base stream benchmark, this course includes different versions
* [stream-strided-base.cpp](../src/stream-strided/stream-strided-base.cpp),
* [stream-strided-cuda-expl.cu](../src/stream-strided/stream-strided-cuda-expl.cu),
* [stream-strided-omp-target-expl.cpp](../src/stream-strided/stream-strided-omp-target-expl.cpp), and
* [stream-strided-openacc-expl.cpp](../src/stream-strided/stream-strided-openacc-expl.cpp).

The extended list of command line arguments is now:
- **Data type**: `float` or `double`
- **nx**: number of elements to be copied
- **strideRead**: stride applied when reading data
- **strideWrite**: stride applied when writing data
- **nWarmUp**: Number of non-timed warm-up iterations
- **nIt**: Number of timed iterations

Compilation and execution can be done with the below cells.

### Base

In [None]:
!g++ -O3 -march=native -std=c++17 ../src/stream-strided/stream-strided-base.cpp -o ../build/stream-strided-base

In [None]:
!../build/stream-strided-base double $((64 * 1024 * 1024)) 2 1 2 8

### CUDA

In [None]:
!nvc++ -O3 -fast -std=c++17 -o ../build/stream-strided-cuda-expl ../src/stream-strided/stream-strided-cuda-expl.cu

In [None]:
!../build/stream-strided-cuda-expl double $((1024 * 1024 * 1024)) 2 1 2 8

### OpenMP

In [None]:
!nvc++ -O3 -std=c++17 -mp=gpu -target=gpu -o ../build/stream-strided-omp-target-expl ../src/stream-strided/stream-strided-omp-target-expl.cpp

In [None]:
!../build/stream-strided-omp-target-expl double $((1024 * 1024 * 1024)) 2 1 2 8

### OpenACC

In [None]:
!nvc++ -O3 -std=c++17 -acc=gpu -target=gpu -o ../build/stream-strided-openacc-expl ../src/stream-strided/stream-strided-openacc-expl.cpp

In [None]:
!../build/stream-strided-openacc-expl double $((1024 * 1024 * 1024)) 2 1 2 8

### Results

<img src="img/stream-strided-A40.png" alt="A40 strided stream results" width="512px" style="background-color:white"/>
<img src="img/stream-strided-A100-SXM4-80GB.png" alt="A100 strided stream results" width="512px" style="background-color:white"/>
<img src="img/stream-strided-H100.png" alt="H100 strided stream results" width="512px" style="background-color:white"/>

**Decreased Throughput**

Generally, we distinguish two access patterns:
* *Coalesced* accesses: consecutive threads access consecutive memory locations.
* *Uncoalesced* accesses: consecutive threads access non-consecutive memory locations, either with strides or fully random.

The main issue stems from the way GPUs transfer memory between the different stages of the memory hierarchy.
In particular, transferring single bytes is generally not possible.
Instead, the GPU works with chunks of differing granularity with **up to 128 bytes**.
They can be referred to differently (e.g. cache line, sector, transfer, transaction) and may depend on different factors (e.g. the GPU in use, the source and destination as well as the operation).
The following figures illustrate the issue - with 'less' coalesced accesses, more unused data has to be transferred which reduces the effective bandwidth.

**A fully coalesced access** results in optimal performance.

<img src="img/coalesced-access.png" alt="coalesced access" width="512px" style="background-color:white"/>

**A partly coalesced access** with a stride of two results in doubling the number of bytes transferred and observed bandwidth halved.

<img src="img/stride-2-access.png" alt="stride 2 access" width="512px" style="background-color:white"/>

**A fully uncoalesced access** results in a massive drop in performance.

<img src="img/uncoalesced-access.png" alt="uncoalesced access" width="512px" style="background-color:white"/>

**Higher Impact on Write Operations**

When all bytes of chunks are written, then the chunk can be put directly into L2.
If, however, not all bytes are written, e.g. due to uncoalesced accesses, then the chunk has to be filled with the original values from DRAM first, before being partially overwritten in cache.

## Strided FMA Benchmark

Lastly, we extend our previously introduced FMA Benchmark to also include a potential stride.
While this shares certain similarities with the strided stream benchmark, the effect investigated is a different one.

As with the original fma benchmark, this course includes different versions
* [fma-strided-base.cpp](../src/fma-strided/fma-strided-base.cpp),
* [fma-strided-cuda-expl.cu](../src/fma-strided/fma-strided-cuda-expl.cu),
* [fma-strided-omp-target-expl.cpp](../src/fma-strided/fma-strided-omp-target-expl.cpp), and
* [fma-strided-openacc-expl.cpp](../src/fma-strided/fma-strided-openacc-expl.cpp).

Compilation and execution can be done with the below cells.

### Base

In [None]:
!g++ -O3 -march=native -std=c++17 ../src/fma-strided/fma-strided-base.cpp -o ../build/fma-strided-base

In [None]:
!../build/fma-strided-base float $((1024)) 2 2 8

### CUDA

In [None]:
!nvc++ -O3 -fast -std=c++17 -o ../build/fma-strided-cuda-expl ../src/fma-strided/fma-strided-cuda-expl.cu

In [None]:
!../build/fma-strided-cuda-expl float $((1024 * 1024)) 2 2 8

### OpenMP

In [None]:
!nvc++ -O3 -std=c++17 -mp=gpu -target=gpu -o ../build/fma-strided-omp-target-expl ../src/fma-strided/fma-strided-omp-target-expl.cpp

In [None]:
!../build/fma-strided-omp-target-expl float $((1024 * 1024)) 2 2 8

### OpenACC

In [None]:
!nvc++ -O3 -std=c++17 -acc=gpu -target=gpu -o ../build/fma-strided-openacc-expl ../src/fma-strided/fma-strided-openacc-expl.cpp

In [None]:
!../build/fma-strided-openacc-expl float $((1024 * 1024)) 2 2 8

### Results

<img src="img/fma-strided-A40.png" alt="A40 strided fma results" width="512px" style="background-color:white"/>
<img src="img/fma-strided-A100-SXM4-80GB.png" alt="A100 strided fma results" width="512px" style="background-color:white"/>
<img src="img/fma-strided-H100.png" alt="H100 strided fma results" width="512px" style="background-color:white"/>

**Idle threads waste performance**

If only a subset of threads in a warp participate in arithmetic operations, some of the execution units can not perform meaningful work (assuming that non-participating threads cannot overlap other work).
This directly relates to a scaled down performance.

Note: branching itself is usually not harmful for performance, only the resulting *divergence* is.
Moreover, modern GPUs support thread divergence which can help mitigate any negative implications.

## Next Step

After now having a better understanding of the GPU hardware and some common performance mitigating patterns, we resume our analysis of our 2D stencil application in the [Application Level Profiling](./application-level-profiling.ipynb) notebook.