# Performance Models

## Overview

Performance models provide simplified yet effective frameworks to analyze and predict the performance of GPU kernels.
Their main aim is
* **Predict** performance of (parts of) applications
* **Relate** observed performance to hardware characteristics
* Evaluate whether spending time on (further) optimizing the code is promising
* **Guide** optimization efforts by identifying performance bottlenecks

## Bottleneck Model

The bottleneck model analyzes potential performance limiters individually, such as:
* **DRAM Bandwidth**, i.e. the speed at which data can be transferred from/ to global memory.
* **Compute Throughput**, i.e. the rate at which computations can be executed.
* **Cache Bandwidth**, i.e. the speed of accessing data from different parts of the memory hierarchy, e.g. L2 cache, L1 cache, or shared memory.

Each bottleneck is modeled separately to estimate its contribution to the total execution time.

### Steps to Apply the Bottleneck Model

1. For each potential bottleneck, calculate the time required for the kernel to execute assuming it is limited by that bottleneck.
2. Take the maximum of the single contributions to estimate the total execution time.
3. Compare the predicted time to the actual run time.
   * The interpretation of the (potential) difference is similar to that of the roofline model (see below).

## Roofline Model

The roofline model is one of the most widely used performance models in GPU computing.
It highlights two primary performance limiters for a kernel:
1. **Computational Throughput**, i.e. how fast the GPU can perform the required computations.
2. **Memory Bandwidth** (data throughput), i.e. how fast data can be transferred to and from execution units.

The *model* is parameterized with theoretical limits for both, where they are either
- **Theoretical Peak Values** to provide optimistic performance limits based on GPU architecture (see the [GPU architecture](gpu-architecture.ipynb) notebook).
- **Benchmark Values** obtained through micro benchmarks (see the [Micro Benchmarks](micro-benchmarks.ipynb) notebook).

The *application* code is represented by its ratio of computations to memory traffic:

 **Arithmetic Intensity (AI) = FLOPs (Floating Point Operations) / Bytes Transferred**.

* **High AI** points at a code likely bound by computational throughput.
    * Examples are (dense) matrix-matrix multiplications, fast Fourier transforms (FFT), and many operations in deep neural nets (DNN).
* **Low AI** points at a code likely bound by memory bandwidth.
    * Examples are sparse matrix-vector multiplications (incl. stencil applications), histogram computations, and vector operations (init, copy, ...).

### Steps to Apply the Roofline Model

1. Identify GPU-specific compute and memory bandwidth limits.
2. Calculate the arithmetic intensity of the kernel to be modelled.
3. Map the kernel's AI onto the Roofline chart.

<img src="https://upload.wikimedia.org/wikipedia/commons/b/b1/Example_of_a_Roofline_model.svg" alt="roofline example" width="768px" style="background-color:white"/>

While setting up roofline models by hand can give valuable insights, it can also be cumbersome in practice.
One alternative is using profiling tools to automate (parts of) the process, e.g. Nsight Compute as discussed in the [Kernel Level Profiling](./kernel-level-profiling.ipynb) notebook.

<img src="img/ncu-roofline.png" alt="ncu roofline" width="768px"/>

## Exercise

* Compute the *machine balance* (ridge point of the roofline) for the **A40**, **A100** (80 GB) and **H100** (94 GB) GPUs as discussed in the [GPU Architecture](./gpu-architecture.ipynb) notebook.
* Compute the *arithmetic intensity* of the [2D stencil test case](../src/stencil-2d/stencil-2d-base.cpp).
* Do both tasks for **single** and **double precision** floating point operations.

### Solution

**Machine balance**

| GPU  |         SP         |          DP           |
| ---- | :----------------: | :-------------------: |
| A40  | **~53** (37 / 0.7) | **~0.8** (0.59 / 0.7) |
| A100 |  **~9** (19 / 2)   |   **~5** (9.75 / 2)   |
| H100 | **~28** (67 / 2.4) |  **~14** (34 / 2.4)   |

**Arithmetic intensity**

| App        |        SP        |          DP           |
| ---------- | :--------------: | :-------------------: |
| Stencil-2D | **~0.9** (7 / 8) | DP: **~0.4** (7 / 16) |

## Interpreting the Roofline Model

The Roofline Model offers a quick estimate of the best-case performance of your code on a given GPU, as well as the available optimization potential.
Generally, there are three major outcome possibilities:
* Measured performance near roofline
    * suggests that the kernel's performance is hitting hardware limits and that
    * further optimization is only viable if the primary bottleneck can be addressed.
* Measured performance exceeds roofline
    * corresponds to at least one assumption made not holding, e.g.
    * the number of flops is lower due to compiler optimizations such as removal of redundant computations, or
    * the number of bytes transferred is lower due to caching of data or usage of already cached data.
* Measured performance below prediction
    *  indicates either model inconsistencies, e.g. modeling for a different data type than is actually used, or
    *  optimization potential that can frequently be attributed to common patterns (also discussed in more detail in the [Micro Benchmarks](./micro-benchmarks.ipynb) notebook), e.g.
        * **Uncoalesced Memory Accesses** from, e.g., strided or random access patterns. This leads to more bytes transferred than required.
        * **Occupancy Problems** due to launching too few threads, choosing non-optimal block sizes or over-utilizing common resources such as shared memory.
        * **Thread Divergence** due to divergent branching in warps.
        * **Serialization Effects** from heavily using synchronization functionalities, or from a high degree of atomic congestion.
        * **Load Imbalances** either due to varying workload per thread or due to partial waves.

Besides being able to estimate the efficiency with which a given kernel runs on a given GPU, roofline models also allow estimating how a kernel would perform on a *different* GPU.

## Extended Roofline Models

The classical roofline model can be extended to include additional effects, including
* Performing computations with **different data types** which introduces additional AI values as well as additional, potentially different computational throughput limits
* Modelling transfers from/ to different **cache levels** which can be helpful when modelling
    * kernels where (part of) the read data is already in the L2 cache from a previous kernel
    * kernels that are limited by the available L2 bandwidth
    * kernels that write only a small amount of memory which gets cached in L2, i.e. is not directly written to DRAM

## Next Step

Next, we use micro benchmarks to assess realistic performance limits and to investigate some of the common performance limiting patterns mentioned above.
Head over to the [Micro Benchmarks](./micro-benchmarks.ipynb) notebook to get started.