# GPU Architecture Basics

Before diving into performance analysis, we must understand how GPUs work. Key performance factors include:
* **Memory bandwidth:** How quickly data can be moved between memory and compute units.
* **Compute throughput:** How many floating-point or integer operations can be performed per second.
* **Occupancy:** Ratio of active threads to maximum threads supported by the GPU.

## Example: NVIDIA H100

<img align="right" src="img/h100-chip.png" alt="H100 Chip" width="384px"/>

A key design principle of all GPUs is their **hierarchical structure**, which is essential for achieving high levels of parallelism and scalability.
To understand GPU performance, we begin by examining the **NVIDIA H100 architecture**, a state-of-the-art example of modern GPUs.
The H100 demonstrates the hierarchical design and advanced capabilities that drive GPU performance.

For further details, refer to the official [NVIDIA H100 White Paper](https://resources.nvidia.com/en-us-tensor-core).
The following figures and information are derived from this resource as well.

<img align="right" src="img/h100-layout-annotated.png" alt="H100 Chip Layout with Annotations" width="768px" style="background-color:white"/>

The diagram on the right illustrates a 'full configuration' of one H100 chip, emphasizing the hierarchical arrangement of its components:
* **Graphics Processing Clusters (GPCs)** with multiple
* **Texture Processing Clusters (TPCs)** with multiple
* **Streaming Multiprocessors (SMs)** with multiple
* **SM Sub-Partitions (SMSPs)** (see below).

In practice, not all units in the full configuration are available. For example, the SXM5 version of the H100 includes:
* 8 GPCs, each with
* 8 or 9 TPCs, each with
* 2 SMs, each with
* 4 SMSPs,
for a total of **132 Streaming Multiprocessors (SMs)** available for computation.

<img align="right" src="img/h100-sm-layout.png" alt="H100 SM Layout" width="384px" style="background-color:white"/>

Each SM is further subdivided into 4 sub partitions, as visualized in the figure on the right, each with
* 16 INT32 units, for a total of 132 * 4 * 16 = 8448,
* 32 FP32 units, for a total of 132 * 4 * 32 = 16896,
* 16 FP64 units, for a total of 132 * 4 * 16 = 8448, and
* 1 tensor core, for a total of 132 * 4 = 528.
Each unit is capable of executing one fused-multiply-add (FMA) operation per cycle, with the exception of the tensor cores which can each perform 512 FP16/FP32-mixed-precision FMAs at once.

## Theoretical Peak Performance of a GPU

To evaluate the computational capabilities of a GPU, we calculate its **theoretical peak performance**. The formula is:

$$p_{\text{peak}} = n_{\text{cores}} \cdot \frac{n_{\text{inst}}}{cy} \cdot \frac{n_{\text{OP}}}{\text{inst}} \cdot clk$$

Where:
* **$n_{\text{cores}}$**: The total number of execution units (cores) in the GPU.
* **$\frac{n_{\text{inst}}}{cy}$**: The number of instructions issued per cycle.
* **$\frac{n_{\text{OP}}}{\text{inst}}$**: The number of operations, usually floating-point operations (FLOPs), executed per instruction, determined by:
  * An **FMA factor** (usually 2), as one fused multiply-add (FMA) instruction performs two operations.
  * The **SIMD width**, representing the number of parallel operations per instruction.
* **$clk$:** The clock rate or frequency of the GPU.


For GPUs, performance modeling can adopt different perspectives depending on how the architecture is abstracted.
Below, we calculate $p_{\text{peak}}$ for **single-precision floating-point operations** for one H100 GPU using three views:

#### 1. Execution Units as Cores

Each execution unit is modeled as a single core:
- Total execution units: $132 \, \text{SMs} \times 4 \, \text{Sub-Partitions per SM} \times 32 \, \text{FP32 units per Sub-Partition} = 16\,896 \, \text{execution units}$.
- Instructions per cycle: $1$.
- FMA factor: $2$ (each FMA performs two FLOPs).
- Clock speed: $1.84 \, \text{GHz}$.

$$p_{\text{peak}} = 16\,896 \cdot 1 \cdot 2 \cdot 1.84 \, \text{GF/s} = 62 \, \text{TF/s}$$

#### 2. SMSPs as Cores

Each **Streaming Multiprocessor Sub-Partition (SMSP)** is modeled as a core with a SIMD width of 32:
- Total SMSPs: $132 \, \text{SMs} \times 4 \, \text{SMSPs per SM} = 528 \, \text{SMSPs}$.
- SIMD width: $32$.
- Instructions per cycle: $1$.
- FMA factor: $2$.
- Clock speed: $1.84 \, \text{GHz}$.

$$p_{\text{peak}} = 528 \cdot 1 \cdot (2 \cdot 32) \cdot 1.84 \, \text{GF/s} = 62 \, \text{TF/s}$$

#### 3. SMs as Cores

Each **Streaming Multiprocessor (SM)** is modeled as a core capable of issuing 4 instructions per cycle, with a SIMD width of 32:
- Total SMs: $132$.
- Instructions per cycle: $4$.
- SIMD width: $32$.
- FMA factor: $2$.
- Clock speed: $1.84 \, \text{GHz}$.

$$p_{\text{peak}} = 132 \cdot 4 \cdot (2 \cdot 32) \cdot 1.84 \, \text{GF/s} = 62 \, \text{TF/s}$$

#### Note on SIMD Width and Warp Size

We determine the **SIMD width** according to the number of FP32 execution units in the GPU.
Alternatively, the **warp size** (typically 32 for NVIDIA GPUs) can be used as a basis for modeling.
When using the warp size, the **instructions per cycle** must be adjusted and it may be less than $1$ if multiple cycles are required to finish the computations for an entire warp.

## Memory Hierarchy

<img align="right" src="img/h100-memory-abstract.png" alt="memory abstraction" width="512px"/>

The GPU memory system is organized into a hierarchy designed to balance capacity, speed, and access latency.
Its levels, at the example of the H100, are (latency and bandwidth values provided by [GPU Benches](https://github.com/RRZE-HPC/gpu-benches))
- **Registers**
    - Scope: Private to each thread.
    - Latency: Typically 1 cycle.
    - Capacity: Each SM has 256 KB available for registers.
- **L1 Cache** and **Shared Memory**
    - Scope: Shared among all threads scheduled on a given SM. Threads of a given block can access the same shared memory.
    - Latency: ~32 cycles.
    - Capacity: Configurable up to 100 KB per SM for shared memory, with the remaining space used for L1 cache (up to 256 KB per SM).
- **L2 Cache**
    - Scope: Shared among all threads (GPU-wide)
    - Latency: ~280 cycles.
    - Capacity: 50 MB (two partitions with 25 MB each)
- **Global Memory (DRAM)**
    - Scope: Global to all threads.
    - Latency: ~690 cycles.
    - Capacity: 80 or 96 GB
    - Bandwidth: up to 3.35 TB/s.

<img align="right" src="img/interconnect.png" alt="interconnect" width="384px"/>

Beyond accessing data already residing in GPU memory, data transfers between host and device as well as between devices are necessary.
These connection paths are usually much slower and can be serious bottlenecks in applications.

| Connection Type  | Bandwidth             | Direction     | Purpose                                   |
|------------------|-----------------------|---------------|-------------------------------------------|
| **DRAM**         | 3.36 TB/s             | Bidirectional | Internal memory access for GPU operations |
| **PCIe 5.0 x16** | 63 GB/s               | Per direction | Communication with CPU                    |
| **NVLink**       | 450 GB/s              | Per direction | High-speed data sharing between GPUs      |


## Comparison of Different GPUs

### NVIDIA

|                             |  V100 PCIe 32 GB   |      RTX 3080       |       A40 PCIe       | A100 SXM 40 GB \| 80 GB |    H100 SXM 94 GB     |    H100 PCIe 96 GB    |
| --------------------------- | :----------------: | :-----------------: | :------------------: | :---------------------: | :-------------------: | :-------------------: |
| Availability                |  NHR@FAU TinyGPU   |   NHR@FAU TinyGPU   |     NHR@FAU Alex     |      NHR@FAU Alex       |     NHR@FAU Helma     |          ---          |
| CUDA                        |        7.0         |         8.6         |         8.6          |           8.0           |          9.0          |          9.0          |
| #Cores                      | 5120<br/>(80 * 64) | 8704<br/>(68 * 128) | 10752<br/>(84 * 128) |   6912<br/>(108 * 64)   | 16896<br/>(132 * 128) | 16896<br/>(132 * 128) |
| FP32 Performance \[TFLOPS\] |         14         |         30          |          37          |           19            |          67           |          62           |
| FP64 Performance \[TFLOPS\] |         7          |        0.47         |         0.59         |          9.75           |          34           |          31           |
| FP64:FP32 Ratio             |        1:2         |        1:64         |         1:64         |           1:2           |          1:2          |          1:2          |
| Memory \[GB\]               |         32         |         10          |          48          |        40 \| 80         |          94           |          96           |
| Bandwidth \[GB/s\]          |        897         |         760         |         696          |      1555 \| 2039       |         2400          |         3360          |
| L2 Cache \[MB\]             |         6          |          5          |          6           |           40            |          50           |          50           |
| TDP \[W\]                   |        250         |         320         |         300          |           400           |          700          |          700          |

### AMD

|                             |        MI100         |        MI210         |        MI250X        |        MI300X        |        MI300A        |
| --------------------------- | :------------------: | :------------------: | :------------------: | :------------------: | :------------------: |
| Availability                | NHR@FAU Test Cluster | NHR@FAU Test Cluster |        LUMI-G        | NHR@FAU Test Cluster | NHR@FAU Test Cluster |
| #Cores                      | 7680<br/>(120 * 64)  | 6656<br/>(104 * 64)  | 14080<br/>(220 * 64) | 19456<br/>(304 * 64) | 14592<br/>(228 * 64) |
| FP32 Performance \[TFLOPS\] |          23          |          23          |          48          |         163          |         123          |
| FP64 Performance \[TFLOPS\] |          12          |          23          |          48          |          82          |          61          |
| FP64:FP32 Ratio             |         1:2          |         1:1          |         1:1          |         1:2          |         1:2          |
| Memory \[GB\]               |          32          |          64          |         128          |         192          |         128          |
| Bandwidth \[GB/s\]          |         1229         |         1638         |         3277         |         5300         |         5300         |
| L2 Cache \[MB\]             |          8           |          16          |          16          |          16          |          4           |
| Infinity Cache \[MB\]       |         ---          |         ---          |         ---          |         256          |         256          |
| TDP \[W\]                   |         300          |         300          |         500          |         750          |         550          |

## H2D and D2D

### NVIDIA

|                                    |  Volta  | Ampere  | Hopper  | 
|------------------------------------|:-------:|:-------:|:-------:|
| CUDA                               |   7.x   |   8.x   |   9.x   |
| PCIe                               | 3.0 x16 | 4.0 x16 | 5.0 x16 |
| Bandwidth (per direction) \[GB/s\] |  15.8   |  31.5   |  63.0   |
| NVLink (per direction) \[GB/s\]    |   150   |   300   |   450   |

### AMD

|                                                      |  MI100  |  MI210  | MI250X  |
| ---------------------------------------------------- | :-----: | :-----: | :-----: |
| PCIe                                                 | 4.0 x16 | 4.0 x16 | 4.0 x16 |
| Bandwidth (per direction) \[GB/s\]                   |  31.5   |  31.5   |  31.5   |
| Infinity Fabric (inside GPU, per direction) \[GB/s\] |   ---   |   ---   |   400   |
| Infinity Fabric D-D (per direction) \[GB/s\]         |   100   |   150   |   300   |

## Next Step

Next, we will look at performance models to relate the theoretical peak performance to application characteristics.
Head over to the [Performance Models](./performance-models.ipynb) notebook to get started.