

## **Energy-Efficient GPU Computing**

Ben van Werkhoven

Alessio Sclocco

Stijn Heldens

Floris-Jan Willemsen







### Introduction: Ben van Werkhoven

Ben van Werkhoven Assistant Professor at LIACS, Leiden University b.j.c.van.werkhoven@liacs.leidenuniv.nl



#### Research interests:

High-Performance Computing, GPU programming, automatic performance tuning (auto-tuning), optimization techniques, energy efficiency, mixed-precision and accuracy, performance modeling, concurrency and multi-threading, research software engineering

#### Background:

- 2010-2014 PhD "Scientific Supercomputing with Graphics Processing Units" at VU Amsterdam
- 2014-2023 Research Software Engineer at the Netherlands eScience Center
- 2023-now Assistant Professor at LIACS, Leiden University

## Introduction: Alessio Sclocco

Research Software Engineer @ Netherlands eScience Center

#### a.sclocco@esciencecenter.nl

#### Background:

- 2011-2012 junior researcher at VU Amsterdam
  - Working on GPUs for radio astronomy
- 2012-2017 PhD "Accelerating Radio Astronomy with Auto-Tuning" at VU Amsterdam
  - Under the supervision or professors Henri Bal and Rob van Nieuwpoort
- 2015-2016 scientific programmer at ASTRON, the Netherlands Institute for Radio Astronomy
  - Designing and developing a real-time GPU pipeline for the Westerbork radio telescope
- 2019 visiting scholar at Nanyang Technological University in Singapore
  - Real-time tracking of social insects
- 2017-2023 Research Software Engineer at the Netherlands eScience Center
  - Radio astronomy, climate modeling, biology, natural language processing, high-energy physics



# Introduction: Stijn Heldens

- Stijn Heldens (s.heldens@esciencecenter.nl)
  - Research Software Engineer (RSE)
  - Research interests in HPC include:
     GPU programming, parallel algorithms,
     distributed systems, and scalable programming models.
  - background:
    - 2022-now, Research Software Engineer at Netherlands eScience Center
    - 2018-2022, PhD candidate on scalable programming models
    - 2016-2017, Researcher at University of Twente
    - 2015-2016, Researcher at Delft University of Technology
    - 2012-2015, MSc Computer science at VU University Amsterdam



## Introduction: Floris-Jan Willemsen

- Floris-Jan Willemsen
  - PhD Candidate at LIACS (Leiden University) and Netherlands eScience Center
  - Research focusses on optimization algorithms in auto-tuning and efficient search space resolution
  - Get in touch via fjwillemsen.com or f.j.willemsen@esciencecenter.nl



| 13:30 - | 13:35 | Opening and welcome                                |
|---------|-------|----------------------------------------------------|
|         |       | Introduction Energy Efficient GPU Computing        |
| 14:00 - | 14:15 | Introduction to auto-tuning with Kernel Tuner      |
| 14:15 - | 14:30 | First hands-on session                             |
|         |       |                                                    |
| 14:30 - | 15:00 | Code optimization techniques for energy efficiency |
| 15:00 - | 15:30 | Coffee break                                       |
| 15:30 - | 15:45 | Second hands-on session                            |
|         |       |                                                    |
| 15:45 - | 16:15 | Mixed precision programming techniques             |
| 16:15 - | 16:30 | Third hands-on session                             |
|         |       |                                                    |
| 16:30 - | 16:45 | Optimizing GPU core clock frequency                |
| 16:45 - | 16:55 | Fourth hands-on session                            |
| 16:55 – | 17:00 | Closing remarks                                    |

- This tutorial is organized in four parts, each part consisting of a presentation followed by a hands-on exercise
- The slides are verbose on purpose, and you can use them as references while working on the exercises or after the tutorial

- We will use Google Colab for the hands-on sessions, so you don't need to have access to a GPU or install anything locally
  - But you are free to experiment on your own system
- Please download the latest version of slides and hands-on notebooks here:
  - https://github.com/KernelTuner/kernel\_tuner\_tutorial



# Energy Efficient GPU Computing



# LLM Training emissions

#### CO2 Equivalent Emissions (Tonnes) by Selected Machine Learning Models and Real Life Examples, 2022



#### What is 500 tons of CO2?

#### Roughly equal to:

- 8,268 tree seedlings grown for 10 years
- \$80000 in electricity bill
- 63 homes' energy use for a year in the US
- 111 passenger cars driving around for a year in the US

• Less than 2 days of running the Frontier supercomputer ...



# Energy cost of supercomputers

Frontier: #1 in TOP500 list (Jun 2023)

- #6 Green500 (Jun 2023)
- 20 Megawatt continuously
- \$40 million annual electricity bill
- 100,000 metric tons of CO2 annually
- ~20,000 cars on the road for a year in US

Summit: (#5, Frontier's predecessor)

• 64% of energy is consumed by GPUs



Efficient Computation through Tuned Approximation
David Keyes, SIAG/SC Supercomputing Spotlights 2022

Autotuning based on frequency scaling toward energy efficiency of blockchain algorithms on graphics processing units M. Stachowski, A. Fiebig, and T. Rauber, Journal of Supercomputing, 2020.



Hardware and Software Optimizations for Accelerating Deep Neural Networks: Survey of Current Trends, Challenges, and the Road Ahead Capra et al. 2020 IEEE Access



Summarizing CPU and GPU Design Trends with Product Data Sun et al. 2020

#### • Nvidia H100 GPU:

• Energy: 350 Watt

• Surface: 8.14 cm<sup>2</sup>

• Heat dissipation: 43.0 Watt/cm<sup>2</sup>

#### • Light bulb:

• Energy: 100 Watt

• Surface: 15 cm<sup>2</sup>

• Heat dissipation: 6.7 Watt/cm<sup>2</sup>

#### • Electric cooker:

• Energy: 1800 Watt

• Surface: 1017 cm<sup>2</sup>

• Heat dissipation: 1.8 Watt/cm<sup>2</sup>









Hotter GPUs can be ~7% less energy efficient

Results obtained with xGPU (radio astronomy correlator) on Nvidia K20

 Liquid cooling is more energy friendly than air cooling

 But as the efficiency difference between hot and cold GPUs is ~7%, you probably shouldn't overdo the cooling

#### **RACK LEVEL COST REDUCTION**

A100 PCIe Air-Cooled vs A100 PCIe Liquid-Cooled



Configuration

2000 servers each with 2x CPU | 192GB | 1TB SSD | 2x A100 80GB
Air-cooled and liquid-cooled GPUs each at 300W TOP and same performance characteristics
Air-cooled infrastructure @ 1.6 PUE; Liquid-cooled infrastructure @ 1.15 PUE
15KW Air-Cooled Rack | 30KW Liquid-Cooled Rack | Power costs = \$0.2 per KWhr



 Moving data around is 20x more expensive than computing on it

#### Estimations for Nvidia H100:

- A single double-precision Fused Multiply-Add¹: 13.7 pJ
- Moving the operands (4x 64-bits) for 10 mm within chip<sup>2</sup>: 294.4 pJ (21x more energy)

mad.f64 %f1, %f2, %f3, %f0; // c += a\*b;



Three strategies for energy efficient GPU Computing:

- 1. Use for shorter amount of time
- 2. Minimize data movements
- 3. Optimize device settings

Three strategies for energy efficient GPU Computing:

1. Use for shorter amount of time  $\longrightarrow$  Optimize application performance

2. Minimize data movements  $\longrightarrow$  Lower/mixed precision techniques

3. Optimize device settings  $\longrightarrow$  Optimize clock frequency

```
13:30 – 13:35 Opening and welcome
      13:35 - 14:00 Introduction Energy Efficient GPU Computing
      14:00 – 14:15 Introduction to auto-tuning with Kernel Tuner
      14:15 - 14:30 First hands-on session
      14:30 - 15:00 Code optimization techniques for energy efficiency
      15:00 – 15:30 Coffee break
      15:30 - 15:45 Second hands-on session
7. 15:45 – 16:15 Mixed precision programming techniques
      16:15 - 16:30 Third hands-on session
      16:30 – 16:45 Optimizing GPU core clock frequency
16:45 – 16:55 Fourth hands-on session
      16:55 – 17:00 Closing remarks
```

```
13:30 – 13:35 Opening and welcome
      13:35 - 14:00 Introduction Energy Efficient GPU Computing
      14:00 – 14:15 Introduction to auto-tuning with Kernel Tuner
      14:15 - 14:30 First hands-on session
      14:30 - 15:00 Code optimization techniques for energy efficiency
      15:00 – 15:30 Coffee break
      15:30 - 15:45 Second hands-on session
2. 15:45 – 16:15 Mixed precision programming techniques
      16:15 – 16:30 Third hands-on session
      16:30 – 16:45 Optimizing GPU core clock frequency
16:45 – 16:55 Fourth hands-on session
      16:55 – 17:00 Closing remarks
```



# Introduction to Auto-Tuning with Kernel Tuner



To maximize GPU code performance, you need to find the best combination of:

- Different mappings of the problem to threads and thread blocks
- Different data layouts in different memories (shared, constant, ...)
- Different ways of exploiting special hardware features
- Thread block dimensions
- Code optimizations that may be applied or not
- Work per thread in each dimension
- Loop unrolling factors
- Overlapping computation and communication
- •

#### Challenge:

A very large search space



- Optimizing code manually you iteratively:
  - Modify the code
  - Run a few benchmarks
  - Revert or accept the change



- Write a templated version of your code or a code generator
- Benchmark the performance of all code variants









A tool for automatic performance tuning of GPU kernels

- Developed open source (Apache 2.0)
- Funded by several national and European projects
- Used by 10+ eScience center projects, and 10+ other universities & organizations
- Supports:
  - CUDA, HIP, OpenCL, C, Fortran, OpenACC
  - 20+ search optimization algorithms
  - Energy measurement of GPU kernels
  - Many different use cases











```
import numpy
from kernel_tuner import tune_kernel
kernel_string = """
__global__ void vector_add(float *c, float *a, float *b, int n) {
    int i = blockIdx.x * block size x + threadIdx.x;
    if (i<n) {
       c[i] = a[i] + b[i];
}"""
n = numpy.int32(1e7)
a = numpy.random.randn(n).astype(numpy.float32)
b = numpy.random.randn(n).astype(numpy.float32)
c = numpy.zeros like(b)
args = [c, a, b, n]
tune_params = {"block_size_x": [32, 64, 128, 256, 512]}
tune_kernel("vector_add", kernel_string, n, args, tune_params)
```

# Growing Kernel Tuner ecosystem

# **Kernel** Launcher

C++ magic to integrate auto-tuned kernels into C++ applications

# **Kernel** Float

C++ data types for mixed-precision GPU kernel programming





# Auto-tuning as numerical optimization

• Construct a search space  $\mathcal{X}$  from all tunable parameters  $P_1, P_2, \dots P_n$  and all possible values  $D_1, D_2, \dots, D_n$  satisfying all user-defined constraints  $C_1, C_2, \dots, C_m$ :

$$\mathcal{X} = CSP\langle P, D, C \rangle$$

• Let f(x) be the execution time of kernel configuration  $x \in \mathcal{X}$ 

Treat the problem as a numerical optimization problem

$$x_{opt} = \arg\min f(x)$$

# Optimization strategies in Kernel Tuner

- Local optimization
  - Nelder-Mead, Powell, CG, BFGS, L-BFGS-B, TNC, COBYLA, and SLSQP
- Global optimization
  - Basin Hopping, Simulated Annealing,
     Differential Evolution, Genetic Algorithm,
     Particle Swarm Optimization, Firefly
     Algorithm, Bayesian Optimization, Multistart local search, Iterative local search,
     Dual Annealing, Random search, ...



# Optimizing Energy Efficiency

Minimize energy consumption instead of only the execution time

#### Kernel Tuner supports:

- Measuring power consumption during tuning
  - GPU built-in current sensors
  - External power measurement hardware
- Custom objectives to optimize for time, energy or any user-defined metric
- Performance models for model-steered auto-tuning



- GFLOPS/W is a widely-used metric for energy efficiency, also used by Green500
- GFLOPS or GFLOP/s is a measure of computational throughput: billions of floatingpoint operations per second
- We can compute GFLOP/s as: the total number of floating-point operations (in billions) divided by the kernel execution time in seconds
- Watt (W) is a measure of power, equal to energy in Joule (J) per second (s)

$$\frac{GFLOPS}{W} = \frac{GFLOP/s}{J/s} = \frac{GFLOP}{J}$$

- Prerequisites:
  - Python 3.8 or newer
  - CUDA or OpenCL device with necessary drivers and compilers installed
- To install Kernel Tuner:
  - pip install kernel\_tuner
- For more information:
  - <a href="https://kerneltuner.github.io/kernel\_tuner/latest/install.html">https://kerneltuner.github.io/kernel\_tuner/latest/install.html</a>

• Note: installation on your system is not required for the hands-on sessions



# Hands-on



- The first hands-on notebook is:
  - <a href="https://colab.research.google.com/github/KernelTuner/kernel\_tuner\_tutorial/blob/master/energy/00\_Kernel\_Tuner\_Introduction.ipynb">https://colab.research.google.com/github/KernelTuner/kernel\_tuner\_tutorial/blob/master/energy/00\_Kernel\_Tuner\_Introduction.ipynb</a>
- The goal of this hands-on is to:
  - Install and run Kernel Tuner
  - Measure energy consumption of our kernels
  - View the results using Kernel Tuner Dashboard
- Open the notebook in Google Colab and work there
- Please follow the instructions in the Notebook
- Feel free to ask questions to instructors and mentors



#### Observers

- To observe quantities other than execution time
- Measurements are stored in results, but not printed to screen

#### Metrics

- Allows user to create derived metrics
- Specified as keys in a dict using lambda functions
- Always printed to screen, also stored in results

#### • Cache files

- Allows Kernel Tuner to continue from previous runs, where it left off
- Allows Dashboard to visualize tuning results during/after the run



# Code Optimization Techniques for Energy Efficiency



- Modify the kernel source code to improve performance or tunability
- Effects on performance can be different on different GPUs or different input data
- You can tune:
  - Enabling or disabling an optimization
  - The parameters introduced by certain optimizations
- You often need to combine multiple different optimizations with specific tunable parameter values to arrive at optimal performance

- In March 2023, we published a literature review summarizing the last decade of code optimizations for GPU programming
  - We describe which optimizations are used in literature and how they are used
- Optimization Techniques for GPU Programming Pieter Hijma, Stijn Heldens, Alessio Sclocco, Ben van Werkhoven, and Henri Bal ACM Computing surveys 2023

https://dl.acm.org/doi/abs/10.1145/3570638

- Coalescing memory accesses
- Host/device communication
- Kernel fusion
- Loop blocking
- Loop unrolling
- Prefetching
- Recomputing values
- Reducing atomics

- Reducing branch divergence
- Reducing redundant work
- Reducing register usage
- Reformatting input data
- Using a specific memory space
- Using warp shuffle instructions
- Varying work per thread
- Vectorization

- Coalescing memory accesses
- Host/device communication
- Kernel fusion
- Loop blocking
- Loop unrolling
- Prefetching
- Recomputing values
- Reducing atomics

- Reducing branch divergence
- Reducing redundant work
- Reducing register usage
- Reformatting input data
- Using a specific memory space
- Using warp shuffle instructions
- Varying work per thread
- Vectorization

Merge one or more kernels into one kernel

- Why?
  - Reduces data movements between off-chip DRAM and GPU registers
  - Moving data around is more expensive than computing on it
- How?
  - Fuse the kernel arguments and computations of two kernels into one
  - Demote a kernel to a \_\_\_device\_\_\_ function and call it from another kernel
  - Temporal fusion: merge multiple calls of the same kernel into one

```
// c = a+b
                                                       // e = a+b+d
vector_add<<<grid, threads>>>(c, a, b, n);
                                                       vector_3add<<<grid, threads>>>(e, a, b, d, n);
// e = c+d
vector_add<<<grid, threads>>>(e, c, d, n);
__global__
                                                       global
void vector_add(float *c, float *a, float *b,
                                                       void vector 3add(float *d, float *a, float *b,
                int n) {
                                                                        float *c, int n) {
    int i = (blockIdx.x*blockDim.x)+threadIdx.x;
                                                           int i = (blockIdx.x*blockDim.x)+threadIdx.x;
    if (i < n) {
                                                           if (i < n) {
        c[i] = a[i] + b[i];
                                                               d[i] = a[i] + b[i] + c[i];
```

Modify the structure of one or more loops to work in blocks over the data

- Why?
  - Increases spatial / temporal locality
  - Reduces the 'working set' of the algorithm
- How?
  - Change the order of computations and data accesses in nested loops
  - Usually nearly doubles the number of for-loops in the code
  - Outer-loops iterate over the blocks
  - Inner-loops iterate within each block



```
for (int j=0; j<ny; j++) {
    for (int i=0; i<nx; i++) {
        ...[j*nx + i]
    }
}</pre>
```



```
for (int j=0; j<ny; j+=nyb) {
    for (int i=0; i<nx; i+=nxb) {

        for (int jb=0; jb<nyb; jb++) {
            for (int ib=0; ib<nxb; ib++) {
                ...[(j+jb)*nx + (i+ib)]
            }
        }
    }
}</pre>
```

### Reduce the number of iterations of a loop

- Why?
  - Increases instruction-level-parallelism
  - Reduces loop overhead instructions
- How?
  - Replicate the contents of a for-loop n times, increase loop counter by n
  - In the early days, only manually or with a code generator
  - Compiler does this now: **#pragma unroll <value>**
  - In CUDA, value has to be an integer constant expression
    - 0 is not allowed and gives an error, 1 means unrolling is disabled

```
#pragma unroll loop_unroll_factor_k
for (int k=0; k<n; k++) {
    ...
}</pre>
```

The compiler can unroll this loop if **n** is known at compiletime. The **loop\_unroll\_factor\_k** parameter should be a divisor of the loop counter **n** 

Reduce the number of registers per thread required by the kernel

- Why?
  - Registers are an important and limited SM resource and are likely to limit occupancy
  - Allows to increase the tunable range of thread block dimensions
- How?
  - Compiling constant values into your code rather than keeping them in registers (e.g. using templates or tunable parameters)
  - Limiting or disabling loop unrolling is very effective in reducing register usage
  - In kernels that do many different things, split the kernel
  - Enabling register spilling with compiler flag -maxrregcount=N or tuning the number of blocks per SM using the kernel \_\_launch\_bounds\_\_()

```
__global__ void
some_kernel(...)
{
    ...
}
```

Increase or decrease the amount of work per thread (or thread block) and adjust the number of threads and thread blocks accordingly

### Why?

- Increasing work per thread often increases data reuse and locality
- Reduces redundant instructions previously executed by other threads
- Increases instruction-level parallelism and possibly increases register usage

#### How?

- Reduce number of threads blocks in total, but increase the work per thread block
- Bring down number of threads within the block, but keep the amount of work equal

```
#pragma unroll
for (kb = 0; kb < block_size_x; kb++) {</pre>
    sum += sA[ty][kb] * sB[kb][tx];
#pragma unroll
for (kb = 0; kb < block_size_x; kb++) {</pre>
    #pragma unroll
    for (int j = 0; j < work_per_thread_x; j++) {</pre>
        sum[j] += sA[ty][kb] * sB[kb][tx + j * block_size_x];
```



- Coalescing memory accesses
- Host/device communication
- Kernel fusion
- Loop blocking
- Loop unrolling
- Prefetching
- Recomputing values
- Reducing atomics

- Reducing branch divergence
- Reducing redundant work
- Reducing register usage
- Reformatting input data
- Using a specific memory space
- Using warp shuffle instructions
- Varying work per thread
- Vectorization



## Hands-on



- The second hands-on notebook is:
  - https://colab.research.google.com/github/KernelTuner/kernel\_tuner\_tutorial/blob/master/energy/01\_Code\_Optimizations\_for\_Energy.ipynb

- The goal of this hands-on is to:
  - See an example of Kernel Fusion
  - Compare the energy consumption of different kernels
- Open the notebook in Google Colab and work there
- Please follow the instructions in the Notebook
- Feel free to ask questions to instructors and mentors



# Mixed Precision Programming Techniques



• Prevalent is double precision (64 bit); GPUs also support lower precision



- Low precision has many benefits ☺!
  - Faster computation
    - Less compute cycles required, especially double precision is often slow
  - Lower memory footprint
    - Less bits required per number
  - Better cache utilization
    - Higher cache hit rates
  - Higher effective memory bandwidth
    - More numbers per second
  - Lower register usage
    - Increases GPU occupancy, thus performance
  - All these points also increase energy efficiency

• But, at the cost of loss in precision 😂

| Floating-point | Value of Pi                                                | Error                   |
|----------------|------------------------------------------------------------|-------------------------|
| Infinite bits  | 3.141592653589793238462643383279502884197169399375         | 0                       |
| 64 bit         | 3.141592653589793115997963468544185161590576171875         | 3.9×10 <sup>-15</sup> % |
| 32 bit         | 3.14159250259399414062500000000000000000000000000000000000 | 0.000005%               |
| 16 bit         | 3.14062500000000000000000000000000000000000                | 0.03%                   |
| 8 bit          | 3.0000000000000000000000000000000000000                    | 4.5%                    |

- Core idea:
  - What if we mix different precision levels in one application?
  - Use different floating-point types for different variables in code
- Leads to trade-off between accuracy and performance
  - Lower precision typically results in higher performance
  - Need to find balance between error and speedup
- What precision should be used for each variable?
  - Ideally, we want maximum performance for an acceptable error
  - Auto-tuning to the rescue!





- IEEE 754 standard is implemented in all architectures
- Floating-point number consists of three parts:
  - S: sign (+ or -)
  - M: mantissa/significand
  - E: exponent
- Floating-point number represented using exponential format:
  - $(-1)^S \times M \times 2^E$
  - Example:  $+1.42 \times 2^3$  means S=+1, M=1.42, E=3
  - Where 1≤M<2, which makes representation unique
  - There are also non-normal numbers: NaN, Inf, subnormal

- Sign bit (1 bit)
- Mantissa/significand (A bits)
  - Determines number of significant digits
  - Results rounded to number of decimal places
  - Example: A=23 means ~7 decimal places
- Exponent (**B** bits)
  - Determines range of numbers
  - Numbers outside range become zero or infinity
  - Example: **B=8** means range is ~10<sup>-38</sup> to ~10<sup>38</sup>
- Total size: 1 + A + B bits

| Туре           | √2               |
|----------------|------------------|
| A=52 (Float64) | 1.41421356237309 |
| A=23 (Float32) | 1.414213         |
| A=10 (Float16) | 1.414            |
| A=2 (Float8)   | 1.5              |





- double precision (64 bits) prevalent in scientific computing
- GPUs typically slow on double arithmetic
  - Except the scientific/datacenter-rated GPUs

| Type name |    |    | Significant<br>bits |          |          | Decimal<br>places | 1+Epsilon    |
|-----------|----|----|---------------------|----------|----------|-------------------|--------------|
| double    | 64 | 11 | 52                  | 2.2e-308 | 1.8e+308 | 15                | 1 + 2.22e-16 |



- Single precision (32 bits) balances accuracy and throughput
- Widely used in graphics and general GPU applications

| Type name | Total<br>bits | Exponent<br>bits | Significant<br>bits | Smallest<br>normal | Biggest<br>normal | Decimal<br>places | 1+Epsilon   |
|-----------|---------------|------------------|---------------------|--------------------|-------------------|-------------------|-------------|
| float     | 32            | 8                | 23                  | 1.2e-38            | 3.4e+38           | 6                 | 1.000000119 |



- Introduced with NVIDIA's Pascal architecture (2016)
- Double computational throughput of float
- Limited range, reasonable accuracy

| Type name | Total<br>bits |   | Significant<br>bits |          | Biggest<br>normal | Decimal<br>places | 1+Epsilon |
|-----------|---------------|---|---------------------|----------|-------------------|-------------------|-----------|
| half      | 16            | 5 | 10                  | 0.000061 | 65536             | 3                 | 1.00097   |



- "Brain" Floating-point. Introduced by Google Brain project
- Introduced with NVIDIA's Ampere architecture (2020)
- Large range, limited accuracy

| Type name | Total<br>bits | Exponent<br>bits | Significant<br>bits |         | Biggest<br>normal | Decimal<br>places | 1+Epsilon |
|-----------|---------------|------------------|---------------------|---------|-------------------|-------------------|-----------|
| bfloat    | 16            | 8                | 7                   | 1.2e-38 | 3.4e+38           | 2                 | 1.00781   |





- Introduced with NVIDIA's Hopper architecture (2022)
- Two flavors: 5+2 bits or 4+3 bits
- No arithmetic functions, only conversions

| Type name | Total<br>bits | Exponent<br>bits | Significant<br>bits | Smallest<br>normal | Biggest<br>normal | Decimal<br>places | 1+Epsilon |
|-----------|---------------|------------------|---------------------|--------------------|-------------------|-------------------|-----------|
| fp8_e4m3  | 8             | 4                | 3                   | 0.015625           | 256               | 1                 | 1.125     |
| fp8_e5m2  | 8             | 5                | 2                   | 0.000061           | 65536             | 0                 | 1.25      |

| Type name | Total<br>bits | Exponent<br>bits | Significant<br>bits | Smallest<br>normal | Biggest<br>normal | Decimal<br>places | 1+Epsilon    |
|-----------|---------------|------------------|---------------------|--------------------|-------------------|-------------------|--------------|
| double    | 64            | 11               | 52                  | 2.2e-308           | 1.8e+308          | 15                | 1 + 2.22e-16 |
| float     | 32            | 8                | 23                  | 1.2e-38            | 3.4e+38           | 6                 | 1.000000119  |
| half      | 16            | 5                | 10                  | 0.000061           | 65536             | 3                 | 1.00097      |
| bfloat    | 16            | 8                | 7                   | 1.2e-38            | 3.4e+38           | 2                 | 1.00781      |
| fp8_e4m3  | 8             | 4                | 3                   | 0.015625           | 256               | 1                 | 1.125        |
| fp8_e5m2  | 8             | 5                | 2                   | 0.000061           | 65536             | <1                | 1.25         |

- Create type aliases in kernels
  - C: use preprocess #define
  - C++: use template parameters
- Available data types in CUDA
  - double and float are predefined
  - \_\_half found in <cuda\_fp16.h>
  - \_\_nv\_bfloat16 found in <cuda\_bf16.h>
  - \_\_nv\_fp8\_eXmY found in <cuda\_fp8.h>

```
__global__ void vector_add(
       int n,
       const float* A,
        const float* B,
             float* C
 int i = threadIdx.x + blockIdx.x * blockDim.x;
 if (i < n)
     C[i] = A[i] + B[i];
```

```
#define TYPE_A float
#define TYPE_B float
#define TYPE_C float
__global__ void vector_add(
        int n,
        const TYPE_A* A,
        const TYPE_B* B,
             TYPE_C* C
  int i = threadIdx.x + blockIdx.x * blockDim.x;
  if (i < n)
     C[i] = A[i] + B[i];
```

```
#include <cuda_fp16.h>
#define TYPE_A __half
#define TYPE_B 110at
#define TYPE_C float
__global__ void vector_add(
        int n,
        const TYPE_A* A,
        const TYPE_B* B,
             TYPE_C* C
  int i = threadIdx.x + blockIdx.x * blockDim.x;
  if (i < n)
      C[i] = A[i] + B[i];
```

```
#include <cuda fp16.h>
#define TYPE_A __half
#define TYPE B float
#define TYPE C float
__global__ void vector_add(
       int n,
       const TYPE A* A,
       const TYPE_B* B,
            TYPE C* C
                                                  Does not compile ②!
 int i = threadIdx.x + blockIdx.x * blockDim.x;
 if (i < n)
     C[i] = A[i] + B[i];
                                                  kernel.cu(15): error: no operator
                                                  "+" matches these operands
                                                  operand types are: __half + float
```

```
#include <cuda fp16.h>
#define TYPE_A __half
#define TYPE B float
#define TYPE C float
__global__ void vector_add(
       int n,
       const TYPE_A* A,
       const TYPE_B* B,
            TYPE C* C
                                                  Does not compile 😂!
 int i = threadIdx.x + blockIdx.x * blockDim.x;
 if (i < n)
    C[1] = A[i] + B[i];
                                                  kernel.cu(15): error: no operator
                                                  "+" matches these operands
                                                  operand types are: __half + float
```

- No type promotion
  - Cannot mix types in binary operations
- Some operations require intrinsics
  - \_\_hdiv(), \_\_hsin(), \_\_hfmad()
- Missing operations
  - No \_\_htan()?
- Missing or awkward type conversion
  - \_\_nv\_cvt\_bfloat16raw2\_to\_fp8x2
  - No fp8 to double?
  - No half to bfloat16?

```
_global__ void kernel(const __half* input, float constant, float* output) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    _half in0 = input[2 * i + 0];
    _half in1 = input[2 * 1 + 1];
    _half2 a = _halves2half2(in0, int1);
    float b = float(constant);
    _half c = _float2half(b);
    _half2 d = _half2half2(c);
    _half2 e = _hadd2(a, d);
    _half f = _low2half(e);
    _half g = _high2half(e);
    float out0 = _half2float(f);
    float out1 = _half2float(g);
    output[2 * i + 0] = out0;
    output[2 * i + 1] = out1;
}
```

```
global void kernel(const half* input, float constant, float* output) {
   int i = blockIdx.x * blockDim.x + threadIdx.x;
   _half in0 = input[2 * i + 0];
   _half in1 = input[2 * 1 + 1];
   __half2 a = __halves2half2(in0, int1);
   float b = float(constant);
   half c = float2half(b);
   _half2 d = _half2half2(c);
   half2 e = hadd2(a, d);
   half f = low2half(e);
   __half g = __high2half(e);
                                               #include "kernel float.h"
   float out0 = __half2float(f);
                                               namespace kf = kernel_float;
   float out1 = half2float(g);
   output[2 * i + 0] = out0;
                                               __global__ void kernel(const kf::vec<half, 2>* input, float constant, kf::vec<float, 2>* output) {
   output[2 * i + 1] = out1;
                                                  int i = blockIdx.x * blockDim.x + threadIdx.x;
                                                  output[i] = input[i] + kf::cast<half>(constant);
```



https://github.com/KernelTuner/kernel\_float

- Header-only C++ library to simplify mixed precision GPU programming
- Offers single type: vec<T, N>
  - N elements of type T
  - Auto selects optimal storage format
- Offers all mathematical operations
  - Auto selects best intrinsic
  - Fallback to single precision for missing operations



```
#define TYPE_A float
#define TYPE_B float
#define TYPE_C float
__global__ void vector_add(
        int n,
        const TYPE_A* A,
        const TYPE_B* B,
             TYPE_C* C
) {
  int i = threadIdx.x + blockIdx.x * blockDim.x;
  if (i < n)
     C[i] = A[i] + B[i];
```

```
#include "kernel_float.h"
#define TYPE_A float
#define TYPE B float
#define TYPE_C float
__global__ void vector_add(
        int n,
        const kernel_float::vec<TYPE_A, 1>* A,
        const kernel_float::vec<TYPE_B, 1>* B,
              kernel_float::vec<TYPE_C, 1>* C
  int i = threadIdx.x + blockIdx.x * blockDim.x;
  if (i < n)
      C[i] = A[i] + B[i];
```

```
#include "kernel_float.h"
#define TYPE_A __half
#define TYPE_C float
__global__ void vector_add(
        int n,
        const kernel_float::vec<TYPE_A, 1>* A,
        const kernel_float::vec<TYPE_B, 1>* B,
              kernel_float::vec<TYPE_C, 1>* C
 int i = threadIdx.x + blockIdx.x * blockDim.x;
  if (i < n)
     C[i] = A[i] + B[i];
```

- Kernel Float automatically uses vector intrinsics
  - Requires using **kernel\_float::vec<T, N>** with N≥2
- Several types benefit from vectorization!
  - half and bfloat require vectorized intrinsics for high throughput
  - Vectorized memory operations
  - Vectorized integer operations
  - •

```
#include "kernel_float.h"
#define TYPE_A float
#define TYPE B float
#define TYPE_C __half
#define VECTOR SIZE 1
__global__ void vector_add(
        int n,
        const kernel_float::vec<TYPE_A, VECTOR_SIZE>* A,
        const kernel float::vec<TYPE B, VECTOR SIZE>* B,
              kernel_float::vec<TYPE_C, VECTOR_SIZE>* C
  int i = threadIdx.x + blockIdx.x * blockDim.x;
  if (i * VECTOR_SIZE < n)</pre>
      C[i] = A[i] + B[i];
```

```
#include "kernel_float.h"
#define TYPE_A float
#define TYPE B float
#define TYPE C half
#uefine VECTOR_SIZE 2
__global__ void vector_add(
        int n,
        const kernel_float::vec<TYPE_A, VECTOR_SIZE>* A,
        const kernel float::vec<TYPE B, VECTOR SIZE>* B,
              kernel_float::vec<TYPE_C, VECTOR_SIZE>* C
  int i = threadIdx.x + blockIdx.x * blockDim.x;
  if (i * VECTOR_SIZE < n)</pre>
      C[i] = A[i] + B[i];
```

- Accuracy vs performance trade-off
  - What type should we use for each variable?
  - Ideally want high performance with low error
- Variables datatypes and kernel parameters both affect performance
  - Usually heavily intertwined, we cannot tune them separately
- Leads to large search-space, for example:
  - 10 variables and 4 precision levels:  $4^{10} = 1$  million options
  - 8 parameters with each 6 options:  $6^8 = 1$  million options
  - Total: 1 trillion configurations!

## Kernel Tuner offers native support for accuracy tuning

- Step 1: Add tunable floating-point types as tuning parameters
- Step 2: Wrap inputs/outputs in **TunablePrecision** objects
- Step 3: provide reference output as **answer**
- Step 4: Add AccuracyObserver

### See the example:

examples/cuda/accuracy.py

```
size = 100000000
n = numpy.int32(size)
a = numpy.random.randn(size)
b = numpy.random.randn(size)
c = numpy.zeros like(b)
args = [n,
     TunablePrecision("float_type", a),
     TunablePrecision("float type", b),
     TunablePrecision("float_type", c)]
answer = [None, None, None, a + b]
tune params = dict()
tune params ["block size x"] = [32, 64, 128, 256, 512, 1024]
tune params["float type"] = ["float", "double", "half"]
observers = [AccuracyObserver("RMSE")]
results, env = tune kernel("vector add", kernel string,
     size, args, tune_params, answer=answer,
     observers=observers, lang="CUDA")
```

```
size = 100000000
n = numpy.int32(size)
a = numpy.random.randn(size)
b = numpy.random.randn(size)
c = numpy.zeros like(b)
args = [n,
     TunablePrecision("float_type", a),
     TunablePrecision("float type", b),
     TunablePrecision("float_type", c)]
answer = [None, None, None, a + b]
tune params = dict()
tune params ["block size x"] = [32, 64, 128, 256, 512, 1024]
tune params["float type"] = ["float", "double", "half"]
observers = [AccuracyObserver("RMSE")]
results, env = tune kernel("vector add", kernel string,
     size, args, tune_params, answer=answer,
     observers=observers, lang="CUDA")
```

```
size = 100000000
n = numpy.int32(size)
a = numpy.random.randn(size)
b = numpy.random.randn(size)
c = numpy.zeros like(b)
args = [n,
     TunablePrecision("float type", a),
     TunablePrecision("float type", b),
     TunablePrecision("float_type", c)]
answer = [None, None, None, a + b]
tune params = dict()
tune params ["block size x"] = [32, 64, 128, 256, 512, 1024]
tune_params["float_type"] = ["float", "double", "half"]
observers = [AccuracyObserver("RMSE")]
results, env = tune kernel("vector add", kernel string,
     size, args, tune_params, answer=answer,
     observers=observers, lang="CUDA")
```

- The TunablePrecision wrapper tells Kernel Tuner that type of input/output arguments depends on a tunable parameter
- Before benchmarking, data converted to provided data types
- During benchmarking, kernel is passed pointer of correct data type
- [Advanced] The general **Tunable** object allows arbitrary conversions

```
size = 100000000
n = numpy.int32(size)
a = numpy.random.randn(size)
b = numpy.random.randn(size)
c = numpy.zeros like(b)
args = [n,
     TunablePrecision("float_type", a),
     TunablePrecision("float type", b),
     TunablePrecision("float type", c)]
answer = [None, None, None, a + b]
tune params = dict()
tune_params["block_size_x"] = [32, 64, 128, 256, 512, 1024]
tune params["float type"] = ["float", "double", "half"]
observers = [AccuracyObserver("RMSE")]
results, env = tune kernel("vector add" kernel string,
     size, args, tune_params, answer=answer,
     observers=observers, lang "CUDA")
```

```
size = 100000000
n = numpy.int32(size)
a = numpy.random.randn(size)
b = numpy.random.randn(size)
c = numpy.zeros like(b)
args = [n,
     TunablePrecision("float type", a),
     TunablePrecision("float type", b),
     TunablePrecision("float type", c)]
answer = [None, None, None, a + b]
tune params = dict()
tune params ["block size x"] = [32, 64, 128, 256, 512, 1024]
tune params["float type"] = ["float", "double", "half"]
observers = [AccuracyObserver("RMSE")]
 results, env = tune_kernel("vector_add",kernel_string,
     size angs tune panams answer=answer,
     observers=observers, larg="CUDA")
```

- The AccuracyObserver measures the error and adds a metric
- Supports 10+ well-known metrics
  - Root mean square error (RMSE)
  - Mean relative error (rel)
  - Mean absolute error (abs)
  - Maximum relative error (max)
  - •
  - Custom metrics are also possible!
- The best error metric is application-dependent
- Compatible with other observers!



## Hands-on



- The third hands-on notebook is:
  - <a href="https://colab.research.google.com/github/KernelTuner/kernel\_tuner\_tutorial/blob/master/energy/02\_Mixed\_precision\_programming.ipynb">https://colab.research.google.com/github/KernelTuner/kernel\_tuner\_tutorial/blob/master/energy/02\_Mixed\_precision\_programming.ipynb</a>

- The goal of this hands-on is to:
  - Tune a signal convolution kernel with mixed precision types
  - Experiment with the accuracy-performance trade-off
- Open the notebook in Google Colab and work there
- Please follow the instructions in the Notebook
- Feel free to ask questions to instructors and mentors



```
# The tunable types. Currently, the code is only
tune_params = dict()
tune_params["OUTPUT_TYPE"] = ["double"]
tune_params["INPUT_TYPE"] = ["double"]
tune_params["FILTER_TYPE"] = ["double"]
tune_params["ACCUM_TYPE"] = ["double"]

# Other tunable parameters
tune_params["block_size_x"] = [128, 256]
tune_params["VECTOR_SIZE"] = [1, 2, 4]
tune_params["PREFETCH_INPUT"] = [0, 1]
tune_params["UNROLL_LOOP"] = [0, 1]
```



# Optimizing GPU Core Clock Frequency



NVML can observe GPU temperature, core and memory clocks, core voltage, and power

#### Advantages:

• Highly available

#### Disadvantages:

- Returns time-averaged power, not instantaneous power consumption
- Limited time resolution

#### Current solution:

 Measure power while continuously running the kernel for one second



Going Green: optimizing GPUs for energy efficiency through model-steered auto-tuning Richard Schoonhoven, Bram Veenboer, Ben van Werkhoven, K. Joost Batenburg PMBS workshop at SC22 2022

#### Pros:

- Instantaneous power readings
- Time resolution: 2.8 KHz
- Open source: <a href="https://gitlab.com/astron-misc/PowerSensor">https://gitlab.com/astron-misc/PowerSensor</a>

#### Cons:

- Some assembly required
  - You need to build the hardware!



Supported in Kernel Tuner, using PowerSensorObserver

## Allows to measure several quantities during tuning:

 Power consumption, core frequency, core voltage, memory frequency, GPU temperature, and energy consumption

#### Provides an interface within Kernel Tuner to NVML:

- Enables new tunable parameters:
  - nvml\_pwr\_limit: try out different power limits
  - nvml\_gr\_clock: set the GPU core clock frequency
  - nvml\_mem\_clock: set the GPU memory clock frequency
  - Setting these requires root privileges

```
tune_params["nvml_pwr_limit"] = [250, 225, 200, 175]

nvmlobserver = NVMLObserver(["nvml_energy", "temperature"])

metrics = OrderedDict()
metrics["GFLOPS/W"] = lambda p: (size/1e9) / p["nvml_energy"]

results, env = tune_kernel("vector_add", kernel_string, size, args, tune_params, observers=[nvmlobserver], metrics=metrics, iterations=32)
```

Kernel Tuner has helper functions to setup tunable parameters:

In kernel\_tuner.observers.nvml:

- get\_nvml\_pwr\_limits(device, n=None, quiet=False):
  - Device is the device ordinal as reported by **nvidia-smi**
  - **n** is the number of evenly-spaced values to tune
    - if unspecified returns values spaced 5 Watts apart
- get\_nvml\_gr\_clocks(device, n=None, quiet=False):
  - **n** is the number of evenly-spaced values to tune
    - If unspecified, all supported core clocks are returned

Many tunable parameters affect compute performance and/or energy efficiency But we can also:

- Limit the GPU clock frequency, allow GPU to vary power consumption
- Limit the GPU power consumption, allow GPU to determine clock frequency

Both methods unfortunately require root privileges for the latest generations of Nvidia GPUs

Tuning CLBlast GEMM using frequency or power limit tuning



## Advantages of power capping:

- Potentially more effective, GPU may also lower memory clock
- Reliable method in face of limited power

## Advantages of frequency tuning:

- Especially on A100, frequency tuning enables a wider power range
- Fixing the clock frequency also improves measurement stability

- GPUs rapidly ramp up voltage when clock frequency increases beyond a certain point
- This point appears to be a sweet spot in the trade-off between energy consumption and compute performance
- We call this point the 'ridge point'



Going Green: optimizing GPUs for energy efficiency through model-steered auto-tuning Richard Schoonhoven, Bram Veenboer, Ben van Werkhoven, K. Joost Batenburg PMBS workshop at SC22 2022

Core Frequency [MHz]

- Not every GPU reports core voltages, but we can estimate the voltage using a simple power model
- When we fix all parameters and vary the clock frequency, we can approximate power consumption using:

$$P_{load} = \min(P_{max}, P_{idle} + \alpha * f * v^2)$$

 And identify the GPUs 'ridge point' frequency in this way





Going Green: optimizing GPUs for energy efficiency through model-steered auto-tuning Richard Schoonhoven, Bram Veenboer, Ben van Werkhoven, K. Joost Batenburg PMBS workshop at SC22 2022

 Use performance model to limit the frequency range for tuning

 Reduces the search space by ~80% on average



- By default, Kernel Tuner's optimization strategy minimizes time
- But there is also support for using a custom tuning **objective**
- The objective can be any observed quantity or user-defined metric



## Hands-on



- The fourth hands-on notebook is:
  - <a href="https://colab.research.google.com/github/KernelTuner/kernel\_tuner\_tutorial/blob/master/energy/03\_energy\_efficient\_computing.ipynb">https://colab.research.google.com/github/KernelTuner/kernel\_tuner\_tutorial/blob/master/energy/03\_energy\_efficient\_computing.ipynb</a>
- The goal of this hands-on is to:
  - Tune a kernel to minimize the execution time or the energy consumption
  - Use an optimization strategy
  - Compare different energy optimization strategies
- Open the notebook in Google Colab and work there
- Please follow the instructions in the Notebook
- Feel free to ask questions to instructors and mentors



## Closing Remarks



- We are developing Kernel Tuner as an open-source project
- GitHub repository:
  - https://github.com/KernelTuner/kernel\_tuner
  - License: Apache 2.0
- If you use Kernel Tuner in a project, please cite the paper:
  - B. van Werkhoven, Kernel Tuner: A search-optimizing GPU code auto-tuner, Future Generation Computer Systems, 2019

- Contributions can come in many forms: tweets, blog posts, issues, pull requests
- Before making larger changes, please create an issue to discuss

For the full contribution guide, please see:
 <u>https://kerneltuner.github.io/kernel\_tuner/stable/contributing.html</u>

 We have a discussion board on GitHub!



- Kernel Launcher: C++ library for creating optimal-performance portable CUDA applications
   S. Heldens, B. van Werkhoven
   International Workshop on Automatic Performance Tuning (iWAPT2023) co-located with IPDPS 2023
- Optimization Techniques for GPU Programming Pieter Hijma, Stijn Heldens, Alessio Sclocco, Ben van Werkhoven, and Henri Bal ACM Computing surveys 2023
- Going green: optimizing GPUs for energy efficiency through model-steered auto-tuning Richard Schoonhoven, Bram Veenboer, Ben van Werkhoven, K. Joost Batenburg International Workshop on Performance Modeling, Benchmarking and Simulation of High-Performance Computer Systems (PMBS) at Supercomputing (SC22) 2022
- Bayesian Optimization for auto-tuning GPU kernels
   F.J. Willemsen, R.V. van Nieuwpoort, B. van Werkhoven
   International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) at Supercomputing (SC21) 2021
- Kernel Tuner: A search-optimizing GPU code auto-tuner
   B. van Werkhoven
   Future Generation Computer Systems 2019

## Acknowledgments

- The CORTEX project has received funding from the Dutch Research Council (NWO) in the framework of the NWA-ORC Call (file number NWA.1160.18.316).
- The COMPAS project has received funding from the Netherlands eScience Center (NLESC.OEC.2022.001).
- ESiWACE3 is funded by the European Union. This work has received funding from the European High Performance Computing Joint Undertaking (JU) and Spain, Netherlands, Germany, Sweden, Finland, Italy and France, under grant agreement No 1010930





## Thanks!

If you have any further questions or would like to reach out, please feel free to contact me at:

Ben van Werkhoven b.van.werkhoven@liacs.leidenuniv.nl

