 <div align="center">
    
# Kernel Tuner demo

<br />
<br />
<br />
<br />
<br />
<br />
<br />


By Ben van Werkhoven, Netherlands eScience Center <br />
b.vanwerkhoven@esciencecenter.nl
    
</div>

alt+r to start the slideshow, spacebar or shift+spacebar to move forward to next slide, comma to remove on screen buttons

preparation: run the next code cell, start a second terminal and go the the directory of this notebook

In [None]:
%%bash
rm matmul_cache.json
rm vector_add.cu

# Optimizing GPU Applications

To maximize GPU code performance, you need to find the best combination of:

* Different mappings of the problem to threads and thread blocks
* Different data layouts in different memories (shared, constant, …)
* Different ways of exploiting special hardware features
* Thread block dimensions
* Code optimizations that may be applied or not
* Work per thread in each dimension
* Loop unrolling factors
* Overlapping computation and communication
* ...

<font color=red>Problem</font>:
* Creates a very large design space!

# Kernel Tuner

*A Python tool for optimizing and tuning GPU applications*

Started in 2016:
* As a software development tool for GPU projects at the eScience center
* To be used directly on existing kernels
* Without inserting dependences in the kernel code
* Kernels can still be compiled with regular compilers

Today:
* Comprehensive toolbox for auto-tuning with several tools being built on top
* Developed by a team of 7 developers across CWI, Astron, and eScience center
* Used in over 10 different eScience center projects and by others

https://github.com/KernelTuner/kernel_tuner

# Minimal Example

In [None]:
%%writefile vector_add.cu
__global__ void vector_add(float *c, float *a, float *b, int n) {
    int i = blockIdx.x * block_size_x + threadIdx.x;
    if (i<n) {
        c[i] = a[i] + b[i];
    }
}

In [None]:
import numpy as np
import kernel_tuner as kt

size = 1000000

a = np.random.randn(size).astype(np.float32)
b = np.random.randn(size).astype(np.float32)
c = np.zeros_like(b)
args = [c, a, b, np.int32(size)]

tune_params = dict()
tune_params["block_size_x"] = [32, 64, 128, 256, 512]

_ = kt.tune_kernel("vector_add", "vector_add.cu", size, args, tune_params)

<img src="img/dashboard_logo.png" style="height:100px;">

<div align="left">

#### Live visualizations of auto-tuning sessions using Kernel Tuner

<img src="img/dashboard.png" style="height:40%;">
    
https://github.com/KernelTuner/dashboard
</div>

# Tuning a larger problem

In [None]:
from collections import OrderedDict
problem_size = (512, 512)
A = np.random.randn(*problem_size).astype(np.float32)
B = np.random.randn(*problem_size).astype(np.float32)
C = np.zeros_like(A)

args = [C, A, B]

tune_params = OrderedDict()
tune_params["block_size_x"] = [2**i for i in range(0, 11)]
tune_params["block_size_y"] = [2**i for i in range(0, 11)]
tune_params["tile_size_x"] = [2**i for i in range(0, 6)]
tune_params["tile_size_y"] = [2**i for i in range(0, 6)]

restrict = ["block_size_x == block_size_y * tile_size_y"]
grid_div_x = ["block_size_x", "tile_size_x"]
grid_div_y = ["block_size_y", "tile_size_y"]

from kernel_tuner.nvml import NVMLObserver
nvml_observer = NVMLObserver(["nvml_energy", "temperature", "core_freq"])

metrics = OrderedDict()
metrics["GFLOP/s"] = lambda p : (2 * 512**3 / 1e9) / (p["time"] / 1e3)
metrics["GFLOPs/W"] = lambda p : (2 * 512**3 / 1e9) / (p["nvml_energy"])

_ = kt.tune_kernel("matmul_kernel", "matmul.cu", problem_size, args, tune_params,
                   observers=[nvml_observer], grid_div_y=grid_div_y, grid_div_x=grid_div_x,
                   restrictions=restrict, metrics=metrics, cache="matmul_cache.json")

While the previous cell is running go to second terminal and type "ktdashboard matmul_cache.json"

# Final remarks

Currently, using Kernel Tuner to optimize and tune code in:
* Ultrasound Brain Imaging with Erasmus MC (RECRUIT)
* Atmospheric Modeling (ESiWACE-2 Microhh)
* Radio Astronomy (CORTEX)

Kernel Tuner can also be used for optimizing the energy efficiency of GPU applications:
> Going green: optimizing GPUs for energy efficiency through model-steered auto-tuning <br />
R. Schoonhoven, B. Veenboer, B. van Werkhoven, K. J. Batenburg <br />
International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) at Supercomputing (SC22) 2022 

Main repository: <br />
https://github.com/KernelTuner/kernel_tuner  <br />
Documentation: <br /> 
https://KernelTuner.github.io  <br />
Tutorial: <br />
https://github.com/KernelTuner/kernel_tuner_tutorial <br />