# Kernel Tuner Tutorial

## Advanced Hands-on

In this hands-on we will look at few of the features of Kernel Tuner that have been recently introduced to you: **search optimization strategies** and **custom observers**.

But first, if you have not done it already, it is time to install and import `kernel_tuner` and its dependencies.

In [None]:
%pip install kernel_tuner

import numpy as np
import kernel_tuner as kt
import collections

To work with these features we will use a matrix multiplication kernel.

Matrix multiplication is one of the most well-known and widely-used linear algebra operations, and is frequently used to demonstrate the high-performance computing capabilities of GPUs. As such, matrix multiplication presents a familiar starting point for many GPU programmers. More information about matrix multiplication can be found on [Wikipedia](https://en.wikipedia.org/wiki/Matrix_multiplication).

In [None]:
%%writefile matmul.cu

#define WIDTH 512

__global__ void matmul_kernel(float *C, float *A, float *B) {

    __shared__ float sA[block_size_y*tile_size_y][block_size_x];
    __shared__ float sB[block_size_y*tile_size_y][block_size_x * tile_size_x];

    int tx = threadIdx.x;
    int ty = threadIdx.y;
    int x = blockIdx.x * block_size_x * tile_size_x + threadIdx.x;
    int y = blockIdx.y * block_size_y * tile_size_y + threadIdx.y;
    int k, kb;

    float sum[tile_size_y][tile_size_x];
    #pragma unroll
    for (int i = 0; i < tile_size_y; i++) {
        #pragma unroll
        for (int j = 0; j < tile_size_x; j++) {
            sum[i][j] = 0.0f;
        }
    }

    for (k = 0; k < WIDTH; k += block_size_x) {

        __syncthreads();
        #pragma unroll
        for (int i = 0; i < tile_size_y; i++) {
            sA[ty + block_size_y * i][tx] = A[(y+i*block_size_y) * WIDTH + k + tx];

            #pragma unroll
            for (int j = 0; j < tile_size_x; j++) {
                sB[ty + block_size_y * i][tx + j * block_size_x] = B[(k + ty + block_size_y * i) * WIDTH + x + j * block_size_x];
            }
        }
        __syncthreads();

        //compute
        #pragma unroll
        for (kb = 0; kb < block_size_x; kb++) {

            #pragma unroll
            for (int i = 0; i < tile_size_y; i++) {
            #pragma unroll
                for (int j = 0; j < tile_size_x; j++) {
                    sum[i][j] += sA[ty + block_size_y * i][kb] * sB[kb][tx + j * block_size_x];
                }
            }

        }

    }

    //store result
    #pragma unroll
    for (int i = 0; i < tile_size_y; i++) {
        #pragma unroll
        for (int j = 0; j < tile_size_x; j++) {
            C[y * WIDTH + x + block_size_y * i * WIDTH + j * block_size_x] = sum[i][j];
        }
    }

}

We now allocate memory, define tunable parameters and constraints, and tune the kernel.

In [None]:
# matrix width needs to match the value in the kernel source
problem_size = (512, 512)

A = np.random.randn(*problem_size).astype(np.float32)
B = np.random.randn(*problem_size).astype(np.float32)
C = np.zeros_like(A)

args = [C, A, B]

tune_params = collections.OrderedDict()
tune_params["block_size_x"] = [2**i for i in range(0, 11)]
tune_params["block_size_y"] = [2**i for i in range(0, 11)]
tune_params["tile_size_x"] = [2**i for i in range(0, 6)]
tune_params["tile_size_y"] = [2**i for i in range(0, 6)]

restrict = ["block_size_x == block_size_y * tile_size_y"]

grid_div_x = ["block_size_x", "tile_size_x"]
grid_div_y = ["block_size_y", "tile_size_y"]

answer = [np.matmul(A,B), None, None]

metrics = collections.OrderedDict()
metrics["GFLOP/s"] = lambda p : (2 * 512**3 / 1e9) / (p["time"] / 1e3)

In [None]:
results, env = kt.tune_kernel("matmul_kernel", "matmul.cu",
                             problem_size, args, tune_params,
                             grid_div_y=grid_div_y, grid_div_x=grid_div_x,
                             answer=answer, atol=1e-4,
                             restrictions=restrict, verbose=True, iterations=32, metrics=metrics, lang="cupy", cache="matmul_cache.json")
print(f"Number of configurations: {len(results)}")

We can also visualize the tuning results using [KTdashboard](https://github.com/benvanwerkhoven/dashboard).

In [None]:
%pip install git+https://github.com/benvanwerkhoven/dashboard
import panel as pn
pn.extension(comms='colab')
import ktdashboard.ktdashboard as ktd

In [None]:
ktd.KTdashboard("matmul_cache.json").notebook()

There are times when the amount of possible configurations of tunable parameters is too high, or other time constraints do not allow to perform a full search. In those cases, it could be beneficial to use one of Kernel Tuner **search optimization strategies**.

You can experiment with them in the next block. Try different strategies, and compare the optimum found with the overall optimum found previously. You can also time the tuning process to see the differences there.

The strategies and how to enable them are described in Kernel Tuner's [API](https://benvanwerkhoven.github.io/kernel_tuner/user-api.html).

In [None]:
# experiment with enabling a search optimization strategy
strategy = ""

# tell the strategy to compile and benchmark at most 40 kernel configurations
strategy_options = dict(max_fevals=40)

results_opt, env_opt = kt.tune_kernel("matmul_kernel", "matmul.cu",
                                      problem_size, args, tune_params,
                                      grid_div_y=grid_div_y, grid_div_x=grid_div_x,
                                      answer=answer, atol=1e-4,
                                      restrictions=restrict, verbose=True, iterations=32,
                                      metrics=metrics, lang="cupy",
                                      strategy=strategy, strategy_options=strategy_options)
print(f"Number of configurations: {len(results_opt)}")

Next we are going to add a **custom observer** to the kernel. One possibility is to add an observer to compute the number of registers used by the kernel, and add this value to the metrics.

In order to create a new observer it is necessary to extend the class `BenchmarkObserver` provided by Kernel Tuner in the `kt.observers` package. In case you want to access the number of registers used by a kernel instance, this is available inside your observer class in `self.dev.func.num_regs`.

As usual, how to add observers is described in Kernel Tuner's [API](https://benvanwerkhoven.github.io/kernel_tuner/user-api.html).

In [None]:
observers = []

# add a custom observer
from kernel_tuner.observers import BenchmarkObserver

# define your own observer class that extends BenchmarkObserver
#class CustomObserver(BenchmarkObserver)

# implement the get_results method of this class
# ...

# create an instance of your custom observer
custom_observer = None

# append it to the list of observers by uncommenting the line below
#observers.append(custom_observer)

# add a metric so that our observed number of registers appears in the console output
#metrics["regs"] = lambda p:p["num_regs"]


# add an NVMLObserver
from kernel_tuner.nvml import NVMLObserver
nvml_observer = NVMLObserver(["nvml_energy", "temperature"])

observers.append(nvml_observer)

# add metrics to enable console output for observed quantities
metrics["GFLOPS/W"] = lambda p : (2 * 512**3 / 1e9) / (p["nvml_energy"])
metrics["T"] = lambda p:p["temperature"]

# call tune_kernel to tune using our new Observers and additional metrics
results, env = kt.tune_kernel("matmul_kernel", "matmul.cu",
                             problem_size, args, tune_params,
                             observers=observers,
                             grid_div_y=grid_div_y, grid_div_x=grid_div_x,
                             answer=answer, atol=1e-4,
                             restrictions=restrict, verbose=True, iterations=32, metrics=metrics, lang="cupy")
print(f"Number of configurations: {len(results)}")