# Kernel Tuner Tutorial

## Getting Started Hands-on

In this hands-on we will look at two features of Kernel Tuner that have been recently introduced to you: **tunable grid dimensions** and **user defined metrics**.

But first, if you have not done it already, it is time to install and import kernel_tuner and its dependencies.

In [None]:
%pip install kernel_tuner

import numpy as np
import kernel_tuner as kt
import collections

To introduce these concepts we will use a modified vector add kernel.

This kernel computes the same result as the kernel in the previous hands-on, i.e. the elementwise sum of two vectors of size `n`, but each thread can compute more than one element.

The content of the cell is written to the `vector_add_tiled.cu` file, and you only need to execute this cell once as this hands-on does not require to change the implementation of the kernel.

In [None]:
%%writefile vector_add_tiled.cu

__global__ void vector_add(float * c, float * a, float * b, int n) {
    int i = (blockIdx.x * blockDim.x * tiling_factor) + (threadIdx.x * tiling_factor);

    if ( (i + tiling_factor) <= n ) {
        #pragma unroll
        for ( int item = 0; item < tiling_factor; item++ ) {
            c[i + item] = a[i + item] + b[i + item];
        }
    }
}

Before running the code we need to allocate memory and add some tuning parameters.

In [None]:
size = 1_000_000

a = np.random.randn(size).astype(np.float32)
b = np.random.randn(size).astype(np.float32)
c = np.zeros_like(b)
n = np.int32(size)

args = [c, a, b, n]

tune_params = collections.OrderedDict()
tune_params["block_size_x"] = [2**i for i in range(0, 11)]
tune_params["tiling_factor"] = [i for i in range(1, 11)]

Normally, Kernel Tuner computes the grid size of our CUDA kernel automatically, based on problem size and number of threads per block (``block_size_x``). However, this is not possible for cases where other tunable parameters (i.e. `tiling_factor`) also affect the grid size.

It is your responsibility to tell Kernel Tuner to work with **tunable grid dimensions**. To do so, you can define a Python list containing the names of the tunable parameters that should be used to compute the grid dimensions from the problem size.

In [None]:
# EXERCISE 1: Provide a list of tunable parameter names that divide the grid dimenions
grid_div_x = []

Execution time is important, but not always the most relevant metric for many users. Because of this, Kernel Tuner allows to work with **user defined metrics** that are computed within and then returned by `tune_kernel`.

Metrics are passed to Kernel Tuner as `lambda` functions contained in an ordered dictionary, with the key of the entry being the name of the metric itself. The order is important because it is allowed for metrics to build on earlier defined metrics.

It is your responsibility to define one or more metrics and then tune the provided kernel. Possible user defined metrics in this case are the number of operations per second, or memory bandwidth.


In [None]:
# First we create an OrderedDict. Actually, in newer Python versions all dictionaries are ordered.
metrics = collections.OrderedDict()

# Now we define our first metric. In this case, we want the performance of our kernel to
# be computed in billions of floating-point operations per second.
metrics["Performance (GFLOP/s)"] = lambda p: (n / 1e9) / (p["time"] / 1e3)
# Let's unpack what the above line means:. We've created a lambda function that
# takes an argument 'p' that contains the results collected by Kernel Tuner.
# Our function should return the performance in GFLOP/s of this specific code
# variant of our kernel.
# Because 'n' is the size of our array, and equal to the number of floating-point additions
# our kernel performs, we start with dividing n by one billion (1e9).
# Kernel Tuner measures the execution time of our kernel in miliseconds. So, to arrive
# at the execution time in seconds, we divide the execution time by a thousand.

# EXERCISE 2: Define a user-defined metric for the achieved memory bandwith (throughput)
# of our vector_add kernel, use "Throughput (GB/s)" as the key.
# Because the vector_add kernel reads twice as much data as it writes, it is OK
# to only consider the bandwidth required for the input data.
# Think of how to express the througput in gigabytes per second of our kernel.

Now we are ready to pass these additional arguments to the `tune_kernel` function as documented in Kernel Tuner's [API](https://KernelTuner.github.io/kernel_tuner/stable/user-api.html).

In [None]:
if not grid_div_x:
    print("Error: first setup grid_div_x (Exercise 1)")
elif "Throughput (GB/s)" not in metrics:
    print("Error: first define a metric for the throughput (Exercise 2)")

# Call the tuner
# Mostly the same as before, but now we also pass:
#    grid_div_x, to tell Kernel Tuner how to compute the grid dimensions
#    metrics, a dictionary with user-defined metrics
else:
    results, env = kt.tune_kernel("vector_add", "vector_add_tiled.cu", size, args, tune_params,
                                  grid_div_x=grid_div_x,
                                  metrics=metrics,
                                  lang="cupy")
if results:
    print(f"Number of configurations: {len(results)}")