# Kernel Tuner Tutorial

## Introduction Hands-on

Welcome to the first hands-on of the Kernel Tuner tutorial. In this hands-on exercise we will learn how to **install** and access Kernel Tuner from Python, and **tune** our first CUDA kernel.

To install the latest version of `kernel_tuner` we use `pip`, but before installing it we need to have a working CUDA installation available, and be sure to have both `numpy`  and `cupy` installed.

In [None]:
%pip install kernel-tuner[tutorial]==1.0.0b3

After installing all necessary packages, we can import `numpy` and `kernel_tuner`.

In [None]:
import numpy as np
import kernel_tuner as kt

Before using Kernel Tuner, we will create a text file containing the code of the CUDA kernel that we are going to use in this hands-on.

This simple kernel is called `vector_add` and computes the elementwise sum of two vectors of size `n`.

In [None]:
%%writefile vector_add_kernel.cu

__global__ void vector_add(float * c, float * a, float * b, int n) {
    int i = (blockIdx.x * blockDim.x) + threadIdx.x;
    
    if ( i < n ) {
        c[i] = a[i] + b[i];
    }
}

The execution of the cell above created the file `vector_add_kernel.cu` containing the source code of our kernel.

We can now use Kernel Tuner to execute the code on the GPU; please read carefully both code and comments to become familiar with how Kernel Tuner works.

For more details refer to the [API](https://KernelTuner.github.io/kernel_tuner/stable/user-api.html).

In [None]:
# the size of the vectors
size = 1000000

# all the kernel input and output data need to use numpy data types,
# note that we explicitly state that these arrays should consist of
# 32 bit floating-point values, to match our kernel source code
a = np.random.randn(size).astype(np.float32)
b = np.random.randn(size).astype(np.float32)
c = np.zeros_like(b)
n = np.int32(size)

# now we combine these variables in an argument list, which matches
# the order and types of the function arguments of our CUDA kernel
args = [c, a, b, n]

# the next step is to create a dictionary to tell Kernel Tuner about
# the tunable parameters in our code and what values these may take
tune_params = dict(block_size_x=[16, 32, 64, 128, 256, 512, 1024])

# finally, we call tune_kernel to start the tuning process. To do so,
# we pass 
#    the name of the kernel we'd like to tune, in our case: "vector_add", 
#    the name of the file containing our source code,
#    the problem_size that our kernel operates on
#    the argument list to call our kernel function
#    the dictionary with tunable parameters
results, env = kt.tune_kernel("vector_add", "vector_add_kernel.cu", size, args, tune_params, lang="cupy")



The `tune_kernel` function returns two outputs that we saved as `results` and `env`:

* `results` is a list of dictionaries, each containing detailed information about the configurations that have been benchmarked;
* `env` is a dictionary that stores information about the hardware and software environment in which this experiment took place; it is recommended to store this information along with the benchmark results.

We can also print the content of `results` to have a look at the output.

In [None]:
print(f"Number of configurations: {len(results)}")
for res in results:
    print(res)

As we can see the results returned by `tune_kernel` lists the average execution time of our kernel in the field "time" and even includes the individual measurements of each kernel that was benchmarked stored under "times".

Kernel Tuner also collects a lot of other timing information, including the overall time it took to compile our kernel and benchmark our kernel.

## Energy measurements

However, today we are interested in particular in the energy used by our kernels, so let's add our first **Observer**

In [None]:
from kernel_tuner.observers.nvml import NVMLObserver

The NVMLObserver uses the Nvidia Management Library to query all kinds of information about our GPU while the kernel is running. We can measure many things, like the clock frequency, temperature, but also power usage of our GPU in this way.

Let's setup an NVMLObserver and tell it about the quantities we want to observe.

In [None]:
# Setup the NVMLObserver

# among the options we can choose from are:
# "nvml_power", "nvml_energy", "core_freq", "mem_freq", "temperature"

# The constructor expects to receive a list of 'observerables', e.g. NVMLObserver(["nvml_energy", "temperature"])
# Finish the code below to pick which quantities you want to observe while tuning and construct the NVMLObserver

nvmlobserver = NVMLObserver(["nvml_energy", ...]) # TODO: replace ... with something else


The quantities observed by the Observers are added to the results returned by `tune_kernel`. They are however, not directly printed while `tune_kernel` is running. User-defined metrics are always printed to screen, so let's add a few metrics as well.

In [None]:
nvmlobserver = NVMLObserver(["nvml_energy", "temperature"]) # remove this line for the tutorial

metrics = dict()
metrics["GFLOP/s"]  = lambda p: (size/1e9) / (p["time"]/1e3)  # Kernel Tuner's time is always in ms, so convert to s
metrics["GFLOPS/W"] = lambda p: (size/1e9) / p["nvml_energy"] # computed as GFLOP/J
# Optional TODO: add another metric
#metrics["my_metric"] = lambda p: p[...]

results, env = kt.tune_kernel("vector_add", "vector_add_kernel.cu", size, args, tune_params,
                              observers=[nvmlobserver], metrics=metrics,
                              cache="vec_add_cache.json", lang="cupy")

Kernel Tuner has now printed for every kernel it benchmarked the the performance in GFLOP/s and the energy efficiency in GFLOPs/W. To get a better view of these performance results and understand how changing the thread block dimensions `block_size_x` influences the results, we can use Kernel Tuner's dashboard.

Install and import dashboard by running the next cell:

In [None]:
%pip install git+https://github.com/KernelTuner/dashboard
import panel as pn
pn.extension(comms='colab')
import ktdashboard.ktdashboard as ktd

Now, let's the results from our last run:

In [None]:
ktd.KTdashboard("vec_add_cache.json").notebook()