# Kernel Tuner Tutorial

## Introduction Hands-on

Welcome to the first hands-on of the Kernel Tuner tutorial. In this hands-on exercise we will learn how to **install** and access Kernel Tuner from Python, and **tune** our first CUDA kernel.

To install the latest version of `kernel_tuner` we use `pip`. In general,  before installing Kernel Tuner, we need to have a working CUDA installation available, and be sure to have both `numpy`  and `cupy` installed. On Google Colab, these packages are already present and we only need to install Kernel Tuner.

In [None]:
%pip install kernel-tuner

After installing all necessary packages, we can import `numpy` and `kernel_tuner`.

In [None]:
import numpy as np
import kernel_tuner as kt

Before using Kernel Tuner, we will create a text file containing the code of the CUDA kernel that we are going to use in this hands-on.

This simple kernel is called `vector_add` and computes the elementwise sum of two vectors of size `n`.

In [None]:
%%writefile vector_add_kernel.cu

__global__ void vector_add(float * c, float * a, float * b, int n) {
    int i = (blockIdx.x * blockDim.x) + threadIdx.x;

    if ( i < n ) {
        c[i] = a[i] + b[i];
    }
}

The execution of the cell above created the file `vector_add_kernel.cu` containing the source code of our kernel.

We can now use Kernel Tuner to execute the code on the GPU; please read carefully both code and comments to become familiar with how Kernel Tuner works. For more details refer to the [API](https://KernelTuner.github.io/kernel_tuner/stable/user-api.html).

Now our first step is to create some input data to test our vector_add kernel on.

In [None]:
# the size of the vectors
size = 1000000

# all the kernel input and output data need to use numpy data types,
# note that we explicitly state that these arrays should consist of
# 32 bit floating-point values, to match our kernel source code
a = np.random.randn(size).astype(np.float32)
b = np.random.randn(size).astype(np.float32)
c = np.zeros_like(b)
n = np.int32(size)

# EXERCISE 1: combine the above created variables into an argument list that matches
# the order and types of the function arguments of our CUDA kernel
args = [] # Example Python syntax for creating a list is: [x, y, z]

The next step is to define the tunable parameters. These are the things that we would like Kernel Tuner to vary and experiment with in our code. Every tunable parameter has a name (as a string) and a list of possible values (usually a list of integers).

We will use a Python dictionary that works as a key-value store. As keys, we will use the name of the tunable parameter. As value of this key, we insert the list of possible values of our tunable parameter.

For our first tunable parameter, we will be testing the vector_add kernel with different thread block dimensions. By default, Kernel Tuner assumes that the x-dimension of the thread block is called ``block_size_x``. In other words, this is a special tunable parameter name that Kernel Tuner recognizes to mean the thread block size in x.

In [None]:
# EXERCISE 2: Create a dictionary to tell Kernel Tuner about
# the tunable parameters in our code and what values these may take.
#
# There are many ways in which you can instantiate dictionaries in Python.
# Let's call our dictionary 'tune_params'
tune_params = dict()
# Now we can insert a new key named 'block_size_x' and supply it
# with a list of possible values for our thread block size
# try for example some powers of 2
tune_params['block_size_x'] = [] # TODO: insert values here

Now, we are ready to call the tuner.

In [None]:
if not args:
    print("Error: You have to first create the argument list (Exercise 1)")
elif not tune_params['block_size_x']:
    print("Error: You have to first insert some values for the block_size_x (Exercise 2)")

# finally, we call tune_kernel to start the tuning process. To do so,
# we pass
#    the name of the kernel we'd like to tune, in our case: "vector_add",
#    the name of the file containing our source code,
#    the "problem_size" that our kernel operates on
#    the argument list to call our kernel function
#    the dictionary with tunable parameters
else:
    results, env = kt.tune_kernel("vector_add", "vector_add_kernel.cu", size, args, tune_params, lang="cupy")



The `tune_kernel` function returns two outputs that we saved as `results` and `env`:

* `results` is a list of dictionaries, each containing detailed information about the configurations that have been benchmarked;
* `env` is a dictionary that stores information about the hardware and software environment in which this experiment took place; it is recommended to store this information along with the benchmark results.

We can also print the content of `results` to have a look at the output.

In [None]:
print(f"Number of configurations: {len(results)}")
print(results)

Congratulations! You have made your first steps towards automatic performance tuning of GPU kernels!

There is much more training material available on: https://github.com/kerneltuner/kernel_tuner_tutorial