# GPU Optimization with Kernel Tuner

## First hands-on

Welcome to the first hands-on of the GPU optimization with Kernel Tuner tutorial.

The goal of this hands-on exercise is to familiarize with calling Kernel Tuner from Python to tune our first CUDA kernel for performance.

We start by importing `numpy` and `kernel_tuner`.

In [None]:
import numpy as np
import kernel_tuner as kt

Before using Kernel Tuner, we will create a text file containing the code of the CUDA kernel that we are going to use in this hands-on.

This simple kernel is called `vector_add` and computes the element-wise sum of two vectors of size `n`.

In [None]:
%%writefile vector_add_kernel.cu

__global__ void vector_add(float * c, float * a, float * b, int n) {
    int i = (blockIdx.x * blockDim.x) + threadIdx.x;

    if ( i < n ) {
        c[i] = a[i] + b[i];
    }
}

The execution of the cell above has created the file `vector_add_kernel.cu` containing the source code of our kernel.

We can now use Kernel Tuner to execute the code on the GPU; please read carefully both code and comments below to become familiar with how Kernel Tuner works.

For more details refer to the [API](https://KernelTuner.github.io/kernel_tuner/stable/user-api.html).

In [None]:
# the size of the vectors
size = 1000000

# all the kernel input and output data need to use numpy data types,
# note that we explicitly state that these arrays should consist of
# 32 bit floating-point values, to match our kernel source code
a = np.random.randn(size).astype(np.float32)
b = np.random.randn(size).astype(np.float32)
c = np.zeros_like(b)
n = np.int32(size)

# now we combine these variables in an argument list, which matches
# the order and types of the function arguments of our CUDA kernel
args = [c, a, b, n]

# the next step is to create a dictionary to tell Kernel Tuner about
# the tunable parameters in our code and what values these may take
tune_params = dict()
# add some values to the block_size_x tunable parameter
tune_params["block_size_x"] = [1]


# finally, we call tune_kernel to start the tuning process. To do so,
# we pass
#    the name of the kernel we'd like to tune, in our case: "vector_add",
#    the name of the file containing our source code,
#    the problem_size that our kernel operates on
#    the argument list to call our kernel function
#    the dictionary with tunable parameters
results, env = kt.tune_kernel("vector_add", "vector_add_kernel.cu", size, args, tune_params, lang="cupy")

What happens when you run the cell above is that Kernel Tuner compiles and benchmarks our vector_add kernel using different thread block dimensions. A summary of the results is printed to the console. The `tune_kernel` function also returns two outputs that we saved as `results` and `env`:

* `results` is a list of dictionaries, each containing detailed information about the configurations that have been benchmarked;
* `env` is a dictionary that stores information about the hardware and software environment in which this experiment took place; it is recommended to store this information along with the benchmark results.

We can also print the content of `results` to have a look at the output.

In [None]:
print(f"Number of configurations: {len(results)}")
for res in results:
    print(res)

As we can see the results returned by `tune_kernel` lists the average execution time of our kernel in the field "time" and even includes the individual measurements of each kernel that was benchmarked stored under "times".

Kernel Tuner also collects a lot of other timing information, including the overall time it took to compile our kernel and benchmark our kernel.
