# Kernel Tuner Tutorial

## Getting Started Hands-on

In this hands-on we will look at two features of Kernel Tuner that have been recently introduced to you: **tunable grid dimensions** and **user defined metrics**.

But first, if you have not done it already, it is time to install and import kernel_tuner and its dependencies.

In [None]:
%pip install numpy
%pip install pycuda
%pip install kernel_tuner

import numpy as np
import kernel_tuner as kt
import collections

To introduce these concepts we will use a modified vector add kernel.

This kernel computes the same result as the kernel in the previous hands-on, i.e. the elementwise sum of two vectors of size `n`, but each thread can compute more than one element.

The content of the cell is written to the `vector_add_tiled.cu` file, and you only need to execute this cell once as this hands-on does not require to change the implementation of the kernel.

In [None]:
%%writefile vector_add_tiled.cu

__global__ void vector_add(float * c, float * a, float * b, int n) {
    int i = (blockIdx.x * blockDim.x * tiling_factor) + (threadIdx.x * tiling_factor);
    
    if ( (i + tiling_factor) <= n ) {
        #pragma unroll
        for ( int item = 0; item < tiling_factor; item++ ) {
            c[i + item] = a[i + item] + b[i + item];
        }
    }
}

Before running the code we need to allocate memory and add some tuning parameters.

In [None]:
size = 1000000

compiler_flags = ["-Wno-deprecated-gpu-targets"]

a = np.random.randn(size).astype(np.float32)
b = np.random.randn(size).astype(np.float32)
c = np.zeros_like(b)
n = np.int32(size)

args = [c, a, b, n]

tune_params = collections.OrderedDict()
tune_params["block_size_x"] = [2**i for i in range(0, 11)]
tune_params["tiling_factor"] = [i for i in range(1, 11)]

Normally, Kernel Tuner does compute the CUDA grid size for you based on problem size and number of threads per block. However, this is not possible for cases like this one where other tunable parameters (i.e. `tiling_factor`) other than the number of threads can possibly affect the grid size.

It is your responsibility to configure Kernel Tuner to work with **tunable grid dimensions**. To do so you can define a Python list containing a string that represent the right expression to use for computing the grid size.

In [None]:
# define in the correct way how the problem size should be divided
grid_div_x = []

Execution time is important, but not always the most relevant metric for many users. Because of this, Kernel Tuner allows to work with **user defined metrics** that are computed within and then returned by `tune_kernel`.

Metrics are passed to Kernel Tuner as `lambda` functions contained in a dictionary, with the key of the entry being the name of the metric itself.

It is your responsibility to define one or more metrics and then tune the provided kernel. Possible user defined metrics in this case are the number of operations per second, or memory memory bandwidth.


In [None]:
# define at least one user defined metric for the kernel
metrics = collections.OrderedDict()
metrics[] = 

Do not forget to pass the additional parameters to the `tune_kernel` function as documented in Kernel Tuner's [API](https://benvanwerkhoven.github.io/kernel_tuner/user-api.html).

In [None]:
# add the right parameters to the tune_kernel method
results, env = kt.tune_kernel("vector_add", "vector_add_tiled.cu",
                             size, args, tune_params, compiler_options=compiler_flags)

print("Number of configurations: {}".format(len(results)))