# Kernel Tuner Tutorial

## Intermediate Hands-on

In this hands-on we will look at two features of Kernel Tuner that have been recently introduced to you: **search space restrictions** and **caching**.

But first, if you have not done it already, it is time to install and import `kernel_tuner` and its dependencies.

In [None]:
%pip install numpy
%pip install pycuda
%pip install kernel_tuner

import numpy as np
import kernel_tuner as kt
import collections

To work with these features we will use a matrix multiplication kernel.

Matrix multiplication is one of the most well-known and widely-used linear algebra operations, and is frequently used to demonstrate the high-performance computing capabilities of GPUs. As such, matrix multiplication presents a familiar starting point for many GPU programmers. More information about matrix multiplication can be found on [Wikipedia](https://en.wikipedia.org/wiki/Matrix_multiplication).

The following cell contains the code of a matrix multiply kernel using shared memory. The content of the cell is written to the `matmul_shared.cu` file, and you only need to execute the cell once as this hands-on does not require to change the implementation of the kernel.

This kernel assumes that the width and height of the matrices `A`, `B`, and `C` is equal to `WIDTH`, which is known at compile time. Of course, you'll want a more flexible solution in reality, but this is just an example kernel to demonstrate how to use Kernel Tuner.

In [None]:
%%writefile matmul_shared.cu

#define WIDTH 512

__global__ void matmul_kernel(float *C, float *A, float *B) {

    __shared__ float sA[block_size_y][block_size_x];
    __shared__ float sB[block_size_y][block_size_x];

    int tx = threadIdx.x;
    int ty = threadIdx.y;
    int x = blockIdx.x * block_size_x + tx;
    int y = blockIdx.y * block_size_y + ty;

    float sum = 0.0;
    int k,kb;

    for (k=0; k<WIDTH; k+=block_size_x) {
        __syncthreads();
        sA[ty][tx] = A[y*WIDTH+k+tx];
        sB[ty][tx] = B[(k+ty)*WIDTH+x];
        __syncthreads();

        for (kb=0; kb<block_size_x; kb++) {
            sum += sA[ty][kb] * sB[kb][tx];
        }

    }

    C[y*WIDTH+x] = sum;
}

Before running the code we need to allocate input and output matrices, and add some tuning parameters.

In [None]:
# matrix width needs to match the value in the kernel source
problem_size = (512, 512)

compiler_flags = ["-Wno-deprecated-gpu-targets"]

A = np.random.randn(*problem_size).astype(np.float32)
B = np.random.randn(*problem_size).astype(np.float32)
C = np.zeros_like(A)

args = [C, A, B]

tune_params = collections.OrderedDict()
tune_params["block_size_x"] = [2**i for i in range(0, 11)]
tune_params["block_size_y"] = [2**i for i in range(0, 11)]

It is now your turn to add some **search space restrictions**. You are free to add all the restrictions you want, but there is one in particular that is required for the kernel to produce correct results: the shape of the thread block needs to be a square.

Remember that restrictions are specified as either a Python list containing strings, each string being one restriction, or as a callable object that returns `True` if the configuration is valid and `False` otherwise.

In [None]:
# define some search space restrictions for the matrix multiplication kernel
restrict = 

To enable the **caching** of intermediate results during tuning, Kernel Tuner needs to know the name of the cache file. The name can be specified as a string, to which Kernel Tuner automatically adds the `.json` extension.

In [None]:
# define a string containing the cache file name
cache_name = 

Do not forget to pass the restrictions to the `tune_kernel` function and enable caching as documented in Kernel Tuner's [API](https://benvanwerkhoven.github.io/kernel_tuner/user-api.html).

In [None]:
# add the right parameters to the tune_kernel method
results, env = kt.tune_kernel("matmul_kernel", "matmul_shared.cu", 
                             problem_size, args, tune_params, restrictions=restrict, cache=cache_name, verbose=True, compiler_options=compiler_flags)

print("Number of configurations: {}".format(len(results)))


### Output verification

There are times, like with this matrix multiplication kernel, when some tuning configurations may produce wrong results.

It is important to catch this as soon as possible, and Kernel Tuner allows to pass to the `tune_kernel` function a reference answer to which the results produced by all configuration are compared against.

The reference answer is a Python list that matches in size and order the argument list provided to the kernel (`args` in our case), with `None` for all elements for which a comparison is not needed. In case of working with floating point values, Kernel Tuner allows also to specify a tolerance value. 

Again refer to the [API](https://benvanwerkhoven.github.io/kernel_tuner/user-api.html) for more information.

In [None]:
# compute the reference result, e.g. by using NumPy
reference = 

# add the right parameters to the tune_kernel method
results, env = kt.tune_kernel("matmul_kernel", "matmul_shared.cu",
                             problem_size, args, tune_params, restrictions=restrict, answer=reference, compiler_options=compiler_flags, atol=1e-4)

print("Number of configurations: {}".format(len(results)))