# Kernel Tuner Tutorial

Welcome to the first part of the Kernel Tuner tutorial! Let's get started with making sure you have everything we need installed. Before you run this, make sure you have a working CUDA installation.

If you have not yet installed Kernel Tuner you may run the following cell by selecting it and pressing Shift+Enter. If you have already installed Kernel Tuner then feel free to skip this cell.

In [None]:
%pip install numpy
%pip install kernel_tuner[pycuda]

If needed, you can restart the kernel from the menu at the top of the notebook. Once we have everything installed we can import the modules that we will need for this tutorial.

In [11]:
import numpy as np
import kernel_tuner

Our first GPU kernel will be stored in a file that is created by running the following cell:

In [12]:
%%writefile kernel.cu

__global__ void vector_add(float *c, float *a, float *b, int n) {
    int i = blockIdx.x * block_size_x + threadIdx.x;
    if (i<n) {
        c[i] = a[i] + b[i];
    }
}


Overwriting kernel.cu


In case you'd like to check the contents of our newly created kernel file from this notebook you can run the following cell:

In [13]:
%pycat kernel.cu

Now that we have our kernel source code in a file, we can get started on the tuning script! We start by generating some input data that we will use to benchmark our GPU kernel.

In [14]:
size = 1000000

# all the kernel input and output data need to use numpy data types,
# note that we explicitly state that these arrays should consist of
# 32 bit floating-point values
a = np.random.randn(size).astype(np.float32)
b = np.random.randn(size).astype(np.float32)
c = np.zeros_like(b)
n = np.int32(size)

# Now we combine these variables in an argument list, which matches
# the order and types of the function arguments of our GPU kernel
args = [c, a, b, n]

# The next step is to create a dictionary to tell Kernel Tuner about
# the tunable parameters in our code and what values these may take
tune_params = dict()
tune_params["block_size_x"] = [32, 64, 128, 256, 512]

# Finally, we call tune_kernel to start the tuning process. To do so,
# we pass 
#    the name of the kernel we'd like to tune "vector_add", 
#    the name of the file containing our source code,
#    the problem_size that our kernel operates on
#    the argument list to call our kernel function
#    the dictionary with tunable parameters
res, _ = kernel_tuner.tune_kernel("vector_add", "kernel.cu", size, args, tune_params)

Using: Tesla V100-PCIE-32GB
block_size_x=32, time=0.631510853767395ms
block_size_x=64, time=0.3285440036228725ms
block_size_x=128, time=0.1769965716770717ms
block_size_x=256, time=0.19806171102183207ms
block_size_x=512, time=0.1955839991569519ms
best performing configuration: block_size_x=128, time=0.1769965716770717ms


## Step 2. Problem sizes and thread block dimensions

You may have noticed that in the vector_add example we don't really specify the number of threads per block and the total number of thread blocks that Kernel Tuner should use to tune our kernel. These are inferred automatically by the tuner from the problem_size which we did specify and the use of a tunable parameter with the special name "block_size_x" which Kernel Tuner interprets as the thread block x-dimension.

This is a good point to answer some questions that may arise. 

* What happens if we have multiple dimensions?

The single dimension behavior directly translates to multiple dimensions. You can pass a tuple as the problem size to specify the number of items in each dimension and use block_size x, y, and z to denote the thread block dimensions in up to three dimensions.

* How do I specify the number of thread blocks for kernels that don't follow this pattern?

In Kernel Tuner we try to provide good defaults for the most common use-cases, but also offer the flexibility to overwrite those defaults and specify what you need. In cases where the problem_size is divided by more than just the thread block dimensions, for example if you tune the number of elements processed by each thread you can tell Kernel Tuner about it by specifying a grid divisor list. These are lists containing the names of tunable parameters that all divide the problem_size in that dimension to calculate the number of thread blocks. Note that the result of this division is ceiled to the nearest integer.

If you prefer to specify the number of thread blocks directly, rather than through a division of the problem size, perhaps you are using a grid-strided loop (see Reduction example), you can tell Kernel Tuner by setting problem_size directly to the desired number of thread blocks and passing empty lists as grid_divisor lists. This may seem a bit cumbersome, but remember that the user interface is optimized for the most common use cases. It's good to remember that problem_size does not need to be a tuple of integers, you may also use strings to specify the names of any tunable parameters in case you'd like to tune the number of thread blocks that execute the kernel.

* Can I use a name different from "block_size_x"?

Of course, please use tune_kernel's optional argument "block_size_names" to specify the names of the tunable parameters that you use for the thread block dimensions.



Let's do a little exercise with what we've just learned about multiple dimensions and grid divisor lists. First, we should define a kernel that we want to tune:

In [17]:
%%writefile kernel2d.cu

__global__ void vector_add(float *c, float *a, float *b, int n, int m) {
    int i = blockIdx.x * block_size_x + threadIdx.x;
    int j = blockIdx.y * block_size_y + threadIdx.y;
    if ((i<n) && (j<m)) {
        c[j*n+i] = a[j*n+i] + b[j*n+i];
    }
}

Overwriting kernel2d.cu


In [19]:
# We again start with preparing the input and output data of our kernel
n = np.int32(1e4)
m = np.int32(1e4)
a = np.random.randn(n*m).astype(np.float32)
b = np.random.randn(n*m).astype(np.float32)
c = np.zeros_like(b)

# And combine these into an argument list that fits our kernel
args = [c, a, b, n, m]

# Now we have to define the two tunable parameters in our code and what values they may take
# Let's just pick a few sensible options
tune_params = dict()
tune_params["block_size_x"] = [32, 64, 128, 256, 512]
tune_params["block_size_y"] = [1, 2, 4, 8]

# Now we are ready to call tune kernel again, but this time for our 2-dimensional problem
res, _ = kernel_tuner.tune_kernel("vector_add", "kernel2d.cu", (n, m), args, tune_params)

Using: Tesla V100-PCIE-32GB
block_size_x=32, block_size_y=1, time=5.526089123317173ms
block_size_x=32, block_size_y=2, time=2.7643519810267856ms
block_size_x=32, block_size_y=4, time=1.6470354114259993ms
block_size_x=32, block_size_y=8, time=1.6515565940311976ms
block_size_x=64, block_size_y=1, time=2.7718171732766286ms
block_size_x=64, block_size_y=2, time=1.6206948586872645ms
block_size_x=64, block_size_y=4, time=1.6136639969689506ms
block_size_x=64, block_size_y=8, time=1.6407268728528703ms
block_size_x=128, block_size_y=1, time=1.5689097132001604ms
block_size_x=128, block_size_y=2, time=1.5852845736912318ms
block_size_x=128, block_size_y=4, time=1.5961097138268607ms
block_size_x=128, block_size_y=8, time=1.6275657245091029ms
block_size_x=256, block_size_y=1, time=1.5665599959237235ms
block_size_x=256, block_size_y=2, time=1.5599817037582397ms
block_size_x=256, block_size_y=4, time=1.6133805683680944ms
skipping config 256_8 reason: too many threads per block
block_size_x=512, block_

You may notice that Kernel Tuner has silently skipped a bunch of configurations! What about 256x8?! What about 512x4 and 512x8?! Kernel Tuner actually queries the device to detect the maximum number of threads before benchmarking and silently skips over configurations that can't be executed. It does that too for configurations that cannot be compiled because they use too much shared memory, or kernel that cannot be launched at runtime for using too many registers. You can modify the above call to tune_kernel by adding the option ``verbose=True``, to tell Kernel Tuner that you want to hear about skipped configurations.

Let's modify our kernel a bit further. We introduce another tunable parameter, which allows us to vary the amount of work performed by each thread in the x-dimension. Note that when threads are processing more than one element in a particular dimension, we either need to create fewer threads or use fewer thread blocks. We start with an implementation that will use fewer thread blocks when increasing work per thread.

In [20]:
%%writefile kernel2dw.cu

__global__ void vector_add(float *c, float *a, float *b, int n, int m) {
    int i = blockIdx.x * block_size_x + threadIdx.x;
    int j = blockIdx.y * block_size_y + threadIdx.y;
    for (int ti=0; ti<work_per_thread_x; ti++) {
        if (((i+ti*block_size_x)<n) && (j<m)) {
            c[j*n+(i+ti*block_size_x)] = a[j*n+(i+ti*block_size_x)] + b[j*n+(i+ti*block_size_x)];
        }
    }
}

Writing kernel2dw.cu


In [23]:
# Now we have to add our new tunable parameter work_per_thread_x to our dictionary of tunable parameters.
# We'll keep the number of possible values extremely low for this tutorial because the total number of
# possible configurations in the search space explodes really quickly.
tune_params["work_per_thread_x"] = [1, 2]

# If we were to call tune_kernel now in the same way as we did before it would use too many thread blocks
# for configurations that do more work per thread. Therefore, we have to tell Kernel Tuner that we now have
# another parameter (in addition to the block_size_x) that divides the number of thread blocks in the x dimension.
grid_div_x = ["block_size_x", "work_per_thread_x"]

# Now we are ready to call tune kernel again, but this time using grid_div_x
res, _ = kernel_tuner.tune_kernel("vector_add", "kernel2dw.cu", (n, m), args, tune_params, grid_div_x=grid_div_x)

Using: Tesla V100-PCIE-32GB
block_size_x=32, block_size_y=1, work_per_thread_x=1, time=5.512470858437674ms
block_size_x=32, block_size_y=1, work_per_thread_x=2, time=2.770939452307565ms
block_size_x=32, block_size_y=2, work_per_thread_x=1, time=2.7705051558358327ms
block_size_x=32, block_size_y=2, work_per_thread_x=2, time=1.4131199972970145ms
block_size_x=32, block_size_y=4, work_per_thread_x=1, time=1.6228891440800257ms
block_size_x=32, block_size_y=4, work_per_thread_x=2, time=0.9341714382171631ms
block_size_x=32, block_size_y=8, work_per_thread_x=1, time=1.6370788642338343ms
block_size_x=32, block_size_y=8, work_per_thread_x=2, time=0.9449965698378426ms
block_size_x=64, block_size_y=1, work_per_thread_x=1, time=2.7841006006513322ms
block_size_x=64, block_size_y=1, work_per_thread_x=2, time=1.4267154250826155ms
block_size_x=64, block_size_y=2, work_per_thread_x=1, time=1.5895268746784754ms
block_size_x=64, block_size_y=2, work_per_thread_x=2, time=0.8997851354735238ms
block_size_x=6

Now you might be wondering if we can output something more meaningful than the kernel run time. GPU programmers typically use two different metrics for the performance of their kernels. For bandwidth-limited kernels we focus on the achieved throughput in GB/s and for compute-bound kernels we focus on compute performance in GFLOP/s (giga floating-point operations per second). We can calculate these ourselves based on the output returned by tune_kernel, but we can also tell Kernel Tuner how to compute any user-defined metrics. 

In [25]:
# Because metrics are composable (meaning that we can define a metric and then use it in the definition of another)
# We have to specify the order in which we define metrics, which we do by using an OrderedDict.
from collections import OrderedDict
metrics = OrderedDict()

# We can specify for example how Kernel Tuner should calculate the GFLOP/s metric of our kernel
# by passing a function that calculates the total number of floating-point operations and dividing
# by 1*10^9 (or 1e9 for short). Time in Kernel Tuner is expressed in miliseconds by default,
# and therefore we divide the measured time by a thousand to convert to time in milliseconds
# to time in seconds.
metrics["GFLOP/s"] = lambda p : (n*m/1e9) / (p["time"]/1000)

# The function defined using a lambda is assumed to receive a dictionary with the benchmark results and
# the tunable parameters used in this specific configuration. Similar to the information Kernel Tuner
# prints to the screen or returns in the results dictionary. Therefore we can access the run time
# of this configuration using the "time" key in the dictionary.

# However for our bandwidth-limited 2D vector add kernel the throughput in GB/s will be more relevant.

