# GPU Optimization with Kernel Tuner

Welcome to the first hands-on of day 2 of the GPU optimization with Kernel Tuner tutorial.

In this hands-on exercise, we will learn about code optimizations and look at their impact on energy efficiency.

We again use Kernel Tuner to easily benchmark different implementations of our GPU kernels.

In [None]:
import numpy as np
import kernel_tuner as kt
from kernel_tuner.observers.nvml import NVMLObserver
from kernel_tuner.util import get_best_config

The main example of a code optimization that we'll look at in this hands-on exercise is a simple case where we can apply *Kernel Fusion*. Kernel Fusion is the process of merging or 'fusing' multiple kernels into a single kernel.

The operation that we would like to perform is to compute a dot product of two vectors a and b of equal length. The computation consists of two steps. First, all elements in a and b are point-wise multiplied, then these values are summed together to a single value. These two computations can be implemented by different kernels, or as we will see, fused into one.

For our first CUDA kernel, let's take a look at a kernel that applies a point-wise multiplication of two vectors:

In [None]:
%%writefile vector_mul_kernel.cu

__global__ void vector_mul(float * c, float * a, float * b, int n) {
    int i = (blockIdx.x * blockDim.x) + threadIdx.x;
    if (i < n) {
        c[i] = a[i] * b[i];
    }
}

To compute a full dot product of two vectors, we also have to sum the result. So let's take a simple CUDA kernel to compute the sum of a large vector.

In [None]:
%%writefile sum_kernel.cu

__global__ void sum(float *result, float *X, int n) {
    __shared__ float cache[block_size_x];
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    float temp = 0.0f;
    for (; i<n; i+= blockDim.x * gridDim.x) {
        temp += X[i];
    }
    cache[threadIdx.x] = temp;
    __syncthreads();
    for (int s=block_size_x/2; s>0; s/=2) {
        if (threadIdx.x < s) {
            cache[threadIdx.x] += cache[threadIdx.x + s];
        }
        __syncthreads();
    }
    if (threadIdx.x == 0) {
        atomicAdd(result, cache[0]);
    }
}

Now, we use Kernel Tuner to measure the execution time and energy consumption of both kernels.

In [None]:
# the size of the vectors
size = 100_000_000

# all the kernel input and output data need to use numpy data types,
# note that we explicitly state that these arrays should consist of
# 32 bit floating-point values, to match our kernel source code
a = np.random.randn(size).astype(np.float32)
b = np.random.randn(size).astype(np.float32)
c = np.zeros_like(b)
r = np.zeros(1, dtype=np.float32)
n = np.int32(size)

# Setup the tunable parameters
tune_params = dict(block_size_x=[16, 32, 64, 128, 256, 512, 1024])

# Setup the NVMLObserver to measure energy
nvmlobserver = NVMLObserver(["nvml_energy"])
metrics = dict(energy=lambda p:p["nvml_energy"])

# Let's call tune_kernel to start the tuning process.
results_mul, env = kt.tune_kernel("vector_mul", "vector_mul_kernel.cu", size, [c, a, b, n], tune_params, lang="cupy",
                                  observers=[nvmlobserver], metrics=metrics)

# Let's call tune_kernel to start the tuning process.
results_sum, env = kt.tune_kernel("sum", "sum_kernel.cu", size, [r, c, n], tune_params, lang="cupy",
                                 observers=[nvmlobserver], metrics=metrics)

Now if we create a single kernel that in one go applies a point-wise multiplication followed by a sum reduction, we effectively implement a dot product kernel. This kernel can be seen as the fusion of the previous two kernels.

**Exercise**: Complete the missing line of code in the kernel below! Then, run the cell to store the code to a file.

In [None]:
|%%writefile dot_kernel.cu

__global__ void dot(float *result, float *a, float *b, int n) {
    __shared__ float cache[block_size_x];
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    float temp = 0.0f;
    for (; i<n; i+= blockDim.x * gridDim.x) {
        temp += ...; //TODO: write code here
    }
    cache[threadIdx.x] = temp;
    __syncthreads();
    for (int s=block_size_x/2; s>0; s/=2) {
        if (threadIdx.x < s) {
            cache[threadIdx.x] += cache[threadIdx.x + s];
        }
        __syncthreads();
    }
    if (threadIdx.x == 0) {
        atomicAdd(result, cache[0]);
    }
}

Now that we've fused the mul and sum kernels into a dot kernel, let's have a look at the impact on the time and energy of our computations:

In [None]:
# TODO: create the argument list
# hint: we want to compute the dot product of a and b and store the result in r
# hint2: look at the CUDA kernel code above to see the kernel arguments
dot_args = [ ]

if not dot_args:
    print("Error: First write the argument list dot_args!")
else:
    # Let's call tune_kernel to start the tuning process.
    results_dot, env = kt.tune_kernel("dot", "dot_kernel.cu", size, dot_args, tune_params, lang="cupy",
                                      observers=[nvmlobserver], metrics=metrics)

OK, now that we've collected our measurements. Let's take a look at the energy needed to complete our dot product computation, using two separate kernels, versus a single-kernel approach:

In [None]:
# The best kernel configurations in terms of energy for mul and sum were:
mul_best = get_best_config(results_mul, "energy")
print('Energy of the best mul kernel:', mul_best["energy"], "Joule")
sum_best = get_best_config(results_sum, "energy")
print('Energy of the best sum kernel:', sum_best["energy"], "Joule")

# The total amount of energy for the two separate kernels:
total = mul_best["energy"] + sum_best["energy"]
print('Energy used in total by using separate kernels:', total, "Joule")

# Energy used by the fused kernel:
dot_best = get_best_config(results_dot, "energy")
print('Energy of the best dot kernel:', dot_best["energy"], "Joule")

Can you think of why the approach with two separate kernels uses so much more energy than a single kernel?

Hint: Think about the total amount of data that is loaded from and stored to global memory by the two separate kernels, and the same for the fused kernel.

**That's it! You've successfully completed the first hands-on of the day!**