In [1]:
import platform
import numpy as np
# from pycuda import autoinit # set up the PyCUDA runtime

print("Platform == " + platform.platform() + ' ' + platform.machine())
print("Python version == " + platform.python_version())
print("NumPy version == " + np.__version__)
# print("PyCUDA version == " + pycuda.__version__)

Platform == Windows-10-10.0.16299 AMD64
Python version == 2.7.14
NumPy version == 1.14.0


# Background

## Autotuning

Autotuning is the idea of letting programs find the best parameters to an algorithm and generate the algorithm according to these parameters. It is a widely used technique in numerical computation software such as [ATLAS](http://math-atlas.sourceforge.net/) and [FFTW](http://www.fftw.org/). 

The use of autotuning for GPU kernels has become a hot research topic in the recent years. Two major approaches to autotuning are model-based tuning (which creates a computation model *a priori* based on the algorithm and architecture) and empirical tuning (which measures the performance as if the kernel is a black box). The latter is used for the `autotuner` package for the specific problem of GPU dense matrix multiplication.

## Matrix Multiplication

Dense matrix multiplication is one of the most studied GPU applications. Although highly tuned implementations can be found with relative ease, an autotuner for matrix multiplication still has its benefits. This is because hand-tuned implementations for one device may not run as well on another device, while autotuners can be ported easily to new architectures with good performance. In addition, matrix multiplication kernels are relatively simple and serves as a good introduction to autotuner design.

## PyCUDA and GPU metaprogramming

[PyCUDA](https://mathema.tician.de/software/pycuda/) is a Python library for accessing Nvidia’s CUDA parallel computation API. One of the main strengths of PyCUDA is its ability to perform **just-in-time compilation (JIT)** of CUDA C source code. A programmer can create some kernels, compile and run them on the GPU, and then modify the existing kernels dynamically based on the results without leaving the Python interpreter. This process, when done in a program-controlled manner, enables **GPU metaprogramming**, which is crucial to the development of autotuners.

# Using `autotuner.py`

When invoked as a Python script on the command line, `autotuner.py` parses the arguments and initiates a profiling session to find the optimal parameters for the matmul kernel (this is the autotuning part). After profiling is done, the source code for the optimal kernel is written out as a `.cu` file which can later be used either with PyCUDA or just as input to `nvcc`.

In [2]:
# run autotuner.py -h

A sample session using `autotuner.py` looks like the following:

In [3]:
# run autotuner.py 2000 single -o kernel.cu

In [4]:
# with open('kernel.cu', 'r') as f:
#     print(f.read())

# A closer look at the internals

The base building block of each kernel is the template file `template.cu`

In [5]:
with open('template.cu', 'r') as f:
    print(f.read())

__global__ void matmul({real} *M, {real} *N, {real} *P, int Width)
{{
    // Compute M * N and store result in P
    // M and N are Width * Width matrices
    __shared__ {real} Ms[{TW}][{TW}];
    __shared__ {real} Ns[{TW}][{TW}];
    int tx = threadIdx.x;
    int ty = threadIdx.y;
    int Row = blockIdx.y * {TW} + ty;
    int Col = blockIdx.x * {TW} + tx;

    {real} Pvalue = {fzero};
    for (int ph = 0; ph < ceil(Width / ({real}){TW}); ++ph)
    {{
        // Cooperatively load tile into shared memory
        if (Row < Width && ph*{TW} + tx < Width)
        {{
            Ms[ty][tx] = M[Row*Width + ph*{TW} + tx];
        }}
        else
        {{
            Ms[ty][tx] = {fzero};
        }}
        if (Col < Width && ph*{TW} + ty < Width)
        {{
            Ns[ty][tx] = N[(ph*{TW} + ty)*Width + Col];
        }}
        else
        {{
            Ns[ty][tx] = {fzero};
        }}
        __syncthreads();

        {loop}
        __syncthreads();
    }}

    if (Row < Width && Col

The template is meant to be passed to `str.format` to convert to a functioning CUDA C kernel (thus the existence of all the double curly braces, which escape to single curly braces in the `str.format` specification). It is a fairly straightforward tiled matrix multiplication kernel, the tile width `{TW}` being one of the parameters. The `{loop}` section is meant to both represent a full or unrolled version of the inner loop, which the user can also specify.

The textual convertion is handled by the class `MatMulKernel` defined in `matmul.py`, which also serves to interface with the PyCUDA runtime. We can construct a kernel for double precision matmul with 16 tile width and non-unrolled loops as:

In [6]:
# from matmul import MatMulKernel
# k1 = MatMulKernel(dtype=np.float64, tile_width=16, loop_unroll=False)

Under the hood, the constructor calls the `gen_source` method to retrieve the template and perform textual transformation. The generated source code is then compiled by `nvcc` via the `compile` method and loaded onto the GPU. Alternatively, we can skip the compilation explicitly by:

In [7]:
# k1_uncompiled = MatMulKernel(dtype=np.float64, tile_width=16, loop_unroll=False, compile=False)

The `matmul` method of `MatMulKernel` provides a convenient interface to compute matrix multiplication using the compiled kernel. It determines the block and grid size automatically and launches CDUA kernels using PyCUDA's APIs. It also includes code to track execution time, which is used by the profiler to determine the relative performance of different kernels.

In [9]:
# n = 2000
# # Construct test matrices
# M = np.random.randn(n, n)
# N = np.random.randn(n, n)
# P = M.dot(N) # sequential result
# P_gpu = k1.matmul(M, N) # GPU result using our kernel
# # P_gpu, milisecs = k1.matmul(M, N, timed=True)
# err = np.max(np.abs((P - P_gpu) / P))
# print("Error between CPU and GPU result: {:e}".format(err))

The profiling is done by the `tune_kernels` function defined in `autotuner.py`. `tune_kernels` accepts a matrix width `n` and data type `dtype` (can be `numpy.float32` or `numpy.float64`) and finds the best kernel for which n*n `dtype` matrix multiplication is fastest on the current CUDA device. It does this by constructing different `MatMulKernel` instances, generating two randomly constructed matrices and timing the execution of all `MatMulKernel` on these two matrices. Since GPU execution time may differ drastically between runs, a third parameter, `num_trials`, is used to determine how many trials to average. After profiling data is gathered, the parameter set with the least execution time is returned.

In [10]:
# from autotuner import tune_kernels
# n = 2000
# dtype = np.float64
# tw, unroll = tune_kernels(n, dtype) # invoke the autotuner

And you can construct the optimal kernel using the returned parameter set to use in subsequent calculations (this is essentially what the script interface does):

In [11]:
# k_optimal = MatMulKernel(dtype, tw, unroll)
# # Do something with k_optimal