# Matrix multiplication

Matrix multiplication is one of the most well-known linear algebra algorithms, and frequently used to demonstrate the high-performance computing capabilities of GPUs. As such, an example using matrix multiplication could not be left out. A naive CUDA kernel for a square matrix multiplication is:

In [None]:
# %load matmul_naive.cu
#define WIDTH 4096

__global__ void matmul_kernel(float *C, float *A, float *B) {
    int x = blockIdx.x * block_size_x + threadIdx.x;
    int y = blockIdx.y * block_size_y + threadIdx.y;
    float sum = 0.0;

    for (int k=0; k<WIDTH; k++) {
        sum += A[y*WIDTH+k] * B[k*WIDTH+x];
    }

    C[y*WIDTH+x] = sum;
}


This kernel simply creates a single thread per output element. Each thread computes the index of the element it is responsible for, and iterates over the corresponding row in A, and corresponding column in B.

In [14]:
#!/usr/bin/env python
import numpy
import kernel_tuner
from collections import OrderedDict

problem_size = (4096, 4096)
size = numpy.prod(problem_size)

A = numpy.random.randn(*problem_size).astype(numpy.float32)
B = numpy.random.randn(*problem_size).astype(numpy.float32)
C = numpy.zeros_like(A)

args = [C, A, B]
tune_params = OrderedDict()
tune_params["block_size_x"] = [2**i for i in range(3,7)]
tune_params["block_size_y"] = [2**i for i in range(6)]

answer = [numpy.dot(A,B), None, None]

results = kernel_tuner.tune_kernel("matmul_kernel", "matmul_naive.cu",
                                   problem_size, args, tune_params, answer=answer, atol=1e-3)  

Using: GeForce GTX TITAN X
block_size_x=8, block_size_y=1, time=3349.53833008
block_size_x=8, block_size_y=2, time=1678.16967773
block_size_x=8, block_size_y=4, time=876.774719238
block_size_x=8, block_size_y=8, time=861.712475586
block_size_x=8, block_size_y=16, time=782.979882813
block_size_x=8, block_size_y=32, time=605.516772461
block_size_x=16, block_size_y=1, time=1706.25876465
block_size_x=16, block_size_y=2, time=855.339624023
block_size_x=16, block_size_y=4, time=763.935302734
block_size_x=16, block_size_y=8, time=706.841027832
block_size_x=16, block_size_y=16, time=585.361218262
block_size_x=16, block_size_y=32, time=501.997399902
block_size_x=32, block_size_y=1, time=985.078295898
block_size_x=32, block_size_y=2, time=858.53651123
block_size_x=32, block_size_y=4, time=817.795361328
block_size_x=32, block_size_y=8, time=628.078552246
block_size_x=32, block_size_y=16, time=555.53515625
block_size_x=32, block_size_y=32, time=526.978662109
block_size_x=64, block_size_y=1, time=9

There aren't many parameters to tune yet, and more importantly, tuning will not be very effective because this kernel will be limited by bandwidth rather than compute. 

The utilisation of the GPU is very low, even for the optimal combination of block_size_x and block_size_y:

![](Matmul-naive-utilisation.png)

There is however, a lot of opportunity for data reuse, which is realized by making the threads in a thread block collaborate.

# Increase data reuse

This can be solved by using a technique called loop-blocking or loop-tiling. We define two square data structures in shared memory, which will be used for storing square parts of matrix A and B. The threads in a thread block will collaboratively fill these two variables, and then proceed to perform all the computations that need this data, before moving to the next blocked iteration.

In [None]:
# %load matmul_data_reuse.cu
#define WIDTH 4096

__global__ void matmul_kernel(float *C, float *A, float *B) {

    __shared__ float sA[block_size_y][block_size_x];
    __shared__ float sB[block_size_y][block_size_x];

    int tx = threadIdx.x;
    int ty = threadIdx.y;
    int x = blockIdx.x * block_size_x + tx;
    int y = blockIdx.y * block_size_y + ty;

    float sum = 0.0;
    int k,kb;

    for (k=0; k<WIDTH; k+=block_size_x) {
        __synchthreads();
        sA[ty][tx] = A[y*WIDTH+k+tx];
        sB[ty][tx] = B[(k+ty)*WIDTH+x];
        __synchthreads();

        for (kb=0; kb<block_size_x; kb++) {
            sum += sA[ty][kb] * sB[kb][tx];
        }

    }

    C[y*WIDTH+x] = sum;
}



In [17]:
restrict = ["block_size_x==block_size_y"] 

results = kernel_tuner.tune_kernel("matmul_kernel", "matmul_data_reuse.cu",
                                   problem_size, args, tune_params, restrictions=restrict, answer=answer, atol=1e-3)  

Using: GeForce GTX TITAN X
block_size_x=8, block_size_y=8, time=247.656115723
block_size_x=16, block_size_y=16, time=184.494805908
block_size_x=32, block_size_y=32, time=183.317108154
best performing configuration: block_size_x=32, block_size_y=32, time=183.317108154


We have made matrix multiplication about three times faster now, which comes from much better memory use:

![](Matmul-utilisation-with-data-reuse.png)

The compute intensity has dropped slightly, because of the syncthread operations.