Matrix Transpose

In tut-kernelexecpols-label and tut-launchexecpols-label, we presented a simple array initialization kernel using RAJA::kernel and RAJA::launch interfaces, respectively, and compared the two. This section describes the implementation of a matrix transpose kernel using both RAJA::kernel and RAJA::launch interfaces. The intent is to compare and contrast the two, as well as introduce additional features of the interfaces.

There are exercise files RAJA/exercises/kernel-matrix-transpose.cpp and RAJA/exercises/launch-matrix-transpose.cpp for you to work through if you wish to get some practice with RAJA. The files RAJA/exercises/kernel-matrix-transpose_solution.cpp and RAJA/exercises/launch-matrix-transpose_solution.cpp contain complete working code for the examples. You can use the solution files to check your work and for guidance if you get stuck. To build the exercises execute make (kernel/launch)-matrix-transpose and make (kernel/launch)-matrix-transpose_solution from the build directory.

Key RAJA features shown in this example are:

RAJA::kernel method and kernel execution policies

RAJA::launch method and kernel execution interface

In the example, we compute the transpose of an input matrix A of size N_r × N_c and store the result in a second matrix At of size N_c × N_r.

First we define our matrix dimensions

../../../../exercises/kernel-matrix-transpose_solution.cpp

and wrap the data pointers for the matrices in RAJA::View objects to simplify the multi-dimensional indexing:

../../../../exercises/kernel-matrix-transpose_solution.cpp

Then, a C-style for-loop implementation looks like this:

../../../../exercises/kernel-matrix-transpose_solution.cpp

`RAJA::kernel` Implementation

For RAJA::kernel variants, we use RAJA::statement::For and RAJA::statement::Lambda statement types in the execution policies. The complete sequential RAJA::kernel variant is:

../../../../exercises/kernel-matrix-transpose_solution.cpp

A CUDA RAJA::kernel variant for the GPU is similar with different policies in the RAJA::statement::For statements:

../../../../exercises/kernel-matrix-transpose_solution.cpp

A notable difference between the CPU and GPU execution policy is the insertion of the RAJA::statement::CudaKernel type in the GPU version, which indicates that the execution will launch a CUDA device kernel.

In the CUDA RAJA::kernel variant above, the thread-block size and and number of blocks to launch is determined by the implementation of the RAJA::kernel execution policy constructs using the sizes of the RAJA::TypedRangeSegment objects in the iteration space tuple.

`RAJA::launch` Implementation

For RAJA::launch variants, we use RAJA::loop methods to write a loop hierarchy within the kernel execution space. For a sequential implementation, we pass the RAJA::seq_launch_t template parameter to the launch method and pass the RAJA::seq_exec parameter to the loop methods. The complete sequential RAJA::launch variant is:

../../../../exercises/launch-matrix-transpose_solution.cpp

A CUDA RAJA::launch variant for the GPU is similar with CUDA policies in the RAJA::loop methods. The complete RAJA::launch variant is:

../../../../exercises/launch-matrix-transpose_solution.cpp

A notable difference between the CPU and GPU RAJA::launch implementations is the definition of the compute grid. For the CPU version, the argument list is empty for the RAJA::LaunchParams constructor. For the CUDA GPU implementation, we define a 'Team' of one two-dimensional thread-block with 16 x 16 = 256 threads.