In tut-kernelexecpols-label
and tut-launchexecpols-label
, we presented a simple array initialization kernel using RAJA::kernel
and RAJA::launch
interfaces, respectively, and compared the two. This section describes the implementation of a matrix transpose kernel using both RAJA::kernel
and RAJA::launch
interfaces. The intent is to compare and contrast the two, as well as introduce additional features of the interfaces.
There are exercise files RAJA/exercises/kernel-matrix-transpose.cpp
and RAJA/exercises/launch-matrix-transpose.cpp
for you to work through if you wish to get some practice with RAJA. The files RAJA/exercises/kernel-matrix-transpose_solution.cpp
and RAJA/exercises/launch-matrix-transpose_solution.cpp
contain complete working code for the examples. You can use the solution files to check your work and for guidance if you get stuck. To build the exercises execute make (kernel/launch)-matrix-transpose
and make (kernel/launch)-matrix-transpose_solution
from the build directory.
Key RAJA features shown in this example are:
RAJA::kernel
method and kernel execution policiesRAJA::launch
method and kernel execution interface
In the example, we compute the transpose of an input matrix A of size Nr × Nc and store the result in a second matrix At of size Nc × Nr.
First we define our matrix dimensions
../../../../exercises/kernel-matrix-transpose_solution.cpp
and wrap the data pointers for the matrices in RAJA::View
objects to simplify the multi-dimensional indexing:
../../../../exercises/kernel-matrix-transpose_solution.cpp
Then, a C-style for-loop implementation looks like this:
../../../../exercises/kernel-matrix-transpose_solution.cpp
For RAJA::kernel
variants, we use RAJA::statement::For
and RAJA::statement::Lambda
statement types in the execution policies. The complete sequential RAJA::kernel
variant is:
../../../../exercises/kernel-matrix-transpose_solution.cpp
A CUDA RAJA::kernel
variant for the GPU is similar with different policies in the RAJA::statement::For
statements:
../../../../exercises/kernel-matrix-transpose_solution.cpp
A notable difference between the CPU and GPU execution policy is the insertion of the RAJA::statement::CudaKernel
type in the GPU version, which indicates that the execution will launch a CUDA device kernel.
In the CUDA RAJA::kernel
variant above, the thread-block size and and number of blocks to launch is determined by the implementation of the RAJA::kernel
execution policy constructs using the sizes of the RAJA::TypedRangeSegment
objects in the iteration space tuple.
For RAJA::launch
variants, we use RAJA::loop
methods to write a loop hierarchy within the kernel execution space. For a sequential implementation, we pass the RAJA::seq_launch_t
template parameter to the launch method and pass the RAJA::seq_exec
parameter to the loop methods. The complete sequential RAJA::launch
variant is:
../../../../exercises/launch-matrix-transpose_solution.cpp
A CUDA RAJA::launch
variant for the GPU is similar with CUDA policies in the RAJA::loop
methods. The complete RAJA::launch
variant is:
../../../../exercises/launch-matrix-transpose_solution.cpp
A notable difference between the CPU and GPU RAJA::launch
implementations is the definition of the compute grid. For the CPU version, the argument list is empty for the RAJA::LaunchParams
constructor. For the CUDA GPU implementation, we define a 'Team' of one two-dimensional thread-block with 16 x 16 = 256 threads.