GPU Porting in Axom

Axom uses the following two libraries as the main workhorses for GPU porting:

RAJA: Handles software abstractions that enable architecture and programming model portability for HPC applications

Umpire: Resource management library that allows the discovery, provision, and management of memory on machines with multiple memory devices like NUMA and GPUs.

From RAJA and Umpire, Axom derives a set of convenience macros and function wrappers in the axom namespace encapsulating commonly-used RAJA and Umpire functions, and preset execution spaces for host/device execution.

For the user's guide on using GPU utilities, see also :ref:`Core Acceleration<core-acceleration>`.

Macros

Axom's macros can be found in the file axom/core/Macros.hpp.

Most of the GPU-related macros are used to guard device code for compilation.

For guarding device code:

.. literalinclude:: ../../../axom/core/Macros.hpp
   :start-after:  _guarding_macros_start
   :end-before:  _guarding_macros_end
   :language: C++

Note

Functions called in CUDA or HIP GPU device code require the __device__ annotation.
Functions that will be called in device code and CPU host code require the __host__ __device__ annotation.

The following code shows the macros used in Axom to apply these annotations:

.. literalinclude:: ../../../axom/core/Macros.hpp
   :start-after:  _decorating_macros_start
   :end-before:  _decorating_macros_end
   :language: C++

Below is a function that uses Axom macros to apply a __host__ __device__ annotation and guard the use of a CUDA intrinsic to inside a kernel:

.. literalinclude:: ../../../axom/core/utilities/BitUtilities.hpp
   :start-after:  gpu_macros_example_start
   :end-before:  gpu_macros_example_end
   :language: C++

Memory

Axom's memory management routines can be found in the file axom/core/memory_management.hpp.

Memory Management Routines

Umpire has the concept of "allocators" associated with each memory resource type (umpire::resource::MemoryResourceType).

To allocate memory on a particular resource, you use the ID for the allocator associated with the umpire::resource::MemoryResourceType.

You are able to set a default allocator, whereby all your memory allocations will go on the resource associated with the allocator unless otherwise specified:

.. literalinclude:: ../../../axom/core/memory_management.hpp
   :start-after:  _memory_management_routines_start
   :end-before:  _memory_management_routines_end
   :language: C++

Note

When Axom is built without Umpire, the getters and setters shown above become no-ops or are undefined, while the memory allocation functions default to C++ standard library functions with only allocation on the host (CPU):

axom::allocate calls std::malloc
axom::deallocate calls std::free
axom::reallocate calls std::realloc
axom::copy calls std::memcpy

MemorySpace

.. literalinclude:: ../../../axom/core/memory_management.hpp
   :start-after:  _memory_space_start
   :end-before:  _memory_space_end
   :language: C++

Axom provides the axom::MemorySpace enum type to define values indicating the memory space where data in axom::Array and axom::ArrayView lives.

Dynamic allows you to define the location at runtime, with some caveats (see :ref:`Core Containers<core-containers>` for more details and examples).

Useful Links

Umpire Tutorial - First two sections cover Allocators and Resources.

Kernels

axom::for_all

axom::for_all can be found in the file axom/core/execution/for_all.hpp.

axom::for_all is a wrapper around RAJA forall, which is used to execute simple for-loop kernels.

This is used in Axom to execute for-loop style kernels that will be run on a GPU device, or on both a GPU device and a CPU host. For example:

template <typename ExecSpace, typename KernelType>
void axom::for_all(const IndexType& N, KernelType&& kernel)

template <typename ExecSpace, typename KernelType>
void axom::for_all(const IndexType& begin, const IndexType& end, KernelType&& kernel)

Note

When Axom is built without RAJA, axom::for_all becomes a for-loop on host (CPU).

RAJA::kernel

RAJA::kernel is used to execute kernels implemented using nested loops.

This is used infrequently, mainly seen only in a few unit tests.

Your general go-to will be axom::for_all.

Useful Links

RAJA Loops - Covers RAJA::forall, RAJA::kernel, RAJA::launch kernel execution methods.

Execution Spaces & Policies

Axom's execution spaces can be found in the file axom/core/execution/execution_space.hpp.

Axom's execution spaces are derived from an axom::execution_space<ExecSpace> traits class containing RAJA execution policies and default Umpire memory allocators associated with each space.

Axom currently supports four execution spaces, each one a type with the following specialization of the execution_space class:

SEQ_EXEC - Sequential execution policies on host
OMP_EXEC - OpenMP execution policies on host
CUDA_EXEC - CUDA execution policies in Unified Memory (host + device)
HIP_EXEC - HIP execution policies in Unified Memory (host + device)

Additionally, HIP_EXEC and CUDA_EXEC types are templated by the number of threads and SYNCHRONOUS or ASYNC execution:

.. literalinclude:: ../../../axom/core/execution/internal/cuda_exec.hpp
   :start-after:  _cuda_exec_start
   :end-before:  _cuda_exec_end
   :language: C++

Each execution space provides:

Axom policies that are type aliases of RAJA policies to be used with kernels, RAJA types, and RAJA operations
- loop_policy - For RAJA scans and other operations; axom::for_all uses the loop_policy from the templated execution space.
- reduce_policy - For RAJA reduction types that perform reduction operations:
```
.. literalinclude:: ../../../axom/core/examples/core_acceleration.cpp
   :start-after:  _gpu_reduce_start
   :end-before:  _gpu_reduce_end
   :language: C++
```
- atomic_policy - For RAJA atomic operations that avoid race conditions when updating data values:
```
.. literalinclude:: ../../../axom/core/examples/core_acceleration.cpp
   :start-after:  _gpu_atomic_start
   :end-before:  _gpu_atomic_end
   :language: C++
```
- sync_policy - For Axom's synchronize function, which is a wrapper around RAJA::synchronize(). Synchronizes execution threads when using an asynchronous loop_policy:
```
.. literalinclude:: ../../../axom/core/execution/synchronize.hpp
   :start-after:  _gpu_synchronize_start
   :end-before:  _gpu_synchronize_end
   :language: C++
```
Umpire allocator defaults
- memory_space - The memory space abstraction for use by :ref:`Core Containers<core-containers>` like axom::Array.
- allocatorID() - Gets the allocator ID for the Umpire resource to use in this execution space.
General information on the execution space
- name() - Name of the execution space
- onDevice() - Is the execution space on device? (True/False)
- valid() - Is the execution space valid? (True)
- async() - Is the execution space asynchronous? (True/False)

The :doc:`Mint <../../../axom/mint/docs/sphinx/index>` component also provides a set of nested execution policies located at axom/mint/execution/internal/structured_exec.hpp to be used with RAJA::kernel e.g. for iterating over mint meshes.

Note

When Axom is built without RAJA, only SEQ_EXEC is available for host (CPU) execution. When Axom is built with RAJA but without Umpire for memory management on device, only SEQ_EXEC and OMP_EXEC is available for host (CPU) execution.

General, Rough Porting Tips

Start with figuring out what memory you need on device, and use axom::Array, axom::ArrayView, and :ref:`memory_managment routines<gpu-memory-label>` to do the allocations:

// Allocate 100 2D Triangles in unified memory
using cuda_exec = axom::CUDA_EXEC<256>;
using TriangleType = axom::primal::Triangle<double, 2>;
axom::Array<Triangle> tris (100, axom::execution_space<cuda_exec>::allocatorID()));
axom::ArrayView<Triangle> tris_view(tris);

// Allocate the sum of Triangle areas
using reduce_pol = typename axom::execution_space<cuda_exec>::reduce_policy;
RAJA::ReduceSum<reduce_pol, double> totalArea(0);

Using an axom::for_all kernel with a device policy, attempt to access and/or manipulate the memory on device:

axom::for_all<cuda_exec>(
100,
AXOM_LAMBDA(int idx) {
  // Set values on device
  tris_view[idx] = Triangle();
  totalArea = 0;
});

Add the functions you want to call on device to the axom::for_all kernel:

axom::for_all<cuda_exec>(
100,
AXOM_LAMBDA(int idx) {
  tris_view[idx] = Triangle();
  totalArea = 0;

  // Call area() method on device
  double area = tris_view[idx].area();
});

Apply a __host__ __device__ annotation to your functions if you see the following error or similar:
```
error: reference to __host__ function 'area' in __host__ __device__ function
```
- Recompiling will likely introduce complaints about more functions (the functions being the non-decorated functions your newly-decorated functions are calling):
```
error: reference to __host__ function 'abs<double>' in __host__ __device__ function

error: reference to __host__ function 'signedArea<2>' in __host__ __device__ function
```
  Keep decorating until all the complaints are gone.
- Most of the C++ standard library is not available on device. Your options are Axom's equivalent functions/classes if it exists, or to add your own or rewrite the code to not use standard library.
With no more decorating complaints from the compiler, write the logically correct kernel:
```
// Computes the total area of a 100 triangles
axom::for_all<cuda_exec>(
  100,
  AXOM_LAMBDA(int idx) {
    totalArea += tris_view[idx].area();
});
```
- If at this point your kernel is not working/segfaulting, it is hopefully a logical error, and you can debug the kernel without diving into debugging tools.
- Utilize printf() for debugging output
- Try using the SEQ_EXEC execution space

Useful Links

List of debugging tools:
- Totalview (CUDA)
- Nvidia Nsight Developer Tools (CUDA)
  - LLNL Guide for Sierra
  - ORNL Guides for Nsight Compute and Nsight Systems
- HPCToolkit (CUDA, HIP)
- ROCprof (HIP)
- ROCgdb (HIP)
ORNL Training Archives
LLNL HPC Tutorials

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpu_porting.rst

gpu_porting.rst

GPU Porting in Axom

Macros

Memory

Memory Management Routines

MemorySpace

Useful Links

Kernels

axom::for_all

RAJA::kernel

Useful Links

Execution Spaces & Policies

General, Rough Porting Tips

Useful Links

Files

gpu_porting.rst

Latest commit

History

gpu_porting.rst

File metadata and controls

GPU Porting in Axom

Macros

Memory

Memory Management Routines

MemorySpace

Useful Links

Kernels

axom::for_all

RAJA::kernel

Useful Links

Execution Spaces & Policies

General, Rough Porting Tips

Useful Links