Skip to content

Commit

Permalink
Merge pull request #378 from LLNL/feature/kunen1/nested2
Browse files Browse the repository at this point in the history
Rewrite of nested::forall to support complex loop structures
  • Loading branch information
davidbeckingsale committed Mar 19, 2018
2 parents e3457c5 + f829384 commit 995cbb6
Show file tree
Hide file tree
Showing 103 changed files with 10,740 additions and 2,550 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@
*.gch
build/
build-*/
/Debug/
Original file line number Diff line number Diff line change
Expand Up @@ -12,24 +12,25 @@
.. ## For details about use and distribution, please read RAJA/LICENSE.
.. ##
.. _forall-label:
.. _loop_basic-label:

=========================
forall and nested::forall
Single and Nested Loops
=========================

The ``RAJA::forall`` and `RAJA::`nested::forall`` loop traversal template
The ``RAJA::forall`` and ``RAJA::kernel`` loop traversal template
methods are the building blocks for most RAJA usage. RAJA users pass
application code fragments, such as loop bodies, into these loop traversal
methods using lambda expressions along with iteration space information.
Then, once loops are written in the RAJA form, they can be run using different
programming model back-ends by changing execution policy template arguments.
For information on available RAJA execution policies, see :ref:`policies-label`.

.. note:: * All forall and nested::forall methods are in the namespace
``RAJA``.
* Each loop traversal method is templated on an *execution policy*,
or multiple execution policies for the case of ``nested::forall``.
.. note:: * All forall and kernel methods are in the namespace ``RAJA``.
* Each ``RAJA::forall`` traversal method is templated on an
*execution policy*.
* Each ``RAJA::kernel`` method requires a statement with an
*execution policy* type for each level in a loop nest.

The ``RAJA::forall`` templates encapsulate standard C-style for loops.
For example, a C-style loop like::
Expand All @@ -48,7 +49,7 @@ The RAJA form takes a template argument for the execution policy, and
two arguments: an object describing the loop iteration space (e.g., a RAJA
segment or index set) and a lambda expression defining the loop body.

The ``RAJA::nested::forall`` traversal templates provide flexibility in
The ``RAJA::kernel`` traversal templates provide flexibility in
how arbitrary loop nests can be run with minimal source code changes. A
loop nest, such as::

Expand All @@ -61,29 +62,42 @@ loop nest, such as::

may be written in a RAJA form as::
RAJA::nested::forall< RAJA::nested::Policy<

RAJA::nested::For<N, exec_policyN>, ...
RAJA::nested::For<0, exec_policy0>,

RAJA::make_tuple(iter_space IN, ..., iter_space I0),

[=] (index_type iN, ... , index_type i1) {
//loop body
RAJA::kernel< RAJA::KernelPolicy<

RAJA::statement::For<N, exec_policyN,
...
RAJA::statement::For<0, exec_policy0,
RAJA::statement::Lambda<0>
>
...
>
>(
RAJA::make_tuple(iter_space IN, ..., iter_space I0),

[=] (index_type iN, ... , index_type i1) {
//loop body
});

Here, we have a loop nest of M = N+1 levels. The ``RAJA::nested::forall``
method is templated on 'M' execution policy arguments and takes, as arguments,
a tuple of M iteration spaces and a lambda expression for the inner loop body.
The lambda expression for the loop body must have M loop index arguments and
they must be in the same order as the associated iteration spaces in the tuple.
Here, we have a loop nest of M = N+1 levels. The ``RAJA::kernel``
takes a ``RAJA::KernelPolicy`` template type, which defines a nested sequence
of ``RAJA::statement::For`` types, one for each level of the loop nest plus
a ``RAJA::statement::Lambda`` type for the lambda loop body. This first argument
to the ``RAJA::kernel`` method is a tuple of M iteration spaces and the second
is the lambda expression for the inner loop body. The lambda expression for
the loop body must have M loop index arguments and they must be in the same
order as the associated iteration spaces in the tuple.

.. note:: For the nested loop case, the loop nest ordering is determined by the
order of the nested policies, starting with the outermost loop and
ending with the innermost loop. The integer value that appears as
the first parameter to each of the ``For`` templates indicates which
iteration space/lambda index argument it corresponds to.

**This allows arbitrary loop nesting order transformations to
to be done simply by changing the ordering of the policies**. This
is analogous to changing the order or 'for-loop' statements in
C-style code.

In summary, these RAJA template methods require a user to understand how to
specify several items:

Expand All @@ -97,5 +111,5 @@ specify several items:

#. The loop iteration variables and their types, which are arguments to the lambda loop body.

Typical usage of ``RAJA::forall`` and ``RAJA::nested::forall`` may be found
Basic usage of ``RAJA::forall`` and ``RAJA::kernel`` may be found
in the examples in :ref:`tutorial-label`.
4 changes: 4 additions & 0 deletions docs/sphinx/user_guide/feature/policies.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,10 @@
Execution Policies
==================

.. warning:: **This section is a work-in-progress!! It needs to be updated
and reworked to be consistent with recent changes related to
new 'kernel' stuff.**

This section describes the various execution policies that ``RAJA`` provides.

.. note:: * All RAJA execution policies are in the namespace ``RAJA``.
Expand Down
2 changes: 1 addition & 1 deletion docs/sphinx/user_guide/features.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ This section provides a high-level description of all the main RAJA features.
.. toctree::
:maxdepth: 2

feature/forall
feature/loop_basic
feature/policies
feature/index
feature/reduction
Expand Down
3 changes: 2 additions & 1 deletion docs/sphinx/user_guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,8 @@ source code bases that can be readily ported to new architectures. RAJA is
one C++-based programming model abstraction layer that can help to meet this
performance portability challenge.

RAJA provides portable abstractions for single and nested loops, reductions,
RAJA provides portable abstractions for singly-nested and multiply-nested
loops -- as well as a variety of loop transformations, reductions,
scans, atomic operations, data layouts and views, iteration spaces, etc.
Currently supported execution policies for different programming model
back-ends include: sequential, SIMD, CUDA, OpenMP multi-threading and target
Expand Down
33 changes: 26 additions & 7 deletions docs/sphinx/user_guide/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,28 @@ RAJA Tutorial
**********************

This RAJA tutorial introduces the most commonly-used RAJA concepts and
capabilities via a sequence of simple examples. To understand the discussion
and example codes, a working knowledge of C++ templates and lambda functions
is required. Before we begin, we provide a bit of background discussion of
the key features of C++ lambda expressions, which are essential using RAJA
easily.
capabilities via a sequence of simple examples.

To understand the discussion and example codes, a working knowledge of C++
templates and lambda functions is required. Here, we provide a bit
of background discussion of the key aspect of C++ lambda expressions, which
are essential to using RAJA easily.

To understand the examples that run on a GPU device, it is important to note
that any lambda expression that is defined outside of a GPU kernel and passed
to GPU kernel must decorated with the ``__device__`` attribute when it is
defined. This can be done directly or by using the ``RAJA_DEVICE`` macro.

It is also important to understand the difference between CPU (host) and
GPU (device) memory allocations and transfers work. For a detailed discussion,
see `Device Memory <http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory>`_. RAJA does not provide a memory model by design. So users
are responsible for ensuring that data is properly allocated and initialized
on the device when working running GPU code. This can be done using explicit
host and device allocation and copying between host and device memory spaces
or via CUDA unified memory (UM), if available. RAJA developers also support a
library called ``CHAI`` which is complementary to RAJA and which provides a
simple alternative to manual CUDA calls or UM. For more information about
CHAI, see :ref:`plugins-label`.

===============================
A Little C++ Lambda Background
Expand Down Expand Up @@ -97,7 +114,7 @@ Examples
The remainder of this tutorial illustrates how to exercise various RAJA
features using simple examples. Note that all the examples employ
RAJA traversal template methods, which are described briefly
here :ref:`forall-label`. For the purposes of the discussion, we
here :ref:`loop_basic-label`. For the purposes of the discussion, we
assume that any and all data used has been properly allocated and initialized.
This is done in the code examples, but is not discussed further here.

Expand All @@ -114,9 +131,11 @@ for reference.
tutorial/add_vectors.rst
tutorial/dot_product.rst
tutorial/indexset_segments.rst
tutorial/vertexsum_coloring.rst
tutorial/matrix_multiply.rst
tutorial/nested_loop_reorder.rst
tutorial/vertexsum_coloring.rst
tutorial/complex_loops-intro.rst
tutorial/complex_loops-shmem.rst
tutorial/reductions.rst
tutorial/atomic_binning.rst
tutorial/scan.rst
9 changes: 1 addition & 8 deletions docs/sphinx/user_guide/tutorial/add_vectors.rst
Original file line number Diff line number Diff line change
Expand Up @@ -94,14 +94,7 @@ parameter is optional; if not specified the RAJA policy provides a default of

Since the lambda defining the loop body will be passed to a device kernel,
it must be decorated with the ``__device__`` attribute when it is defined.
This can be done directly, or by using the ``RAJA_DEVICE`` macro if one so
chooses.

Note that the user is responsible for making sure that the data arrays
are properly allocated and initialized on the device. This can be done using
explicit device allocation and copying from host memory, via CUDA unified
memory if available, or by using ``CHAI`` (for more information about CHAI, s
ee :ref:`plugins-label`).
This can be done directly or by using the ``RAJA_DEVICE`` macro.

The file ``RAJA/examples/ex1-add-vectors.cpp`` contains the complete
working example code.
27 changes: 27 additions & 0 deletions docs/sphinx/user_guide/tutorial/complex_loops-intro.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
.. ##
.. ## Copyright (c) 2016-18, Lawrence Livermore National Security, LLC.
.. ##
.. ## Produced at the Lawrence Livermore National Laboratory
.. ##
.. ## LLNL-CODE-689114
.. ##
.. ## All rights reserved.
.. ##
.. ## This file is part of RAJA.
.. ##
.. ## For details about use and distribution, please read RAJA/LICENSE.
.. ##
.. _complex_intro-label:

---------------------------------
Introduction to Complex Loops
---------------------------------

.. warning:: **This section is a work-in-progress!!**

Introduce concepts and semantics of complex loop execution using
``RAJA::kernel`` and ``RAJA:KernelPolicy`` constructs....

Add example codes to the examples directory and reference here to provide
working examples to support the discussion.
27 changes: 27 additions & 0 deletions docs/sphinx/user_guide/tutorial/complex_loops-shmem.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
.. ##
.. ## Copyright (c) 2016-18, Lawrence Livermore National Security, LLC.
.. ##
.. ## Produced at the Lawrence Livermore National Laboratory
.. ##
.. ## LLNL-CODE-689114
.. ##
.. ## All rights reserved.
.. ##
.. ## This file is part of RAJA.
.. ##
.. ## For details about use and distribution, please read RAJA/LICENSE.
.. ##
.. _complex_shmem-label:

---------------------------------
Complex Loops: Shared Memory
---------------------------------

.. warning:: **This section is a work-in-progress!!**

Describe and illustrate shared memory window concepts. Motivate for
performance: cache-blocking on CPU, CUDA shared memory on GPU.

Add example codes to the examples directory and reference here to provide
working examples to support the discussion.

0 comments on commit 995cbb6

Please sign in to comment.