Merge pull request #378 from LLNL/feature/kunen1/nested2

Rewrite of nested::forall to support complex loop structures
LLNL · Mar 19, 2018 · 995cbb6 · 995cbb6
2 parents e3457c5 + f829384
commit 995cbb6
Show file tree

Hide file tree

Showing 103 changed files with 10,740 additions and 2,550 deletions.
diff --git a/.gitignore b/.gitignore
@@ -5,3 +5,4 @@
 *.gch
 build/
 build-*/
+/Debug/
diff --git a/docs/sphinx/user_guide/feature/forall.rst → .../sphinx/user_guide/feature/loop_basic.rst b/docs/sphinx/user_guide/feature/forall.rst → .../sphinx/user_guide/feature/loop_basic.rst
@@ -12,24 +12,25 @@
 .. ## For details about use and distribution, please read RAJA/LICENSE.
 .. ##
 
-.. _forall-label:
+.. _loop_basic-label:
 
 =========================
-forall and nested::forall
+Single and Nested Loops
 =========================
 
-The ``RAJA::forall`` and `RAJA::`nested::forall`` loop traversal template 
+The ``RAJA::forall`` and ``RAJA::kernel`` loop traversal template 
 methods are the building blocks for most RAJA usage. RAJA users pass 
 application code fragments, such as loop bodies, into these loop traversal 
 methods using lambda expressions along with iteration space information. 
 Then, once loops are written in the RAJA form, they can be run using different 
 programming model back-ends by changing execution policy template arguments. 
 For information on available RAJA execution policies, see :ref:`policies-label`.
 
-.. note:: * All forall and nested::forall methods are in the namespace 
-            ``RAJA``.
-          * Each loop traversal method is templated on an *execution policy*,
-            or multiple execution policies for the case of ``nested::forall``.
+.. note:: * All forall and kernel methods are in the namespace ``RAJA``.
+          * Each ``RAJA::forall`` traversal method is templated on an 
+            *execution policy*. 
+          * Each ``RAJA::kernel`` method requires a statement with an 
+            *execution policy* type for each level in a loop nest.
 
 The ``RAJA::forall`` templates encapsulate standard C-style for loops.  
 For example, a C-style loop like::
@@ -48,7 +49,7 @@ The RAJA form takes a template argument for the execution policy, and
 two arguments: an object describing the loop iteration space (e.g., a RAJA 
 segment or index set) and a lambda expression defining the loop body.
 
-The ``RAJA::nested::forall`` traversal templates provide flexibility in
+The ``RAJA::kernel`` traversal templates provide flexibility in
 how arbitrary loop nests can be run with minimal source code changes. A
 loop nest, such as::
 
@@ -61,29 +62,42 @@ loop nest, such as::
 
 may be written in a RAJA form as::
   
-    RAJA::nested::forall< RAJA::nested::Policy<
-
-                      RAJA::nested::For<N, exec_policyN>, ...
-                      RAJA::nested::For<0, exec_policy0>,
-
-		      RAJA::make_tuple(iter_space IN, ..., iter_space I0),
-
-        [=] (index_type iN, ... , index_type i1) {
-           //loop body
+    RAJA::kernel< RAJA::KernelPolicy<
+
+                    RAJA::statement::For<N, exec_policyN, 
+                      ...
+                        RAJA::statement::For<0, exec_policy0,
+                          RAJA::statement::Lambda<0>
+                        >
+                      ...
+                    > 
+                >( 
+      RAJA::make_tuple(iter_space IN, ..., iter_space I0),
+
+      [=] (index_type iN, ... , index_type i1) {
+         //loop body
     });
 
-Here, we have a loop nest of M = N+1 levels. The ``RAJA::nested::forall`` 
-method is templated on 'M' execution policy arguments and takes, as arguments,
-a tuple of M iteration spaces and a lambda expression for the inner loop body.
-The lambda expression for the loop body must have M loop index arguments and
-they must be in the same order as the associated iteration spaces in the tuple.
+Here, we have a loop nest of M = N+1 levels. The ``RAJA::kernel`` 
+takes a ``RAJA::KernelPolicy`` template type, which defines a nested sequence
+of ``RAJA::statement::For`` types, one for each level of the loop nest plus
+a ``RAJA::statement::Lambda`` type for the lambda loop body. This first argument
+to the ``RAJA::kernel`` method is a tuple of M iteration spaces and the second
+is the lambda expression for the inner loop body. The lambda expression for 
+the loop body must have M loop index arguments and they must be in the same 
+order as the associated iteration spaces in the tuple.
 
 .. note:: For the nested loop case, the loop nest ordering is determined by the
           order of the nested policies, starting with the outermost loop and 
           ending with the innermost loop. The integer value that appears as 
           the first parameter to each of the ``For`` templates indicates which 
           iteration space/lambda index argument it corresponds to.
 
+          **This allows arbitrary loop nesting order transformations to 
+          to be done simply by changing the ordering of the policies**. This
+          is analogous to changing the order or 'for-loop' statements in
+          C-style code.
+
 In summary, these RAJA template methods require a user to understand how to
 specify several items:
 
@@ -97,5 +111,5 @@ specify several items:
 
   #. The loop iteration variables and their types, which are arguments to the lambda loop body.
 
-Typical usage of ``RAJA::forall`` and ``RAJA::nested::forall`` may be found 
+Basic usage of ``RAJA::forall`` and ``RAJA::kernel`` may be found 
 in the examples in :ref:`tutorial-label`.
diff --git a/docs/sphinx/user_guide/feature/policies.rst b/docs/sphinx/user_guide/feature/policies.rst
@@ -18,6 +18,10 @@
 Execution Policies
 ==================
 
+.. warning:: **This section is a work-in-progress!! It needs to be updated 
+             and reworked to be consistent with recent changes related to 
+             new 'kernel' stuff.**
+
 This section describes the various execution policies that ``RAJA`` provides. 
 
 .. note:: * All RAJA execution policies are in the namespace ``RAJA``.

diff --git a/docs/sphinx/user_guide/features.rst b/docs/sphinx/user_guide/features.rst
@@ -23,7 +23,7 @@ This section provides a high-level description of all the main RAJA features.
 .. toctree::
    :maxdepth: 2
 
-   feature/forall
+   feature/loop_basic
    feature/policies
    feature/index
    feature/reduction

diff --git a/docs/sphinx/user_guide/index.rst b/docs/sphinx/user_guide/index.rst
@@ -45,7 +45,8 @@ source code bases that can be readily ported to new architectures. RAJA is
 one C++-based programming model abstraction layer that can help to meet this 
 performance portability challenge.
 
-RAJA provides portable abstractions for single and nested loops, reductions,
+RAJA provides portable abstractions for singly-nested and multiply-nested 
+loops -- as well as a variety of loop transformations, reductions,
 scans, atomic operations, data layouts and views, iteration spaces, etc.
 Currently supported execution policies for different programming model 
 back-ends include: sequential, SIMD, CUDA, OpenMP multi-threading and target 

diff --git a/docs/sphinx/user_guide/tutorial.rst b/docs/sphinx/user_guide/tutorial.rst
@@ -19,11 +19,28 @@ RAJA Tutorial
 **********************
 
 This RAJA tutorial introduces the most commonly-used RAJA concepts and
-capabilities via a sequence of simple examples. To understand the discussion 
-and example codes, a working knowledge of C++ templates and lambda functions
-is required. Before we begin, we provide a bit of background discussion of
-the key features of C++ lambda expressions, which are essential using RAJA
-easily.
+capabilities via a sequence of simple examples. 
+
+To understand the discussion and example codes, a working knowledge of C++ 
+templates and lambda functions is required. Here, we provide a bit 
+of background discussion of the key aspect of C++ lambda expressions, which 
+are essential to using RAJA easily.
+
+To understand the examples that run on a GPU device, it is important to note
+that any lambda expression that is defined outside of a GPU kernel and passed
+to GPU kernel must decorated with the ``__device__`` attribute when it is 
+defined. This can be done directly or by using the ``RAJA_DEVICE`` macro.
+
+It is also important to understand the difference between CPU (host) and 
+GPU (device) memory allocations and transfers work. For a detailed discussion, 
+see `Device Memory <http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory>`_. RAJA does not provide a memory model by design. So users 
+are responsible for ensuring that data is properly allocated and initialized 
+on the device when working running GPU code. This can be done using explicit 
+host and device allocation and copying between host and device memory spaces
+or via CUDA unified memory (UM), if available. RAJA developers also support a 
+library called ``CHAI`` which is complementary to RAJA and which provides a 
+simple alternative to manual CUDA calls or UM. For more information about 
+CHAI, see :ref:`plugins-label`.
 
 ===============================
 A Little C++ Lambda Background
@@ -97,7 +114,7 @@ Examples
 The remainder of this tutorial illustrates how to exercise various RAJA 
 features using simple examples. Note that all the examples employ
 RAJA traversal template methods, which are described briefly 
-here :ref:`forall-label`. For the purposes of the discussion, we
+here :ref:`loop_basic-label`. For the purposes of the discussion, we
 assume that any and all data used has been properly allocated and initialized.
 This is done in the code examples, but is not discussed further here.
 
@@ -114,9 +131,11 @@ for reference.
    tutorial/add_vectors.rst
    tutorial/dot_product.rst
    tutorial/indexset_segments.rst
+   tutorial/vertexsum_coloring.rst
    tutorial/matrix_multiply.rst
    tutorial/nested_loop_reorder.rst
-   tutorial/vertexsum_coloring.rst
+   tutorial/complex_loops-intro.rst
+   tutorial/complex_loops-shmem.rst
    tutorial/reductions.rst
    tutorial/atomic_binning.rst
    tutorial/scan.rst
diff --git a/docs/sphinx/user_guide/tutorial/add_vectors.rst b/docs/sphinx/user_guide/tutorial/add_vectors.rst
@@ -94,14 +94,7 @@ parameter is optional; if not specified the RAJA policy provides a default of
 
 Since the lambda defining the loop body will be passed to a device kernel, 
 it must be decorated with the ``__device__`` attribute when it is defined. 
-This can be done directly, or by using the ``RAJA_DEVICE`` macro if one so 
-chooses.
-
-Note that the user is responsible for making sure that the data arrays
-are properly allocated and initialized on the device. This can be done using
-explicit device allocation and copying from host memory, via CUDA unified
-memory if available, or by using ``CHAI`` (for more information about CHAI, s
-ee :ref:`plugins-label`). 
+This can be done directly or by using the ``RAJA_DEVICE`` macro.
 
 The file ``RAJA/examples/ex1-add-vectors.cpp`` contains the complete 
 working example code.
diff --git a/docs/sphinx/user_guide/tutorial/complex_loops-intro.rst b/docs/sphinx/user_guide/tutorial/complex_loops-intro.rst
@@ -0,0 +1,27 @@
+.. ##
+.. ## Copyright (c) 2016-18, Lawrence Livermore National Security, LLC.
+.. ##
+.. ## Produced at the Lawrence Livermore National Laboratory
+.. ##
+.. ## LLNL-CODE-689114
+.. ##
+.. ## All rights reserved.
+.. ##
+.. ## This file is part of RAJA.
+.. ##
+.. ## For details about use and distribution, please read RAJA/LICENSE.
+.. ##
+
+.. _complex_intro-label:
+
+---------------------------------
+Introduction to Complex Loops
+---------------------------------
+
+.. warning:: **This section is a work-in-progress!!**
+
+Introduce concepts and semantics of complex loop execution using 
+``RAJA::kernel`` and ``RAJA:KernelPolicy`` constructs....
+
+Add example codes to the examples directory and reference here to provide 
+working examples to support the discussion.
diff --git a/docs/sphinx/user_guide/tutorial/complex_loops-shmem.rst b/docs/sphinx/user_guide/tutorial/complex_loops-shmem.rst
@@ -0,0 +1,27 @@
+.. ##
+.. ## Copyright (c) 2016-18, Lawrence Livermore National Security, LLC.
+.. ##
+.. ## Produced at the Lawrence Livermore National Laboratory
+.. ##
+.. ## LLNL-CODE-689114
+.. ##
+.. ## All rights reserved.
+.. ##
+.. ## This file is part of RAJA.
+.. ##
+.. ## For details about use and distribution, please read RAJA/LICENSE.
+.. ##
+
+.. _complex_shmem-label:
+
+---------------------------------
+Complex Loops: Shared Memory
+---------------------------------
+
+.. warning:: **This section is a work-in-progress!!**
+
+Describe and illustrate shared memory window concepts. Motivate for 
+performance: cache-blocking on CPU, CUDA shared memory on GPU.
+
+Add example codes to the examples directory and reference here to provide 
+working examples to support the discussion.