Remotes/origin/task/chen59/omptarget #539

rchen20 · 2018-11-06T03:00:01Z

Fix for changing OpenMP target NUMTEAMS to look and feel more like CUDA NUMTHREADS, for the sake of consistency. The user will specify a number of threads per block for omp target directives. The number of OpenMP teams is calculated internally, using the data size and number of threads per block.

Also included is an updated omp_target script for the most recent XL compiler (10.29).

codecov-io · 2018-11-06T03:15:50Z

Codecov Report

Merging #539 into develop will not change coverage.
The diff coverage is n/a.

@@            Coverage Diff            @@
##           develop      #539   +/-   ##
=========================================
  Coverage   98.631%   98.631%           
=========================================
  Files           62        62           
  Lines         1242      1242           
=========================================
  Hits          1225      1225           
  Misses          17        17

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 660cd24...8516b4a. Read the comment docs.

rhornung67 · 2018-11-14T21:00:16Z

@davidbeckingsale when you have a chance, please look this over and approve if you're good with it.

davidbeckingsale

Looks good apart from ThreadC

davidbeckingsale · 2018-11-14T22:01:22Z

include/RAJA/policy/openmp/policy.hpp

@@ -60,6 +60,10 @@ template <unsigned int TeamSize>
 struct Teams : std::integral_constant<unsigned int, TeamSize> {
 };

+template <unsigned int ThreadCount>
+struct ThreadC : std::integral_constant<unsigned int, ThreadCount> {


This could be named better: Threads or ThreadCount

….com/LLNL/RAJA into remotes/origin/task/chen59/omptarget

rhornung67 · 2018-11-29T18:00:05Z

@davidbeckingsale, @trws could you guys look over this PR when you have a few minutes? It's been languishing for a while. Thanks.

davidbeckingsale · 2018-11-29T18:41:28Z

This is looking pretty good.

MrBurmark · 2018-11-29T18:44:17Z

include/RAJA/policy/openmp/target_reduce.hpp

-            omp_target_alloc(Teams * sizeof(T), info.deviceID))},
-        host{new T[Teams]}
+            omp_target_alloc(Threads * sizeof(T), info.deviceID))},
+        host{new T[Threads]}


I think this will break the omp target reducers when Teams > Threads
Each team writes one value to this array, so when there there are more teams than threads per team this will cause writes beyond the end of the buffer.

@MrBurmark makes a great point. However, the code in its current form will guarantee that Teams <= Threads, but it is not obvious to RAJA developers (and less obvious to RAJA users). The guarantee is specified in the omp_target_parallel_for_exec pattern (/include/RAJA/policy/openmp/target_forall.hpp):

auto teamnum = RAJA_DIVIDE_CEILING_INT( (int)distance, (int)Threads );
#pragma omp target teams distribute parallel for num_teams(teamnum) thread_limit(Threads) schedule(static, 1) map(to : body)

So far, this is the only pattern we use for omp_target policies, but if we develop other patterns they will require similar team computations or checks. There is another unused pattern (omp_target_parallel_for_exec_nt) in which neither teams nor threads are specified. But this should still work with reduce because the default number of teams is 1.

Potentially adding to the confusion is a piece of code I neglected to change:

RAJA/include/RAJA/policy/openmp/policy.hpp

Lines 182 to 187 in 2fddee3

#if defined(RAJA_ENABLE_TARGET_OPENMP)

template <size_t Teams>

struct omp_target_reduce

: make_policy_pattern_t<Policy::target_openmp, Pattern::reduce> {

};

#endif

"Teams" should be changed to "Threads", because that is how they are used with my other modifications.

Won't teamnum be greater than Threads if distance > Threads*Threads?

That is true! I suppose I can add an assert after that calculation, and maybe to be ultra safe, another assert in ~TargetReduce() and ~TargetReduceLoc() where teams are used during reduction.

rhornung67 · 2018-11-29T18:53:14Z

@MrBurmark good point. Thanks!

….com/LLNL/RAJA into remotes/origin/task/chen59/omptarget

rchen20 · 2018-12-05T19:03:14Z

@trws @MrBurmark @davidbeckingsale If you have time, would you mind looking at the latest small commits? Thanks!

davidbeckingsale

As long as Jason is okay with the Reducers stuff, I'm happy.

trws

Aside from the small doc/wording changes this looks fine to me.

trws · 2018-12-05T19:15:14Z

docs/sphinx/user_guide/feature/policies.rst

-* ``omp_target_parallel_for_exec<NUMTEAMS>`` - Execute a loop in parallel using an ``omp target parallel for`` pragma with given number of thread teams; e.g.,
-if a GPU device is available, this is similar to launching a CUDA kernel with 
-a thread block size of NUMTEAMS. 
+* ``omp_target_parallel_for_exec<NUMTHREADS>`` - Execute a loop in parallel using an ``omp target parallel for`` pragma with given number of threads per team; e.g.,


I realize the original had this too, but the trailing "e.g." implies there should be an example that seems to be missing? Either delete it or add one.

include/RAJA/policy/openmp/policy.hpp

….com/LLNL/RAJA into remotes/origin/task/chen59/omptarget

rhornung67

@rchen20 I think I accidentally removed @trws comment about renaming 'Threads' to 'ThreadsPerTeam' by mistakenly clicking the wrong button. Sorry about that. In addition, I think 'Teams' would be named better as 'NumTeams'.

rhornung67 · 2018-11-08T19:09:42Z

include/RAJA/policy/openmp/target_forall.hpp

-#pragma omp target teams distribute parallel for num_teams(Teams) \
-    schedule(static, 1) map(to                                    \
-                            : body)
+  auto teamnum = RAJA_DIVIDE_CEILING_INT( (int)distance, (int)Threads );


Would 'numteams' or 'nteams' be a better name for this variable. 'team num' seems to imply a team id or similar.

rhornung67 · 2018-11-08T19:11:47Z

include/RAJA/policy/openmp/target_reduce.hpp

@@ -117,8 +117,8 @@ struct Reduce_Data {
  explicit Reduce_Data(T defaultValue, T identityValue, Offload_Info &info)
      : value(identityValue),
        device{reinterpret_cast<T *>(
-            omp_target_alloc(Teams * sizeof(T), info.deviceID))},
-        host{new T[Teams]}
+            omp_target_alloc(Threads * sizeof(T), info.deviceID))},


Is this what we want here? I thought we want a data item per team (i.e., thread block) not per thread.

That is true, and that is still what it is actually doing.
Maybe we should add a constexpr MaxNumTeams = ThreadsPerTeam to the reducer to make it more clear this is happening.

@MrBurmark I think that would help when reading the code (at least for me!). @rchen20 please do that.

I'm a bit confused about the reasoning for something like "constexpr MaxNumTeams = ThreadsPerTeam" in the reducer. Oddly enough, the reducer never specifies or demands the number of teams, so this macro would be compiled away. The reducer relies solely on the number of teams calculated in the execution policy.

If we want a real check to occur, I could try one of the following:
A. In the reducer, add an assert( omp_get_num_teams() <= ThreadsPerTeam ) before the omp pragma in ~TargetReduce().

B. Create a NumTeams member variable in the execution policy, and do a similar sanity check by accessing that member variable in each of the specializations (i.e. ReduceSum, ReduceMin, etc.).

Option A would put the sanity check closest to the potential bug culprit in the reducer, but it would still be relatively unclear who is setting the number of teams. Option B would be fairly clear, but every future specialization will need to implement similar checks, leaving us susceptible to an occasional "gotcha". We could also do both and have overkill on this problem. I'm open to any suggestions.

Adding "constexpr MaxNumTeams = ThreadsPerTeam" is intended to be a change in name only. Hopefully the name change will avoid confusion by having the name match its usage in the reducer. This change is also simpler than re-implementing how the size of the allocation is decided/communicated, but doesn't solve the issue you raised about a real check.

To restate the problem, the ThreadsPerTeam check added to the forall is a band-aid over the lack of communication between the reducers and the forall. It relies on the reduction and execution policies having the same values of ThreadsPerTeam. This assumption is not checked and can cause out of bounds accesses when the ThreadsPerTeam in the reduction policy is smaller than the ThreadsPerTeam in the execution policy.

There are a couple of ways to deal with these issues that trade off performance, complexity, and capability.

A. Test inside the loop with omp_get_num_teams. The omp target reducers don't have ideal performance anyway, so adding an assert in their destructors may not be too bad. We would have to try it and see the performance impact.

B. Test outside the loop. Unfortunately the execution policy and the reducers can only communicate through an intermediary. For example, the cuda backend fixes essentially the same problem by communicating the number of blocks through threadlocal global variables. The threadlocal global variables are set before the loop and are then read in the copy constructors of the reducers which can then allocate the correct amount of memory. In this case no cap on the number of blocks is required because the number of blocks is communicated to the reducers.

C. Remove the ThreadsPerTeam template parameter from the reduction policies and use an arbitrary global value for MaxNumTeams. Then MaxNumTeams can be enforced in forall without needing to check for a consistent value in the reducers. This does arbitrarily limit MaxNumTeams, but it was already limited by ThreadsPerTeam.

@rchen20 @MrBurmark since we removed the block size parameter from the CUDA reduction policies, we should not have it in the omp target reduction policies for consistency.

@MrBurmark Thanks, the reasoning for constexpr makes sense now. Your A and B suggestions are pretty similar to mine, but stated much more clearly. I considered C as well, but didn't think that would be flexible enough.

@rhornung67 @MrBurmark I'll implement the constexpr for clarity, along with the reducer asserts that Jason and I suggested (option As).

MrBurmark

Looks good to me.

…ts in reducer to check for valid number of teams.

rchen20 · 2018-12-07T01:23:07Z

Ran assert'ed and non-assert'ed reducers with normal target_forall and target_reduce test cases. No observable wall clock time difference between assert'ed vs. non-assert'ed. May be different for larger data sets.

rhornung67

@rchen20 in addition to my other comments, this PR needs to have the OpenMP target reduce policies updated to remove the threads-per-team template parameter. We want this to look like the CUDA reduce policies where we recently removed the thread block size parameter.

rhornung67 · 2018-12-07T17:21:22Z

host-configs/lc-builds/blueos/xl_2018_10_29.cmake

+
+set(RAJA_COMPILER "RAJA_COMPILER_XLC" CACHE STRING "")
+
+set(CMAKE_CXX_COMPILER "/usr/tce/packages/xl/xl-beta-2018.10.29/bin/xlc++_r" CACHE PATH "")


I don't think we need this host-config file. The 2018.11.02 one should be sufficient.

rhornung67 · 2018-12-07T17:23:52Z

scripts/lc-builds/blueos_xl-2018.10.29_omptarget.sh

@@ -0,0 +1,32 @@
+#!/bin/bash


I think we can get rid of this build script. The 2018.11.02 one should be sufficient.

rhornung67 · 2018-12-21T18:17:49Z

@rchen20 when will this PR be ready to merge?

rchen20 · 2018-12-21T18:38:51Z

I ran into a couple issues when I tried to remove the threads-per-team template parameter from OpenMP target reduce:

The only clean way I can think of to pass the threads-per-team (or any other team info) to the reducer is via a global variable. I tried the simple technique of setting OMP_NUM_THREADS and retrieving it via omp_get_max_threads(), but that call only returns the max number of available threads which can vary over time. The only problem with having a global variable is compiling and linking the .cpp file (in RAJA/src/) in which it resides. I tried throwing this file into BLT, but it tries to nvlink it, which is unnecessary and fails - that is what I'm trying to work through at the moment.
There is also a test case which uses omp_target_reduce, but does not use omp_target_parallel_for_exec. If I remove threads-per-team from the reducer, this test case will fail, because threads-per-team is initialized in omp_target_parallel_for_exec. Once I get the previous linking issue solved, I can attempt to somehow gracefully fail on this test case.

trws · 2018-12-21T23:20:02Z

On 21 Dec 2018, at 10:38, Robert Chen wrote: I ran into a couple issues when I tried to remove the threads-per-team template parameter from OpenMP target reduce: - The only clean way I can think of to pass the threads-per-team (or any other team info) to the reducer is via a global variable. I tried the simple technique of setting OMP_NUM_THREADS and retrieving it via omp_get_max_threads(), but that call only returns the max number of _available_ threads which can vary over time. The only problem with having a global variable is compiling and linking the .cpp file (in RAJA/src/) in which it resides. I tried throwing this file into BLT, but it tries to nvlink it, which is unnecessary and fails - that is what I'm trying to work through at the moment.

Why does a reducer need to know, is this so it can calculate the number of teams? If you do end up using a global, please make it thread private so we don't end up with issues having multiple master threads (rare, but best to be safe about it).

rchen20 · 2018-12-22T00:02:40Z

Yes, it is for the calculation of the number of teams. I might be able to avoid doing this check within the parallel regions. If so, I can get all the checks done on the CPU, and may not need to pass any threadprivate variables to omp.

…ver-allocate target_reduce array to max CUDA threads per block size 1024. May need to revisit this if performance declines or if max CUDA changes. Updated scripts for xl 11.26.

Conflicts: docs/sphinx/user_guide/feature/policies.rst include/RAJA/policy/openmp/policy.hpp include/RAJA/policy/openmp_target/forall.hpp include/RAJA/policy/openmp_target/reduce.hpp test/unit/omp-target/test-nested-reduce.cpp test/unit/omp-target/test-reductions.cpp

rchen20 · 2019-01-11T23:32:39Z

@trws @MrBurmark @davidbeckingsale @rhornung67 Hello, I made some updates to make the omp_target_reduce API look the same as CUDA's reduction; namely, the user no longer needs to specify the number of ThreadsPerTeam when declaring an omp target reduction object. To do this, I allocated by default an array the size of the max number of CUDA threads per block (1024). If the parallel-for execution policy notices that the user exceeds this number, it will readjust to 1024 automatically. Please review these changes and let me know what you think. Thanks!

rchen20 · 2019-02-06T19:18:56Z

@trws @davidbeckingsale @MrBurmark @rhornung67 Just a friendly reminder to please look over this PR. Thanks!

MrBurmark

Shall we approve as an improvement, but remember to revisit this later?

rhornung67

I agree with @MrBurmark. Let's get this in after one or two more reviewers approve. But, continue to evaluate it via perf suite, etc.

Addressed in 0.7 release.

Robert Chang Che Chen added 2 commits November 5, 2018 18:34

RAJA-270 and 283, omp target TEAMS changed to THREADS.

e204703

RAJA-270/283 source and doc file changes.

2e1f001

rchen20 requested review from davidbeckingsale, trws, MrBurmark, DavidPoliakoff, artv3, rhornung67, ajkunen and pearce8 November 6, 2018 03:00

Merge branch 'develop' into remotes/origin/task/chen59/omptarget

d2b5a98

davidbeckingsale requested changes Nov 14, 2018

View reviewed changes

Robert Chang Che Chen and others added 3 commits November 15, 2018 10:25

Changed ThreadC to Threads. Added xl 11.02 compilation script.

a132e7b

Merge branch 'remotes/origin/task/chen59/omptarget' of https://github…

76017aa

….com/LLNL/RAJA into remotes/origin/task/chen59/omptarget

Merge branch 'develop' into remotes/origin/task/chen59/omptarget

2d84750

davidbeckingsale previously approved these changes Nov 29, 2018

View reviewed changes

Merge branch 'develop' into remotes/origin/task/chen59/omptarget

4f8454a

MrBurmark requested changes Nov 29, 2018

View reviewed changes

Robert Chang Che Chen added 2 commits December 3, 2018 11:25

Limit teams == Threads to prevent out of bounds access.

e62e89f

Merge branch 'remotes/origin/task/chen59/omptarget' of https://github…

95958c3

….com/LLNL/RAJA into remotes/origin/task/chen59/omptarget

rchen20 dismissed davidbeckingsale’s stale review via 95958c3 December 3, 2018 19:28

Changed omp_reduce teams to threads.

a5b7a59

Merge branch 'develop' into remotes/origin/task/chen59/omptarget

57b968c

davidbeckingsale previously approved these changes Dec 5, 2018

View reviewed changes

trws previously requested changes Dec 5, 2018

View reviewed changes

Robert Chang Che Chen added 2 commits December 5, 2018 12:46

Documentation fix. Changed all versions of Threads to ThreadsPerTeam.

9c7c886

Merge branch 'remotes/origin/task/chen59/omptarget' of https://github…

b5479ab

….com/LLNL/RAJA into remotes/origin/task/chen59/omptarget

rchen20 dismissed davidbeckingsale’s stale review via b5479ab December 5, 2018 20:48

rhornung67 reviewed Dec 5, 2018

View reviewed changes

MrBurmark previously approved these changes Dec 5, 2018

View reviewed changes

Clarified NumTeams in forall, and MaxNumTeams in reducer. Added asser…

8516b4a

…ts in reducer to check for valid number of teams.

rchen20 dismissed MrBurmark’s stale review via 8516b4a December 7, 2018 01:19

rhornung67 reviewed Dec 7, 2018

View reviewed changes

Robert Chang Che Chen added 4 commits January 10, 2019 17:46

omp_target_reduce API now looks like CUDA reduce. Caveat: Needed to o…

72361ba

…ver-allocate target_reduce array to max CUDA threads per block size 1024. May need to revisit this if performance declines or if max CUDA changes. Updated scripts for xl 11.26.

Including pertinent files.

dc04423

Updated documentation.

a1f6243

Merge branch 'develop' into remotes/origin/task/chen59/omptarget

ddc01de

MrBurmark approved these changes Feb 8, 2019

View reviewed changes

rhornung67 approved these changes Feb 8, 2019

View reviewed changes

Merge branch 'develop' into remotes/origin/task/chen59/omptarget

9dc1f50

davidbeckingsale approved these changes Feb 14, 2019

View reviewed changes

rchen20 merged commit 6659bd6 into develop Feb 14, 2019

rchen20 deleted the remotes/origin/task/chen59/omptarget branch February 14, 2019 20:09

rchen20 mentioned this pull request Feb 22, 2019

Raj av0.7.0 kernelupdates LLNL/RAJAPerf#57

Closed

	#if defined(RAJA_ENABLE_TARGET_OPENMP)
	template <size_t Teams>
	struct omp_target_reduce
	: make_policy_pattern_t<Policy::target_openmp, Pattern::reduce> {
	};
	#endif


		set(RAJA_COMPILER "RAJA_COMPILER_XLC" CACHE STRING "")

		set(CMAKE_CXX_COMPILER "/usr/tce/packages/xl/xl-beta-2018.10.29/bin/xlc++_r" CACHE PATH "")

Remotes/origin/task/chen59/omptarget #539

Remotes/origin/task/chen59/omptarget #539

Conversation

rchen20 commented Nov 6, 2018

codecov-io commented Nov 6, 2018 • edited Loading

Codecov Report

rhornung67 commented Nov 14, 2018

davidbeckingsale left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhornung67 commented Nov 29, 2018

davidbeckingsale commented Nov 29, 2018

Choose a reason for hiding this comment

rchen20 Nov 29, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhornung67 commented Nov 29, 2018

rchen20 commented Dec 5, 2018

davidbeckingsale left a comment

Choose a reason for hiding this comment

trws left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhornung67 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MrBurmark left a comment

Choose a reason for hiding this comment

rchen20 commented Dec 7, 2018

rhornung67 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhornung67 commented Dec 21, 2018

rchen20 commented Dec 21, 2018

trws commented Dec 21, 2018 via email

rchen20 commented Dec 22, 2018

rchen20 commented Jan 11, 2019

rchen20 commented Feb 6, 2019

MrBurmark left a comment

Choose a reason for hiding this comment

rhornung67 left a comment

Choose a reason for hiding this comment

codecov-io commented Nov 6, 2018 •

edited

Loading

rchen20 Nov 29, 2018 •

edited

Loading