Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remotes/origin/task/chen59/omptarget #539

Merged
merged 20 commits into from
Feb 14, 2019

Conversation

rchen20
Copy link
Member

@rchen20 rchen20 commented Nov 6, 2018

Fix for changing OpenMP target NUMTEAMS to look and feel more like CUDA NUMTHREADS, for the sake of consistency. The user will specify a number of threads per block for omp target directives. The number of OpenMP teams is calculated internally, using the data size and number of threads per block.

Also included is an updated omp_target script for the most recent XL compiler (10.29).

@codecov-io
Copy link

codecov-io commented Nov 6, 2018

Codecov Report

Merging #539 into develop will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##           develop      #539   +/-   ##
=========================================
  Coverage   98.631%   98.631%           
=========================================
  Files           62        62           
  Lines         1242      1242           
=========================================
  Hits          1225      1225           
  Misses          17        17

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 660cd24...8516b4a. Read the comment docs.

@rhornung67
Copy link
Member

@davidbeckingsale when you have a chance, please look this over and approve if you're good with it.

Copy link
Member

@davidbeckingsale davidbeckingsale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good apart from ThreadC

@@ -60,6 +60,10 @@ template <unsigned int TeamSize>
struct Teams : std::integral_constant<unsigned int, TeamSize> {
};

template <unsigned int ThreadCount>
struct ThreadC : std::integral_constant<unsigned int, ThreadCount> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be named better: Threads or ThreadCount

@rhornung67
Copy link
Member

@davidbeckingsale, @trws could you guys look over this PR when you have a few minutes? It's been languishing for a while. Thanks.

@davidbeckingsale
Copy link
Member

This is looking pretty good.

omp_target_alloc(Teams * sizeof(T), info.deviceID))},
host{new T[Teams]}
omp_target_alloc(Threads * sizeof(T), info.deviceID))},
host{new T[Threads]}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will break the omp target reducers when Teams > Threads
Each team writes one value to this array, so when there there are more teams than threads per team this will cause writes beyond the end of the buffer.

Copy link
Member Author

@rchen20 rchen20 Nov 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MrBurmark makes a great point. However, the code in its current form will guarantee that Teams <= Threads, but it is not obvious to RAJA developers (and less obvious to RAJA users). The guarantee is specified in the omp_target_parallel_for_exec pattern (/include/RAJA/policy/openmp/target_forall.hpp):

auto teamnum = RAJA_DIVIDE_CEILING_INT( (int)distance, (int)Threads );
#pragma omp target teams distribute parallel for num_teams(teamnum) thread_limit(Threads) schedule(static, 1) map(to : body)

So far, this is the only pattern we use for omp_target policies, but if we develop other patterns they will require similar team computations or checks. There is another unused pattern (omp_target_parallel_for_exec_nt) in which neither teams nor threads are specified. But this should still work with reduce because the default number of teams is 1.

Potentially adding to the confusion is a piece of code I neglected to change:

#if defined(RAJA_ENABLE_TARGET_OPENMP)
template <size_t Teams>
struct omp_target_reduce
: make_policy_pattern_t<Policy::target_openmp, Pattern::reduce> {
};
#endif

"Teams" should be changed to "Threads", because that is how they are used with my other modifications.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't teamnum be greater than Threads if distance > Threads*Threads?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true! I suppose I can add an assert after that calculation, and maybe to be ultra safe, another assert in ~TargetReduce() and ~TargetReduceLoc() where teams are used during reduction.

@rhornung67
Copy link
Member

@MrBurmark good point. Thanks!

@rchen20
Copy link
Member Author

rchen20 commented Dec 5, 2018

@trws @MrBurmark @davidbeckingsale If you have time, would you mind looking at the latest small commits? Thanks!

Copy link
Member

@davidbeckingsale davidbeckingsale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as Jason is okay with the Reducers stuff, I'm happy.

trws
trws previously requested changes Dec 5, 2018
Copy link
Member

@trws trws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from the small doc/wording changes this looks fine to me.

* ``omp_target_parallel_for_exec<NUMTEAMS>`` - Execute a loop in parallel using an ``omp target parallel for`` pragma with given number of thread teams; e.g.,
if a GPU device is available, this is similar to launching a CUDA kernel with
a thread block size of NUMTEAMS.
* ``omp_target_parallel_for_exec<NUMTHREADS>`` - Execute a loop in parallel using an ``omp target parallel for`` pragma with given number of threads per team; e.g.,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize the original had this too, but the trailing "e.g." implies there should be an example that seems to be missing? Either delete it or add one.

include/RAJA/policy/openmp/policy.hpp Outdated Show resolved Hide resolved
Copy link
Member

@rhornung67 rhornung67 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rchen20 I think I accidentally removed @trws comment about renaming 'Threads' to 'ThreadsPerTeam' by mistakenly clicking the wrong button. Sorry about that. In addition, I think 'Teams' would be named better as 'NumTeams'.

#pragma omp target teams distribute parallel for num_teams(Teams) \
schedule(static, 1) map(to \
: body)
auto teamnum = RAJA_DIVIDE_CEILING_INT( (int)distance, (int)Threads );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would 'numteams' or 'nteams' be a better name for this variable. 'team num' seems to imply a team id or similar.

@@ -117,8 +117,8 @@ struct Reduce_Data {
explicit Reduce_Data(T defaultValue, T identityValue, Offload_Info &info)
: value(identityValue),
device{reinterpret_cast<T *>(
omp_target_alloc(Teams * sizeof(T), info.deviceID))},
host{new T[Teams]}
omp_target_alloc(Threads * sizeof(T), info.deviceID))},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this what we want here? I thought we want a data item per team (i.e., thread block) not per thread.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true, and that is still what it is actually doing.
Maybe we should add a constexpr MaxNumTeams = ThreadsPerTeam to the reducer to make it more clear this is happening.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MrBurmark I think that would help when reading the code (at least for me!). @rchen20 please do that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused about the reasoning for something like "constexpr MaxNumTeams = ThreadsPerTeam" in the reducer. Oddly enough, the reducer never specifies or demands the number of teams, so this macro would be compiled away. The reducer relies solely on the number of teams calculated in the execution policy.

If we want a real check to occur, I could try one of the following:
A. In the reducer, add an assert( omp_get_num_teams() <= ThreadsPerTeam ) before the omp pragma in ~TargetReduce().

B. Create a NumTeams member variable in the execution policy, and do a similar sanity check by accessing that member variable in each of the specializations (i.e. ReduceSum, ReduceMin, etc.).

Option A would put the sanity check closest to the potential bug culprit in the reducer, but it would still be relatively unclear who is setting the number of teams. Option B would be fairly clear, but every future specialization will need to implement similar checks, leaving us susceptible to an occasional "gotcha". We could also do both and have overkill on this problem. I'm open to any suggestions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding "constexpr MaxNumTeams = ThreadsPerTeam" is intended to be a change in name only. Hopefully the name change will avoid confusion by having the name match its usage in the reducer. This change is also simpler than re-implementing how the size of the allocation is decided/communicated, but doesn't solve the issue you raised about a real check.

To restate the problem, the ThreadsPerTeam check added to the forall is a band-aid over the lack of communication between the reducers and the forall. It relies on the reduction and execution policies having the same values of ThreadsPerTeam. This assumption is not checked and can cause out of bounds accesses when the ThreadsPerTeam in the reduction policy is smaller than the ThreadsPerTeam in the execution policy.

There are a couple of ways to deal with these issues that trade off performance, complexity, and capability.

A. Test inside the loop with omp_get_num_teams. The omp target reducers don't have ideal performance anyway, so adding an assert in their destructors may not be too bad. We would have to try it and see the performance impact.

B. Test outside the loop. Unfortunately the execution policy and the reducers can only communicate through an intermediary. For example, the cuda backend fixes essentially the same problem by communicating the number of blocks through threadlocal global variables. The threadlocal global variables are set before the loop and are then read in the copy constructors of the reducers which can then allocate the correct amount of memory. In this case no cap on the number of blocks is required because the number of blocks is communicated to the reducers.

C. Remove the ThreadsPerTeam template parameter from the reduction policies and use an arbitrary global value for MaxNumTeams. Then MaxNumTeams can be enforced in forall without needing to check for a consistent value in the reducers. This does arbitrarily limit MaxNumTeams, but it was already limited by ThreadsPerTeam.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rchen20 @MrBurmark since we removed the block size parameter from the CUDA reduction policies, we should not have it in the omp target reduction policies for consistency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MrBurmark Thanks, the reasoning for constexpr makes sense now. Your A and B suggestions are pretty similar to mine, but stated much more clearly. I considered C as well, but didn't think that would be flexible enough.

@rhornung67 @MrBurmark I'll implement the constexpr for clarity, along with the reducer asserts that Jason and I suggested (option As).

MrBurmark
MrBurmark previously approved these changes Dec 5, 2018
Copy link
Member

@MrBurmark MrBurmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

…ts in reducer to check for valid number of teams.
@rchen20
Copy link
Member Author

rchen20 commented Dec 7, 2018

Ran assert'ed and non-assert'ed reducers with normal target_forall and target_reduce test cases. No observable wall clock time difference between assert'ed vs. non-assert'ed. May be different for larger data sets.

Copy link
Member

@rhornung67 rhornung67 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rchen20 in addition to my other comments, this PR needs to have the OpenMP target reduce policies updated to remove the threads-per-team template parameter. We want this to look like the CUDA reduce policies where we recently removed the thread block size parameter.


set(RAJA_COMPILER "RAJA_COMPILER_XLC" CACHE STRING "")

set(CMAKE_CXX_COMPILER "/usr/tce/packages/xl/xl-beta-2018.10.29/bin/xlc++_r" CACHE PATH "")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this host-config file. The 2018.11.02 one should be sufficient.

@@ -0,0 +1,32 @@
#!/bin/bash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can get rid of this build script. The 2018.11.02 one should be sufficient.

@rhornung67
Copy link
Member

@rchen20 when will this PR be ready to merge?

@rchen20
Copy link
Member Author

rchen20 commented Dec 21, 2018

I ran into a couple issues when I tried to remove the threads-per-team template parameter from OpenMP target reduce:

  • The only clean way I can think of to pass the threads-per-team (or any other team info) to the reducer is via a global variable. I tried the simple technique of setting OMP_NUM_THREADS and retrieving it via omp_get_max_threads(), but that call only returns the max number of available threads which can vary over time. The only problem with having a global variable is compiling and linking the .cpp file (in RAJA/src/) in which it resides. I tried throwing this file into BLT, but it tries to nvlink it, which is unnecessary and fails - that is what I'm trying to work through at the moment.

  • There is also a test case which uses omp_target_reduce, but does not use omp_target_parallel_for_exec. If I remove threads-per-team from the reducer, this test case will fail, because threads-per-team is initialized in omp_target_parallel_for_exec. Once I get the previous linking issue solved, I can attempt to somehow gracefully fail on this test case.

@trws
Copy link
Member

trws commented Dec 21, 2018 via email

@rchen20
Copy link
Member Author

rchen20 commented Dec 22, 2018

Yes, it is for the calculation of the number of teams. I might be able to avoid doing this check within the parallel regions. If so, I can get all the checks done on the CPU, and may not need to pass any threadprivate variables to omp.

Robert Chang Che Chen added 4 commits January 10, 2019 17:46
…ver-allocate target_reduce array to max CUDA threads per block size 1024. May need to revisit this if performance declines or if max CUDA changes. Updated scripts for xl 11.26.
Conflicts:
	docs/sphinx/user_guide/feature/policies.rst
	include/RAJA/policy/openmp/policy.hpp
	include/RAJA/policy/openmp_target/forall.hpp
	include/RAJA/policy/openmp_target/reduce.hpp
	test/unit/omp-target/test-nested-reduce.cpp
	test/unit/omp-target/test-reductions.cpp
@rchen20
Copy link
Member Author

rchen20 commented Jan 11, 2019

@trws @MrBurmark @davidbeckingsale @rhornung67 Hello, I made some updates to make the omp_target_reduce API look the same as CUDA's reduction; namely, the user no longer needs to specify the number of ThreadsPerTeam when declaring an omp target reduction object. To do this, I allocated by default an array the size of the max number of CUDA threads per block (1024). If the parallel-for execution policy notices that the user exceeds this number, it will readjust to 1024 automatically. Please review these changes and let me know what you think. Thanks!

@rchen20
Copy link
Member Author

rchen20 commented Feb 6, 2019

@trws @davidbeckingsale @MrBurmark @rhornung67 Just a friendly reminder to please look over this PR. Thanks!

Copy link
Member

@MrBurmark MrBurmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we approve as an improvement, but remember to revisit this later?

Copy link
Member

@rhornung67 rhornung67 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @MrBurmark. Let's get this in after one or two more reviewers approve. But, continue to evaluate it via perf suite, etc.

@rchen20 rchen20 dismissed trws’s stale review February 14, 2019 20:08

Addressed in 0.7 release.

@rchen20 rchen20 merged commit 6659bd6 into develop Feb 14, 2019
@rchen20 rchen20 deleted the remotes/origin/task/chen59/omptarget branch February 14, 2019 20:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants