Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Policy performance test cleanup #39

Merged
merged 18 commits into from Aug 19, 2016

Conversation

trws
Copy link
Member

@trws trws commented Jun 5, 2016

This is a clean rework of the policy as iterator consumer experiment. It has all of begin/end, containers, iterators, rangesegment, listsegment and indirection array implemented as iterables along with their Icount variants. The result is about half as much code, and some of what's here can probably be factored out if it is determined to be a worthwhile approach.

@trws
Copy link
Member Author

trws commented Jun 5, 2016

Oh, as a note, the IndexSets could go through this as well, but the rules for the construction of their policies is sufficiently different that I didn't want to just do that without some feedback on the preferred mechanism. This should give us a good read on the optimization impacts at least, indexset stuff can come later.

@ghost
Copy link

ghost commented Jun 8, 2016

Tom, did you say you had run this on the GPU? I can't remember.

@trws
Copy link
Member Author

trws commented Jun 8, 2016

I have a version that runs on the GPU for the most part, but I'm working the kinks out of it now that rzhas is back up.

@trws trws force-pushed the policy-performance-test-cleanup branch from ba7ba35 to bee4d26 Compare June 12, 2016 22:58
@trws
Copy link
Member Author

trws commented Jun 13, 2016

Folks, I've decided to pare this down to the absolute base refactor for now, so you can think of this as a mark 1. It's lacking wrappers to remove the need for the Icount variants, and it doesn't support callables as policies, but as it stands it's a huge savings in code size and requirements for new policies and segments, while retaining the exact same original overloading resolution approach. It may still need a bit of work as I haven't tested it on BGQ ( I actually don't have access) but it's ready for review.

@ghost
Copy link

ghost commented Jun 13, 2016

I'll test on Wednesday.

trws added 9 commits June 21, 2016 11:23
This version uses an iterator-based interface for both RangeSegment and
ListSegment execution along with policy implementations for simd, serial
and the standard OpenMP options.  There are extra "range" functions in
each of the policies for now, and the iterator-based version does not
implement the Icount variants yet, but at least for kripke it looks all
good so far.  Lulesh is harder to tell, because the indexsets are still
implemented the original way, that said, that also seems promising so
far.
@trws trws force-pushed the policy-performance-test-cleanup branch from 1f143ae to 4ad2d78 Compare June 21, 2016 18:31
/// Dependency graph node dtor.
///
~DepGraphNode() { if (m_semaphore_value) free(m_semaphore_value); }
m_semaphore_value(0) { }
Copy link

@ghost ghost Jun 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of the alignment of m_semaphore_value in the old version was an attempt to prevent thrashing. Since each core may own it's own semaphore, we did not want multiple semaphores to accidentally fall in the same 'coherence page', causing the page to move around among owning cores as any of the semaphores are updated.

That said, the original implementation was 'int semaphore[numSegments]', and the current implementation automatically spreads the semaphores out, at least a little bit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. I'll add an alignment attribute to the semaphore value member shortly.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trws Or an alignment attribute to the object as a whole if that is easily doable.

Copy link
Member Author

@trws trws Jun 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's equally easy. Do you have a preference @Keasler?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trws Given the current implementation, Making sure the whole object is aligned to a coherence boundary makes good sense to me.

@trws
Copy link
Member Author

trws commented Jun 21, 2016

@Keasler, the update I just pushed should address most of that. I had to add a new macro to config to specify the alignment because GCC has some strange hangup with alignments greater than 128 using the c++11 standard keyword, but works with their attribute. Also this fixes a bug with the chrono timer where it is accidentally rounding to the nearest second rather than retaining the detail from the timer.

/// Satisfy one incoming dependency
///
void satisfyOne() {
if (m_semaphore_value > 0) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check for m_semaphore_value > 0 should not be necessary from a code standpoint. The schedules are tightly controlled. If it is needed to make C++ recognize m_semapore_value can't be optimized away, that's different.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's mainly just protection against underflow to preserve the check against 0. You're right that it shouldn't be required when used correctly, but I thought it would be safer to have the check.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of those places where we would throw an internal error if the RAJA_err mechanism were in place.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. It probably should be an assert rather than an if now that I think about it.

@trws trws assigned trws and unassigned ghost and rhornung67 Jun 24, 2016
@trws trws changed the title [DNM] Policy performance test cleanup Policy performance test cleanup Jun 26, 2016
trws added 2 commits June 30, 2016 17:20
…ce-test-cleanup

The massive merge of doom is complete, major testing in order just
because of the scale of this thing.
@trws
Copy link
Member Author

trws commented Jul 27, 2016

Ok, this one should be good to go now assuming all the tests pass. The merge was huge, so I'm not 100% confident that there aren't merge issues in there somewhere without the testing.

@@ -0,0 +1,185 @@
#ifndef RAJA_ITERATORS_HXX
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trws Please add the release statement to the top of this new file -- can be cut and pasted from some other file. Thanks.

The CMAKE_C_COMPILER variable needs to be set for consistency in library
usage, adding that makes both 16.0.109 and 17 work for our current unit
test suite.  In addition, the Iterators.hxx file now has the copyright
header applied to it.
@trws
Copy link
Member Author

trws commented Jul 27, 2016

Pending other comments, this last commit has the copyright header added, as well as some fixes to the intel host-config files. It now passes every test on all of our platforms but windows, and the known bug in the nested test with icpc-16.0.210, (see https://lc.llnl.gov/bamboo/browse/RAJA-GH6-ICC16H-9 for the LC test runs).

while (m_semaphore_value > 0) {
// TODO: an efficient wait would be better here, but the standard
// promise/future is not good enough
std::this_thread::yield();
Copy link

@ghost ghost Jul 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be a key function point to optimize. Unfortunately there is no best approach. Do you want to add a comment for at least three possibilities to help people understand what the tradeoffs are?
(1) No waiting in the while loop will touch main memory 'too often', causing contention. (semaphore is volatile, which forces main memory touch).
(2) a small spin-loop on a volatile loop-control-variable is probably optimal, but the 'correct' length for that loop is going to be lambda-body and hardware dependent.
(3) The yield overhead is O/S dependent, so you never know what delay you will get. On some systems, the tasks may need to be very large for the yield overhead to not dominate execution time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It certainly wouldn't hurt, but this is really a stopgap at the moment. At least for OpenMP, this should probably tie into the built-in task mechanism rather than being built on top, the same may be true for the other models as well.

@trws
Copy link
Member Author

trws commented Aug 19, 2016

The merge with develop turned out to be trivial. This should be good to go if I can get a second approval and someone to hit go. @davidbeckingsale?

@DavidPoliakoff
Copy link
Contributor

Approved

@trws trws merged commit 0850bb4 into LLNL:develop Aug 19, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants