-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Policy performance test cleanup #39
Conversation
Oh, as a note, the IndexSets could go through this as well, but the rules for the construction of their policies is sufficiently different that I didn't want to just do that without some feedback on the preferred mechanism. This should give us a good read on the optimization impacts at least, indexset stuff can come later. |
Tom, did you say you had run this on the GPU? I can't remember. |
I have a version that runs on the GPU for the most part, but I'm working the kinks out of it now that rzhas is back up. |
ba7ba35
to
bee4d26
Compare
Folks, I've decided to pare this down to the absolute base refactor for now, so you can think of this as a mark 1. It's lacking wrappers to remove the need for the Icount variants, and it doesn't support callables as policies, but as it stands it's a huge savings in code size and requirements for new policies and segments, while retaining the exact same original overloading resolution approach. It may still need a bit of work as I haven't tested it on BGQ ( I actually don't have access) but it's ready for review. |
I'll test on Wednesday. |
This version uses an iterator-based interface for both RangeSegment and ListSegment execution along with policy implementations for simd, serial and the standard OpenMP options. There are extra "range" functions in each of the policies for now, and the iterator-based version does not implement the Icount variants yet, but at least for kripke it looks all good so far. Lulesh is harder to tell, because the indexsets are still implemented the original way, that said, that also seems promising so far.
…simplify the design
1f143ae
to
4ad2d78
Compare
/// Dependency graph node dtor. | ||
/// | ||
~DepGraphNode() { if (m_semaphore_value) free(m_semaphore_value); } | ||
m_semaphore_value(0) { } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of the alignment of m_semaphore_value in the old version was an attempt to prevent thrashing. Since each core may own it's own semaphore, we did not want multiple semaphores to accidentally fall in the same 'coherence page', causing the page to move around among owning cores as any of the semaphores are updated.
That said, the original implementation was 'int semaphore[numSegments]', and the current implementation automatically spreads the semaphores out, at least a little bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point. I'll add an alignment attribute to the semaphore value member shortly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trws Or an alignment attribute to the object as a whole if that is easily doable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's equally easy. Do you have a preference @Keasler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trws Given the current implementation, Making sure the whole object is aligned to a coherence boundary makes good sense to me.
@Keasler, the update I just pushed should address most of that. I had to add a new macro to config to specify the alignment because GCC has some strange hangup with alignments greater than 128 using the c++11 standard keyword, but works with their attribute. Also this fixes a bug with the chrono timer where it is accidentally rounding to the nearest second rather than retaining the detail from the timer. |
/// Satisfy one incoming dependency | ||
/// | ||
void satisfyOne() { | ||
if (m_semaphore_value > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check for m_semaphore_value > 0 should not be necessary from a code standpoint. The schedules are tightly controlled. If it is needed to make C++ recognize m_semapore_value can't be optimized away, that's different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's mainly just protection against underflow to preserve the check against 0. You're right that it shouldn't be required when used correctly, but I thought it would be safer to have the check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of those places where we would throw an internal error if the RAJA_err mechanism were in place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. It probably should be an assert rather than an if now that I think about it.
…ce-test-cleanup The massive merge of doom is complete, major testing in order just because of the scale of this thing.
Ok, this one should be good to go now assuming all the tests pass. The merge was huge, so I'm not 100% confident that there aren't merge issues in there somewhere without the testing. |
…itting invalid instructions on rzhas under some circumstances
@@ -0,0 +1,185 @@ | |||
#ifndef RAJA_ITERATORS_HXX |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trws Please add the release statement to the top of this new file -- can be cut and pasted from some other file. Thanks.
The CMAKE_C_COMPILER variable needs to be set for consistency in library usage, adding that makes both 16.0.109 and 17 work for our current unit test suite. In addition, the Iterators.hxx file now has the copyright header applied to it.
Pending other comments, this last commit has the copyright header added, as well as some fixes to the intel host-config files. It now passes every test on all of our platforms but windows, and the known bug in the nested test with icpc-16.0.210, (see https://lc.llnl.gov/bamboo/browse/RAJA-GH6-ICC16H-9 for the LC test runs). |
while (m_semaphore_value > 0) { | ||
// TODO: an efficient wait would be better here, but the standard | ||
// promise/future is not good enough | ||
std::this_thread::yield(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to be a key function point to optimize. Unfortunately there is no best approach. Do you want to add a comment for at least three possibilities to help people understand what the tradeoffs are?
(1) No waiting in the while loop will touch main memory 'too often', causing contention. (semaphore is volatile, which forces main memory touch).
(2) a small spin-loop on a volatile loop-control-variable is probably optimal, but the 'correct' length for that loop is going to be lambda-body and hardware dependent.
(3) The yield overhead is O/S dependent, so you never know what delay you will get. On some systems, the tasks may need to be very large for the yield overhead to not dominate execution time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It certainly wouldn't hurt, but this is really a stopgap at the moment. At least for OpenMP, this should probably tie into the built-in task mechanism rather than being built on top, the same may be true for the other models as well.
The merge with develop turned out to be trivial. This should be good to go if I can get a second approval and someone to hit go. @davidbeckingsale? |
Approved |
This is a clean rework of the policy as iterator consumer experiment. It has all of begin/end, containers, iterators, rangesegment, listsegment and indirection array implemented as iterables along with their Icount variants. The result is about half as much code, and some of what's here can probably be factored out if it is determined to be a worthwhile approach.