Policy performance test cleanup #39

trws · 2016-06-05T20:43:02Z

This is a clean rework of the policy as iterator consumer experiment. It has all of begin/end, containers, iterators, rangesegment, listsegment and indirection array implemented as iterables along with their Icount variants. The result is about half as much code, and some of what's here can probably be factored out if it is determined to be a worthwhile approach.

trws · 2016-06-05T20:44:42Z

Oh, as a note, the IndexSets could go through this as well, but the rules for the construction of their policies is sufficiently different that I didn't want to just do that without some feedback on the preferred mechanism. This should give us a good read on the optimization impacts at least, indexset stuff can come later.

ghost · 2016-06-08T18:56:50Z

Tom, did you say you had run this on the GPU? I can't remember.

trws · 2016-06-08T18:57:47Z

I have a version that runs on the GPU for the most part, but I'm working the kinks out of it now that rzhas is back up.

trws · 2016-06-13T21:44:22Z

Folks, I've decided to pare this down to the absolute base refactor for now, so you can think of this as a mark 1. It's lacking wrappers to remove the need for the Icount variants, and it doesn't support callables as policies, but as it stands it's a huge savings in code size and requirements for new policies and segments, while retaining the exact same original overloading resolution approach. It may still need a bit of work as I haven't tested it on BGQ ( I actually don't have access) but it's ready for review.

ghost · 2016-06-13T21:46:50Z

I'll test on Wednesday.

This version uses an iterator-based interface for both RangeSegment and ListSegment execution along with policy implementations for simd, serial and the standard OpenMP options. There are extra "range" functions in each of the policies for now, and the iterator-based version does not implement the Icount variants yet, but at least for kripke it looks all good so far. Lulesh is harder to tell, because the indexsets are still implemented the original way, that said, that also seems promising so far.

…simplify the design

… execution

ghost · 2016-06-21T18:45:57Z

include/RAJA/DepGraphNode.hxx

-   /// Dependency graph node dtor.
-   ///
-   ~DepGraphNode() { if (m_semaphore_value) free(m_semaphore_value); }
+        m_semaphore_value(0) { }


The purpose of the alignment of m_semaphore_value in the old version was an attempt to prevent thrashing. Since each core may own it's own semaphore, we did not want multiple semaphores to accidentally fall in the same 'coherence page', causing the page to move around among owning cores as any of the semaphores are updated.

That said, the original implementation was 'int semaphore[numSegments]', and the current implementation automatically spreads the semaphores out, at least a little bit.

Fair point. I'll add an alignment attribute to the semaphore value member shortly.

@trws Or an alignment attribute to the object as a whole if that is easily doable.

That's equally easy. Do you have a preference @Keasler?

@trws Given the current implementation, Making sure the whole object is aligned to a coherence boundary makes good sense to me.

trws · 2016-06-21T19:15:52Z

@Keasler, the update I just pushed should address most of that. I had to add a new macro to config to specify the alignment because GCC has some strange hangup with alignments greater than 128 using the c++11 standard keyword, but works with their attribute. Also this fixes a bug with the chrono timer where it is accidentally rounding to the nearest second rather than retaining the detail from the timer.

ghost · 2016-06-21T19:19:18Z

include/RAJA/DepGraphNode.hxx

+   /// Satisfy one incoming dependency
+   ///
+   void satisfyOne() {
+       if (m_semaphore_value > 0) {


The check for m_semaphore_value > 0 should not be necessary from a code standpoint. The schedules are tightly controlled. If it is needed to make C++ recognize m_semapore_value can't be optimized away, that's different.

It's mainly just protection against underflow to preserve the check against 0. You're right that it shouldn't be required when used correctly, but I thought it would be safer to have the check.

This is one of those places where we would throw an internal error if the RAJA_err mechanism were in place.

Agreed. It probably should be an assert rather than an if now that I think about it.

…ce-test-cleanup The massive merge of doom is complete, major testing in order just because of the scale of this thing.

trws · 2016-07-27T17:58:30Z

Ok, this one should be good to go now assuming all the tests pass. The merge was huge, so I'm not 100% confident that there aren't merge issues in there somewhere without the testing.

…itting invalid instructions on rzhas under some circumstances

rhornung67 · 2016-07-27T21:37:36Z

include/RAJA/Iterators.hxx

@@ -0,0 +1,185 @@
+#ifndef RAJA_ITERATORS_HXX


@trws Please add the release statement to the top of this new file -- can be cut and pasted from some other file. Thanks.

The CMAKE_C_COMPILER variable needs to be set for consistency in library usage, adding that makes both 16.0.109 and 17 work for our current unit test suite. In addition, the Iterators.hxx file now has the copyright header applied to it.

trws · 2016-07-27T22:38:57Z

Pending other comments, this last commit has the copyright header added, as well as some fixes to the intel host-config files. It now passes every test on all of our platforms but windows, and the known bug in the nested test with icpc-16.0.210, (see https://lc.llnl.gov/bamboo/browse/RAJA-GH6-ICC16H-9 for the LC test runs).

ghost · 2016-07-27T22:45:23Z

include/RAJA/DepGraphNode.hxx

+    while (m_semaphore_value > 0) {
+      // TODO: an efficient wait would be better here, but the standard
+      // promise/future is not good enough
+      std::this_thread::yield();


This is going to be a key function point to optimize. Unfortunately there is no best approach. Do you want to add a comment for at least three possibilities to help people understand what the tradeoffs are?
(1) No waiting in the while loop will touch main memory 'too often', causing contention. (semaphore is volatile, which forces main memory touch).
(2) a small spin-loop on a volatile loop-control-variable is probably optimal, but the 'correct' length for that loop is going to be lambda-body and hardware dependent.
(3) The yield overhead is O/S dependent, so you never know what delay you will get. On some systems, the tasks may need to be very large for the yield overhead to not dominate execution time.

It certainly wouldn't hurt, but this is really a stopgap at the moment. At least for OpenMP, this should probably tie into the built-in task mechanism rather than being built on top, the same may be true for the other models as well.

…ce-test-cleanup

trws · 2016-08-19T20:22:42Z

The merge with develop turned out to be trivial. This should be good to go if I can get a second approval and someone to hit go. @davidbeckingsale?

DavidPoliakoff · 2016-08-19T21:10:15Z

Approved

trws force-pushed the policy-performance-test-cleanup branch from ba7ba35 to bee4d26 Compare June 12, 2016 22:58

trws assigned trws, davidbeckingsale, ghost and rhornung67 Jun 13, 2016

trws added 9 commits June 21, 2016 11:23

icount replaced, ready for some performance tests

ee6e58c

small updates for stability and speed

6be044e

added strided range iterators

7e9ec1a

removing some dead code

0916adb

partial cuda support and support for overloads

8e1892e

adding cuda header

a8d4de7

moving to an all forall based approach, and reducing the wrappers to …

de4ca98

…simplify the design

basically all ported to iterator default policies, including IndexSet…

4ad2d78

… execution

trws force-pushed the policy-performance-test-cleanup branch from 1f143ae to 4ad2d78 Compare June 21, 2016 18:31

ghost reviewed Jun 21, 2016
View reviewed changes

fix for alignment, timer casting and cuda foralls

2359b13

ghost reviewed Jun 21, 2016
View reviewed changes

bringing cilk+ into line

ab22b4a

trws mentioned this pull request Jun 22, 2016

Interface abstraction hardening tracking issue #63

Open

14 tasks

trws unassigned davidbeckingsale and trws Jun 24, 2016

trws assigned trws and unassigned ghost and rhornung67 Jun 24, 2016

trws changed the title ~~[DNM] Policy performance test cleanup~~ Policy performance test cleanup Jun 26, 2016

trws added 2 commits June 30, 2016 17:20

Merge branch 'develop' into policy-performance-test-cleanup

0b939af

Merge remote-tracking branch 'upstream/develop' into policy-performan…

c66b68c

…ce-test-cleanup The massive merge of doom is complete, major testing in order just because of the scale of this thing.

trws and others added 3 commits July 27, 2016 11:09

fixes for merge issue

930d1cb

Merge branch 'develop' into policy-performance-test-cleanup

3257e56

switching clang host-config to a reliable version, old version was em…

f0d27dd

…itting invalid instructions on rzhas under some circumstances

rhornung67 reviewed Jul 27, 2016
View reviewed changes

icc hostconfig repair and adding copyright header

1f98335

The CMAKE_C_COMPILER variable needs to be set for consistency in library usage, adding that makes both 16.0.109 and 17 work for our current unit test suite. In addition, the Iterators.hxx file now has the copyright header applied to it.

ghost reviewed Jul 27, 2016
View reviewed changes

Merge remote-tracking branch 'upstream/develop' into policy-performan…

c11b059

…ce-test-cleanup

trws merged commit 0850bb4 into LLNL:develop Aug 19, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Policy performance test cleanup #39

Policy performance test cleanup #39

trws commented Jun 5, 2016 •

edited

trws commented Jun 5, 2016

ghost commented Jun 8, 2016

trws commented Jun 8, 2016

trws commented Jun 13, 2016

ghost commented Jun 13, 2016

ghost Jun 21, 2016 •

edited by ghost

trws Jun 21, 2016

ghost Jun 21, 2016

trws Jun 21, 2016 •

edited

ghost Jun 21, 2016

trws commented Jun 21, 2016

ghost Jun 21, 2016

trws Jun 21, 2016

ghost Jun 21, 2016

trws Jun 21, 2016

trws commented Jul 27, 2016

rhornung67 Jul 27, 2016

trws commented Jul 27, 2016

ghost Jul 27, 2016 •

edited by ghost

trws Jul 28, 2016

trws commented Aug 19, 2016

DavidPoliakoff commented Aug 19, 2016

Policy performance test cleanup #39

Policy performance test cleanup #39

Conversation

trws commented Jun 5, 2016 • edited

trws commented Jun 5, 2016

ghost commented Jun 8, 2016

trws commented Jun 8, 2016

trws commented Jun 13, 2016

ghost commented Jun 13, 2016

ghost Jun 21, 2016 • edited by ghost

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trws Jun 21, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trws commented Jun 21, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trws commented Jul 27, 2016

Choose a reason for hiding this comment

trws commented Jul 27, 2016

ghost Jul 27, 2016 • edited by ghost

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trws commented Aug 19, 2016

DavidPoliakoff commented Aug 19, 2016

trws commented Jun 5, 2016 •

edited

ghost Jun 21, 2016 •

edited by ghost

trws Jun 21, 2016 •

edited

ghost Jul 27, 2016 •

edited by ghost