-
-
Notifications
You must be signed in to change notification settings - Fork 740
faster pairwise summation #4069
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d04ca1f
to
f531248
Compare
data = data[16 .. $]; | ||
} | ||
else store[idx] = sumPairwiseN!(16, false, F)(data); | ||
foreach (_; 0 .. cast(uint)bsf(k+1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should best be a while loop checking the limit of idx
so there's only one iteration variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, the others also
@andralex Thanks for the comments, I'll fix them soon. Are you interested in more things like this? I do love writing fast implementations of simple algorithms, but e.g. if you compare this implementation to the old one, it's a lot harder to understand / maintain. |
@John-Colvin depends on the gains. At some point we do need to add some benchmarks to Phobos, and let's defer performance work to after that. In this case I'd say eliminating the unused iteration variable may arguably lead to code that's easier to understand and maintain, too. |
@andralex I don’t mean your suggestions make it harder to maintain/understand (quite the opposite actually), I was comparing to the old code, which was very simple. If we had a framework for benchmarking then each performance improvement pull could be required to add an attempt at best/worst/mediocre case benchmarks for the new implementation. It’s pretty boring writing benchmarks without being allowed to optimise the code, so it might be good to bundle the work together.
|
f5f037e
to
22bedb9
Compare
Ewwww, found a nasty bug in |
The std.range.Take bug was introduced in #2158, which is a great example of why unittests should actually test something, not just do something to check it compiles. |
F[64] store = void; | ||
size_t idx = 0; | ||
|
||
auto collapseStore(T)(T k) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andralex I had to make this a template because size_t
is no good as a generic index/length type for ranges (e.g. std.range.iota
of ulong
on 32bit).
I'm now unsure what to do with e.g. line 4087 which uses size_t
as a counter (also it makes me reconsider my assumptions about the necessary store
length). Should I use ulong
? Do we have some sort of standard as to how long ranges are allowed to be while still being supported by phobos?
With the current state of affairs, even something as trivial as walkLength
is limited to 2^^32 - 1 on 32 bit platforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking in this matter is to avoid implementation aggravation. If someone needs ranges longer than 4B, they need to switch to a 64 bit platform. I think dealing with lengths that are non-size_t
is just wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. It overcomplicates life to support lengths of any type other than size_t
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that's fine, I just needed a decision. I'll sort this out now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on second thoughts, what does that mean for iota(5L)
or iota(long.max)
on 32 bit? What type should the length be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, if we go with enforcing length to be size_t, then iota(long.max) should simply have no length
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty much already addressed. Somehow got stalled: #4013
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iota with longs should probably not offer length on 32-bit systems. Or just assert and truncate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andralex On x86 or on x64 as well?
On 24 Mar 2016 8:04 pm, "Andrei Alexandrescu" notifications@github.com
wrote:
In std/algorithm/iteration.d
#4069 (comment)
:{
- static assert (isFloatingPoint!Result);
- switch (r.length)
- import core.bitop : bsf;
- // Works for r with at least length < 2^^(64 + log2(16)), in keeping with the use of size_t
- // elsewhere in std.algorithm and std.range on 64 bit platforms. The 16 in log2(16) comes
- // from the manual unrolling in sumPairWise16
- F[64] store = void;
- size_t idx = 0;
- auto collapseStore(T)(T k)
Iota with longs should probably not offer length.
—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/D-Programming-Language/phobos/pull/4069/files/ef9f8338fe123c1514a61c5b4b50fbfb072632e4#r57379092
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#4013 takes the approach of asserting that the length of the range will fit in size_t
, so long
and ulong
will work on 32-bit systems - but only if the resultant range is not too long. So, most common cases will work just fine, but if you try and do something like iota(long.min, long.max)
, then the assertion will fail. The downside, of course, is that when the problem does occur, it's only found at runtime, but in the vast majority of cases, you can continue to use long
and ulong
with iota
and have length
as we've had.
I am working on summation module... could we hold this PR? |
@9il: As far as I can see, this PR only improves the existing implementation, which is probably a good thing to have until your module arrives in Phobos, no? (I presume the latter will take quite some time, review process and all.) |
@klickverbot Ok |
Please rebase as it now has merge conflicts. |
rebased |
~ Take.stringof); | ||
return source[i .. j - i]; | ||
return source[i .. j]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. Not sure what I was thinking when I wrote this. Kind of sad it took 2 years to catch it :/
|
Should we just close this then? Or does somebody want to port the mir version to Phobos? |
Nitpick: without official review process (see wiki) I can not add new modules. |
@John-Colvin could you please separate your fix for |
I argue for this to be merged. Just because something in the future might be faster doesn't mean that improving what we have now shouldn't happen. This is a performance optimization of some pretty common code, it should be merged posthaste. |
This is OK to have just a commit without separate PR. In the same time we need a filled issue to show this fix in the history. |
If someone would like to have this in Phobos feel free to merge. Nitpick: this algorithm allows to remove Kahan summation, so it should be removed. |
Oh, looks like a may be wrong about performance for current state of mir :-/ |
OK, sorry to be an annoying complainer ... |
Test resultsThe PR code is better for DMD and Mir So, the PR LGTM. Do not forget to remove Kahan summation.
|
Benchmark code http://dpaste.dzfl.pl/adb07eea2db4 |
Anyhow it would be good to merge this while waiting on more of MIR additions to Phobos |
Ping - so any other objections except that khan summation should be removed? I also vote for merging :) |
I have not. Vote too |
Auto-merge toggled on |
Why Kahan was not removed? |
Was that a requirement? |
New pairwise summation can work with input ranges, so we don't need Kahan summation algorithms, which was selected of range has hot random access |
Should I sent a follow-up PR to remove Kahan? |
Yeah, a follow-up makes sense |
Just had a closer look at it and didn't see the random access
and
|
For those interested in speed, there are two things I discovered: LDC doesn't inline core.bitop.bsf I found that copy-pasting the bsf definition in to the benchmark module and replacing sumPairwise16 with sumPairwise32 brought this implementation to speed-parity with mir. |
We need to make |
that or fix cross-module inlining |
This would not work for Phobos. Even if you place |
Hmm, GDC and DMD manage to inline it ok. I guess they're magic intrinsics. Any idea what's going on with the definition in dmd's druntime? Looks
|
Minized test case: //import stdx.traits; //import stdx.typecons; template ReturnType(func...) if (isCallable!func) { enum ReturnType = true; // dummy } template isCallable(T...) { static if (is(typeof(&T[0].opCall) V)) enum isCallable = true; } auto sliced(Range, Lengths...)(Range range) { alias S = Slice!(0, typeof(range)); S ret; return ret; } void main()//unittest { auto a = sliced(2); auto b = sliced(2); a.opIndexAssign(b); } struct Slice(size_t _N, _Range) { auto opCall(int) { } auto opCall(string) { } auto opIndex() { return this; } void opIndexAssign(size_t RN, RRange)(Slice!(RN, RRange) ) if (ReturnType!opIndex) // will test &opIndex().opCall in std.traits.isCallable { } }
Using an iterative approach instead of recursive. When I originally wrote this I recorded an order-of-magnitude speedup on arrays and approx. 2x on more complicated ranges.