faster pairwise summation #4069

John-Colvin · 2016-03-09T14:31:51Z

Using an iterative approach instead of recursive. When I originally wrote this I recorded an order-of-magnitude speedup on arrays and approx. 2x on more complicated ranges.

andralex · 2016-03-09T16:50:51Z

std/algorithm/iteration.d

+                data = data[16 .. $];
+            }
+            else store[idx] = sumPairwiseN!(16, false, F)(data);
+            foreach (_; 0 .. cast(uint)bsf(k+1))


This should best be a while loop checking the limit of idx so there's only one iteration variable.

done, the others also

John-Colvin · 2016-03-10T00:22:35Z

@andralex Thanks for the comments, I'll fix them soon. Are you interested in more things like this? I do love writing fast implementations of simple algorithms, but e.g. if you compare this implementation to the old one, it's a lot harder to understand / maintain.

andralex · 2016-03-10T15:07:39Z

@John-Colvin depends on the gains. At some point we do need to add some benchmarks to Phobos, and let's defer performance work to after that.

In this case I'd say eliminating the unused iteration variable may arguably lead to code that's easier to understand and maintain, too.

John-Colvin · 2016-03-10T16:36:00Z

@andralex I don’t mean your suggestions make it harder to maintain/understand (quite the opposite actually), I was comparing to the old code, which was very simple.

If we had a framework for benchmarking then each performance improvement pull could be required to add an attempt at best/worst/mediocre case benchmarks for the new implementation. It’s pretty boring writing benchmarks without being allowed to optimise the code, so it might be good to bundle the work together.

On 10 Mar 2016, at 15:07, Andrei Alexandrescu notifications@github.com wrote:

@John-Colvin https://github.com/John-Colvin depends on the gains. At some point we do need to add some benchmarks to Phobos, and let's defer performance work to after that.

In this case I'd say eliminating the unused iteration variable may arguably lead to code that's easier to understand and maintain, too.

—
Reply to this email directly or view it on GitHub #4069 (comment).

John-Colvin · 2016-03-10T18:33:06Z

Ewwww, found a nasty bug in std.range.Take. Fixed, tested, added more testing for pairwise sum (which was how I found the bug, completely by chance), changed to use one less loop-variable as per @andralex's recommendation and factored the whole store-collapse in to a local function.

John-Colvin · 2016-03-10T18:37:39Z

The std.range.Take bug was introduced in #2158, which is a great example of why unittests should actually test something, not just do something to check it compiles.

John-Colvin · 2016-03-10T22:25:53Z

std/algorithm/iteration.d

+    F[64] store = void;
+    size_t idx = 0;
+
+    auto collapseStore(T)(T k)


@andralex I had to make this a template because size_t is no good as a generic index/length type for ranges (e.g. std.range.iota of ulong on 32bit).

I'm now unsure what to do with e.g. line 4087 which uses size_t as a counter (also it makes me reconsider my assumptions about the necessary store length). Should I use ulong? Do we have some sort of standard as to how long ranges are allowed to be while still being supported by phobos?

With the current state of affairs, even something as trivial as walkLength is limited to 2^^32 - 1 on 32 bit platforms.

My thinking in this matter is to avoid implementation aggravation. If someone needs ranges longer than 4B, they need to switch to a 64 bit platform. I think dealing with lengths that are non-size_t is just wrong.

Agreed. It overcomplicates life to support lengths of any type other than size_t.

Ok, that's fine, I just needed a decision. I'll sort this out now.

on second thoughts, what does that mean for iota(5L) or iota(long.max) on 32 bit? What type should the length be?

IMO, if we go with enforcing length to be size_t, then iota(long.max) should simply have no length

+1

This is pretty much already addressed. Somehow got stalled: #4013

Iota with longs should probably not offer length on 32-bit systems. Or just assert and truncate.

@andralex On x86 or on x64 as well?
On 24 Mar 2016 8:04 pm, "Andrei Alexandrescu" notifications@github.com
wrote:

In std/algorithm/iteration.d
#4069 (comment)
:

{

static assert (isFloatingPoint!Result);

switch (r.length)

import core.bitop : bsf;

// Works for r with at least length < 2^^(64 + log2(16)), in keeping with the use of size_t

// elsewhere in std.algorithm and std.range on 64 bit platforms. The 16 in log2(16) comes

// from the manual unrolling in sumPairWise16

F[64] store = void;

size_t idx = 0;

auto collapseStore(T)(T k)

Iota with longs should probably not offer length.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/D-Programming-Language/phobos/pull/4069/files/ef9f8338fe123c1514a61c5b4b50fbfb072632e4#r57379092

#4013 takes the approach of asserting that the length of the range will fit in size_t, so long and ulong will work on 32-bit systems - but only if the resultant range is not too long. So, most common cases will work just fine, but if you try and do something like iota(long.min, long.max), then the assertion will fail. The downside, of course, is that when the problem does occur, it's only found at runtime, but in the vast majority of cases, you can continue to use long and ulong with iota and have length as we've had.

9il · 2016-03-14T16:56:12Z

I am working on summation module... could we hold this PR?

dnadlinger · 2016-03-14T17:08:08Z

@9il: As far as I can see, this PR only improves the existing implementation, which is probably a good thing to have until your module arrives in Phobos, no? (I presume the latter will take quite some time, review process and all.)

9il · 2016-03-14T17:09:31Z

@klickverbot Ok

wilzbach · 2016-03-24T11:56:44Z

Please rebase as it now has merge conflicts.

John-Colvin · 2016-03-24T13:15:44Z

rebased

monarchdodra · 2016-03-24T18:30:43Z

std/range/package.d

                ~ Take.stringof);
-            return source[i .. j - i];
+            return source[i .. j];


Oops. Not sure what I was thinking when I wrote this. Kind of sad it took 2 years to catch it :/

9il · 2016-03-29T17:54:30Z

mir.las.sum was released. May be we can just put the link to mir in the docs :-) ? I expect that mir version would be significantly faster, especially with optimization for native CPU by LDC/GDC.

quickfur · 2016-03-29T18:01:13Z

Should we just close this then? Or does somebody want to port the mir version to Phobos?

wilzbach · 2016-03-29T18:03:22Z

Should we just close this then? Or does somebody want to port the mir version to Phobos?

AFAIK will @9il merge mir periodically to Phobos.

9il · 2016-03-29T18:05:47Z

AFAIK will @9il merge mir periodically to Phobos.

Nitpick: without official review process (see wiki) I can not add new modules.

wilzbach · 2016-03-29T18:05:49Z

@John-Colvin could you please separate your fix for Take into another PR?

JackStouffer · 2016-03-29T19:28:36Z

I argue for this to be merged. Just because something in the future might be faster doesn't mean that improving what we have now shouldn't happen. This is a performance optimization of some pretty common code, it should be merged posthaste.

9il · 2016-03-29T20:30:08Z

@John-Colvin could you please separate your fix for Take into another PR?

This is OK to have just a commit without separate PR. In the same time we need a filled issue to show this fix in the history.

9il · 2016-03-29T20:32:59Z

If someone would like to have this in Phobos feel free to merge.

Nitpick: this algorithm allows to remove Kahan summation, so it should be removed.

9il · 2016-03-29T20:46:17Z

Oh, looks like a may be wrong about performance for current state of mir :-/

wilzbach · 2016-03-30T04:48:54Z

This is OK to have just a commit without separate PR. In the same time we need a filled issue to show this fix in the history.

OK, sorry to be an annoying complainer ...

9il · 2016-03-30T07:33:54Z

Test results

The PR code is better for DMD and Mir 0.11.0-beta2 is better for LDC. The difference for LDC is small if bounds check are turned off and an array is large. Mir is 2x+ times faster when array length = 15.

So, the PR LGTM. Do not forget to remove Kahan summation.

Compiler	BoundsCheck	Array Length	Bench Count	Algorithm	Time
DMD	safe	10000	10000	std	466 ms
DMD	safe	10000	10000	PR	94 ms
DMD	safe	10000	10000	Mir	128 ms
DMD	safe	15	100000	std	69 ms
DMD	safe	15	100000	PR	73 ms
DMD	safe	15	100000	Mir	65 ms
DMD	off	10000	10000	std	407 ms
DMD	off	10000	10000	PR	43 ms
DMD	off	10000	10000	Mir	96 ms
DMD	off	15	100000	std	55 ms
DMD	off	15	100000	PR	61 ms
DMD	off	15	100000	Mir	63 ms
LDC	safe	10000	10000	std	306 ms
LDC	safe	10000	10000	PR	62 ms
LDC	safe	10000	10000	Mir	38 ms
LDC	safe	15	100000	std	38 ms
LDC	safe	15	100000	PR	59 ms
LDC	safe	15	100000	Mir	26 ms
LDC	off	10000	10000	std	307 ms
LDC	off	10000	10000	PR	39 ms
LDC	off	10000	10000	Mir	37 ms
LDC	off	15	100000	std	40 ms
LDC	off	15	100000	PR	56 ms
LDC	off	15	100000	Mir	22 ms

9il · 2016-03-30T07:40:43Z

Benchmark code http://dpaste.dzfl.pl/adb07eea2db4

DmitryOlshansky · 2016-04-01T10:51:36Z

Anyhow it would be good to merge this while waiting on more of MIR additions to Phobos

wilzbach · 2016-04-10T15:36:32Z

Ping - so any other objections except that khan summation should be removed? I also vote for merging :)

9il · 2016-04-10T15:37:22Z

Ping - so any other objections except that khan summation should be removed? I also vote for merging :)

I have not. Vote too

DmitryOlshansky · 2016-04-12T07:34:15Z

Auto-merge toggled on

9il · 2016-04-12T08:13:01Z

Auto-merge toggled on

Why Kahan was not removed?

DmitryOlshansky · 2016-04-12T08:19:19Z

Why Kahan was not removed?

Was that a requirement?

9il · 2016-04-12T08:31:03Z

Was that a requirement?

New pairwise summation can work with input ranges, so we don't need Kahan summation algorithms, which was selected of range has hot random access

wilzbach · 2016-04-12T09:21:22Z

Should I sent a follow-up PR to remove Kahan?

DmitryOlshansky · 2016-04-12T09:25:21Z

Yeah, a follow-up makes sense

wilzbach · 2016-04-12T09:32:00Z

New pairwise summation can work with input ranges, so we don't need Kahan summation algorithms, which was selected of range has hot random access

Just had a closer look at it and didn't see the random access

 static assert (isFloatingPoint!Result && isMutable!Result);
    Result c = 0;
    for (; !r.empty; r.popFront())
    {
        auto y = r.front - c;
        auto t = result + y;
        c = (t - result) - y;
        result = t;
    }
    return result;

and

$(LI If $(D ElementType!R) is a floating-point type and $(D R) is a
finite input range (but not a random-access range with slicing), then
$(D sum) uses the $(WEB en.wikipedia.org/wiki/Kahan_summation,
Kahan summation) algorithm.)

John-Colvin · 2016-04-12T09:58:58Z

For those interested in speed, there are two things I discovered:

LDC doesn't inline core.bitop.bsf
manually unrolling to 32 is even faster than 16

I found that copy-pasting the bsf definition in to the benchmark module and replacing sumPairwise16 with sumPairwise32 brought this implementation to speed-parity with mir.

9il · 2016-04-12T16:25:44Z

I found that copy-pasting the bsf definition in to the benchmark module

We need to make bsf and others templates

John-Colvin · 2016-04-12T16:28:05Z

that or fix cross-module inlining

9il · 2016-04-12T16:36:35Z

... or fix cross-module inlining

This would not work for Phobos. Even if you place bsf in module with summation bsf would be called from Phobos.

John-Colvin · 2016-04-12T22:02:27Z

Hmm, GDC and DMD manage to inline it ok. I guess they're magic intrinsics.

Any idea what's going on with the definition in dmd's druntime? Looks
seriously weird to me. On x86_64 size_t is ulong, so there's a separate
declaration and definition in the same module, in D? Then the definition
does a no-op cast and calls itself? Whut? I'm guessing the way intrinsics
work in DMD breaks language rules and is confusing me.
On 12 Apr 2016 5:36 pm, "Ilya Yaroshenko" notifications@github.com wrote:

... or fix cross-module inlining

This would not work for Phobos. Even if you place bsf in module with
summation bsf would be called from Phobos.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#4069 (comment)

Minized test case: //import stdx.traits; //import stdx.typecons; template ReturnType(func...) if (isCallable!func) { enum ReturnType = true; // dummy } template isCallable(T...) { static if (is(typeof(&T[0].opCall) V)) enum isCallable = true; } auto sliced(Range, Lengths...)(Range range) { alias S = Slice!(0, typeof(range)); S ret; return ret; } void main()//unittest { auto a = sliced(2); auto b = sliced(2); a.opIndexAssign(b); } struct Slice(size_t _N, _Range) { auto opCall(int) { } auto opCall(string) { } auto opIndex() { return this; } void opIndexAssign(size_t RN, RRange)(Slice!(RN, RRange) ) if (ReturnType!opIndex) // will test &opIndex().opCall in std.traits.isCallable { } }

John-Colvin force-pushed the patch-16 branch 2 times, most recently from d04ca1f to f531248 Compare March 9, 2016 15:10

andralex reviewed Mar 9, 2016
View reviewed changes

John-Colvin force-pushed the patch-16 branch 2 times, most recently from f5f037e to 22bedb9 Compare March 10, 2016 18:29

John-Colvin force-pushed the patch-16 branch from 22bedb9 to ef9f833 Compare March 10, 2016 22:19

John-Colvin reviewed Mar 10, 2016
View reviewed changes

9il added the math label Mar 14, 2016

John-Colvin added 2 commits March 24, 2016 13:11

fix broken std.range.Take slicing

f69a7d9

faster pairwise summation

678a511

John-Colvin force-pushed the patch-16 branch from ef9f833 to 678a511 Compare March 24, 2016 13:15

John-Colvin mentioned this pull request Mar 24, 2016

Fix Issue 12731 - Infinite range slices are not themselves sliceable #2158

Merged

monarchdodra reviewed Mar 24, 2016
View reviewed changes

wilzbach mentioned this pull request Apr 11, 2016

Detect performance regressions libmir/mir#75

Closed

DmitryOlshansky merged commit 1d2e88c into dlang:master Apr 12, 2016

Uh oh!

faster pairwise summation #4069

faster pairwise summation #4069

Uh oh!

Conversation

John-Colvin commented Mar 9, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

John-Colvin commented Mar 10, 2016

Uh oh!

andralex commented Mar 10, 2016

Uh oh!

John-Colvin commented Mar 10, 2016

Uh oh!

John-Colvin commented Mar 10, 2016

Uh oh!

John-Colvin commented Mar 10, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

9il commented Mar 14, 2016

Uh oh!

dnadlinger commented Mar 14, 2016

Uh oh!

9il commented Mar 14, 2016

Uh oh!

wilzbach commented Mar 24, 2016

Uh oh!

John-Colvin commented Mar 24, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

9il commented Mar 29, 2016

Uh oh!

quickfur commented Mar 29, 2016

Uh oh!

wilzbach commented Mar 29, 2016

Uh oh!

9il commented Mar 29, 2016

Uh oh!

wilzbach commented Mar 29, 2016

Uh oh!

JackStouffer commented Mar 29, 2016

Uh oh!

9il commented Mar 29, 2016

Uh oh!

9il commented Mar 29, 2016

Uh oh!

9il commented Mar 29, 2016

Uh oh!

wilzbach commented Mar 30, 2016

Uh oh!

9il commented Mar 30, 2016

Test results

Uh oh!

9il commented Mar 30, 2016

Uh oh!

DmitryOlshansky commented Apr 1, 2016

Uh oh!