Skip to content

faster pairwise summation #4069

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 12, 2016
Merged

Conversation

John-Colvin
Copy link
Contributor

Using an iterative approach instead of recursive. When I originally wrote this I recorded an order-of-magnitude speedup on arrays and approx. 2x on more complicated ranges.

@John-Colvin John-Colvin force-pushed the patch-16 branch 2 times, most recently from d04ca1f to f531248 Compare March 9, 2016 15:10
data = data[16 .. $];
}
else store[idx] = sumPairwiseN!(16, false, F)(data);
foreach (_; 0 .. cast(uint)bsf(k+1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should best be a while loop checking the limit of idx so there's only one iteration variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, the others also

@John-Colvin
Copy link
Contributor Author

@andralex Thanks for the comments, I'll fix them soon. Are you interested in more things like this? I do love writing fast implementations of simple algorithms, but e.g. if you compare this implementation to the old one, it's a lot harder to understand / maintain.

@andralex
Copy link
Member

@John-Colvin depends on the gains. At some point we do need to add some benchmarks to Phobos, and let's defer performance work to after that.

In this case I'd say eliminating the unused iteration variable may arguably lead to code that's easier to understand and maintain, too.

@John-Colvin
Copy link
Contributor Author

@andralex I don’t mean your suggestions make it harder to maintain/understand (quite the opposite actually), I was comparing to the old code, which was very simple.

If we had a framework for benchmarking then each performance improvement pull could be required to add an attempt at best/worst/mediocre case benchmarks for the new implementation. It’s pretty boring writing benchmarks without being allowed to optimise the code, so it might be good to bundle the work together.

On 10 Mar 2016, at 15:07, Andrei Alexandrescu notifications@github.com wrote:

@John-Colvin https://github.com/John-Colvin depends on the gains. At some point we do need to add some benchmarks to Phobos, and let's defer performance work to after that.

In this case I'd say eliminating the unused iteration variable may arguably lead to code that's easier to understand and maintain, too.


Reply to this email directly or view it on GitHub #4069 (comment).

@John-Colvin John-Colvin force-pushed the patch-16 branch 2 times, most recently from f5f037e to 22bedb9 Compare March 10, 2016 18:29
@John-Colvin
Copy link
Contributor Author

Ewwww, found a nasty bug in std.range.Take. Fixed, tested, added more testing for pairwise sum (which was how I found the bug, completely by chance), changed to use one less loop-variable as per @andralex's recommendation and factored the whole store-collapse in to a local function.

@John-Colvin
Copy link
Contributor Author

The std.range.Take bug was introduced in #2158, which is a great example of why unittests should actually test something, not just do something to check it compiles.

F[64] store = void;
size_t idx = 0;

auto collapseStore(T)(T k)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andralex I had to make this a template because size_t is no good as a generic index/length type for ranges (e.g. std.range.iota of ulong on 32bit).

I'm now unsure what to do with e.g. line 4087 which uses size_t as a counter (also it makes me reconsider my assumptions about the necessary store length). Should I use ulong? Do we have some sort of standard as to how long ranges are allowed to be while still being supported by phobos?

With the current state of affairs, even something as trivial as walkLength is limited to 2^^32 - 1 on 32 bit platforms.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking in this matter is to avoid implementation aggravation. If someone needs ranges longer than 4B, they need to switch to a 64 bit platform. I think dealing with lengths that are non-size_t is just wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. It overcomplicates life to support lengths of any type other than size_t.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that's fine, I just needed a decision. I'll sort this out now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on second thoughts, what does that mean for iota(5L) or iota(long.max) on 32 bit? What type should the length be?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, if we go with enforcing length to be size_t, then iota(long.max) should simply have no length

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty much already addressed. Somehow got stalled: #4013

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iota with longs should probably not offer length on 32-bit systems. Or just assert and truncate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andralex On x86 or on x64 as well?
On 24 Mar 2016 8:04 pm, "Andrei Alexandrescu" notifications@github.com
wrote:

In std/algorithm/iteration.d
#4069 (comment)
:

{

  • static assert (isFloatingPoint!Result);
  • switch (r.length)
  • import core.bitop : bsf;
  • // Works for r with at least length < 2^^(64 + log2(16)), in keeping with the use of size_t
  • // elsewhere in std.algorithm and std.range on 64 bit platforms. The 16 in log2(16) comes
  • // from the manual unrolling in sumPairWise16
  • F[64] store = void;
  • size_t idx = 0;
  • auto collapseStore(T)(T k)

Iota with longs should probably not offer length.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/D-Programming-Language/phobos/pull/4069/files/ef9f8338fe123c1514a61c5b4b50fbfb072632e4#r57379092

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#4013 takes the approach of asserting that the length of the range will fit in size_t, so long and ulong will work on 32-bit systems - but only if the resultant range is not too long. So, most common cases will work just fine, but if you try and do something like iota(long.min, long.max), then the assertion will fail. The downside, of course, is that when the problem does occur, it's only found at runtime, but in the vast majority of cases, you can continue to use long and ulong with iota and have length as we've had.

@9il
Copy link
Member

9il commented Mar 14, 2016

I am working on summation module... could we hold this PR?

@9il 9il added the math label Mar 14, 2016
@dnadlinger
Copy link
Member

@9il: As far as I can see, this PR only improves the existing implementation, which is probably a good thing to have until your module arrives in Phobos, no? (I presume the latter will take quite some time, review process and all.)

@9il
Copy link
Member

9il commented Mar 14, 2016

@klickverbot Ok

@wilzbach
Copy link
Member

Please rebase as it now has merge conflicts.

@John-Colvin
Copy link
Contributor Author

rebased

~ Take.stringof);
return source[i .. j - i];
return source[i .. j];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Not sure what I was thinking when I wrote this. Kind of sad it took 2 years to catch it :/

@9il
Copy link
Member

9il commented Mar 29, 2016

mir.las.sum was released. May be we can just put the link to mir in the docs :-) ? I expect that mir version would be significantly faster, especially with optimization for native CPU by LDC/GDC.

@quickfur
Copy link
Member

Should we just close this then? Or does somebody want to port the mir version to Phobos?

@wilzbach
Copy link
Member

Should we just close this then? Or does somebody want to port the mir version to Phobos?

AFAIK will @9il merge mir periodically to Phobos.

@9il
Copy link
Member

9il commented Mar 29, 2016

AFAIK will @9il merge mir periodically to Phobos.

Nitpick: without official review process (see wiki) I can not add new modules.

@wilzbach
Copy link
Member

@John-Colvin could you please separate your fix for Take into another PR?

@JackStouffer
Copy link
Member

I argue for this to be merged. Just because something in the future might be faster doesn't mean that improving what we have now shouldn't happen. This is a performance optimization of some pretty common code, it should be merged posthaste.

@9il
Copy link
Member

9il commented Mar 29, 2016

@John-Colvin could you please separate your fix for Take into another PR?

This is OK to have just a commit without separate PR. In the same time we need a filled issue to show this fix in the history.

@9il
Copy link
Member

9il commented Mar 29, 2016

If someone would like to have this in Phobos feel free to merge.

Nitpick: this algorithm allows to remove Kahan summation, so it should be removed.

@9il
Copy link
Member

9il commented Mar 29, 2016

Oh, looks like a may be wrong about performance for current state of mir :-/

@wilzbach
Copy link
Member

This is OK to have just a commit without separate PR. In the same time we need a filled issue to show this fix in the history.

OK, sorry to be an annoying complainer ...

@9il
Copy link
Member

9il commented Mar 30, 2016

Test results

The PR code is better for DMD and Mir 0.11.0-beta2 is better for LDC. The difference for LDC is small if bounds check are turned off and an array is large. Mir is 2x+ times faster when array length = 15.

So, the PR LGTM. Do not forget to remove Kahan summation.

Compiler BoundsCheck Array Length Bench Count Algorithm Time
DMD safe 10000 10000 std 466 ms
DMD safe 10000 10000 PR 94 ms
DMD safe 10000 10000 Mir 128 ms
DMD safe 15 100000 std 69 ms
DMD safe 15 100000 PR 73 ms
DMD safe 15 100000 Mir 65 ms
DMD off 10000 10000 std 407 ms
DMD off 10000 10000 PR 43 ms
DMD off 10000 10000 Mir 96 ms
DMD off 15 100000 std 55 ms
DMD off 15 100000 PR 61 ms
DMD off 15 100000 Mir 63 ms
LDC safe 10000 10000 std 306 ms
LDC safe 10000 10000 PR 62 ms
LDC safe 10000 10000 Mir 38 ms
LDC safe 15 100000 std 38 ms
LDC safe 15 100000 PR 59 ms
LDC safe 15 100000 Mir 26 ms
LDC off 10000 10000 std 307 ms
LDC off 10000 10000 PR 39 ms
LDC off 10000 10000 Mir 37 ms
LDC off 15 100000 std 40 ms
LDC off 15 100000 PR 56 ms
LDC off 15 100000 Mir 22 ms

@9il
Copy link
Member

9il commented Mar 30, 2016

Benchmark code http://dpaste.dzfl.pl/adb07eea2db4

@DmitryOlshansky
Copy link
Member

Anyhow it would be good to merge this while waiting on more of MIR additions to Phobos

@wilzbach
Copy link
Member

Ping - so any other objections except that khan summation should be removed? I also vote for merging :)

@9il
Copy link
Member

9il commented Apr 10, 2016

Ping - so any other objections except that khan summation should be removed? I also vote for merging :)

I have not. Vote too

@DmitryOlshansky
Copy link
Member

Auto-merge toggled on

@9il
Copy link
Member

9il commented Apr 12, 2016

Auto-merge toggled on

Why Kahan was not removed?

@DmitryOlshansky
Copy link
Member

Why Kahan was not removed?

Was that a requirement?

@9il
Copy link
Member

9il commented Apr 12, 2016

Was that a requirement?

New pairwise summation can work with input ranges, so we don't need Kahan summation algorithms, which was selected of range has hot random access

@DmitryOlshansky DmitryOlshansky merged commit 1d2e88c into dlang:master Apr 12, 2016
@wilzbach
Copy link
Member

Should I sent a follow-up PR to remove Kahan?

@DmitryOlshansky
Copy link
Member

Yeah, a follow-up makes sense

@wilzbach
Copy link
Member

New pairwise summation can work with input ranges, so we don't need Kahan summation algorithms, which was selected of range has hot random access

Just had a closer look at it and didn't see the random access

 static assert (isFloatingPoint!Result && isMutable!Result);
    Result c = 0;
    for (; !r.empty; r.popFront())
    {
        auto y = r.front - c;
        auto t = result + y;
        c = (t - result) - y;
        result = t;
    }
    return result;

and

$(LI If $(D ElementType!R) is a floating-point type and $(D R) is a
finite input range (but not a random-access range with slicing), then
$(D sum) uses the $(WEB en.wikipedia.org/wiki/Kahan_summation,
Kahan summation) algorithm.)

@John-Colvin
Copy link
Contributor Author

For those interested in speed, there are two things I discovered:

LDC doesn't inline core.bitop.bsf
manually unrolling to 32 is even faster than 16

I found that copy-pasting the bsf definition in to the benchmark module and replacing sumPairwise16 with sumPairwise32 brought this implementation to speed-parity with mir.

@9il
Copy link
Member

9il commented Apr 12, 2016

I found that copy-pasting the bsf definition in to the benchmark module

We need to make bsf and others templates

@John-Colvin
Copy link
Contributor Author

that or fix cross-module inlining

@9il
Copy link
Member

9il commented Apr 12, 2016

... or fix cross-module inlining

This would not work for Phobos. Even if you place bsf in module with summation bsf would be called from Phobos.

@John-Colvin
Copy link
Contributor Author

Hmm, GDC and DMD manage to inline it ok. I guess they're magic intrinsics.

Any idea what's going on with the definition in dmd's druntime? Looks
seriously weird to me. On x86_64 size_t is ulong, so there's a separate
declaration and definition in the same module, in D? Then the definition
does a no-op cast and calls itself? Whut? I'm guessing the way intrinsics
work in DMD breaks language rules and is confusing me.
On 12 Apr 2016 5:36 pm, "Ilya Yaroshenko" notifications@github.com wrote:

... or fix cross-module inlining

This would not work for Phobos. Even if you place bsf in module with
summation bsf would be called from Phobos.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#4069 (comment)

9rnsr added a commit to 9rnsr/phobos that referenced this pull request Apr 25, 2016
Minized test case:

//import stdx.traits;
//import stdx.typecons;

template ReturnType(func...)
    if (isCallable!func)
{
    enum ReturnType = true; // dummy
}

template isCallable(T...)
{
    static if (is(typeof(&T[0].opCall) V))
        enum isCallable = true;
}

auto sliced(Range, Lengths...)(Range range)
{
    alias S = Slice!(0, typeof(range));
    S ret;
    return ret;
}

void main()//unittest
{
    auto a = sliced(2);
    auto b = sliced(2);

    a.opIndexAssign(b);
}

struct Slice(size_t _N, _Range)
{
    auto opCall(int)
    {
    }

    auto opCall(string)
    {
    }

    auto opIndex()
    {
        return this;
    }

    void opIndexAssign(size_t RN, RRange)(Slice!(RN, RRange) )
        if (ReturnType!opIndex) // will test &opIndex().opCall in std.traits.isCallable
    {
    }
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.