Performance of matrices using complex numbers #1145

Closed
cdubout opened this Issue Aug 10, 2012 · 23 comments

Comments

Projects
None yet
8 participants

cdubout commented Aug 10, 2012

I am trying to reimplement some Matlab code in Julia, but there is a big performance gap when using complex numbers.

As an example:

m = rand(1024,1024);
@time n = sum(m);
elapsed time: 0.007010221481323242 seconds (shortest of 5 trials on my machine)

m = m + im;
@time n = sum(m);
elapsed time: 0.07563614845275879 seconds (again shortest of 5 trials on my machine)

@time n = sum(real(m)) + sum(imag(m));
elapsed time: 0.016952991485595703 seconds

Why is the second experiment more than 10 times slower instead of just 2 (like Matlab)?
And why the third still manages to beat the second...

Owner

timholy commented Aug 10, 2012

There are others who can comment more fully (and possibly, accurately!), but my impression is that our complex number support is definitely slower than ideal. The issue comes down to the fact that currently Julia doesn't support "immutable types," which in this case would mean a set of two doubles in adjacent memory locations that are addressable as a type (structure). There are some clever shenanigans that simulate this (see all the "box" and "unbox" stuff in base/complex.jl), but I think there's some overhead to this.

The good news is that Jeff appears to be working on this precise issue. I suspect that when he's done, you'll see a significant performance boost.

I can't comment on your third case.

Owner

Keno commented Aug 10, 2012

As Tim has mentioned, there is the issue of immutability in structs, but I think that's not at work here. If I am interpreting my observations correctly, the actual issue is that the first and third case are inlined (though there's additional copying overhead for the third) while the second one is not (LLVM dump here ). Maybe @JeffBezanson can elaborate a little further.

EDIT: I should have been more precise. What I meant is that + is inlined into sum in the first and third case, but not in the second one.

Owner

JeffBezanson commented Aug 10, 2012

See #323. But also, @loladiro is right and I think I can do something more in this case.

cdubout commented Aug 13, 2012

Ok yes that fixed it.
In fact that fixed it so well you may want to check it:

m = rand(1024,1024); n = sum(m); @time n = sum(m); m = m + im; n = sum(m); @time n = sum(m);
elapsed time: 0.0039370059967041016 seconds
elapsed time: 0.002023935317993164 seconds

So now instead of being 10 times slower it is twice faster with complex. Strange…

For comparison here is why I get with Matlab 2012a:

m = rand(1024,1024); n = sum(m); tic; n = sum(m); toc, m = m + 1i; n = sum(m); tic; n = sum(m); toc
Elapsed time is 0.000504 seconds.
Elapsed time is 0.000921 seconds.

Owner

JeffBezanson commented Aug 13, 2012

That is because we use a compensated summation algorithm by default for float arrays; it is a bit slower. I guess we should use the same algorithm for complex float.

cdubout commented Aug 13, 2012

A bit? It is 4 times slower…

On Aug 13, 2012, at 21:19, Jeff Bezanson notifications@github.com wrote:

That is because we use a compensated summation algorithm by default for float arrays; it is a bit slower. I guess we should use the same algorithm for complex float.


Reply to this email directly or view it on GitHub.

Owner

JeffBezanson commented Aug 13, 2012

For some reason I get a much smaller gap on my machine:

julia> m = rand(1024,1024); n = sum(m); @time n = sum(m); m = m + im; n = sum(m); @time n = sum(m);
elapsed time: 0.0037260055541992188 seconds
elapsed time: 0.0035698413848876953 seconds

Also since we're 2x to 7x slower than matlab here, the differences between real and complex, compensated and not are masked by other overhead. I would also not expect complex to be 2x slower since it is unrolled, doing 2 sums per loop (e.g. in matlab it is only 1.8x slower).

Owner

StefanKarpinski commented Aug 13, 2012

This kind of makes sense given that our theory about why compensated summation is only marginally slower is that superscalar architectures can do a few more ops in a tight loop essentially for free. That would be extremely architecture-sensitive. Maybe @cdubout's machine is a sufficiently different CPU that the extra work in the compensated version is no longer free.

Owner

timholy commented Aug 13, 2012

@cdubout, in case it's not entirely clear, the compensated algorithm is designed to be insensitive to roundoff. See line 1430 of array.jl, commit 19ff52a, and this wikipedia page: http://en.wikipedia.org/wiki/Kahan_summation_algorithm.

Jeff and Stefan, it would seem that a poor-person's version of an improved summation routine might offer a middle course (warning, just making this up off the top of my head):

z = zero(eltype(a))
n = length(a)
blocksize = int(round(sqrt(n)))
s = z
for i = 1:blocksize:n
    ilim = min(n, i+blocksize-1)
    spart = z
    for j = i:ilim
        spart += a[j]
    end
    s += spart
end

The idea is that if a's values are essentially randomly-distributed, then adding up in blocks of size sqrt(n) should reduce the frequency of adding big and small values.

Owner

Keno commented Aug 13, 2012

My timings on an I7-2760QM with 1333 Mhz DDR3 RAM are quite similar to @cdubout's (same code as Jeff's):

elapsed time: 0.0037779808044433594 seconds
elapsed time: 0.002089977264404297 seconds

, while my timings on beowulf show are quite different:

elapsed time: 0.031223058700561523 seconds
elapsed time: 0.04382896423339844 seconds

Replacing the compensated sum with the trivial algorithm on my i7 gives results that might be closer to what is expected:

elapsed time: 0.0010178089141845703 seconds
elapsed time: 0.0020389556884765625 seconds
Owner

StefanKarpinski commented Aug 13, 2012

Makes sense. Those beowulf machines are ancient.

Owner

StefanKarpinski commented Aug 13, 2012

Maybe we should just have a separate function kahansum? We could do exactsum that does some efficient computation of the exact floating-point sum (I have some idea for this).

Owner

Keno commented Aug 13, 2012

+1 for making it the compensated sum a separate function.

cdubout commented Aug 13, 2012

My timings were on an i7-2720QM with 1333 Mhz DDR3 RAM, makes sense that they are similar.
I also think that having two separate functions (a fast one and an accurate one) makes sense if the performance gap is that big.

Member

johnmyleswhite commented Aug 13, 2012

The requirement for having at least two functions (a fast one and an accurate one) is going to be a general issue as we move forward. It's come up while I've been writing code for doing PCA, for example.

It would be great to come up with a general naming convention for this distinction.

-- John

On Aug 13, 2012, at 1:48 PM, cdubout wrote:

My timings were on an i7-2720QM with 1333 Mhz DDR3 RAM, makes sense that they are similar.
I also think that having two separate functions (a fast one and an accurate one) makes sense if the performance gap is that big.


Reply to this email directly or view it on GitHub.

Owner

StefanKarpinski commented Aug 13, 2012

One interesting approach would be to have Approx and Exact modules and import from one or the other depending on whether you generally want approximate or exact computations. Finer-grained control could be gotten by importing specific functions from one or the other. That would allow one to play with this just by changing imports rather than searching and replacing in code.

Owner

timholy commented Aug 13, 2012

There's a whole class of approximate solutions to problems (e.g., approximate nearest-neighbor, ANN), for which there's a speed/accuracy tradeoff. For those algorithms you can often specify an upper bound on how accurate you require the answer to be. Not certain that applies in this case, but if one is trying to come up with a general API it would be worth keeping this class of algorithms in mind.

Member

johnmyleswhite commented Aug 14, 2012

Agreed, Tim: this becomes a very deep question once we start thinking about it.

In R, for example, there are, at minimum, two separate PCA functions that differ in numerical accuracy based on their use or non-use of the SVD. Neither even gets into the more complex issues raised by newer approximate PCA methods that let you tune the level of approximation. We should definitely think about this. It might be as big an issue to get right as the modeling functions like fit and predict that we've temporarily punted on.

Member

kmsquire commented Sep 6, 2012

Bringing the discussion on exact vs. approximate summation back here (from #1257), since most of the discussion is here.

For K-B-N summation, at least, it would be nice to have a KBNSum module. IMHO, the best scenario would be to have any identical bindings in the current namespace overridden with an import KBNSum.* Unfortunately, from my understanding of modules (*), this does not possible--is this correct? None of my playing around seems to allow this.

In the case of K-B-N summation, at least, this is probably fine. The K-B-N summation only needs to be defined for FloatingPoints, which could otherwise use generic/templated summation functions. For other classes of functions, it might not be as easy.

Shall we try this for K-B-N summation? I can update #1257 to further the discussion.

(*) https://groups.google.com/forum/#!msg/julia-dev/Rj1xkrZkgcw/yb9ZlEAoKzUJ

Owner

StefanKarpinski commented Sep 6, 2012

That sort of facility with modules would be nice. I guess the idea would be that you have normal sum and cumsum functions provided by default, but then do import KBNSum.* and get new versions of sum and cumsum that do KBN summation in the current scope.

The other way that you'd want to use this is to opt for KBN (or something even more precise) globally as an option. That's an entirely different issue, but not off-topic here, I don't think.

So it seems like what we want is:

  1. a way to override a bunch of bindings via an import: import KBNSum.*
  2. a way to switch to using KBNSum everywhere.

The latter indicates that summation may be something we want to load later than most of the stuff in base since we might want the option to be able to choose to load a different definition based on command-line flags.

Member

kmsquire commented Sep 6, 2012

Nice summary! The first is a clearer explanation of what I was proposing, and the second summarizes the needs discussed previously.

(KBNSum now seems awkward looking at it, but Exact or ExactSum sounded misleading, since there still could be roundoff error... Other ideas for the name?)

Owner

timholy commented Sep 6, 2012

CompensatedSum? (I'm bad at names, don't listen to me!)

But there's some merit in starting the name with Sum, e.g., SumKBN, because if packages are listed alphabetically it might appear nearer to where someone might expect to find this functionality.

Member

pao commented Sep 6, 2012

From the "what other languages do" department, here's some pseudo-Haskell-Julia:

import Base hiding (sum, cumsum)
import KBNSum

So basically you can exclude certain things. Pseudo-Julia, it could be nice to do:

import Base.* hiding exports(KBNSum)
import KBNSum.*

where exports(m::Module) returns an array of exported names, and is a method that might already exist but I can't check right now. UPDATE: whos() comes close, but needs to have the module imported first without a .*, which is rather inconvenient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment