Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sum(a) is now 30% slower than NumPy #30290

Closed
stevengj opened this issue Dec 6, 2018 · 24 comments
Closed

sum(a) is now 30% slower than NumPy #30290

stevengj opened this issue Dec 6, 2018 · 24 comments
Labels
performance Must go faster

Comments

@stevengj
Copy link
Member

stevengj commented Dec 6, 2018

I was updating my performance-optimization lecture notes from last year to Julia 1.0, which start with a comparison of C, Python, and Julia sum functions, and I noticed something odd:

Both the Julia sum(::Vector{Float64}) function and the NumPy sum function are faster than last year (yay for compiler improvements?). Last year, Julia and NumPy sum had almost identical speed, but now the NumPy sum function is now about 30% faster than Julia.

I'm running a 2016 Intel core i7, the same as last year. So apparently the NumPy sum function has gotten some new optimization that we don't have? (I did switch from Python 2 to Python 3; I'm using the Conda Python.) Some kind of missing SIMD optimization?

I'm not so concerned about sum per se, but this is a pretty basic function — if we are leaving 30% on the table here, then we might be missing performance opportunities in many other places too.

@stevengj stevengj added the performance Must go faster label Dec 6, 2018
@andreasnoack
Copy link
Member

I (and the audience) noticed this last week when I used the notebook in a presentation. Looks like Intel has been looking into this since I suspect this could be numpy/numpy#10251.

@KristofferC
Copy link
Sponsor Member

This PR numpy/numpy#11113 removes all the intrinsic stuff and says the compiler can do it by itself with the same performance. So not sure why we get less with LLVM then.

@stevengj
Copy link
Member Author

stevengj commented Dec 6, 2018

Note that numpy11113 does not seem to specifically benchmark sum, so it could be that the intrinsics are still better than LLVM there? Or maybe Conda's numpy is using a different compiler (icc?) or different options?

@StefanKarpinski
Copy link
Sponsor Member

I've been seeing the same thing on a 2018 2.7 GHz Intel Core i7, so this is something that's widespread across a pretty wide range of Intel CPUs. Makes the "variations on sum" / "Julia is fast" summation notebook a bit of a sad trumpet demo since NumPy wins.

@JeffreySarnoff
Copy link
Contributor

Are the summed values equal?

@SyxP
Copy link
Contributor

SyxP commented Dec 7, 2018

Are the summed values equal?

Yes. They are within error for floating points, but not exactly equal as floats.

@andreasnoack
Copy link
Member

Just an update. I don't think this is caused by the PR I linked to. I tried out older versions of Numpy and they are all (back to and including 1.11) as fast.

@tkf
Copy link
Member

tkf commented Dec 7, 2018

FYI, if you are using conda to install Numpy 1.11, you may be installing a very new build https://anaconda.org/anaconda/numpy/files?version=1.11.3 (says 6 days old). Anaconda could be using a newer/better compiler?

@andreasnoack
Copy link
Member

I built the Numpy version from source with the default gcc (Apple's LLVM based version). The odd thing is that Numpy is also significantly faster than the C version regardless of flags I try to use.

@JeffreySarnoff
Copy link
Contributor

JeffreySarnoff commented Dec 7, 2018

Can you see the code generated by Numpy?
Annotating a Julia sum function's loop with @simd gives that ~30% speedup on my machine (until the number of summands > ~35,000, when the speedup starts to decrease).

@KristofferC
Copy link
Sponsor Member

If you look in the notebook you can see @simd is (of course) used.

@JeffreySarnoff
Copy link
Contributor

I meant "do we know the instructions that Numpy generates to perform the summation and how those instructions (emitted assembler that is the summation loop) differ from what Julia via LLVM executes?"

@AzamatB
Copy link
Contributor

AzamatB commented Dec 9, 2018

Would be interesting to see whether updating Julia's LLVM to the latest version would solve the issue.

@c42f
Copy link
Member

c42f commented Dec 11, 2018

Oddly enough, the julia builtin sum doesn't vectorize on my machine. See also #30320 where there is some discussion of vectorization in the current mapreduce implementation.

@c42f
Copy link
Member

c42f commented Dec 11, 2018

sum doesn't vectorize on my machine

Sorry, ignore that — I misread the assembly.

@hycakir
Copy link
Contributor

hycakir commented Dec 31, 2018

On my notebook (i7-6700HQ, not AVX-512), using conda distributed Numpy sum does not give any performance benefit.

I believe this is an AVX-512 optimization issue. Perhaps conda Numpy distributions are compiled with icc (although there is also an intel numpy ) which presumably optimizes better for AVX-512 targets.

Looking at the output from gcc 8.2 vs icc 19.0 following the link below, icc generates instructions using 512-bit registers (zmms) while gcc or clang does not.

https://godbolt.org/z/we75gC

@chriselrod
Copy link
Contributor

chriselrod commented Dec 31, 2018 via email

@hycakir
Copy link
Contributor

hycakir commented Jan 1, 2019

Ah, I see. icc's default being -fp-model fast=1 misled me. Julia also makes use of AVX-512 registers with the same sum implementation with @simd annotation.

@nlw0
Copy link
Contributor

nlw0 commented Apr 18, 2019

Would be interesting to see whether updating Julia's LLVM to the latest version would solve the issue.

I have seen other vectorization issues where clang 6.0 produces vectorized code and Julia doesn't, for instance #29445 and #31442 .

@vchuravy
Copy link
Sponsor Member

vchuravy commented Nov 2, 2019

One idea might be to check whether this happens with numpy from Conda or numpy from pip. The one from Conda has Intel specific code and uses VML

@carstenbauer
Copy link
Member

carstenbauer commented Feb 25, 2021

I just noticed this as well on 1.5.3 (~25% faster numpy). On julia#master (i.e. 1.7-dev) I find that numpy is "only" ~12% faster. Still unfortunate that we are slower here (in particular for demonstration/selling purposes).

@StefanKarpinski
Copy link
Sponsor Member

I'm not sure if there's much value to keeping this open: this is generally a symptom of LLVM not generating as well optimized simd code for newer hardware as numpy's hand coded kernels. We generally catch up as soon as llvm learns how to do as well as the hand written code but there's always newer hardware. Not sure what is actionable.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Mar 1, 2021

I love closing issues

@vtjnash vtjnash closed this as completed Mar 1, 2021
@chriselrod
Copy link
Contributor

chriselrod commented Mar 1, 2021

I'm not sure if there's much value to keeping this open: this is generally a symptom of LLVM not generating as well optimized simd code for newer hardware as numpy's hand coded kernels.

Hand coded kernels of course adds a lot of maintenance burden. LLVM is normally way (months or years) ahead of OpenBLAS in supporting new architectures.
For example, it wasn't until last year that OpenBLAS really started supporting AVX512, but my laptop (which has AVX512) still isn't supported in the latest release and still uses Nehalem kernels.

So just stating the obvious of why I don't think hand coded kernels are the best idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests