sum(a) is now 30% slower than NumPy #30290

stevengj · 2018-12-06T14:06:45Z

I was updating my performance-optimization lecture notes from last year to Julia 1.0, which start with a comparison of C, Python, and Julia sum functions, and I noticed something odd:

Both the Julia sum(::Vector{Float64}) function and the NumPy sum function are faster than last year (yay for compiler improvements?). Last year, Julia and NumPy sum had almost identical speed, but now the NumPy sum function is now about 30% faster than Julia.

I'm running a 2016 Intel core i7, the same as last year. So apparently the NumPy sum function has gotten some new optimization that we don't have? (I did switch from Python 2 to Python 3; I'm using the Conda Python.) Some kind of missing SIMD optimization?

I'm not so concerned about sum per se, but this is a pretty basic function — if we are leaving 30% on the table here, then we might be missing performance opportunities in many other places too.

The text was updated successfully, but these errors were encountered:

andreasnoack · 2018-12-06T14:36:06Z

I (and the audience) noticed this last week when I used the notebook in a presentation. Looks like Intel has been looking into this since I suspect this could be numpy/numpy#10251.

KristofferC · 2018-12-06T15:17:37Z

This PR numpy/numpy#11113 removes all the intrinsic stuff and says the compiler can do it by itself with the same performance. So not sure why we get less with LLVM then.

stevengj · 2018-12-06T15:37:40Z

Note that numpy11113 does not seem to specifically benchmark sum, so it could be that the intrinsics are still better than LLVM there? Or maybe Conda's numpy is using a different compiler (icc?) or different options?

StefanKarpinski · 2018-12-06T16:04:47Z

I've been seeing the same thing on a 2018 2.7 GHz Intel Core i7, so this is something that's widespread across a pretty wide range of Intel CPUs. Makes the "variations on sum" / "Julia is fast" summation notebook a bit of a sad trumpet demo since NumPy wins.

JeffreySarnoff · 2018-12-07T04:03:57Z

Are the summed values equal?

SyxP · 2018-12-07T04:24:06Z

Are the summed values equal?

Yes. They are within error for floating points, but not exactly equal as floats.

andreasnoack · 2018-12-07T09:16:28Z

Just an update. I don't think this is caused by the PR I linked to. I tried out older versions of Numpy and they are all (back to and including 1.11) as fast.

tkf · 2018-12-07T11:07:19Z

FYI, if you are using conda to install Numpy 1.11, you may be installing a very new build https://anaconda.org/anaconda/numpy/files?version=1.11.3 (says 6 days old). Anaconda could be using a newer/better compiler?

andreasnoack · 2018-12-07T11:21:51Z

I built the Numpy version from source with the default gcc (Apple's LLVM based version). The odd thing is that Numpy is also significantly faster than the C version regardless of flags I try to use.

JeffreySarnoff · 2018-12-07T11:41:58Z

Can you see the code generated by Numpy?
Annotating a Julia sum function's loop with @simd gives that ~30% speedup on my machine (until the number of summands > ~35,000, when the speedup starts to decrease).

KristofferC · 2018-12-07T13:36:55Z

If you look in the notebook you can see @simd is (of course) used.

JeffreySarnoff · 2018-12-07T19:05:02Z

I meant "do we know the instructions that Numpy generates to perform the summation and how those instructions (emitted assembler that is the summation loop) differ from what Julia via LLVM executes?"

AzamatB · 2018-12-09T14:36:46Z

Would be interesting to see whether updating Julia's LLVM to the latest version would solve the issue.

c42f · 2018-12-11T05:36:20Z

~~Oddly enough, the julia builtin sum doesn't vectorize on my machine~~. See also #30320 where there is some discussion of vectorization in the current mapreduce implementation.

c42f · 2018-12-11T05:49:06Z

sum doesn't vectorize on my machine

Sorry, ignore that — I misread the assembly.

hycakir · 2018-12-31T22:23:11Z

On my notebook (i7-6700HQ, not AVX-512), using conda distributed Numpy sum does not give any performance benefit.

I believe this is an AVX-512 optimization issue. Perhaps conda Numpy distributions are compiled with icc (although there is also an intel numpy ) which presumably optimizes better for AVX-512 targets.

~~Looking at the output from gcc 8.2 vs icc 19.0 following the link below, icc generates instructions using 512-bit registers (zmms) while gcc or clang does not.~~

https://godbolt.org/z/we75gC

chriselrod · 2018-12-31T22:43:56Z

If you use `-mprefer-vector-width=512` and `-ffast-math` then clang 7.0 and gcc 8.2 will use zmm registers. https://godbolt.org/z/r8lQTH gcc's code looks awkward, while clang has a rather extreme amount of unrolling (by a factor of 128 doubles!), then another body that unrolls by a factor of 32. `ssh` is failing me right now, but I could benchmark on a skylake-x architecture in a couple days when I get home. Julia's `@simd` code tends to match Clang's. I would guess that this wins for long loops by a small margin (icc close behind), while icc is much faster for small loops. And that gcc tends slower than both.

…

On Mon, Dec 31, 2018 at 4:23 PM Hamza Yusuf Çakır ***@***.***> wrote: On my notebook (i7-6700HQ), using conda distributed Numpy sum does not give any performance benefit. Seems to be that conda Numpy distributions are compiled with icc. I believe this is a AVX-512 optimization issue. Looking at the output from gcc 8.2 vs icc 19.0 following the link below, icc uses 512-bit registers while gcc or clang does not. https://godbolt.org/z/we75gC — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#30290 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHq8U_YNQ5D7AKwG9srMt-td9ofYZAupks5u-o5fgaJpZM4ZGktk> .

-- https://github.com/chriselrod?tab=repositories https://www.linkedin.com/in/chris-elrod-9720391a/

hycakir · 2019-01-01T08:43:53Z

Ah, I see. icc's default being -fp-model fast=1 misled me. Julia also makes use of AVX-512 registers with the same sum implementation with @simd annotation.

nlw0 · 2019-04-18T11:24:41Z

Would be interesting to see whether updating Julia's LLVM to the latest version would solve the issue.

I have seen other vectorization issues where clang 6.0 produces vectorized code and Julia doesn't, for instance #29445 and #31442 .

vchuravy · 2019-11-02T19:22:23Z

One idea might be to check whether this happens with numpy from Conda or numpy from pip. The one from Conda has Intel specific code and uses VML

carstenbauer · 2021-02-25T22:02:53Z

I just noticed this as well on 1.5.3 (~25% faster numpy). On julia#master (i.e. 1.7-dev) I find that numpy is "only" ~12% faster. Still unfortunate that we are slower here (in particular for demonstration/selling purposes).

StefanKarpinski · 2021-02-28T22:48:01Z

I'm not sure if there's much value to keeping this open: this is generally a symptom of LLVM not generating as well optimized simd code for newer hardware as numpy's hand coded kernels. We generally catch up as soon as llvm learns how to do as well as the hand written code but there's always newer hardware. Not sure what is actionable.

vtjnash · 2021-03-01T00:25:36Z

I love closing issues

chriselrod · 2021-03-01T00:34:28Z

I'm not sure if there's much value to keeping this open: this is generally a symptom of LLVM not generating as well optimized simd code for newer hardware as numpy's hand coded kernels.

Hand coded kernels of course adds a lot of maintenance burden. LLVM is normally way (months or years) ahead of OpenBLAS in supporting new architectures.
For example, it wasn't until last year that OpenBLAS really started supporting AVX512, but my laptop (which has AVX512) still isn't supported in the latest release and still uses Nehalem kernels.

So just stating the obvious of why I don't think hand coded kernels are the best idea.

stevengj added the performance Must go faster label Dec 6, 2018

nlw0 mentioned this issue Apr 18, 2019

likely vectorization discrepancy between julia and clang triple-nested-loop gemms #29445

Open

vtjnash closed this as completed Mar 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sum(a) is now 30% slower than NumPy #30290

sum(a) is now 30% slower than NumPy #30290

stevengj commented Dec 6, 2018

andreasnoack commented Dec 6, 2018

KristofferC commented Dec 6, 2018

stevengj commented Dec 6, 2018 •

edited

StefanKarpinski commented Dec 6, 2018

JeffreySarnoff commented Dec 7, 2018

SyxP commented Dec 7, 2018 •

edited

andreasnoack commented Dec 7, 2018

tkf commented Dec 7, 2018 •

edited

andreasnoack commented Dec 7, 2018

JeffreySarnoff commented Dec 7, 2018 •

edited by mbauman

KristofferC commented Dec 7, 2018

JeffreySarnoff commented Dec 7, 2018

AzamatB commented Dec 9, 2018

c42f commented Dec 11, 2018 •

edited

c42f commented Dec 11, 2018

hycakir commented Dec 31, 2018 •

edited

chriselrod commented Dec 31, 2018 via email •

edited

hycakir commented Jan 1, 2019

nlw0 commented Apr 18, 2019

vchuravy commented Nov 2, 2019

carstenbauer commented Feb 25, 2021 •

edited

StefanKarpinski commented Feb 28, 2021

vtjnash commented Mar 1, 2021

chriselrod commented Mar 1, 2021 •

edited

sum(a) is now 30% slower than NumPy #30290

sum(a) is now 30% slower than NumPy #30290

Comments

stevengj commented Dec 6, 2018

andreasnoack commented Dec 6, 2018

KristofferC commented Dec 6, 2018

stevengj commented Dec 6, 2018 • edited

StefanKarpinski commented Dec 6, 2018

JeffreySarnoff commented Dec 7, 2018

SyxP commented Dec 7, 2018 • edited

andreasnoack commented Dec 7, 2018

tkf commented Dec 7, 2018 • edited

andreasnoack commented Dec 7, 2018

JeffreySarnoff commented Dec 7, 2018 • edited by mbauman

KristofferC commented Dec 7, 2018

JeffreySarnoff commented Dec 7, 2018

AzamatB commented Dec 9, 2018

c42f commented Dec 11, 2018 • edited

c42f commented Dec 11, 2018

hycakir commented Dec 31, 2018 • edited

chriselrod commented Dec 31, 2018 via email • edited

hycakir commented Jan 1, 2019

nlw0 commented Apr 18, 2019

vchuravy commented Nov 2, 2019

carstenbauer commented Feb 25, 2021 • edited

StefanKarpinski commented Feb 28, 2021

vtjnash commented Mar 1, 2021

chriselrod commented Mar 1, 2021 • edited

stevengj commented Dec 6, 2018 •

edited

SyxP commented Dec 7, 2018 •

edited

tkf commented Dec 7, 2018 •

edited

JeffreySarnoff commented Dec 7, 2018 •

edited by mbauman

c42f commented Dec 11, 2018 •

edited

hycakir commented Dec 31, 2018 •

edited

chriselrod commented Dec 31, 2018 via email •

edited

carstenbauer commented Feb 25, 2021 •

edited

chriselrod commented Mar 1, 2021 •

edited