Dramatic performance degradation in Julia 0.5 #290

multidis · 2016-09-28T20:41:29Z

Use case: pmap with a function that uses Optim within its body.

Problem:

Julia 0.4 @time results for pmap-call:

single core time 3s, 2M allocations, 90MB, 0.57% gc time
8 cores time 1s, 5k allocations, 500KB

Julia 0.5:

single core time 3s, 2M allocations, 80MB, 0.71% gc time
8 cores time 18s, 1k allocations, 150KB

Thus, the multicore pmap call is 18 times slower on Julia 0.5.

It appears that the problem is related to Optim.jl; other parts of the function body applied via pmap do not produce this slowdown. Starting Julia with even a single core (julia, not julia -p 8) and tracking CPU load I noticed that "single core" run still loads multiple CPU cores, intermittently. Not sure if this is some type of multi-threading in Julia 0.5 or something else that was disabled in 0.4.

Tried fminbox with cg, tried also unconstrained Nelder-Mead. Same problem. Tried removing all @simd in Optim sources without much change. Note that Julia prints

WARNING: could not attach metadata for @simd loop.
#157 (comment)

Any suggestions on how to force Optim to use a single worker core? Or did I misunderstand what is happening?

The text was updated successfully, but these errors were encountered:

KristofferC · 2016-09-28T20:44:09Z

Solving linear systems will use multiple cores unless you call. Base.BLAS.set_num_threads (if I remember the function name correctly).

pkofod · 2016-09-28T20:57:36Z

As mentioned above, external calls from Julia to another library will not honor '-p'. Are you absolutely sure it is the same version of Optim?

multidis · 2016-09-28T21:11:42Z

Are you absolutely sure it is the same version of Optim?

Yes, I use the same fork with both Julia 0.4 and 0.5. The performance degradation situation actually reproduces over a range of Optim commits from the last 2 months.

Trying to look into BLAS threads now...

multidis · 2016-09-28T21:13:32Z

julia> versioninfo()
Julia Version 0.5.0
Commit 3c9d753 (2016-09-19 18:14 UTC)
Platform Info:
System: Linux (x86_64-linux-gnu)
CPU: Intel(R) Core(TM) i7-4710MQ CPU @ 2.50GHz
WORD_SIZE: 64
BLAS: libopenblas (NO_LAPACK NO_LAPACKE DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: liblapack.so.3
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

multidis · 2016-09-28T21:33:00Z

Tried with Base.BLAS.set_num_threads(1) in Julia 0.5. Same results. Single-core run seems to stay single-core, so it must have been BLAS multithreading that loaded multiple cores. However, the multicore call even with @everywhere Base.BLAS.set_num_threads(1) stays as slow as before.

multidis · 2016-09-28T21:39:28Z

Correction: Julia 0.5 also writes

WARNING: could not attach metadata for @simd loop.

Tried removing all @simd in Optim sources, that speeds up by a small factor, still much slower than Julia 0.4 with the same simd-warning.

Perhaps there is more to it than just forcing Optim to use a single worker core.

multidis · 2016-09-28T21:48:59Z

An interesting inverse scaling:

julia
2.494269 seconds (1.90 M allocations: 80.106 MB, 0.55% gc time)

julia -p 2
11.449032 seconds (1.20 k allocations: 140.594 KB)

julia -p 4
16.890558 seconds (1.22 k allocations: 153.734 KB)

julia -p 8
20.968143 seconds (1.25 k allocations: 158.594 KB)

amitmurthy · 2016-09-29T04:07:06Z

Can you post some sample code that I can test with?

multidis · 2016-09-29T21:36:32Z

@amitmurthy Here is a simple mockup of what I am running:
https://gist.github.com/multidis/660adbcb2e3b0df2e07436c75b5403a0

Note that in this particular case the list that pmap iterates has only 3 elements, but in practice it may vary from 2 to 1000's of elements. Run times with the actual, full code are longer in the case of v0.5 for larger lists as well (that observation is difficult to reproduce with a simplified mockup code).

Julia v0.4 runs (second @time call):

~/julia04 pmap_optim_time.jl
WARNING: could not attach metadata for @simd loop.
0.006862 seconds (21.71 k allocations: 8.112 MB)

~/julia04 -p 8 pmap_optim_time.jl
WARNING: could not attach metadata for @simd loop.
WARNING: could not attach metadata for @simd loop.
WARNING: could not attach metadata for @simd loop.
0.007946 seconds (2.47 k allocations: 7.827 MB)

Julia v0.5 runs (second @time call):

julia pmap_optim_time.jl
WARNING: could not attach metadata for @simd loop.
0.008386 seconds (14.66 k allocations: 7.944 MB)

julia -p 8 pmap_optim_time.jl
WARNING: could not attach metadata for @simd loop.
WARNING: could not attach metadata for @simd loop.
WARNING: could not attach metadata for @simd loop.
(first call results)
WARNING: could not attach metadata for @simd loop.
WARNING: could not attach metadata for @simd loop.
WARNING: could not attach metadata for @simd loop.
4.073709 seconds (2.31 k allocations: 7.731 MB)

Note also that simd-warnings are generated for the second function call as well, as opposed to v0.4 behaviour.

Version specifics:

~/julia04 -e "versioninfo()"
Julia Version 0.4.7
Commit ae26b25 (2016-09-18 16:17 UTC)
Platform Info:
System: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7-4710MQ CPU @ 2.50GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.3

julia -e "versioninfo()"
Julia Version 0.5.0
Commit 3c9d753 (2016-09-19 18:14 UTC)
Platform Info:
System: Linux (x86_64-linux-gnu)
CPU: Intel(R) Core(TM) i7-4710MQ CPU @ 2.50GHz
WORD_SIZE: 64
BLAS: libopenblas (NO_LAPACK NO_LAPACKE DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: liblapack.so.3
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

amitmurthy · 2016-09-30T03:40:07Z

The numbers above are the first-time compilation overhead. I just added a few more runs of frunmap to the test script.

Ntimes=3

julia -p 8 --depwarn=no pmap_optim_time.jl 
Any[5.60283e-17,5.60283e-17,5.60283e-17]
  5.641841 seconds (5.99 M allocations: 262.483 MB, 3.54% gc time)
  2.350770 seconds (2.25 k allocations: 7.729 MB, 0.26% gc time)
  2.305116 seconds (1.70 k allocations: 7.715 MB)
  0.019322 seconds (620 allocations: 7.688 MB)
  0.024080 seconds (629 allocations: 7.689 MB, 21.92% gc time)
  0.015221 seconds (619 allocations: 7.688 MB)

Ntimes=8

julia -p 8 --depwarn=no pmap_optim_time.jl 
Any[5.60283e-17,5.60283e-17,5.60283e-17,5.60283e-17,5.60283e-17,5.60283e-17,5.60283e-17,5.60283e-17]
  7.750557 seconds (5.99 M allocations: 262.648 MB, 3.01% gc time)
  0.029905 seconds (1.44 k allocations: 7.766 MB, 19.75% gc time)
  0.014826 seconds (1.46 k allocations: 7.768 MB)
  0.014171 seconds (1.44 k allocations: 7.767 MB)
  0.019212 seconds (1.48 k allocations: 7.769 MB, 25.70% gc time)
  0.012541 seconds (1.45 k allocations: 7.767 MB)

In 0.5 all the workers are being used in a round-robin fashion. With Ntimes=3, the first runs primes workers 1-3, the second 4-6 and the third 7-8. Only from the fourth run onwards you see the optimized timings. With Ntimes=8, the first run itself causes all the workers to be warmed up.

In 0.4 only the first 3 workers are being used everytime and hence you see fast run times from the second run itself.

Run times with the actual, full code are longer in the case of v0.5 for larger lists as well (that observation is difficult to reproduce with a simplified mockup code).

How much longer? There is a slight slowdown due to changes in closure serialization. A http://docs.julialang.org/en/latest/stdlib/parallel/#Base.CachingPool may help if that is the cause.

Also see JuliaLang/julia#16508

multidis · 2016-10-03T20:25:26Z

Only from the fourth run onwards you see the optimized timings.

This explains a lot of my confusion, thanks @amitmurthy

After extensive testing taking this into account, Julia 0.5 slowdown (compared to 0.4) is about 5 to 10% extra time. Interestingly, the memory allocation is half of what it was in 0.4. Win on memory, loose on closures I guess... CachingPool does not seem to affect run times.

I guess I'll stay with 0.4 in production to avoid additional computing costs, for as long as package dependencies work with 0.4. Closing the issue since apparently nothing was found specific to Optim.jl.

pkofod · 2016-10-03T20:36:49Z

Hm, you say closures... Is this related to JuliaLang/julia#15276 as well? Maybe when that is fixed, performance will make v0.5 interesting for you again.

multidis · 2016-10-03T21:19:19Z

Could be, frankly I had not had a chance to dig into closure issues for 0.5 in detail. Will definitely time with each subsequent 0.5.* update.

Looking into another performance degradation possibly related to MultivariateOptimizationResults{T,N} to MultivariateOptimizationResults{T,N,M} change, but not sure yet so keeping this closed for now.

multidis · 2016-10-04T01:51:27Z

Moving this to a new issue
#292

multidis mentioned this issue Sep 28, 2016

pmap slowdown since 0.4 JuliaLang/julia#17301

Closed

multidis closed this as completed Oct 3, 2016

multidis mentioned this issue Feb 8, 2017

Performance degradation related to an example problem? #292

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dramatic performance degradation in Julia 0.5 #290

Dramatic performance degradation in Julia 0.5 #290

multidis commented Sep 28, 2016 •

edited

Loading

KristofferC commented Sep 28, 2016

pkofod commented Sep 28, 2016

multidis commented Sep 28, 2016

multidis commented Sep 28, 2016

multidis commented Sep 28, 2016

multidis commented Sep 28, 2016 •

edited

Loading

multidis commented Sep 28, 2016

amitmurthy commented Sep 29, 2016

multidis commented Sep 29, 2016

amitmurthy commented Sep 30, 2016

multidis commented Oct 3, 2016

pkofod commented Oct 3, 2016

multidis commented Oct 3, 2016

multidis commented Oct 4, 2016

Dramatic performance degradation in Julia 0.5 #290

Dramatic performance degradation in Julia 0.5 #290

Comments

multidis commented Sep 28, 2016 • edited Loading

KristofferC commented Sep 28, 2016

pkofod commented Sep 28, 2016

multidis commented Sep 28, 2016

multidis commented Sep 28, 2016

multidis commented Sep 28, 2016

multidis commented Sep 28, 2016 • edited Loading

multidis commented Sep 28, 2016

amitmurthy commented Sep 29, 2016

multidis commented Sep 29, 2016

amitmurthy commented Sep 30, 2016

multidis commented Oct 3, 2016

pkofod commented Oct 3, 2016

multidis commented Oct 3, 2016

multidis commented Oct 4, 2016

multidis commented Sep 28, 2016 •

edited

Loading

multidis commented Sep 28, 2016 •

edited

Loading