Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dramatic performance degradation in Julia 0.5 #290

Closed
multidis opened this issue Sep 28, 2016 · 14 comments
Closed

Dramatic performance degradation in Julia 0.5 #290

multidis opened this issue Sep 28, 2016 · 14 comments

Comments

@multidis
Copy link

multidis commented Sep 28, 2016

Use case: pmap with a function that uses Optim within its body.

Problem:

Julia 0.4 @time results for pmap-call:

  • single core time 3s, 2M allocations, 90MB, 0.57% gc time
  • 8 cores time 1s, 5k allocations, 500KB

Julia 0.5:

  • single core time 3s, 2M allocations, 80MB, 0.71% gc time
  • 8 cores time 18s, 1k allocations, 150KB

Thus, the multicore pmap call is 18 times slower on Julia 0.5.

It appears that the problem is related to Optim.jl; other parts of the function body applied via pmap do not produce this slowdown. Starting Julia with even a single core (julia, not julia -p 8) and tracking CPU load I noticed that "single core" run still loads multiple CPU cores, intermittently. Not sure if this is some type of multi-threading in Julia 0.5 or something else that was disabled in 0.4.

Tried fminbox with cg, tried also unconstrained Nelder-Mead. Same problem. Tried removing all @simd in Optim sources without much change. Note that Julia prints

WARNING: could not attach metadata for @simd loop.
#157 (comment)

Any suggestions on how to force Optim to use a single worker core? Or did I misunderstand what is happening?

@KristofferC
Copy link
Contributor

Solving linear systems will use multiple cores unless you call. Base.BLAS.set_num_threads (if I remember the function name correctly).

@pkofod
Copy link
Member

pkofod commented Sep 28, 2016

As mentioned above, external calls from Julia to another library will not honor '-p'. Are you absolutely sure it is the same version of Optim?

@multidis
Copy link
Author

Are you absolutely sure it is the same version of Optim?

Yes, I use the same fork with both Julia 0.4 and 0.5. The performance degradation situation actually reproduces over a range of Optim commits from the last 2 months.

Trying to look into BLAS threads now...

@multidis
Copy link
Author

julia> versioninfo()
Julia Version 0.5.0
Commit 3c9d753 (2016-09-19 18:14 UTC)
Platform Info:
System: Linux (x86_64-linux-gnu)
CPU: Intel(R) Core(TM) i7-4710MQ CPU @ 2.50GHz
WORD_SIZE: 64
BLAS: libopenblas (NO_LAPACK NO_LAPACKE DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: liblapack.so.3
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

@multidis
Copy link
Author

Tried with Base.BLAS.set_num_threads(1) in Julia 0.5. Same results. Single-core run seems to stay single-core, so it must have been BLAS multithreading that loaded multiple cores. However, the multicore call even with @everywhere Base.BLAS.set_num_threads(1) stays as slow as before.

@multidis
Copy link
Author

multidis commented Sep 28, 2016

Correction: Julia 0.5 also writes

WARNING: could not attach metadata for @simd loop.

Tried removing all @simd in Optim sources, that speeds up by a small factor, still much slower than Julia 0.4 with the same simd-warning.

Perhaps there is more to it than just forcing Optim to use a single worker core.

@multidis
Copy link
Author

An interesting inverse scaling:

julia
2.494269 seconds (1.90 M allocations: 80.106 MB, 0.55% gc time)

julia -p 2
11.449032 seconds (1.20 k allocations: 140.594 KB)

julia -p 4
16.890558 seconds (1.22 k allocations: 153.734 KB)

julia -p 8
20.968143 seconds (1.25 k allocations: 158.594 KB)

@amitmurthy
Copy link

Can you post some sample code that I can test with?

@multidis
Copy link
Author

@amitmurthy Here is a simple mockup of what I am running:
https://gist.github.com/multidis/660adbcb2e3b0df2e07436c75b5403a0

Note that in this particular case the list that pmap iterates has only 3 elements, but in practice it may vary from 2 to 1000's of elements. Run times with the actual, full code are longer in the case of v0.5 for larger lists as well (that observation is difficult to reproduce with a simplified mockup code).

Julia v0.4 runs (second @time call):

~/julia04 pmap_optim_time.jl
WARNING: could not attach metadata for @simd loop.
0.006862 seconds (21.71 k allocations: 8.112 MB)

~/julia04 -p 8 pmap_optim_time.jl
WARNING: could not attach metadata for @simd loop.
WARNING: could not attach metadata for @simd loop.
WARNING: could not attach metadata for @simd loop.
0.007946 seconds (2.47 k allocations: 7.827 MB)

Julia v0.5 runs (second @time call):

julia pmap_optim_time.jl
WARNING: could not attach metadata for @simd loop.
0.008386 seconds (14.66 k allocations: 7.944 MB)

julia -p 8 pmap_optim_time.jl
WARNING: could not attach metadata for @simd loop.
WARNING: could not attach metadata for @simd loop.
WARNING: could not attach metadata for @simd loop.
(first call results)
WARNING: could not attach metadata for @simd loop.
WARNING: could not attach metadata for @simd loop.
WARNING: could not attach metadata for @simd loop.
4.073709 seconds (2.31 k allocations: 7.731 MB)

Note also that simd-warnings are generated for the second function call as well, as opposed to v0.4 behaviour.

Version specifics:

~/julia04 -e "versioninfo()"
Julia Version 0.4.7
Commit ae26b25 (2016-09-18 16:17 UTC)
Platform Info:
System: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7-4710MQ CPU @ 2.50GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.3

julia -e "versioninfo()"
Julia Version 0.5.0
Commit 3c9d753 (2016-09-19 18:14 UTC)
Platform Info:
System: Linux (x86_64-linux-gnu)
CPU: Intel(R) Core(TM) i7-4710MQ CPU @ 2.50GHz
WORD_SIZE: 64
BLAS: libopenblas (NO_LAPACK NO_LAPACKE DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: liblapack.so.3
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

@amitmurthy
Copy link

The numbers above are the first-time compilation overhead. I just added a few more runs of frunmap to the test script.

Ntimes=3

julia -p 8 --depwarn=no pmap_optim_time.jl 
Any[5.60283e-17,5.60283e-17,5.60283e-17]
  5.641841 seconds (5.99 M allocations: 262.483 MB, 3.54% gc time)
  2.350770 seconds (2.25 k allocations: 7.729 MB, 0.26% gc time)
  2.305116 seconds (1.70 k allocations: 7.715 MB)
  0.019322 seconds (620 allocations: 7.688 MB)
  0.024080 seconds (629 allocations: 7.689 MB, 21.92% gc time)
  0.015221 seconds (619 allocations: 7.688 MB)

Ntimes=8

julia -p 8 --depwarn=no pmap_optim_time.jl 
Any[5.60283e-17,5.60283e-17,5.60283e-17,5.60283e-17,5.60283e-17,5.60283e-17,5.60283e-17,5.60283e-17]
  7.750557 seconds (5.99 M allocations: 262.648 MB, 3.01% gc time)
  0.029905 seconds (1.44 k allocations: 7.766 MB, 19.75% gc time)
  0.014826 seconds (1.46 k allocations: 7.768 MB)
  0.014171 seconds (1.44 k allocations: 7.767 MB)
  0.019212 seconds (1.48 k allocations: 7.769 MB, 25.70% gc time)
  0.012541 seconds (1.45 k allocations: 7.767 MB)

In 0.5 all the workers are being used in a round-robin fashion. With Ntimes=3, the first runs primes workers 1-3, the second 4-6 and the third 7-8. Only from the fourth run onwards you see the optimized timings. With Ntimes=8, the first run itself causes all the workers to be warmed up.

In 0.4 only the first 3 workers are being used everytime and hence you see fast run times from the second run itself.

Run times with the actual, full code are longer in the case of v0.5 for larger lists as well (that observation is difficult to reproduce with a simplified mockup code).

How much longer? There is a slight slowdown due to changes in closure serialization. A http://docs.julialang.org/en/latest/stdlib/parallel/#Base.CachingPool may help if that is the cause.

Also see JuliaLang/julia#16508

@multidis
Copy link
Author

multidis commented Oct 3, 2016

Only from the fourth run onwards you see the optimized timings.

This explains a lot of my confusion, thanks @amitmurthy

After extensive testing taking this into account, Julia 0.5 slowdown (compared to 0.4) is about 5 to 10% extra time. Interestingly, the memory allocation is half of what it was in 0.4. Win on memory, loose on closures I guess... CachingPool does not seem to affect run times.

I guess I'll stay with 0.4 in production to avoid additional computing costs, for as long as package dependencies work with 0.4. Closing the issue since apparently nothing was found specific to Optim.jl.

@multidis multidis closed this as completed Oct 3, 2016
@pkofod
Copy link
Member

pkofod commented Oct 3, 2016

Hm, you say closures... Is this related to JuliaLang/julia#15276 as well? Maybe when that is fixed, performance will make v0.5 interesting for you again.

@multidis
Copy link
Author

multidis commented Oct 3, 2016

Could be, frankly I had not had a chance to dig into closure issues for 0.5 in detail. Will definitely time with each subsequent 0.5.* update.

Looking into another performance degradation possibly related to MultivariateOptimizationResults{T,N} to MultivariateOptimizationResults{T,N,M} change, but not sure yet so keeping this closed for now.

@multidis
Copy link
Author

multidis commented Oct 4, 2016

Moving this to a new issue
#292

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants