New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Notable performance slowdown after PR #99 in julia-0.6.0-pre.alpha #102
Comments
Thanks for reporting. At some point during the development of #99, I also noticed a performance regression. Yet, that was addressed and the performance was back. For benchmarking
The table shows that the best timing are obtained for 0697deb; the memory usage is actually improved in #99. I must say that different runs show variations about ~10% in the timing results (whatever goes on in the CPU). The tests considered 8 or 9 runs, so we are far from having good statistics. The problem with the performance that I noted was related to Can you check if the performance hit happens in general, or if it happens concretely in the integration of the variational equations or the Liapunov spectrum? One more comment. While checking the time that takes the tests gives you a first approximation to this, I think it is better to use more sophisticated tools, simply because other processes in the CPU may influence the results, aside from compilation issues. |
By "more sophisticated tools", I think Luis is referring to the
BenchmarkTooks.jl package. I second this recommendation.
…On 31 Mar 2017 9:36 a.m., "Luis Benet" ***@***.***> wrote:
Thanks for reporting. At some point during the development of #99
<#99>, I also noticed
<#99 (comment)>
a performance regression. Yet, that was addressed and the performance was
back.
For benchmarking TaylorSeries I use Fateman's tests. The following table
shows some results I obtained for fateman1(20), using BenchmarkTools,
considering the commit 487cb8b
<487cb8b>
(before #99 <#99> was
merged), 0697deb
<0697deb>
(#99 <#99>) and the
current master. The figures indicate the minimum time and the memory
estimate provided by @benchmark fateman1(20) seconds=20.
Julia 0.5 Julia 0.6
487cb8b
<487cb8b> 2.578
s, 33.39 MiB 2.583 s, 25.51 MiB
0697deb
<0697deb> 2.441
s, 19.27 MiB 2.448 s, 19.26 MiB
master 2.547 s, 19.27 MiB 2.460 s, 19.26 MiB
The table shows that the best timing are obtained for 0697deb
<0697deb>;
the memory usage is actually improved in #99
<#99>. I must say that
different runs show variations about ~10% in the timing results (whatever
goes on in the CPU). The tests considered 8 or 9 runs, so we are far from
having good statistics.
The problem with the performance that I noted was related to TaylorN
multiplication (and similarly in other functions), which was allocating
unnecessary arrays. This was related to the use of * for
HomogeneousPolynomials products in the form c[k+1] += a[i+1] * b[k-i+1]
instead of mul!(c[k+1], a[i+1], b[k-i+1]). The subtlety is that using a[i+1]
* b[k-i+1] creates a temporary array, while mul!(c[k+1], a[i+1], b[k-i+1])
does not.
Can you check if the performance hit happens in general, or if it happens
concretely in the integration of the variational equations or the Liapunov
spectrum?
One more comment. While checking the time that takes the tests gives you a
first approximation to this, I think it is better to use more sophisticated
tools, simply because other processes in the CPU may influence the results,
aside from compilation issues.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#102 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AALtThxARb4ATIAJkGAboab7hw-RkVteks5rrR19gaJpZM4MvG40>
.
|
Sorry if I was ambiguous... |
Are Fateman's tests only for
Thanks for the feedback! I did as both of you suggested and benchmarked TaylorIntegration.jl's tests using BenchmarkTools.jl, for the commits 487cb8b, 0697deb, and current master (47b201c). At TaylorIntegration.jl's root folder, for each of the above commits of TaylorSeries, I ran the following code: using BenchmarkTools
mybench = @benchmarkable include("./test/runtests.jl") seconds=1e10 evals=10 samples=2
@time include("./test/runtests.jl") #warmup lap
run(mybench) My results were the following:
Sure! I'm going to separate TaylorIntegration.jl's tests (the ones that use |
Responding your question, Fateman's test is usually for |
Regarding the benchmarks, my feeling is that #99 induced that the naive approach we use in |
Here are the results for the "separated" benchmarks; i.e., those which use Non-
|
TaylorSeries.jl commit | memory estimate | allocs estimate | min time | mean time |
---|---|---|---|---|
487cb8b | 112.21 MiB | 1810163 | 4.803 s (1.06% GC) | 4.907 s (1.07% GC) |
0697deb | 125.21 MiB | 2084495 | 4.746 s (1.00% GC) | 4.748 s (1.06% GC) |
master (47b201c) | 125.66 MiB | 2094893 | 4.686 s (1.07% GC) | 4.764 s (1.10% GC) |
-
julia-0.6.0.pre.alpha.315:
TaylorSeries.jl commit | memory estimate | allocs estimate | min time | mean time |
---|---|---|---|---|
487cb8b | 124.03 MiB | 1967996 | 4.427 s (1.16% GC) | 4.473 s (1.13% GC) |
0697deb | 137.31 MiB | 2241712 | 4.699 s (1.24% GC) | 4.718 s (1.28% GC) |
master (47b201c) | 137.55 MiB | 2247012 | 4.549 s (1.23% GC) | 4.618 s (1.25% GC) |
TaylorIntegration.jl tests which involve TaylorN
:
-
julia-0.5.1:
TaylorSeries.jl commit | memory estimate | allocs estimate | min time | mean time |
---|---|---|---|---|
487cb8b | 12.91 GiB | 142833706 | 13.829 s (16.46% GC) | 14.586 s (18.56% GC) |
0697deb | 20.94 GiB | 146733942 | 25.631 s (17.93% GC) | 25.692 s (18.74% GC) |
master (47b201c) | 20.41 GiB | 159212234 | 25.313 s (16.25% GC) | 26.327 s (18.54% GC) |
-
julia-0.6.0.pre.alpha.315:
TaylorSeries.jl commit | memory estimate | allocs estimate | min time | mean time |
---|---|---|---|---|
487cb8b | 10.80 GiB | 110187006 | 10.409 s (18.18% GC) | 10.695 s (19.75% GC) |
0697deb | 20.20 GiB | 122406324 | 16.521 s (19.23% GC) | 17.247 s (20.95% GC) |
master (47b201c) | 19.49 GiB | 122409215 | 16.702 s (19.50% GC) | 17.083 s (21.20% GC) |
Ok, thanks!
From the above benchmarks, I agree with you, I think we can say the performance slowdown observed in TaylorIntegration.jl tests are mainly due to a worse use of memory with |
Can you please tell me which is the version (commit) you are using of TaylorIntegration? Regarding the non- |
I'm using the latest commit in PerezHz/TaylorIntegration.jl#18 (3a8e12e) |
Below, I show some benchmarks which don't involve TaylorIntegration.jl, nor function NBP_pN!(t::Float64, q::Array{Taylor1{Float64}, 1}, dq::Array{Taylor1{Float64}, 1})
#... some calculations here...
for i in eachindex(dq)
dq[i] = # ... in-place assignments ...
end
nothing
end Then I did
So for this example, there's a ~6x slowdown in execution time, and almost 2x as much memory used when using master and 0697deb, vs commit 487cb8b. Also, |
As described in the performance tips, you can run a Julia script with the option
to see where memory allocation is occurring. |
Thanks @dpsanders, I will try that! |
Below, I detail a reproducible (simpler) example of what I think is a performance slowdown of arithmetic operations (sums and products) with First I ran the following code: using TaylorSeries, BenchmarkTools
#some parameters, etc.
const order = 28
const q0 = [19.0, 20.0, 50.0]
const σ = 16.0
const β = 4.0
const ρ = 45.92
const t0 = 0.0
const q0T = Array{Taylor1{Float64}}(3)
const dq0T = Array{Taylor1{Float64}}(3)
#the equations of the Lorenz system
function lorenz!(t::Float64, x::Array{Taylor1{Float64}}, dx::Array{Taylor1{Float64}})
dx[1] = σ*(x[2]-x[1])
dx[2] = x[1]*(ρ-x[3])-x[2]
dx[3] = x[1]*x[2]-β*x[3]
nothing
end
#fill the first coeffs. of each element of q0T with the corresponding elements of q0
for i in eachindex(q0)
q0T[i] = Taylor1(q0[i], order)
end
#the function that we will benchmark
function lorenzmanytimes()
for i in 1:1000000
lorenz!(t0,q0T,dq0T)
end
end Then for the benchmarks themselves I ran: mybench = @benchmarkable lorenzmanytimes() seconds=1e10 evals=10 samples=1
lorenzmanytimes() #warmup lap for lorenzmanytimes()
run(mybench) And got the following results (the data in the table reads as: estimated allocated memory, estimated number of allocations, and the mean time of execution):
|
Following @dpsanders suggestion, I ran the Lorenz system example which I detailed above, and got the .mem files for the TaylorSeries.jl files in the The only memory-allocating lines in the file - function resize_coeffs1!{T<:Number}(coeffs::Array{T,1}, order::Int)
0 lencoef = length(coeffs)
0 order ≤ lencoef-1 && return nothing
1008 resize!(coeffs, order+1)
144 coeffs[lencoef+1:end] .= zero(coeffs[1])
0 return nothing
- end and also lines 140 through 150: - ## fixorder ##
- for T in (:Taylor1, :TaylorN)
- @eval begin
- fixorder(a::$T, order::Int64) = $T(a.coeffs, order)
- function fixorder{R<:Number}(a::$T{R}, b::$T{R})
64000000 a.order == b.order && return a, b
0 a.order < b.order && return $T(a.coeffs, b.order), b
0 return a, $T(b.coeffs, a.order)
- end
- end
- end The only memory-allocating lines for the same file, - function resize_coeffs1!{T<:Number}(coeffs::Array{T,1}, order::Int)
0 lencoef = length(coeffs)
0 order ≤ lencoef-1 && return nothing
672001008 resize!(coeffs, order+1)
96000144 coeffs[lencoef+1:end] .= zero(coeffs[1])
0 return nothing
- end In particular, I noted the line On the other hand, in the |
Just noticing that the heavy use of |
Thanks for checking this up so carefully. Yesterday, I simply changed the line 20 of auxiliary.jl to I can't look at this today, so I'll check it tomorrow. |
I'm not sure what is happening. It seems to me that using dot-operations inside Simply put, the line @__dot__ v = $f(a.coeffs, b.coeffs) (inside an for ind = 1:length(a.coeffs
v[ind] = $f(a[ind], b[ind])
end does not. @dpsanders Do you know of this kind of problems? I guess that a solution is using |
What does In any case, if this is a problem, it should certainly be reported as an issue on the Julia repo. |
You can use |
I'll check how is everything expanded... Thanks for the suggestion! |
Are you checking all these allocations on 0.6? Dot fusion is less powerful on 0.5 |
I'm checking it in 0.5, though I'm aware that they are less powerful. The problem seems to appear in both versions... |
I actually don't think there's too much point in worrying about 0.5...
…On 2 Apr 2017 9:10 p.m., "Luis Benet" ***@***.***> wrote:
I'm checking it in 0.5, though I'm aware that they are less powerful. The
problem seems to appear in both versions...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#102 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AALtTr9wvVj-hXU00JagNn60ZPtSK71Iks5rsFUPgaJpZM4MvG40>
.
|
I agree with you; yet, according to what @PerezHz is reporting (see this comment), the problem also occurs in Julia 0.6. |
The use of those if's is at the heart of the problem reported in [#102]. The idea was to avoid fixorder by setting the coefficients that are not defined to zero. This involves if's that killed performance. Also, fixorder was corrected to return a copy of the coefficients to avoid side effects. `max_order` is not anylonger needed, so it is deleted.
Taking back what I said before on the dot operations, I think the problem is actually related to some changes in I just submitted a PR (see #103) to see if reverting that brings things in order, and it seems so. |
Well it seems to solve the problem for julia 0.5.1 (at least for the total allocated memory, although the number of allocations still goes up). For julia-0.6.0-pre.alpha #103 seems to help a lot, although in this latter case there still seems to be a bit of a slowdown when benchmarking
|
I have just heard about PkgBenchmark.jl which should help for these kinds of comparisons. |
Thanks @dpsanders for the tip on PkgBenchmark.jl; I'll check it later. @PerezHz I just pushed another commit which I think it slightly improves the performance over 487cb8b. Can you confirm this? |
On my machine I ran the function lorenzmanytimes()
for i in 1:5000000 #<--- notice the change here; previous value was 1000000
lorenz!(t0,q0T,dq0T)
end
end i.e., instead of evaluating For these benchmarks, I found that #103 latest commit is essentially on-par with 487cb8b, and even slightly better (wrt execution time and total allocated memory), on julia 0.5.1! Actually, total allocated memory goes down by ~30% for julia0.5.1 when comparing #103 vs 487cb8b. I would say this solves the performance regression, at least for julia 0.5.1. On the other hand, what I noticed for julia 0.6.0-pre.alpha is that even as total allocated memory is better in #103 than 487cb8b (15.80GiB vs 16.02GiB), the total number of allocations seems to be slightly higher in #103 (1.25x10^8 vs 1.1x10^8), as well as execution time (10.759 secs vs 9.455 secs). I see a similar trend for other benchmarks with TaylorIntegration.
|
Just updated to |
I just pushed another commit (33a2cb7), which improves the number of allocations and the memory usage. Locally, my results are the following: I can't bit the timing of 487cb8b using your benchmarks by ~10%. I guess that this is due to some optimization (in @PerezHz Can you check in your application for 33a2cb7 if the slowdown is so bad as before? I will see if there is something I can do to improve further the timing, concretely in the method mentioned above for |
For the On the other hand, I benchmarked TaylorIntegration tests, and found that while total allocated memory is better (~25%) for 33a2cb7 vs 487cb8b, number of allocs is slightly worse (~13%), and execution time is actually slightly better, by about 8%. So I'd be happy to close this issue as far as the original slowdown report is concerned! Thank you @lbenet for all the effort you're putting into this 😄 ! |
Great news! @dpsanders Do you agree to merge #103? |
I haven't been following very closely. If #103 solves the problem, then that sounds good to me, even if it's aesthetically less pleasing to some extent. Hopefully in the future, we may be able to find a different solution, but for now it seems like a good one! |
Sorry guys, I have to take back my word about performance for my TaylorIntegration application 😞... Just realised that I didn't check-out correctly the various TaylorSeries commits when benchmarking. After re-doing it carefully, I'm seeing 33a2cb7 about 8% higher in total allocated memory, 20% higher in number of allocs, and ~10% slower in execution time on average vs 487cb8b. This correction is only for the N-body problem application with TaylorIntegration. The other two benchmarks ( |
Sorry for the delay to get back to this. I've been playing a little bit with the old a new implementation and, including I just pushed a commit with those changes; @PerezHz can you check that this indeed improves the benchmarks over 33a2cb7? There may be a marginal difference in favor of 487cb8b; yet, I prefer to stick to the current implementation since I think it will permit do some other nice stuff. |
Sure, @lbenet! Just benchmarked the latest commit in #103 vs 487cb8b, and at least in my machine, for the
|
Great news! I think this should be merged, to continue with other stuff. |
Since the |
I actually thought to include |
Well, I was thinking of something like |
Also, right now I'm benchmarking one of my |
I think I would be more interested in the TaylorIntegration.jl application 😄 I'll do the keplermanytimes() test... |
If you use
|
Thanks @dpsanders! For one of my For commit 487cb8b:
So especifically for this benchmark 487cb8b seems to be still marginally faster than c19c1eb, but I also found that this difference is almost unnoticeable on longer runs. |
* Use `fixorder` again, to avoid the if's used in `getindex` The use of those if's is at the heart of the problem reported in [#102]. The idea was to avoid fixorder by setting the coefficients that are not defined to zero. This involves if's that killed performance. Also, fixorder was corrected to return a copy of the coefficients to avoid side effects. `max_order` is not anylonger needed, so it is deleted. * Fix tests and fateman.jl * Tiny correction to fixorder * Improvements on allocations * Inline mul! and div! methods * Use a modified power_by_squaring, and inline the (mutating) recursion functions * Inline fixorder * Inlining div!
Thanks @lbenet for all the effort you and @dpsanders have put into this; it is really nice to have performance back! For the record, after #103 was merged I benchmarked once again one of my commit 487cb8b:
current master (b173816):
So it seems that current master is performing better than 487cb8b in julia 0.6.0-pre.beta, as far as both total allocated memory and number of allocs are concerned; execution time is essentially on-par with 487cb8b! I also ran some other |
Thanks for the update! |
That seems to be an enormous amount of memory being allocated in a very short time... |
While working with TaylorIntegration.jl and TaylorSeries.jl, I came across a notable slowdown after PR #99. While benchmarking TaylorIntegration.jl tests using commit 3a8e12e in PerezHz/TaylorIntegration.jl#18 in julia 0.6.0-pre.alpha.315 I got the following typical values (as always, I ran a "warmup lap" before benchmarking to discard compilation overhead):
Julia 0.6.0-pre.alpha.315 TaylorIntegration.jl tests benchmarks for different commits of TaylorSeries:
TaylorSeries latest master (commit 47b201c, PR Add
getindex
,setindex!
methods for::Colon
#101):22.468 seconds (124.61 M allocations: 19.621 GiB)
TaylorSeries commit 0697deb (PR Refactor functions #99):
22.786 seconds (124.60 M allocations: 20.336 GiB, 15.47% gc time)
TaylorSeries commit 487cb8b (PR Add getindex; replace a.coeffs[i] with a[i] #97):
16.958 seconds (112.10 M allocations: 10.923 GiB, 14.04% gc time)
So there's a slowndown of about 30% in typical execution times, about 11% more allocations, and about 50% more total bytes allocated when using commits 47b201c and 0697deb vs using commit 487cb8b in julia 0.6.0-pre.alpha. I also tested some "heavier" benchmarks on two different machines, for a more complicated problem than the ones included in TaylorIntegration.jl's tests, and the slowdown is even bigger, about 50%.
Interestingly enough, this slowdown does not occur in julia 0.5.1. Still, julia-0.5.1 fastest time is slower than julia-0.6.0-pre.alpha slowest time:
Julia 0.5.1 TaylorIntegration.jl tests benchmarks for different commits of TaylorSeries:
TaylorSeries commit 47b201c:
32.264 seconds (161.27 M allocations: 20.527 GB, 15.54% gc time)
TaylorSeries commit 0697deb:
32.350 seconds (161.27 M allocations: 20.528 GB, 14.73% gc time)
TaylorSeries commit 487cb8b:
32.849 seconds (161.27 M allocations: 20.527 GB, 14.51% gc time)
The text was updated successfully, but these errors were encountered: