Notable performance slowdown after PR #99 in julia-0.6.0-pre.alpha #102

PerezHz · 2017-03-31T01:13:09Z

While working with TaylorIntegration.jl and TaylorSeries.jl, I came across a notable slowdown after PR #99. While benchmarking TaylorIntegration.jl tests using commit 3a8e12e in PerezHz/TaylorIntegration.jl#18 in julia 0.6.0-pre.alpha.315 I got the following typical values (as always, I ran a "warmup lap" before benchmarking to discard compilation overhead):

Julia 0.6.0-pre.alpha.315 TaylorIntegration.jl tests benchmarks for different commits of TaylorSeries:

TaylorSeries latest master (commit 47b201c, PR Add getindex, setindex! methods for ::Colon #101):
22.468 seconds (124.61 M allocations: 19.621 GiB)
TaylorSeries commit 0697deb (PR Refactor functions #99):
22.786 seconds (124.60 M allocations: 20.336 GiB, 15.47% gc time)
TaylorSeries commit 487cb8b (PR Add getindex; replace a.coeffs[i] with a[i] #97):
16.958 seconds (112.10 M allocations: 10.923 GiB, 14.04% gc time)

So there's a slowndown of about 30% in typical execution times, about 11% more allocations, and about 50% more total bytes allocated when using commits 47b201c and 0697deb vs using commit 487cb8b in julia 0.6.0-pre.alpha. I also tested some "heavier" benchmarks on two different machines, for a more complicated problem than the ones included in TaylorIntegration.jl's tests, and the slowdown is even bigger, about 50%.

Interestingly enough, this slowdown does not occur in julia 0.5.1. Still, julia-0.5.1 fastest time is slower than julia-0.6.0-pre.alpha slowest time:

Julia 0.5.1 TaylorIntegration.jl tests benchmarks for different commits of TaylorSeries:

TaylorSeries commit 47b201c:
32.264 seconds (161.27 M allocations: 20.527 GB, 15.54% gc time)
TaylorSeries commit 0697deb:
32.350 seconds (161.27 M allocations: 20.528 GB, 14.73% gc time)
TaylorSeries commit 487cb8b:
32.849 seconds (161.27 M allocations: 20.527 GB, 14.51% gc time)

The text was updated successfully, but these errors were encountered:

lbenet · 2017-03-31T15:36:29Z

Thanks for reporting. At some point during the development of #99, I also noticed a performance regression. Yet, that was addressed and the performance was back.

For benchmarking TaylorSeries I use Fateman's tests. The following table shows some results I obtained for fateman1(20), using BenchmarkTools, considering the commit 487cb8b (before #99 was merged), 0697deb (#99) and the current master. The figures indicate the minimum time and the memory estimate provided by @benchmark fateman1(20) seconds=20.

	Julia 0.5	Julia 0.6
`487cb8b`	2.578 s, 33.39 MiB	2.583 s, 25.51 MiB
`0697deb`	2.441 s, 19.27 MiB	2.448 s, 19.26 MiB
master	2.547 s, 19.27 MiB	2.460 s, 19.26 MiB

The table shows that the best timing are obtained for 0697deb; the memory usage is actually improved in #99. I must say that different runs show variations about ~10% in the timing results (whatever goes on in the CPU). The tests considered 8 or 9 runs, so we are far from having good statistics.

The problem with the performance that I noted was related to TaylorN multiplication (and similarly in other functions), which was allocating unnecessary arrays. This was related to the use of * for HomogeneousPolynomials products in the form c[k+1] += a[i+1] * b[k-i+1] instead of mul!(c[k+1], a[i+1], b[k-i+1]). The subtlety is that using a[i+1] * b[k-i+1] creates a temporary array, while mul!(c[k+1], a[i+1], b[k-i+1]) does not.

Can you check if the performance hit happens in general, or if it happens concretely in the integration of the variational equations or the Liapunov spectrum?

One more comment. While checking the time that takes the tests gives you a first approximation to this, I think it is better to use more sophisticated tools, simply because other processes in the CPU may influence the results, aside from compilation issues.

dpsanders · 2017-03-31T15:41:16Z

By "more sophisticated tools", I think Luis is referring to the BenchmarkTooks.jl package. I second this recommendation.

…

On 31 Mar 2017 9:36 a.m., "Luis Benet" ***@***.***> wrote: Thanks for reporting. At some point during the development of #99 <#99>, I also noticed <#99 (comment)> a performance regression. Yet, that was addressed and the performance was back. For benchmarking TaylorSeries I use Fateman's tests. The following table shows some results I obtained for fateman1(20), using BenchmarkTools, considering the commit 487cb8b <487cb8b> (before #99 <#99> was merged), 0697deb <0697deb> (#99 <#99>) and the current master. The figures indicate the minimum time and the memory estimate provided by @benchmark fateman1(20) seconds=20. Julia 0.5 Julia 0.6 487cb8b <487cb8b> 2.578 s, 33.39 MiB 2.583 s, 25.51 MiB 0697deb <0697deb> 2.441 s, 19.27 MiB 2.448 s, 19.26 MiB master 2.547 s, 19.27 MiB 2.460 s, 19.26 MiB The table shows that the best timing are obtained for 0697deb <0697deb>; the memory usage is actually improved in #99 <#99>. I must say that different runs show variations about ~10% in the timing results (whatever goes on in the CPU). The tests considered 8 or 9 runs, so we are far from having good statistics. The problem with the performance that I noted was related to TaylorN multiplication (and similarly in other functions), which was allocating unnecessary arrays. This was related to the use of * for HomogeneousPolynomials products in the form c[k+1] += a[i+1] * b[k-i+1] instead of mul!(c[k+1], a[i+1], b[k-i+1]). The subtlety is that using a[i+1] * b[k-i+1] creates a temporary array, while mul!(c[k+1], a[i+1], b[k-i+1]) does not. Can you check if the performance hit happens in general, or if it happens concretely in the integration of the variational equations or the Liapunov spectrum? One more comment. While checking the time that takes the tests gives you a first approximation to this, I think it is better to use more sophisticated tools, simply because other processes in the CPU may influence the results, aside from compilation issues. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#102 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AALtThxARb4ATIAJkGAboab7hw-RkVteks5rrR19gaJpZM4MvG40> .

lbenet · 2017-03-31T15:51:40Z

Sorry if I was ambiguous...

PerezHz · 2017-03-31T21:55:24Z

For benchmarking TaylorSeries I use Fateman's tests.

Are Fateman's tests only for TaylorN or do they test Taylor1 also?

One more comment. While checking the time that takes the tests gives you a first approximation to this, I think it is better to use more sophisticated tools, simply because other processes in the CPU may influence the results, aside from compilation issues.

+

By "more sophisticated tools", I think Luis is referring to the
BenchmarkTooks.jl package. I second this recommendation.

Thanks for the feedback! I did as both of you suggested and benchmarked TaylorIntegration.jl's tests using BenchmarkTools.jl, for the commits 487cb8b, 0697deb, and current master (47b201c).

At TaylorIntegration.jl's root folder, for each of the above commits of TaylorSeries, I ran the following code:

using BenchmarkTools
mybench = @benchmarkable include("./test/runtests.jl") seconds=1e10 evals=10 samples=2
@time include("./test/runtests.jl") #warmup lap
run(mybench)

My results were the following:

TaylorSeries.jl commit	memory estimate	allocs estimate	min time	mean time
`487cb8b`	10.92 GiB	112104218	16.230 s (15.92% GC)	17.082 s (18.62% GC)
`0697deb`	20.34 GiB	124597238	22.737 s (17.60% GC)	23.778 s (20.32% GC)
`47b201c`	19.62 GiB	124605429	22.252 s (17.70% GC)	23.325 s (20.17% GC)

Can you check if the performance hit happens in general, or if it happens concretely in the integration of the variational equations or the Liapunov spectrum?

Sure! I'm going to separate TaylorIntegration.jl's tests (the ones that use TaylorNs from the ones which do not), and report back here

lbenet · 2017-03-31T22:05:10Z

Responding your question, Fateman's test is usually for TaylorN. fateman1(20) corresponds to compute s*(s+1), with s=(1+x+y+z+w)^20. The idea is not to compute s^2+s which can be optimized. This computation imposes using Int128 if you want the exact coefficients.

lbenet · 2017-03-31T22:07:34Z

Regarding the benchmarks, my feeling is that #99 induced that the naive approach we use in TaylorIntegration.jl when using TaylorN for whatever, creates temporary arrays, which essentially are ~2x more use in memory, and thus more time.

PerezHz · 2017-03-31T23:40:48Z

Here are the results for the "separated" benchmarks; i.e., those which use TaylorN (jet transport, Lyapunov spectrum) from those which do not (typical ODE integration with Float64 variables). Benchmarks were performed on both julia-0.5.1 and julia-0.6.0.pre.alpha.315.

Non-`TaylorN` TaylorIntegration.jl tests:

julia-0.5.1:

TaylorSeries.jl commit	memory estimate	allocs estimate	min time	mean time
`487cb8b`	112.21 MiB	1810163	4.803 s (1.06% GC)	4.907 s (1.07% GC)
`0697deb`	125.21 MiB	2084495	4.746 s (1.00% GC)	4.748 s (1.06% GC)
master (`47b201c`)	125.66 MiB	2094893	4.686 s (1.07% GC)	4.764 s (1.10% GC)

julia-0.6.0.pre.alpha.315:

TaylorSeries.jl commit	memory estimate	allocs estimate	min time	mean time
`487cb8b`	124.03 MiB	1967996	4.427 s (1.16% GC)	4.473 s (1.13% GC)
`0697deb`	137.31 MiB	2241712	4.699 s (1.24% GC)	4.718 s (1.28% GC)
master (`47b201c`)	137.55 MiB	2247012	4.549 s (1.23% GC)	4.618 s (1.25% GC)

TaylorIntegration.jl tests which involve `TaylorN`:

julia-0.5.1:

TaylorSeries.jl commit	memory estimate	allocs estimate	min time	mean time
`487cb8b`	12.91 GiB	142833706	13.829 s (16.46% GC)	14.586 s (18.56% GC)
`0697deb`	20.94 GiB	146733942	25.631 s (17.93% GC)	25.692 s (18.74% GC)
master (`47b201c`)	20.41 GiB	159212234	25.313 s (16.25% GC)	26.327 s (18.54% GC)

julia-0.6.0.pre.alpha.315:

TaylorSeries.jl commit	memory estimate	allocs estimate	min time	mean time
`487cb8b`	10.80 GiB	110187006	10.409 s (18.18% GC)	10.695 s (19.75% GC)
`0697deb`	20.20 GiB	122406324	16.521 s (19.23% GC)	17.247 s (20.95% GC)
master (`47b201c`)	19.49 GiB	122409215	16.702 s (19.50% GC)	17.083 s (21.20% GC)

PerezHz · 2017-03-31T23:52:20Z

Responding your question, Fateman's test is usually for TaylorN.

Ok, thanks!

Regarding the benchmarks, my feeling is that #99 induced that the naive approach we use in TaylorIntegration.jl when using TaylorN for whatever, creates temporary arrays, which essentially are ~2x more use in memory, and thus more time.

From the above benchmarks, I agree with you, I think we can say the performance slowdown observed in TaylorIntegration.jl tests are mainly due to a worse use of memory with TaylorNs in 0697deb and master vs 487cb8b; although there's also a smaller performance slowdown in ODE integrations which do not involve TaylorN. Actually, I have a specific ODE example (post-Newtonian N body problem for the Solar System) which does not involve TaylorN; in that problem I observe a greater slowdown than what seems to be indicated by these benchmarks; that was one of the problems I was working on when I first noticed this issue. For completeness I will also report those benchmarks here.

lbenet · 2017-04-01T00:48:08Z

Can you please tell me which is the version (commit) you are using of TaylorIntegration?

Regarding the non-TayloN tests, I think they essentially stay the same. The TaylorN indeed got worst, and I think this is due to the fact that we somehow using some temporary arrays that allocate memory...

PerezHz · 2017-04-01T00:58:25Z

Can you please tell me which is the version (commit) you are using of TaylorIntegration?

I'm using the latest commit in PerezHz/TaylorIntegration.jl#18 (3a8e12e)

PerezHz · 2017-04-01T02:13:54Z

Below, I show some benchmarks which don't involve TaylorIntegration.jl, nor TaylorN variables. I defined a function named NBP_pN!, which takes a Float64 and two arrays of Taylor1{Float64}, does some internal calculations and updates one of these arrays in-place; the function has the following structure:

function NBP_pN!(t::Float64, q::Array{Taylor1{Float64}, 1}, dq::Array{Taylor1{Float64}, 1})
    
    #... some calculations here...

    for i in eachindex(dq)
        dq[i] = # ... in-place assignments ...
    end
    nothing
end

Then I did mybench = @benchmarkable NBP_pN2!(t0,q0T,dq0T) seconds=1e10 evals=10 samples=100, where t0::Float64, q0T::Array{Taylor1{Float64}} and dq0T::Array{Taylor1{Float64}} (both q0 and dq0 have 60 elements each), and benchmarked it using @time run(mybench). The results were:

julia-0.5.1:

TaylorSeries.jl commit	memory estimate	allocs estimate	min time	mean time
`487cb8b`	4.76 MiB	31600	3.373 ms (7.42% GC)	3.558 ms (9.71% GC)
`0697deb`	13.53 MiB	43760	22.767 ms (8.68% GC)	23.434 ms (9.02% GC)
master (`47b201c`)	13.53 MiB	43760	21.538 ms (0.00% GC)	24.361 ms (10.37% GC)

julia-0.6.0.pre.alpha.315:

TaylorSeries.jl commit	memory estimate	allocs estimate	min time	mean time
`487cb8b`	3.12 MiB	23130	2.528 ms (5.91% GC)	2.658 ms (9.51% GC)
`0697deb`	13.30 MiB	36390	15.923 ms (9.95% GC)	16.483 ms (11.23% GC)
master (`47b201c`)	13.30 MiB	36390	17.295 ms (11.10% GC)	17.832 ms (11.70% GC)

So for this example, there's a ~6x slowdown in execution time, and almost 2x as much memory used when using master and 0697deb, vs commit 487cb8b. Also, @time run(mybench) reports that the total benchmark execution time goes from around 5 seconds to about 19 seconds.

dpsanders · 2017-04-01T03:54:59Z

As described in the performance tips, you can run a Julia script with the option

--track-allocation=user

to see where memory allocation is occurring.

PerezHz · 2017-04-01T04:09:38Z

Thanks @dpsanders, I will try that!

PerezHz · 2017-04-01T04:10:42Z

Below, I detail a reproducible (simpler) example of what I think is a performance slowdown of arithmetic operations (sums and products) with Taylor1 after PR #99:

First I ran the following code:

using TaylorSeries, BenchmarkTools

#some parameters, etc.
const order = 28
const q0 = [19.0, 20.0, 50.0]
const σ = 16.0
const β = 4.0
const ρ = 45.92
const t0 = 0.0
const q0T = Array{Taylor1{Float64}}(3)
const dq0T = Array{Taylor1{Float64}}(3)

#the equations of the Lorenz system
function lorenz!(t::Float64, x::Array{Taylor1{Float64}}, dx::Array{Taylor1{Float64}})
    dx[1] = σ*(x[2]-x[1])
    dx[2] = x[1]*(ρ-x[3])-x[2]
    dx[3] = x[1]*x[2]-β*x[3]
    nothing
end

#fill the first coeffs. of each element of q0T with the corresponding elements of q0
for i in eachindex(q0)
    q0T[i] = Taylor1(q0[i], order)
end

#the function that we will benchmark
function lorenzmanytimes()
    for i in 1:1000000
        lorenz!(t0,q0T,dq0T)
    end
end

Then for the benchmarks themselves I ran:

mybench = @benchmarkable lorenzmanytimes() seconds=1e10 evals=10 samples=1
lorenzmanytimes() #warmup lap for lorenzmanytimes()
run(mybench)

And got the following results (the data in the table reads as: estimated allocated memory, estimated number of allocations, and the mean time of execution):

	julia 0.6.0-pre.alpha.315	julia 0.5.1
commit `487cb8b`	3.20 GiB, 22000000 allocs, 1.977 sec	4.86 GiB, 30000000 allocs, 2.444 sec
commit `0697deb`	11.35 GiB, 31000000 allocs, 12.102 sec	11.58 GiB, 39000000 allocs, 16.768 sec
master (`47b201c`)	11.35 GiB, 31000000 allocs, 12.449 sec	11.58 GiB, 39000000 allocs, 17.090 sec

PerezHz · 2017-04-01T07:20:14Z

Following @dpsanders suggestion, I ran the Lorenz system example which I detailed above, and got the .mem files for the TaylorSeries.jl files in the src directory. As far as I could understand, the main difference in memory allocation between commits 487cb8b and 0697deb, is in the src/auxiliary.jl file.

The only memory-allocating lines in the file src/auxiliary.jl.mem, for the 487cb8b commit, are lines 17 through 23:

        - function resize_coeffs1!{T<:Number}(coeffs::Array{T,1}, order::Int)
        0     lencoef = length(coeffs)
        0     order ≤ lencoef-1 && return nothing
     1008     resize!(coeffs, order+1)
      144     coeffs[lencoef+1:end] .= zero(coeffs[1])
        0     return nothing
        - end

and also lines 140 through 150:

        - ## fixorder ##
        - for T in (:Taylor1, :TaylorN)
        -     @eval begin
        -         fixorder(a::$T, order::Int64) = $T(a.coeffs, order)
        -         function fixorder{R<:Number}(a::$T{R}, b::$T{R})
 64000000             a.order == b.order && return a, b
        0             a.order < b.order && return $T(a.coeffs, b.order), b
        0             return a, $T(b.coeffs, a.order)
        -         end
        -     end
        - end

The only memory-allocating lines for the same file, src/auxiliary.jl.mem, but for the 0697deb commit, are lines 17 through 23:

        - function resize_coeffs1!{T<:Number}(coeffs::Array{T,1}, order::Int)
        0     lencoef = length(coeffs)
        0     order ≤ lencoef-1 && return nothing
672001008     resize!(coeffs, order+1)
 96000144     coeffs[lencoef+1:end] .= zero(coeffs[1])
        0     return nothing
        - end

In particular, I noted the line resize!(coeffs, order+1), which allocates 672001008 bytes! That was the only significant difference I could notice using --track-allocation=user.

On the other hand, in the src/arithmetic.jl file, commit 0697deb seems to be much better at memory allocation than 487cb8b (e.g., the use of the in-place version of mul! instead of the non-in-place multiplication, as was mentioned by @lbenet, that trick saves a lot of memory!). The rest of the files in the src directory had no significant contributions to memory allocation, as far as I could tell...

PerezHz · 2017-04-01T07:42:22Z

Just noticing that the heavy use of resize_coeffs1! in commit 0697deb seems to come from the calls to the Taylor1(coeffs, order) constructor. This constructor is called every time the method zero(a::Taylor1, order) is called; in turn, zero is called by the *{T<:Number}(a::Taylor1{T}, b::Taylor1{T}) multiplication method.

lbenet · 2017-04-01T13:21:01Z

Thanks for checking this up so carefully.

Yesterday, I simply changed the line 20 of auxiliary.jl to return a.coeffs == b.coeffs and got a reduction form 11GB to 3.4GB. So it seems that some usage of the dot-operations creates some temporaries which were not before. This may happen in zero but specially in +, - and perhaps *.

I can't look at this today, so I'll check it tomorrow.

lbenet · 2017-04-03T01:52:31Z

I'm not sure what is happening. It seems to me that using dot-operations inside @eval blocks is somehow making that the dot-operations are not properly handled; the following refers to some lines in arithmetic.jl, when defining :+ and :-.

Simply put, the line

    @__dot__ v = $f(a.coeffs, b.coeffs)

(inside an @eval block) allocates lots of memory, while the supposedly equivalent

for ind = 1:length(a.coeffs
    v[ind] = $f(a[ind], b[ind])
end

does not.

@dpsanders Do you know of this kind of problems?

I guess that a solution is using for ... end blocks again.

dpsanders · 2017-04-03T01:54:48Z

What does @__dot__ do? Is it the same as @.?
Any particular reason to not just write it with . everytwhere? Does that have the same problem?

In any case, if this is a problem, it should certainly be reported as an issue on the Julia repo.

dpsanders · 2017-04-03T01:55:48Z

You can use macroexpand (or @macroexpand on 0.6) to see what this is being expanded to. I guess including the @eval and everything.

lbenet · 2017-04-03T01:58:19Z

@__dot__ is the equivalent of @. for 0.5.

I'll check how is everything expanded... Thanks for the suggestion!

dpsanders · 2017-04-03T02:05:35Z

Are you checking all these allocations on 0.6? Dot fusion is less powerful on 0.5

lbenet · 2017-04-03T02:10:22Z

I'm checking it in 0.5, though I'm aware that they are less powerful. The problem seems to appear in both versions...

dpsanders · 2017-04-03T03:17:35Z

I actually don't think there's too much point in worrying about 0.5...

…

On 2 Apr 2017 9:10 p.m., "Luis Benet" ***@***.***> wrote: I'm checking it in 0.5, though I'm aware that they are less powerful. The problem seems to appear in both versions... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#102 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AALtTr9wvVj-hXU00JagNn60ZPtSK71Iks5rsFUPgaJpZM4MvG40> .

lbenet · 2017-04-03T03:29:26Z

I agree with you; yet, according to what @PerezHz is reporting (see this comment), the problem also occurs in Julia 0.6.

The use of those if's is at the heart of the problem reported in [#102]. The idea was to avoid fixorder by setting the coefficients that are not defined to zero. This involves if's that killed performance. Also, fixorder was corrected to return a copy of the coefficients to avoid side effects. `max_order` is not anylonger needed, so it is deleted.

lbenet · 2017-04-05T00:32:26Z

Taking back what I said before on the dot operations, I think the problem is actually related to some changes in getindex(), which precisely allow to extend using a[i] beyond the actual number of coefficients. The idea was that they would return an adequate zero, so we could avoid using fixorder.

I just submitted a PR (see #103) to see if reverting that brings things in order, and it seems so.

@PerezHz Could you check if #103 solves this?

PerezHz · 2017-04-05T05:38:08Z

Well it seems to solve the problem for julia 0.5.1 (at least for the total allocated memory, although the number of allocations still goes up). For julia-0.6.0-pre.alpha #103 seems to help a lot, although in this latter case there still seems to be a bit of a slowdown when benchmarking lorenzmanytimes() with the exact same parameters as before:

	julia 0.6.0-pre.alpha.315	julia 0.5.1
#103 (`248b0a2`)	4.53 GiB, 33000000 allocs, 2.429 sec	4.69 GiB, 39000000 allocs, 2.692 sec
master (`47b201c`)	11.58 GiB, 39000000 allocs, 17.126 sec	11.35 GiB, 31000000, 12.105 sec
commit `487cb8b`	3.20 GiB, 22000000 allocs, 1.900 sec	4.86 GiB, 30000000 allocs, 2.651 sec

dpsanders · 2017-04-05T11:59:59Z

I have just heard about PkgBenchmark.jl which should help for these kinds of comparisons.

lbenet · 2017-04-05T16:46:21Z

Thanks @dpsanders for the tip on PkgBenchmark.jl; I'll check it later.

@PerezHz I just pushed another commit which I think it slightly improves the performance over 487cb8b. Can you confirm this?

PerezHz · 2017-04-05T22:48:08Z

On my machine I ran the lorenzmanytimes() benchmarks, only with a small change:

function lorenzmanytimes()
    for i in 1:5000000 #<--- notice the change here; previous value was 1000000
        lorenz!(t0,q0T,dq0T)
    end
end

i.e., instead of evaluating lorenz! a million times, I evaluated it 5 million times.

For these benchmarks, I found that #103 latest commit is essentially on-par with 487cb8b, and even slightly better (wrt execution time and total allocated memory), on julia 0.5.1! Actually, total allocated memory goes down by ~30% for julia0.5.1 when comparing #103 vs 487cb8b. I would say this solves the performance regression, at least for julia 0.5.1.

On the other hand, what I noticed for julia 0.6.0-pre.alpha is that even as total allocated memory is better in #103 than 487cb8b (15.80GiB vs 16.02GiB), the total number of allocations seems to be slightly higher in #103 (1.25x10^8 vs 1.1x10^8), as well as execution time (10.759 secs vs 9.455 secs). I see a similar trend for other benchmarks with TaylorIntegration.

	julia 0.6.0-pre.alpha.315	julia 0.5.1
#103 (`c528ed2`)	15.80 GiB, 1.25x10^8, 10.759 sec	16.61 GiB, 1.55x10^8 allocs, 11.737 sec
commit (`487cb8b`)	16.02 GiB, 1.1x10^8 allocs, 9.455 sec	24.29 GiB, 1.5x10^8 allocs, 12.266 sec

PerezHz · 2017-04-05T23:18:22Z

Just updated to julia 0.6.0-pre.beta.33 and the results are essentially the same as the ones obtained with julia 0.6.0-pre.alpha.315

lbenet · 2017-04-06T00:17:13Z

Thanks @PerezHz for checking up this !

I will see if there is something else to be done to improve performance for 0.6. In any case, I think #103 or something close to that implementation should be merged.

lbenet · 2017-04-07T15:43:24Z

I just pushed another commit (33a2cb7), which improves the number of allocations and the memory usage.

Locally, my results are the following: I can't bit the timing of 487cb8b using your benchmarks by ~10%. I guess that this is due to some optimization (in * for Taylor1) that is applied in 487cb8b, but cannot by used, probably due to mul!(c, a, b, k). On the other hand, using 33a2cb7 and perf/fateman.jl I get an improvement ~5%.

@PerezHz Can you check in your application for 33a2cb7 if the slowdown is so bad as before?

I will see if there is something I can do to improve further the timing, concretely in the method mentioned above for mul!. Yet, I must admit that the current "modularity" is something I like...

PerezHz · 2017-04-07T23:13:25Z

For the lorenzmanytimes() benchmark we've been working with, I also see commit 33a2cb7 about 10% slower than 487cb8b; but for the TaylorIntegration application I'm seeing 33a2cb7 right on-par with 487cb8b: total allocated memory and number of allocs is practically the same between these two commits, and execution time is also practically the same!

On the other hand, I benchmarked TaylorIntegration tests, and found that while total allocated memory is better (~25%) for 33a2cb7 vs 487cb8b, number of allocs is slightly worse (~13%), and execution time is actually slightly better, by about 8%.

So I'd be happy to close this issue as far as the original slowdown report is concerned! Thank you @lbenet for all the effort you're putting into this 😄 !

lbenet · 2017-04-07T23:22:31Z

Great news!

@dpsanders Do you agree to merge #103?

dpsanders · 2017-04-07T23:36:54Z

I haven't been following very closely. If #103 solves the problem, then that sounds good to me, even if it's aesthetically less pleasing to some extent. Hopefully in the future, we may be able to find a different solution, but for now it seems like a good one!

PerezHz · 2017-04-08T05:42:29Z

Sorry guys, I have to take back my word about performance for my TaylorIntegration application 😞... Just realised that I didn't check-out correctly the various TaylorSeries commits when benchmarking. After re-doing it carefully, I'm seeing 33a2cb7 about 8% higher in total allocated memory, 20% higher in number of allocs, and ~10% slower in execution time on average vs 487cb8b.

This correction is only for the N-body problem application with TaylorIntegration. The other two benchmarks (lorenzmanytimes() and TaylorIntegration tests themselves) were quoted correctly in my last comment. Again, sorry for this confusion...

lbenet · 2017-04-19T18:07:23Z

Sorry for the delay to get back to this. I've been playing a little bit with the old a new implementation and, including @inline for mul! I get results that are pretty comparable with what we had for 487cb8b.

I just pushed a commit with those changes; @PerezHz can you check that this indeed improves the benchmarks over 33a2cb7? There may be a marginal difference in favor of 487cb8b; yet, I prefer to stick to the current implementation since I think it will permit do some other nice stuff.

PerezHz · 2017-04-20T21:05:18Z

Sure, @lbenet! Just benchmarked the latest commit in #103 vs 487cb8b, and at least in my machine, for the lorenzmanytimes() benchmarks, I'm finding #103 actually ahead of 487cb8b, both in julia 0.5.1 and 0.6.0-pre.beta.224! (This time I double-checked that I got the git checkouts right 😅)

	julia 0.6.0-pre.beta.224	julia 0.5.1
#103 (`c19c1eb`)	14.16 GiB, 9.5x10^7 allocs, min time 9.214 s, mean time 9.247 s	14.98 GiB, 1.25x10^8 allocs, min time 10.040 s, mean time 10.318 s
commit (`487cb8b`)	16.02 GiB, 1.1x10^8 allocs, min time 9.390 s, mean time 9.619 s	24.29 GiB, 1.5x10^8 allocs, min time 12.163 s, mean time 12.354 s

lbenet · 2017-04-20T21:09:56Z

Great news! I think this should be merged, to continue with other stuff.

PerezHz · 2017-04-20T21:17:55Z

Since the lorenzmanytimes() benchmark doesn't involve division or elevating to powers, do you think similar things such as inlining should be done for these other operations?

lbenet · 2017-04-20T21:22:03Z

I actually thought to include @inline for other functions. Do you have another possible test for those cases?

PerezHz · 2017-04-20T21:24:12Z

Well, I was thinking of something like keplermanytimes(), that is, modify lorenzmanytimes() for the Kepler problem; if you think this is worthwhile I can do this right away!

PerezHz · 2017-04-20T21:25:03Z

Also, right now I'm benchmarking one of my TaylorIntegration.jl applications

lbenet · 2017-04-20T21:26:08Z

I think I would be more interested in the TaylorIntegration.jl application 😄

I'll do the keplermanytimes() test...

dpsanders · 2017-04-20T21:28:39Z

If you use BenchmarkTools, it should not be necessary to run many times. Or, rather, BenchmarkTools controls that.

using BenchmarkTools

@benchmark lorenz(...)

PerezHz · 2017-04-20T23:51:17Z

Thanks @dpsanders! For one of my TaylorIntegration.jl applications, as you suggested, I ran @benchmark tv, xv = taylorinteg(...) for an ODE which involves divisions and powers, as well as sums and products. The results on julia 0.6.0-pre.beta.224 were the following:

For commit 487cb8b:

BenchmarkTools.Trial: 
  memory estimate:  964.43 MiB
  allocs estimate:  10116941
  --------------
  minimum time:     675.706 ms (11.69% GC)
  median time:      692.568 ms (12.01% GC)
  mean time:        692.260 ms (11.97% GC)
  maximum time:     726.560 ms (12.83% GC)
  --------------
  samples:          8
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

For #103 (c19c1eb):

BenchmarkTools.Trial: 
  memory estimate:  1.03 GiB
  allocs estimate:  12225761
  --------------
  minimum time:     772.681 ms (12.71% GC)
  median time:      798.790 ms (13.52% GC)
  mean time:        810.775 ms (13.61% GC)
  maximum time:     864.409 ms (15.25% GC)
  --------------
  samples:          7
  evals/sample:     1
  timye tolerance:   5.00%
  memory tolerance: 1.00%

So especifically for this benchmark 487cb8b seems to be still marginally faster than c19c1eb, but I also found that this difference is almost unnoticeable on longer runs.

* Use `fixorder` again, to avoid the if's used in `getindex` The use of those if's is at the heart of the problem reported in [#102]. The idea was to avoid fixorder by setting the coefficients that are not defined to zero. This involves if's that killed performance. Also, fixorder was corrected to return a copy of the coefficients to avoid side effects. `max_order` is not anylonger needed, so it is deleted. * Fix tests and fateman.jl * Tiny correction to fixorder * Improvements on allocations * Inline mul! and div! methods * Use a modified power_by_squaring, and inline the (mutating) recursion functions * Inline fixorder * Inlining div!

lbenet · 2017-04-21T19:48:55Z

@PerezHz I just merged #103; thanks a lot for reporting and your help in this. I'm closing this issue.

PerezHz · 2017-04-24T07:37:37Z

Thanks @lbenet for all the effort you and @dpsanders have put into this; it is really nice to have performance back! For the record, after #103 was merged I benchmarked once again one of my TaylorIntegration.jl applications (same from my last comment in this issue), for both commit 487cb8b and current master (b173816) with the following results:

commit 487cb8b:

BenchmarkTools.Trial: 
  memory estimate:  964.43 MiB
  allocs estimate:  10116945
  --------------
  minimum time:     671.458 ms (11.55% GC)
  median time:      678.895 ms (11.49% GC)
  mean time:        679.357 ms (11.47% GC)
  maximum time:     687.634 ms (11.70% GC)
  --------------
  samples:          8
  evals/sample:     1

current master (b173816):

BenchmarkTools.Trial: 
  memory estimate:  933.25 MiB
  allocs estimate:  9282765
  --------------
  minimum time:     669.862 ms (11.14% GC)
  median time:      676.142 ms (11.34% GC)
  mean time:        677.879 ms (11.28% GC)
  maximum time:     689.530 ms (11.05% GC)
  --------------
  samples:          8
  evals/sample:     1

So it seems that current master is performing better than 487cb8b in julia 0.6.0-pre.beta, as far as both total allocated memory and number of allocs are concerned; execution time is essentially on-par with 487cb8b! I also ran some other TaylorIntegration.jl applications of mine and found similar results!

lbenet · 2017-04-24T17:18:30Z

Thanks for the update!

dpsanders · 2017-04-24T17:29:17Z

That seems to be an enormous amount of memory being allocated in a very short time...

lbenet mentioned this issue Apr 5, 2017

Solve performance regresion #103

Merged

lbenet closed this as completed Apr 21, 2017

Notable performance slowdown after PR #99 in julia-0.6.0-pre.alpha #102

Notable performance slowdown after PR #99 in julia-0.6.0-pre.alpha #102

Comments

PerezHz commented Mar 31, 2017 • edited

lbenet commented Mar 31, 2017

dpsanders commented Mar 31, 2017 via email

lbenet commented Mar 31, 2017

PerezHz commented Mar 31, 2017 • edited

lbenet commented Mar 31, 2017

lbenet commented Mar 31, 2017

PerezHz commented Mar 31, 2017

Non-TaylorN TaylorIntegration.jl tests:

julia-0.5.1:

julia-0.6.0.pre.alpha.315:

TaylorIntegration.jl tests which involve TaylorN:

julia-0.5.1:

julia-0.6.0.pre.alpha.315:

PerezHz commented Mar 31, 2017

lbenet commented Apr 1, 2017

PerezHz commented Apr 1, 2017

PerezHz commented Apr 1, 2017 • edited

julia-0.5.1:

julia-0.6.0.pre.alpha.315:

dpsanders commented Apr 1, 2017

PerezHz commented Apr 1, 2017

PerezHz commented Apr 1, 2017 • edited

PerezHz commented Apr 1, 2017 • edited

PerezHz commented Apr 1, 2017

lbenet commented Apr 1, 2017

lbenet commented Apr 3, 2017

dpsanders commented Apr 3, 2017

dpsanders commented Apr 3, 2017

lbenet commented Apr 3, 2017

dpsanders commented Apr 3, 2017

lbenet commented Apr 3, 2017

dpsanders commented Apr 3, 2017 via email

lbenet commented Apr 3, 2017

lbenet commented Apr 5, 2017

PerezHz commented Apr 5, 2017

dpsanders commented Apr 5, 2017

lbenet commented Apr 5, 2017

PerezHz commented Apr 5, 2017

PerezHz commented Apr 5, 2017

lbenet commented Apr 6, 2017

lbenet commented Apr 7, 2017

PerezHz commented Apr 7, 2017

lbenet commented Apr 7, 2017

dpsanders commented Apr 7, 2017

PerezHz commented Apr 8, 2017

lbenet commented Apr 19, 2017

PerezHz commented Apr 20, 2017

lbenet commented Apr 20, 2017

PerezHz commented Apr 20, 2017

lbenet commented Apr 20, 2017

PerezHz commented Apr 20, 2017 • edited

PerezHz commented Apr 20, 2017

lbenet commented Apr 20, 2017

dpsanders commented Apr 20, 2017

PerezHz commented Apr 20, 2017

lbenet commented Apr 21, 2017

PerezHz commented Apr 24, 2017

lbenet commented Apr 24, 2017

dpsanders commented Apr 24, 2017

PerezHz commented Mar 31, 2017 •

edited

PerezHz commented Mar 31, 2017 •

edited

Non-`TaylorN` TaylorIntegration.jl tests:

TaylorIntegration.jl tests which involve `TaylorN`:

PerezHz commented Apr 1, 2017 •

edited

PerezHz commented Apr 1, 2017 •

edited

PerezHz commented Apr 1, 2017 •

edited

PerezHz commented Apr 20, 2017 •

edited