Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notable performance slowdown after PR #99 in julia-0.6.0-pre.alpha #102

Closed
PerezHz opened this issue Mar 31, 2017 · 51 comments
Closed

Notable performance slowdown after PR #99 in julia-0.6.0-pre.alpha #102

PerezHz opened this issue Mar 31, 2017 · 51 comments

Comments

@PerezHz
Copy link
Contributor

PerezHz commented Mar 31, 2017

While working with TaylorIntegration.jl and TaylorSeries.jl, I came across a notable slowdown after PR #99. While benchmarking TaylorIntegration.jl tests using commit 3a8e12e in PerezHz/TaylorIntegration.jl#18 in julia 0.6.0-pre.alpha.315 I got the following typical values (as always, I ran a "warmup lap" before benchmarking to discard compilation overhead):

Julia 0.6.0-pre.alpha.315 TaylorIntegration.jl tests benchmarks for different commits of TaylorSeries:

So there's a slowndown of about 30% in typical execution times, about 11% more allocations, and about 50% more total bytes allocated when using commits 47b201c and 0697deb vs using commit 487cb8b in julia 0.6.0-pre.alpha. I also tested some "heavier" benchmarks on two different machines, for a more complicated problem than the ones included in TaylorIntegration.jl's tests, and the slowdown is even bigger, about 50%.

Interestingly enough, this slowdown does not occur in julia 0.5.1. Still, julia-0.5.1 fastest time is slower than julia-0.6.0-pre.alpha slowest time:

Julia 0.5.1 TaylorIntegration.jl tests benchmarks for different commits of TaylorSeries:

  • TaylorSeries commit 47b201c:
    32.264 seconds (161.27 M allocations: 20.527 GB, 15.54% gc time)

  • TaylorSeries commit 0697deb:
    32.350 seconds (161.27 M allocations: 20.528 GB, 14.73% gc time)

  • TaylorSeries commit 487cb8b:
    32.849 seconds (161.27 M allocations: 20.527 GB, 14.51% gc time)

@lbenet
Copy link
Member

lbenet commented Mar 31, 2017

Thanks for reporting. At some point during the development of #99, I also noticed a performance regression. Yet, that was addressed and the performance was back.

For benchmarking TaylorSeries I use Fateman's tests. The following table shows some results I obtained for fateman1(20), using BenchmarkTools, considering the commit 487cb8b (before #99 was merged), 0697deb (#99) and the current master. The figures indicate the minimum time and the memory estimate provided by @benchmark fateman1(20) seconds=20.

Julia 0.5 Julia 0.6
487cb8b 2.578 s, 33.39 MiB 2.583 s, 25.51 MiB
0697deb 2.441 s, 19.27 MiB 2.448 s, 19.26 MiB
master 2.547 s, 19.27 MiB 2.460 s, 19.26 MiB

The table shows that the best timing are obtained for 0697deb; the memory usage is actually improved in #99. I must say that different runs show variations about ~10% in the timing results (whatever goes on in the CPU). The tests considered 8 or 9 runs, so we are far from having good statistics.

The problem with the performance that I noted was related to TaylorN multiplication (and similarly in other functions), which was allocating unnecessary arrays. This was related to the use of * for HomogeneousPolynomials products in the form c[k+1] += a[i+1] * b[k-i+1] instead of mul!(c[k+1], a[i+1], b[k-i+1]). The subtlety is that using a[i+1] * b[k-i+1] creates a temporary array, while mul!(c[k+1], a[i+1], b[k-i+1]) does not.

Can you check if the performance hit happens in general, or if it happens concretely in the integration of the variational equations or the Liapunov spectrum?

One more comment. While checking the time that takes the tests gives you a first approximation to this, I think it is better to use more sophisticated tools, simply because other processes in the CPU may influence the results, aside from compilation issues.

@dpsanders
Copy link
Contributor

dpsanders commented Mar 31, 2017 via email

@lbenet
Copy link
Member

lbenet commented Mar 31, 2017

Sorry if I was ambiguous...

@PerezHz
Copy link
Contributor Author

PerezHz commented Mar 31, 2017

For benchmarking TaylorSeries I use Fateman's tests.

Are Fateman's tests only for TaylorN or do they test Taylor1 also?

One more comment. While checking the time that takes the tests gives you a first approximation to this, I think it is better to use more sophisticated tools, simply because other processes in the CPU may influence the results, aside from compilation issues.

+

By "more sophisticated tools", I think Luis is referring to the
BenchmarkTooks.jl package. I second this recommendation.

Thanks for the feedback! I did as both of you suggested and benchmarked TaylorIntegration.jl's tests using BenchmarkTools.jl, for the commits 487cb8b, 0697deb, and current master (47b201c).

At TaylorIntegration.jl's root folder, for each of the above commits of TaylorSeries, I ran the following code:

using BenchmarkTools
mybench = @benchmarkable include("./test/runtests.jl") seconds=1e10 evals=10 samples=2
@time include("./test/runtests.jl") #warmup lap
run(mybench)

My results were the following:

TaylorSeries.jl commit memory estimate allocs estimate min time mean time
487cb8b 10.92 GiB 112104218 16.230 s (15.92% GC) 17.082 s (18.62% GC)
0697deb 20.34 GiB 124597238 22.737 s (17.60% GC) 23.778 s (20.32% GC)
47b201c 19.62 GiB 124605429 22.252 s (17.70% GC) 23.325 s (20.17% GC)

Can you check if the performance hit happens in general, or if it happens concretely in the integration of the variational equations or the Liapunov spectrum?

Sure! I'm going to separate TaylorIntegration.jl's tests (the ones that use TaylorNs from the ones which do not), and report back here

@lbenet
Copy link
Member

lbenet commented Mar 31, 2017

Responding your question, Fateman's test is usually for TaylorN. fateman1(20) corresponds to compute s*(s+1), with s=(1+x+y+z+w)^20. The idea is not to compute s^2+s which can be optimized. This computation imposes using Int128 if you want the exact coefficients.

@lbenet
Copy link
Member

lbenet commented Mar 31, 2017

Regarding the benchmarks, my feeling is that #99 induced that the naive approach we use in TaylorIntegration.jl when using TaylorN for whatever, creates temporary arrays, which essentially are ~2x more use in memory, and thus more time.

@PerezHz
Copy link
Contributor Author

PerezHz commented Mar 31, 2017

Here are the results for the "separated" benchmarks; i.e., those which use TaylorN (jet transport, Lyapunov spectrum) from those which do not (typical ODE integration with Float64 variables). Benchmarks were performed on both julia-0.5.1 and julia-0.6.0.pre.alpha.315.

Non-TaylorN TaylorIntegration.jl tests:

  • julia-0.5.1:

TaylorSeries.jl commit memory estimate allocs estimate min time mean time
487cb8b 112.21 MiB 1810163 4.803 s (1.06% GC) 4.907 s (1.07% GC)
0697deb 125.21 MiB 2084495 4.746 s (1.00% GC) 4.748 s (1.06% GC)
master (47b201c) 125.66 MiB 2094893 4.686 s (1.07% GC) 4.764 s (1.10% GC)
  • julia-0.6.0.pre.alpha.315:

TaylorSeries.jl commit memory estimate allocs estimate min time mean time
487cb8b 124.03 MiB 1967996 4.427 s (1.16% GC) 4.473 s (1.13% GC)
0697deb 137.31 MiB 2241712 4.699 s (1.24% GC) 4.718 s (1.28% GC)
master (47b201c) 137.55 MiB 2247012 4.549 s (1.23% GC) 4.618 s (1.25% GC)

TaylorIntegration.jl tests which involve TaylorN:

  • julia-0.5.1:

TaylorSeries.jl commit memory estimate allocs estimate min time mean time
487cb8b 12.91 GiB 142833706 13.829 s (16.46% GC) 14.586 s (18.56% GC)
0697deb 20.94 GiB 146733942 25.631 s (17.93% GC) 25.692 s (18.74% GC)
master (47b201c) 20.41 GiB 159212234 25.313 s (16.25% GC) 26.327 s (18.54% GC)
  • julia-0.6.0.pre.alpha.315:

TaylorSeries.jl commit memory estimate allocs estimate min time mean time
487cb8b 10.80 GiB 110187006 10.409 s (18.18% GC) 10.695 s (19.75% GC)
0697deb 20.20 GiB 122406324 16.521 s (19.23% GC) 17.247 s (20.95% GC)
master (47b201c) 19.49 GiB 122409215 16.702 s (19.50% GC) 17.083 s (21.20% GC)

@PerezHz
Copy link
Contributor Author

PerezHz commented Mar 31, 2017

Responding your question, Fateman's test is usually for TaylorN.

Ok, thanks!

Regarding the benchmarks, my feeling is that #99 induced that the naive approach we use in TaylorIntegration.jl when using TaylorN for whatever, creates temporary arrays, which essentially are ~2x more use in memory, and thus more time.

From the above benchmarks, I agree with you, I think we can say the performance slowdown observed in TaylorIntegration.jl tests are mainly due to a worse use of memory with TaylorNs in 0697deb and master vs 487cb8b; although there's also a smaller performance slowdown in ODE integrations which do not involve TaylorN. Actually, I have a specific ODE example (post-Newtonian N body problem for the Solar System) which does not involve TaylorN; in that problem I observe a greater slowdown than what seems to be indicated by these benchmarks; that was one of the problems I was working on when I first noticed this issue. For completeness I will also report those benchmarks here.

@lbenet
Copy link
Member

lbenet commented Apr 1, 2017

Can you please tell me which is the version (commit) you are using of TaylorIntegration?

Regarding the non-TayloN tests, I think they essentially stay the same. The TaylorN indeed got worst, and I think this is due to the fact that we somehow using some temporary arrays that allocate memory...

@PerezHz
Copy link
Contributor Author

PerezHz commented Apr 1, 2017

Can you please tell me which is the version (commit) you are using of TaylorIntegration?

I'm using the latest commit in PerezHz/TaylorIntegration.jl#18 (3a8e12e)

@PerezHz
Copy link
Contributor Author

PerezHz commented Apr 1, 2017

Below, I show some benchmarks which don't involve TaylorIntegration.jl, nor TaylorN variables. I defined a function named NBP_pN!, which takes a Float64 and two arrays of Taylor1{Float64}, does some internal calculations and updates one of these arrays in-place; the function has the following structure:

function NBP_pN!(t::Float64, q::Array{Taylor1{Float64}, 1}, dq::Array{Taylor1{Float64}, 1})
    
    #... some calculations here...

    for i in eachindex(dq)
        dq[i] = # ... in-place assignments ...
    end
    nothing
end

Then I did mybench = @benchmarkable NBP_pN2!(t0,q0T,dq0T) seconds=1e10 evals=10 samples=100, where t0::Float64, q0T::Array{Taylor1{Float64}} and dq0T::Array{Taylor1{Float64}} (both q0 and dq0 have 60 elements each), and benchmarked it using @time run(mybench). The results were:

  • julia-0.5.1:

TaylorSeries.jl commit memory estimate allocs estimate min time mean time
487cb8b 4.76 MiB 31600 3.373 ms (7.42% GC) 3.558 ms (9.71% GC)
0697deb 13.53 MiB 43760 22.767 ms (8.68% GC) 23.434 ms (9.02% GC)
master (47b201c) 13.53 MiB 43760 21.538 ms (0.00% GC) 24.361 ms (10.37% GC)
  • julia-0.6.0.pre.alpha.315:

TaylorSeries.jl commit memory estimate allocs estimate min time mean time
487cb8b 3.12 MiB 23130 2.528 ms (5.91% GC) 2.658 ms (9.51% GC)
0697deb 13.30 MiB 36390 15.923 ms (9.95% GC) 16.483 ms (11.23% GC)
master (47b201c) 13.30 MiB 36390 17.295 ms (11.10% GC) 17.832 ms (11.70% GC)

So for this example, there's a ~6x slowdown in execution time, and almost 2x as much memory used when using master and 0697deb, vs commit 487cb8b. Also, @time run(mybench) reports that the total benchmark execution time goes from around 5 seconds to about 19 seconds.

@dpsanders
Copy link
Contributor

As described in the performance tips, you can run a Julia script with the option

--track-allocation=user

to see where memory allocation is occurring.

@PerezHz
Copy link
Contributor Author

PerezHz commented Apr 1, 2017

Thanks @dpsanders, I will try that!

@PerezHz
Copy link
Contributor Author

PerezHz commented Apr 1, 2017

Below, I detail a reproducible (simpler) example of what I think is a performance slowdown of arithmetic operations (sums and products) with Taylor1 after PR #99:

First I ran the following code:

using TaylorSeries, BenchmarkTools

#some parameters, etc.
const order = 28
const q0 = [19.0, 20.0, 50.0]
const σ = 16.0
const β = 4.0
const ρ = 45.92
const t0 = 0.0
const q0T = Array{Taylor1{Float64}}(3)
const dq0T = Array{Taylor1{Float64}}(3)

#the equations of the Lorenz system
function lorenz!(t::Float64, x::Array{Taylor1{Float64}}, dx::Array{Taylor1{Float64}})
    dx[1] = σ*(x[2]-x[1])
    dx[2] = x[1]*-x[3])-x[2]
    dx[3] = x[1]*x[2]-β*x[3]
    nothing
end

#fill the first coeffs. of each element of q0T with the corresponding elements of q0
for i in eachindex(q0)
    q0T[i] = Taylor1(q0[i], order)
end

#the function that we will benchmark
function lorenzmanytimes()
    for i in 1:1000000
        lorenz!(t0,q0T,dq0T)
    end
end

Then for the benchmarks themselves I ran:

mybench = @benchmarkable lorenzmanytimes() seconds=1e10 evals=10 samples=1
lorenzmanytimes() #warmup lap for lorenzmanytimes()
run(mybench)

And got the following results (the data in the table reads as: estimated allocated memory, estimated number of allocations, and the mean time of execution):

julia 0.6.0-pre.alpha.315 julia 0.5.1
commit 487cb8b 3.20 GiB, 22000000 allocs, 1.977 sec 4.86 GiB, 30000000 allocs, 2.444 sec
commit 0697deb 11.35 GiB, 31000000 allocs, 12.102 sec 11.58 GiB, 39000000 allocs, 16.768 sec
master (47b201c) 11.35 GiB, 31000000 allocs, 12.449 sec 11.58 GiB, 39000000 allocs, 17.090 sec

@PerezHz
Copy link
Contributor Author

PerezHz commented Apr 1, 2017

Following @dpsanders suggestion, I ran the Lorenz system example which I detailed above, and got the .mem files for the TaylorSeries.jl files in the src directory. As far as I could understand, the main difference in memory allocation between commits 487cb8b and 0697deb, is in the src/auxiliary.jl file.

The only memory-allocating lines in the file src/auxiliary.jl.mem, for the 487cb8b commit, are lines 17 through 23:

        - function resize_coeffs1!{T<:Number}(coeffs::Array{T,1}, order::Int)
        0     lencoef = length(coeffs)
        0     order  lencoef-1 && return nothing
     1008     resize!(coeffs, order+1)
      144     coeffs[lencoef+1:end] .= zero(coeffs[1])
        0     return nothing
        - end

and also lines 140 through 150:

        - ## fixorder ##
        - for T in (:Taylor1, :TaylorN)
        -     @eval begin
        -         fixorder(a::$T, order::Int64) = $T(a.coeffs, order)
        -         function fixorder{R<:Number}(a::$T{R}, b::$T{R})
 64000000             a.order == b.order && return a, b
        0             a.order < b.order && return $T(a.coeffs, b.order), b
        0             return a, $T(b.coeffs, a.order)
        -         end
        -     end
        - end

The only memory-allocating lines for the same file, src/auxiliary.jl.mem, but for the 0697deb commit, are lines 17 through 23:

        - function resize_coeffs1!{T<:Number}(coeffs::Array{T,1}, order::Int)
        0     lencoef = length(coeffs)
        0     order  lencoef-1 && return nothing
672001008     resize!(coeffs, order+1)
 96000144     coeffs[lencoef+1:end] .= zero(coeffs[1])
        0     return nothing
        - end

In particular, I noted the line resize!(coeffs, order+1), which allocates 672001008 bytes! That was the only significant difference I could notice using --track-allocation=user.

On the other hand, in the src/arithmetic.jl file, commit 0697deb seems to be much better at memory allocation than 487cb8b (e.g., the use of the in-place version of mul! instead of the non-in-place multiplication, as was mentioned by @lbenet, that trick saves a lot of memory!). The rest of the files in the src directory had no significant contributions to memory allocation, as far as I could tell...

@PerezHz
Copy link
Contributor Author

PerezHz commented Apr 1, 2017

Just noticing that the heavy use of resize_coeffs1! in commit 0697deb seems to come from the calls to the Taylor1(coeffs, order) constructor. This constructor is called every time the method zero(a::Taylor1, order) is called; in turn, zero is called by the *{T<:Number}(a::Taylor1{T}, b::Taylor1{T}) multiplication method.

@lbenet
Copy link
Member

lbenet commented Apr 1, 2017

Thanks for checking this up so carefully.

Yesterday, I simply changed the line 20 of auxiliary.jl to return a.coeffs == b.coeffs and got a reduction form 11GB to 3.4GB. So it seems that some usage of the dot-operations creates some temporaries which were not before. This may happen in zero but specially in +, - and perhaps *.

I can't look at this today, so I'll check it tomorrow.

@lbenet
Copy link
Member

lbenet commented Apr 3, 2017

I'm not sure what is happening. It seems to me that using dot-operations inside @eval blocks is somehow making that the dot-operations are not properly handled; the following refers to some lines in arithmetic.jl, when defining :+ and :-.

Simply put, the line

    @__dot__ v = $f(a.coeffs, b.coeffs)

(inside an @eval block) allocates lots of memory, while the supposedly equivalent

for ind = 1:length(a.coeffs
    v[ind] = $f(a[ind], b[ind])
end

does not.

@dpsanders Do you know of this kind of problems?

I guess that a solution is using for ... end blocks again.

@dpsanders
Copy link
Contributor

What does @__dot__ do? Is it the same as @.?
Any particular reason to not just write it with . everytwhere? Does that have the same problem?

In any case, if this is a problem, it should certainly be reported as an issue on the Julia repo.

@dpsanders
Copy link
Contributor

You can use macroexpand (or @macroexpand on 0.6) to see what this is being expanded to. I guess including the @eval and everything.

@lbenet
Copy link
Member

lbenet commented Apr 3, 2017

@__dot__ is the equivalent of @. for 0.5.

I'll check how is everything expanded... Thanks for the suggestion!

@dpsanders
Copy link
Contributor

Are you checking all these allocations on 0.6? Dot fusion is less powerful on 0.5

@lbenet
Copy link
Member

lbenet commented Apr 3, 2017

I'm checking it in 0.5, though I'm aware that they are less powerful. The problem seems to appear in both versions...

@dpsanders
Copy link
Contributor

dpsanders commented Apr 3, 2017 via email

@lbenet
Copy link
Member

lbenet commented Apr 3, 2017

I agree with you; yet, according to what @PerezHz is reporting (see this comment), the problem also occurs in Julia 0.6.

lbenet pushed a commit that referenced this issue Apr 5, 2017
The use of those if's is at the heart of the problem reported in [#102].
The idea was to avoid fixorder by setting the coefficients that are
not defined to zero. This involves if's that killed performance.

Also, fixorder was corrected to return a copy of the coefficients to
avoid side effects.

`max_order` is not anylonger needed, so it is deleted.
@lbenet
Copy link
Member

lbenet commented Apr 5, 2017

Taking back what I said before on the dot operations, I think the problem is actually related to some changes in getindex(), which precisely allow to extend using a[i] beyond the actual number of coefficients. The idea was that they would return an adequate zero, so we could avoid using fixorder.

I just submitted a PR (see #103) to see if reverting that brings things in order, and it seems so.

@PerezHz Could you check if #103 solves this?

@PerezHz
Copy link
Contributor Author

PerezHz commented Apr 5, 2017

Well it seems to solve the problem for julia 0.5.1 (at least for the total allocated memory, although the number of allocations still goes up). For julia-0.6.0-pre.alpha #103 seems to help a lot, although in this latter case there still seems to be a bit of a slowdown when benchmarking lorenzmanytimes() with the exact same parameters as before:

julia 0.6.0-pre.alpha.315 julia 0.5.1
#103 (248b0a2) 4.53 GiB, 33000000 allocs, 2.429 sec 4.69 GiB, 39000000 allocs, 2.692 sec
master (47b201c) 11.58 GiB, 39000000 allocs, 17.126 sec 11.35 GiB, 31000000, 12.105 sec
commit 487cb8b 3.20 GiB, 22000000 allocs, 1.900 sec 4.86 GiB, 30000000 allocs, 2.651 sec

@dpsanders
Copy link
Contributor

I have just heard about PkgBenchmark.jl which should help for these kinds of comparisons.

@lbenet
Copy link
Member

lbenet commented Apr 5, 2017

Thanks @dpsanders for the tip on PkgBenchmark.jl; I'll check it later.

@PerezHz I just pushed another commit which I think it slightly improves the performance over 487cb8b. Can you confirm this?

@PerezHz
Copy link
Contributor Author

PerezHz commented Apr 5, 2017

On my machine I ran the lorenzmanytimes() benchmarks, only with a small change:

function lorenzmanytimes()
    for i in 1:5000000 #<--- notice the change here; previous value was 1000000
        lorenz!(t0,q0T,dq0T)
    end
end

i.e., instead of evaluating lorenz! a million times, I evaluated it 5 million times.

For these benchmarks, I found that #103 latest commit is essentially on-par with 487cb8b, and even slightly better (wrt execution time and total allocated memory), on julia 0.5.1! Actually, total allocated memory goes down by ~30% for julia0.5.1 when comparing #103 vs 487cb8b. I would say this solves the performance regression, at least for julia 0.5.1.

On the other hand, what I noticed for julia 0.6.0-pre.alpha is that even as total allocated memory is better in #103 than 487cb8b (15.80GiB vs 16.02GiB), the total number of allocations seems to be slightly higher in #103 (1.25x10^8 vs 1.1x10^8), as well as execution time (10.759 secs vs 9.455 secs). I see a similar trend for other benchmarks with TaylorIntegration.

julia 0.6.0-pre.alpha.315 julia 0.5.1
#103 (c528ed2) 15.80 GiB, 1.25x10^8, 10.759 sec 16.61 GiB, 1.55x10^8 allocs, 11.737 sec
commit (487cb8b) 16.02 GiB, 1.1x10^8 allocs, 9.455 sec 24.29 GiB, 1.5x10^8 allocs, 12.266 sec

@PerezHz
Copy link
Contributor Author

PerezHz commented Apr 5, 2017

Just updated to julia 0.6.0-pre.beta.33 and the results are essentially the same as the ones obtained with julia 0.6.0-pre.alpha.315

@lbenet
Copy link
Member

lbenet commented Apr 6, 2017

Thanks @PerezHz for checking up this !

I will see if there is something else to be done to improve performance for 0.6. In any case, I think #103 or something close to that implementation should be merged.

@lbenet
Copy link
Member

lbenet commented Apr 7, 2017

I just pushed another commit (33a2cb7), which improves the number of allocations and the memory usage.

Locally, my results are the following: I can't bit the timing of 487cb8b using your benchmarks by ~10%. I guess that this is due to some optimization (in * for Taylor1) that is applied in 487cb8b, but cannot by used, probably due to mul!(c, a, b, k). On the other hand, using 33a2cb7 and perf/fateman.jl I get an improvement ~5%.

@PerezHz Can you check in your application for 33a2cb7 if the slowdown is so bad as before?

I will see if there is something I can do to improve further the timing, concretely in the method mentioned above for mul!. Yet, I must admit that the current "modularity" is something I like...

@PerezHz
Copy link
Contributor Author

PerezHz commented Apr 7, 2017

For the lorenzmanytimes() benchmark we've been working with, I also see commit 33a2cb7 about 10% slower than 487cb8b; but for the TaylorIntegration application I'm seeing 33a2cb7 right on-par with 487cb8b: total allocated memory and number of allocs is practically the same between these two commits, and execution time is also practically the same!

On the other hand, I benchmarked TaylorIntegration tests, and found that while total allocated memory is better (~25%) for 33a2cb7 vs 487cb8b, number of allocs is slightly worse (~13%), and execution time is actually slightly better, by about 8%.

So I'd be happy to close this issue as far as the original slowdown report is concerned! Thank you @lbenet for all the effort you're putting into this 😄 !

@lbenet
Copy link
Member

lbenet commented Apr 7, 2017

Great news!

@dpsanders Do you agree to merge #103?

@dpsanders
Copy link
Contributor

I haven't been following very closely. If #103 solves the problem, then that sounds good to me, even if it's aesthetically less pleasing to some extent. Hopefully in the future, we may be able to find a different solution, but for now it seems like a good one!

@PerezHz
Copy link
Contributor Author

PerezHz commented Apr 8, 2017

Sorry guys, I have to take back my word about performance for my TaylorIntegration application 😞... Just realised that I didn't check-out correctly the various TaylorSeries commits when benchmarking. After re-doing it carefully, I'm seeing 33a2cb7 about 8% higher in total allocated memory, 20% higher in number of allocs, and ~10% slower in execution time on average vs 487cb8b.

This correction is only for the N-body problem application with TaylorIntegration. The other two benchmarks (lorenzmanytimes() and TaylorIntegration tests themselves) were quoted correctly in my last comment. Again, sorry for this confusion...

@lbenet
Copy link
Member

lbenet commented Apr 19, 2017

Sorry for the delay to get back to this. I've been playing a little bit with the old a new implementation and, including @inline for mul! I get results that are pretty comparable with what we had for 487cb8b.

I just pushed a commit with those changes; @PerezHz can you check that this indeed improves the benchmarks over 33a2cb7? There may be a marginal difference in favor of 487cb8b; yet, I prefer to stick to the current implementation since I think it will permit do some other nice stuff.

@PerezHz
Copy link
Contributor Author

PerezHz commented Apr 20, 2017

Sure, @lbenet! Just benchmarked the latest commit in #103 vs 487cb8b, and at least in my machine, for the lorenzmanytimes() benchmarks, I'm finding #103 actually ahead of 487cb8b, both in julia 0.5.1 and 0.6.0-pre.beta.224! (This time I double-checked that I got the git checkouts right 😅)

julia 0.6.0-pre.beta.224 julia 0.5.1
#103 (c19c1eb) 14.16 GiB, 9.5x10^7 allocs, min time 9.214 s, mean time 9.247 s 14.98 GiB, 1.25x10^8 allocs, min time 10.040 s, mean time 10.318 s
commit (487cb8b) 16.02 GiB, 1.1x10^8 allocs, min time 9.390 s, mean time 9.619 s 24.29 GiB, 1.5x10^8 allocs, min time 12.163 s, mean time 12.354 s

@lbenet
Copy link
Member

lbenet commented Apr 20, 2017

Great news! I think this should be merged, to continue with other stuff.

@PerezHz
Copy link
Contributor Author

PerezHz commented Apr 20, 2017

Since the lorenzmanytimes() benchmark doesn't involve division or elevating to powers, do you think similar things such as inlining should be done for these other operations?

@lbenet
Copy link
Member

lbenet commented Apr 20, 2017

I actually thought to include @inline for other functions. Do you have another possible test for those cases?

@PerezHz
Copy link
Contributor Author

PerezHz commented Apr 20, 2017

Well, I was thinking of something like keplermanytimes(), that is, modify lorenzmanytimes() for the Kepler problem; if you think this is worthwhile I can do this right away!

@PerezHz
Copy link
Contributor Author

PerezHz commented Apr 20, 2017

Also, right now I'm benchmarking one of my TaylorIntegration.jl applications

@lbenet
Copy link
Member

lbenet commented Apr 20, 2017

I think I would be more interested in the TaylorIntegration.jl application 😄

I'll do the keplermanytimes() test...

@dpsanders
Copy link
Contributor

If you use BenchmarkTools, it should not be necessary to run many times. Or, rather, BenchmarkTools controls that.

using BenchmarkTools

@benchmark lorenz(...)

@PerezHz
Copy link
Contributor Author

PerezHz commented Apr 20, 2017

Thanks @dpsanders! For one of my TaylorIntegration.jl applications, as you suggested, I ran @benchmark tv, xv = taylorinteg(...) for an ODE which involves divisions and powers, as well as sums and products. The results on julia 0.6.0-pre.beta.224 were the following:

For commit 487cb8b:

BenchmarkTools.Trial: 
  memory estimate:  964.43 MiB
  allocs estimate:  10116941
  --------------
  minimum time:     675.706 ms (11.69% GC)
  median time:      692.568 ms (12.01% GC)
  mean time:        692.260 ms (11.97% GC)
  maximum time:     726.560 ms (12.83% GC)
  --------------
  samples:          8
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

For #103 (c19c1eb):

BenchmarkTools.Trial: 
  memory estimate:  1.03 GiB
  allocs estimate:  12225761
  --------------
  minimum time:     772.681 ms (12.71% GC)
  median time:      798.790 ms (13.52% GC)
  mean time:        810.775 ms (13.61% GC)
  maximum time:     864.409 ms (15.25% GC)
  --------------
  samples:          7
  evals/sample:     1
  timye tolerance:   5.00%
  memory tolerance: 1.00%

So especifically for this benchmark 487cb8b seems to be still marginally faster than c19c1eb, but I also found that this difference is almost unnoticeable on longer runs.

lbenet added a commit that referenced this issue Apr 21, 2017
* Use `fixorder` again, to avoid the if's used in `getindex`

The use of those if's is at the heart of the problem reported in [#102].
The idea was to avoid fixorder by setting the coefficients that are
not defined to zero. This involves if's that killed performance.

Also, fixorder was corrected to return a copy of the coefficients to
avoid side effects.

`max_order` is not anylonger needed, so it is deleted.

* Fix tests and fateman.jl

* Tiny correction to fixorder

* Improvements on allocations

* Inline mul! and div! methods

* Use a modified power_by_squaring, and inline the (mutating) recursion functions

* Inline fixorder

* Inlining div!
@lbenet
Copy link
Member

lbenet commented Apr 21, 2017

@PerezHz I just merged #103; thanks a lot for reporting and your help in this. I'm closing this issue.

@lbenet lbenet closed this as completed Apr 21, 2017
@PerezHz
Copy link
Contributor Author

PerezHz commented Apr 24, 2017

Thanks @lbenet for all the effort you and @dpsanders have put into this; it is really nice to have performance back! For the record, after #103 was merged I benchmarked once again one of my TaylorIntegration.jl applications (same from my last comment in this issue), for both commit 487cb8b and current master (b173816) with the following results:

commit 487cb8b:

BenchmarkTools.Trial: 
  memory estimate:  964.43 MiB
  allocs estimate:  10116945
  --------------
  minimum time:     671.458 ms (11.55% GC)
  median time:      678.895 ms (11.49% GC)
  mean time:        679.357 ms (11.47% GC)
  maximum time:     687.634 ms (11.70% GC)
  --------------
  samples:          8
  evals/sample:     1

current master (b173816):

BenchmarkTools.Trial: 
  memory estimate:  933.25 MiB
  allocs estimate:  9282765
  --------------
  minimum time:     669.862 ms (11.14% GC)
  median time:      676.142 ms (11.34% GC)
  mean time:        677.879 ms (11.28% GC)
  maximum time:     689.530 ms (11.05% GC)
  --------------
  samples:          8
  evals/sample:     1

So it seems that current master is performing better than 487cb8b in julia 0.6.0-pre.beta, as far as both total allocated memory and number of allocs are concerned; execution time is essentially on-par with 487cb8b! I also ran some other TaylorIntegration.jl applications of mine and found similar results!

@lbenet
Copy link
Member

lbenet commented Apr 24, 2017

Thanks for the update!

@dpsanders
Copy link
Contributor

That seems to be an enormous amount of memory being allocated in a very short time...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants