Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix perf in exp(::Matrix) (for smallish matrices) #29116

Merged
merged 1 commit into from
Sep 12, 2018
Merged

Conversation

KristofferC
Copy link
Sponsor Member

@KristofferC KristofferC commented Sep 10, 2018

Fixes #29113

Before:

julia> @btime exp([1. 0; 2 0])
  18.912 μs (73 allocations: 4.80 KiB)

After:

julia> @btime exp([1. 0; 2 0])
  3.529 μs (25 allocations: 2.30 KiB)

Edit: The text below is no longer relevant:

Should the broadcasting be able to do better here? cc @mbauman Here is a MWE

f(A, A2, A4, A6, CC, Inn)      =  A * (A6 * (CC[14]*A6  + CC[12]*A4  + CC[10]*A2)  + CC[8]*A6  + CC[6]*A4  + CC[4]*A2  + CC[2]*Inn)
f_dots(A, A2, A4, A6, CC, Inn) =  A * (A6 * (CC[14]*A6 .+ CC[12]*A4 .+ CC[10]*A2) .+ CC[8]*A6 .+ CC[6]*A4 .+ CC[4]*A2 .+ CC[2]*Inn)
CC = rand(100); A = rand(2,2); A2 = A*A; A4 = A2*A2; A6 = A2 * A4; Inn = rand(2,2);

@btime f($A, $A2, $A4, $A6, $CC, $Inn)
  6.949 μs (28 allocations: 1.69 KiB)

@btime f_dots($A, $A2, $A4, $A6, $CC, $Inn)
  565.278 ns (11 allocations: 1.20 KiB)

Update since it is being discussed: using full fusion:

f_ddots(A, A2, A4, A6, CC, Inn) =  A * (A6 * @.((CC[14]*A6  + CC[12]*A4  + CC[10]*A2)  + CC[8]*A6  + CC[6]*A4  + CC[4]*A2  + CC[2]*Inn))

@btime f_ddots($A, $A2, $A4, $A6, $CC, $Inn)
  1.787 μs (7 allocations: 464 bytes)

@KristofferC KristofferC added performance Must go faster domain:linear algebra Linear algebra labels Sep 10, 2018
CC[8]*A6 + CC[6]*A4 + CC[4]*A2 + CC[2]*Inn)
V = A6 * (CC[13]*A6 + CC[11]*A4 + CC[9]*A2) +
CC[7]*A6 + CC[5]*A4 + CC[3]*A2 + CC[1]*Inn
U = A * (A6 * (CC[14]*A6 .+ CC[12]*A4 .+ CC[10]*A2) .+
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC[14]*A6 etcetera should be CC[14].*A6 in order to fuse. Probably cleaner to just do @.(CC[14]*A6 + CC[12]*A4 + …)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'm not sure why CC is an array rather than a bunch of scalars. Not that any of these micro-optimizations matter for large arrays.)

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes the code three times slower so I rather not do that.

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like, I put a MWE on the top so we can talk about actual performance instead of theoretical performance :P

Copy link
Sponsor Member

@mbauman mbauman Sep 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks like #28126, particularly since when I tried it I saw a 2x gain. :-\

julia> f_dots2(A, A2, A4, A6, CC, Inn) =  A * (A6 * (CC[14].*A6 .+ CC[12].*A4 .+ CC[10].*A2) .+ CC[8]*A6 .+ CC[6].*A4 .+ CC[4].*A2 .+ CC[2].*Inn)
f_dots2 (generic function with 1 method)

julia> @btime f_dots2($A, $A2, $A4, $A6, $CC, $Inn)
  254.430 ns (5 allocations: 560 bytes)
2×2 Array{Float64,2}:
 0.516647  0.311652
 0.186972  0.116884

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I also get that, but not when using the macro... Did I mess up?

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, just pushed it with explicit dots.

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the difference appears to be in the n-ary operations. The parser doesn't use special n-ary parsing for .+ but the @. macro does.

U = A * (A6 * (CC[14]*A6 .+ CC[12]*A4 .+ CC[10]*A2) .+
CC[8]*A6 .+ CC[6]*A4 .+ CC[4]*A2 .+ CC[2]*Inn)
V = A6 * (CC[13]*A6 .+ CC[11]*A4 .+ CC[9]*A2) .+
CC[7]*A6 .+ CC[5]*A4 .+ CC[3]*A2 .+ CC[1]*Inn
Copy link
Member

@stevengj stevengj Sep 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we really wanted to we could split out the + CC[2]*Inn operation into a separate loop that just updates the diagonals in-place.

In general, it looks like there are lots of opportunities for avoiding temporary array allocations in this routine if we care, both by re-using arrays and by using things like rmul!.

It will only matter for tiny arrays like the one in #29113, however, since for larger arrays the Θ(n³) operations will dominate.

@KristofferC KristofferC changed the title fix perf in exp(::Matrix) fix perf in exp(::Matrix) (for smallish matrices) Sep 10, 2018
@Keno
Copy link
Member

Keno commented Sep 10, 2018

Do we have this as a nanosoldier benchmark and didn't notice? If not, add it?

@KristofferC KristofferC added the kind:potential benchmark Could make a good benchmark in BaseBenchmarks label Sep 10, 2018
@KristofferC
Copy link
Sponsor Member Author

Benchmark label is on there so anyone feel free to add it.

@KristofferC KristofferC merged commit 6acaa10 into master Sep 12, 2018
@mbauman mbauman deleted the kc/perf_exp branch September 12, 2018 14:49
KristofferC added a commit that referenced this pull request Sep 12, 2018
(cherry picked from commit 6acaa10)
@KristofferC KristofferC mentioned this pull request Sep 12, 2018
@KristofferC KristofferC removed the kind:potential benchmark Could make a good benchmark in BaseBenchmarks label Oct 16, 2018
KristofferC added a commit that referenced this pull request Feb 11, 2019
(cherry picked from commit 6acaa10)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:linear algebra Linear algebra performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants