RFC: Performance improvement for Diagonal' * Vector|Matrix #21302

georgemarrows · 2017-04-06T20:25:03Z

Added tests
Only addresses diagonal part of #21286, not sparse matrices

julia> include("test.jl")
  0.000014 seconds (6 allocations: 78.359 KiB)
  0.000012 seconds (6 allocations: 78.359 KiB)
  0.000012 seconds (6 allocations: 78.359 KiB)

from

begin
	D = Diagonal(randn(10000))
	v = randn(10000)
	@time D * v
	@time D' * v
	@time D.' * v
end

georgemarrows · 2017-04-06T20:29:56Z

First Julia PR - please be gentle :-)
Potential problem, so far as I can see:

tests A'*B, not Ac/t_mul_B directly
no performance tests
lack of generality? - Diagonal' * Vector only

andreasnoack · 2017-04-06T20:37:02Z

base/linalg/diagonal.jl

@@ -225,6 +225,9 @@ A_mul_B!(A::AbstractMatrix,B::Diagonal)  = scale!(A,B.diag)
 A_mul_Bt!(A::AbstractMatrix,B::Diagonal) = scale!(A,B.diag)
 A_mul_Bc!(A::AbstractMatrix,B::Diagonal) = scale!(A,conj(B.diag))

+Ac_mul_B(A::Diagonal,B::AbstractVector) = conj.(A.diag) .* B


Our (c)transpose is recursive so it should be ctranspose.(A.diag) .* B. It would be great if you could make the same fix for Ac/t_mul_B!.

Ah, for block diagonal matrices? New PR tomorrow covering both points.

Yes. You can find a couple of issues where the details have been discussed. Please just update this PR instead of opening a new one.

georgemarrows · 2017-04-07T16:56:07Z

Fixed for block matrices and a new test added.

For Ac/t_mul_B!, I notice that A_mul_B! is actually even slower:

begin
	D = Diagonal(randn(10000))
	v = randn(10000)
	vv = similar(v)
	@time A_mul_B!(vv, D, v)
	@time Ac_mul_B!(vv, D, v)
	@time At_mul_B!(vv, D, v)

	@time vv .= D * v
	@time vv .= Ac_mul_B(D, v)
	@time vv .= At_mul_B(D, v)
	nothing
end

 1.570805 seconds (4 allocations: 160 bytes)
  1.095022 seconds (4 allocations: 160 bytes)
  1.093878 seconds (4 allocations: 160 bytes)
  0.000030 seconds (6 allocations: 78.359 KiB)
  0.000019 seconds (6 allocations: 78.359 KiB)
  0.000017 seconds (6 allocations: 78.359 KiB)

Should l do a change for that too?
Are the vv .= Ac_mul_B(D, v) style implementations ok? They're way faster but allocate more.
Should block diagonals work here too?

BTW should the following work? I wanted to use it in writing my test for transposes.

julia> D = Diagonal([[1 2; 3 4], [1 2; 3 4]])
2×2 Diagonal{Array{Int64,2}}:
 [1 2; 3 4]      ⋅     
     ⋅       [1 2; 3 4]

julia> full(D)
ERROR: MethodError: no method matching zero(::Type{Array{Int64,2}})
Closest candidates are:
  zero(::Type{Base.LibGit2.GitHash}) at libgit2/oid.jl:106
  zero(::Type{Base.Pkg.Resolve.VersionWeights.VWPreBuildItem}) at pkg/resolve/versionweight.jl:80
  zero(::Type{Base.Pkg.Resolve.VersionWeights.VWPreBuild}) at pkg/resolve/versionweight.jl:120
  ...
Stacktrace:
 [1] diagm(::Array{Array{Int64,2},1}, ::Int64) at ./linalg/dense.jl:251
 [2] full(::Diagonal{Array{Int64,2}}) at ./linalg/diagonal.jl:56

georgemarrows · 2017-04-07T21:21:50Z

Now with perf fixes for Ac/t_mul_B for Vector and Diagonal. All tests pass. Will push again with squashed commits if you're happy with this.

andreasnoack · 2017-04-08T17:37:42Z

base/linalg/diagonal.jl

+Ac_mul_B(A::Diagonal, v::AbstractVector) = ctranspose.(A.diag) .* v
+At_mul_B(A::Diagonal, v::AbstractVector) = transpose.(A.diag) .* v
+
+A_mul_B!(vout::AbstractVector, A::Diagonal, vin::AbstractVector) = vout .= A * vin


Couldn't this just be vout .= A.diag .* vin and thereby avoiding the allocation?

georgemarrows · 2017-04-08T21:47:08Z

Indeed, and then you don't need to have separate non-! versions. I tried that briefly before but missed that I was measuring compilation time in there too :-(

Notes on performance here: https://gist.github.com/georgemarrows/678876779e7292a954a344aa3addeaf1

andreasnoack · 2017-04-09T01:25:59Z

Did you confirm that this version still fixes #21286? I'm not sure that D::Diagonal'*B::Matrix dispatches to the two argument Ac_mul_B!s in the current version.

georgemarrows · 2017-04-11T14:09:56Z

Latest push covers D' * Matrix as well as D' * Vector, and adds tests for this combination too. It doesn't address sparse matrices, which were also raised in #21286 - I think that's better as a separate PR (and perhaps a separate bug report).

I ran tests/linalg/diagonal.jl, and this gist shows that performance is hugely increased in both Vector and Matrix cases so that transpose and ctranspose cases are now similar to vanilla multiplication.

georgemarrows · 2017-04-17T06:26:24Z

Nudging for review when time is available...

Sacha0

LGTM from a brief review! Thanks for the great first PR @georgemarrows! :)

Includes tests Only addresses diagonal part of 21286, not sparse matrices

georgemarrows · 2017-04-18T13:44:27Z

Thanks @Sacha0. I've rebased and squashed commits.
make test all good locally, and we still have the performance improvements in this gist.
PR title changed to "RFC:"

Sacha0 · 2017-04-18T16:19:49Z

Additional thoughts @andreasnoack? Mergeworthy? Best!

tkelman · 2017-04-18T16:36:59Z

base/linalg/diagonal.jl

@@ -225,6 +225,16 @@ A_mul_B!(A::AbstractMatrix,B::Diagonal)  = scale!(A,B.diag)
 A_mul_Bt!(A::AbstractMatrix,B::Diagonal) = scale!(A,B.diag)
 A_mul_Bc!(A::AbstractMatrix,B::Diagonal) = scale!(A,conj(B.diag))

+# Get ambiguous method if try to unify AbstractVector/AbstractMatrix here using AbstractVecOrMat
+A_mul_B!(out::AbstractVector, A::Diagonal, in::AbstractVector) = out .= A.diag .* in


we really shouldn't be syntax-highlighting in...

georgemarrows · 2017-04-18T17:44:21Z

Thanks @andreasnoack, I appreciate your guidance and reviews while I thrashed towards a solution.

andreasnoack · 2017-04-18T17:51:07Z

You are welcome. Thanks for the contribution.

andreasnoack reviewed Apr 6, 2017

View reviewed changes

kshyatt added domain:linear algebra Linear algebra performance Must go faster labels Apr 6, 2017

georgemarrows force-pushed the fix21286 branch from 747649e to 3ea83c8 Compare April 7, 2017 16:34

georgemarrows force-pushed the fix21286 branch from 3ea83c8 to e03867c Compare April 7, 2017 21:19

andreasnoack reviewed Apr 8, 2017

View reviewed changes

georgemarrows force-pushed the fix21286 branch from e03867c to 884375c Compare April 8, 2017 21:28

georgemarrows force-pushed the fix21286 branch from 884375c to 90c5cc4 Compare April 11, 2017 14:04

georgemarrows force-pushed the fix21286 branch from 90c5cc4 to c5e00f0 Compare April 11, 2017 14:15

Sacha0 approved these changes Apr 17, 2017

View reviewed changes

Improve perf of Diagonal' * Vector|Matrix

f6554cd

Includes tests Only addresses diagonal part of 21286, not sparse matrices

georgemarrows force-pushed the fix21286 branch from c5e00f0 to f6554cd Compare April 18, 2017 13:36

georgemarrows changed the title ~~WIP: Fix #21286, improve perf of Diagonal' * Vector~~ RFC: Performance improvement for Diagonal' * Vector|Matrix Apr 18, 2017

tkelman reviewed Apr 18, 2017

View reviewed changes

andreasnoack merged commit b0c7084 into JuliaLang:master Apr 18, 2017

georgemarrows deleted the fix21286 branch April 18, 2017 17:52

georgemarrows mentioned this pull request Apr 18, 2017

Poor performance of diagonal transpose * vector #21286

Closed

andreasnoack mentioned this pull request Mar 13, 2019

mul! much slower than copy and rmul! #31325

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Performance improvement for Diagonal' * Vector|Matrix #21302

RFC: Performance improvement for Diagonal' * Vector|Matrix #21302

georgemarrows commented Apr 6, 2017

georgemarrows commented Apr 6, 2017

andreasnoack Apr 6, 2017

georgemarrows Apr 6, 2017

andreasnoack Apr 6, 2017

georgemarrows commented Apr 7, 2017

georgemarrows commented Apr 7, 2017

andreasnoack Apr 8, 2017

georgemarrows commented Apr 8, 2017

andreasnoack commented Apr 9, 2017

georgemarrows commented Apr 11, 2017 •

edited

Loading

georgemarrows commented Apr 17, 2017

Sacha0 left a comment

georgemarrows commented Apr 18, 2017

Sacha0 commented Apr 18, 2017

tkelman Apr 18, 2017

georgemarrows commented Apr 18, 2017

andreasnoack commented Apr 18, 2017

RFC: Performance improvement for Diagonal' * Vector|Matrix #21302

RFC: Performance improvement for Diagonal' * Vector|Matrix #21302

Conversation

georgemarrows commented Apr 6, 2017

georgemarrows commented Apr 6, 2017

andreasnoack Apr 6, 2017

Choose a reason for hiding this comment

georgemarrows Apr 6, 2017

Choose a reason for hiding this comment

andreasnoack Apr 6, 2017

Choose a reason for hiding this comment

georgemarrows commented Apr 7, 2017

georgemarrows commented Apr 7, 2017

andreasnoack Apr 8, 2017

Choose a reason for hiding this comment

georgemarrows commented Apr 8, 2017

andreasnoack commented Apr 9, 2017

georgemarrows commented Apr 11, 2017 • edited Loading

georgemarrows commented Apr 17, 2017

Sacha0 left a comment

Choose a reason for hiding this comment

georgemarrows commented Apr 18, 2017

Sacha0 commented Apr 18, 2017

tkelman Apr 18, 2017

Choose a reason for hiding this comment

georgemarrows commented Apr 18, 2017

andreasnoack commented Apr 18, 2017

georgemarrows commented Apr 11, 2017 •

edited

Loading