Sparse-sparse matrix multiplication 4x slower on 0.5 than 0.4, 6x slower on master #21370

tkelman · 2017-04-13T07:51:49Z

Test code:

$ cat compare.jl
using BenchmarkTools
testfile = "amazon0302"
if !isfile("$(testfile)_inc.tsv")
    download("https://graphchallenge.s3.amazonaws.com/snap/$testfile/$(testfile)_inc.tsv",
        "$(testfile)_inc.tsv")
end
ii = readdlm("$(testfile)_inc.tsv", '\t', Int64)
E = sparse(ii[:,1], ii[:,2], ii[:,3])
@show VERSION
display(@benchmark E'*E)
println()

$ for v in 0.4 0.5 0.6; do ~/Julia/julia-$v/julia compare.jl; done
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 26.5M  100 26.5M    0     0  14.3M      0  0:00:01  0:00:01 --:--:-- 14.3M
VERSION = v"0.4.8-pre+3"
BenchmarkTools.Trial:
  memory estimate:  662.97 MiB
  allocs estimate:  4497172
  --------------
  minimum time:     1.231 s (0.00% GC)
  median time:      1.255 s (1.43% GC)
  mean time:        1.268 s (2.51% GC)
  maximum time:     1.331 s (6.86% GC)
  --------------
  samples:          4
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
VERSION = v"0.5.2-pre+1"
BenchmarkTools.Trial:
  memory estimate:  660.97 MiB
  allocs estimate:  2698301
  --------------
  minimum time:     5.189 s (4.54% GC)
  median time:      5.189 s (4.54% GC)
  mean time:        5.189 s (4.54% GC)
  maximum time:     5.189 s (4.54% GC)
  --------------
  samples:          1
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
VERSION = v"0.6.0-pre.beta.121"
BenchmarkTools.Trial:
  memory estimate:  660.97 MiB
  allocs estimate:  2698301
  --------------
  minimum time:     7.657 s (3.10% GC)
  median time:      7.657 s (3.10% GC)
  mean time:        7.657 s (3.10% GC)
  maximum time:     7.657 s (3.10% GC)
  --------------
  samples:          1
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

From looking into the profiles, almost all of the slowdown is from the sorting of the row indices that happens here https://github.com/JuliaLang/julia/blame/c8026a91acd988fbbb08158092a89e111c243965/base/sparse/linalg.jl#L192. Neither spmatmul nor sortSparseMatrixCSC! (https://github.com/JuliaLang/julia/blame/c8026a91acd988fbbb08158092a89e111c243965/base/sparse/sparsematrix.jl#L3310-L3362) have changed much from 0.4 to 0.5 or 0.5 to now.

The text was updated successfully, but these errors were encountered:

KristofferC · 2017-04-13T08:39:10Z

julia> Profile.print(noisefloor=2)
 [...]
      1600 ./sparse/linalg.jl:134; Ac_mul_B(::SparseMatrixCSC{Int64,Int64}, :...
       1450 ./sparse/linalg.jl:192; #spmatmul#54(::Symbol, ::Function, ::Spar...
        1450 ./<missing>:0; (::Base.SparseArrays.#kw##sortSparseMatrix...
         1435 ./sparse/sparsematrix.jl:3350; #sortSparseMatrixCSC!#37(::Symbol, ::Fu...
          1409 ./sort.jl:660; #sortperm!#12(::Base.Sort.QuickSortAlg,...

Ref: #939 (comment) (sortperm! 150x slower than sort!). Also, sortperm! is 3x slower on 0.6 than 0.5.

KristofferC · 2017-04-13T09:00:52Z

It seems that creating the Perm object takes the vast majority of the time.

Adding (using https://github.com/KristofferC/TimerOutputs.jl):

@timeit "create ord" begin
  order = ord(lt,by,rev,order)
end
@timeit "create perm" begin
  p = Perm(order, v)
end
@timeit "do sort" begin
  sort!(x, alg, p)
end

to sortperm! gives (after a few loops calling sortperm!):

 Section       ncalls     time   %tot     avg
 ────────────────────────────────────────────
 create perm       50   1.14ms  97.5%  22.8μs
 do sort           50   21.0μs  1.80%   420ns
 create ord        50   8.27μs  0.71%   165ns
 ────────────────────────────────────────────

As a note, it is a bit sad that 2% is spent doing the sorting and 98% is overhead.

KristofferC · 2017-04-13T09:13:04Z

Reduced to:

using BenchmarkTools

function _sortperm!(v::AbstractVector;
                   lt=isless,
                   by=identity,
                   rev::Bool=false,
                   order::Base.Ordering=Base.Order.Forward)
        order = Base.Order.ord(lt,by,rev,order)
        p = Base.Order.Perm(order, v)
end

v = rand(12)
@btime _sortperm!($v)

0.6: 14.255 μs (1 allocation: 16 bytes)
0.5: 5.316 μs (1 allocation: 16 bytes)

Changes in 0.6 vs 0.5 is that jl_typemap_assoc_exact got a lot slower.

Full relevant 0.6 trace:

1618 /home/kristoffer/juliamkl/src/gf.c:1910; jl_apply_generic
  1614 /home/kristoffer/juliamkl/src/gf.c:1858; jl_lookup_generic_
   1614 /home/kristoffer/juliamkl/src/julia_internal.h:876; jl_typemap_assoc_exact
    1614 /home/kristoffer/juliamkl/src/typemap.c:828; jl_typemap_level_assoc_exact
     729 /home/kristoffer/juliamkl/src/typemap.c:764; jl_typemap_entry_assoc_exact
     643 /home/kristoffer/juliamkl/src/typemap.c:783; jl_typemap_entry_assoc_exact
      63  /home/kristoffer/juliamkl/src/typemap.c:137; sig_match_simple
      383 /home/kristoffer/juliamkl/src/typemap.c:145; sig_match_simple
       90  /home/kristoffer/juliamkl/src/subtype.c:977; jl_types_equal
        48 /home/kristoffer/juliamkl/src/subtype.c:178; obviously_egal
       190 /home/kristoffer/juliamkl/src/subtype.c:978; jl_types_equal
        58 /home/kristoffer/juliamkl/src/subtype.c:210; obviously_unequal
         42 /home/kristoffer/juliamkl/src/jltypes.c:819; jl_unwrap_unionall
        76 /home/kristoffer/juliamkl/src/subtype.c:211; obviously_unequal
         52 /home/kristoffer/juliamkl/src/jltypes.c:819; jl_unwrap_unionall

Full relevant 0.5 trace:

603 .../buildbot/slave/package_tarball64/build/src/gf.c:1909; jl_apply_generic
603 ...ave/package_tarball64/build/src/julia_internal.h:749; jl_typemap_assoc_exact
 601 ...bot/slave/package_tarball64/build/src/typemap.c:866; jl_typemap_level_assoc_exact
  33  ...bot/slave/package_tarball64/build/src/typemap.c:803; jl_typemap_entry_assoc_exact
  24  ...bot/slave/package_tarball64/build/src/typemap.c:817; jl_typemap_entry_assoc_exact
   24 ...bot/slave/package_tarball64/build/src/typemap.c:107; sig_match_leaf
  250 ...bot/slave/package_tarball64/build/src/typemap.c:821; jl_typemap_entry_assoc_exac
    31 ...ldbot/slave/package_tarball64/build/src/julia.h:969; jl_is_type_type
   19  ...bot/slave/package_tarball64/build/src/typemap.c:136; sig_match_simple
    19 ...ldbot/slave/package_tarball64/build/src/julia.h:690; jl_svecref
   50  ...bot/slave/package_tarball64/build/src/typemap.c:142; sig_match_simple
   101 ...bot/slave/package_tarball64/build/src/typemap.c:147; sig_match_simple
    28 ...ot/slave/package_tarball64/build/src/jltypes.c:1675; type_eqv__
    14 ...ot/slave/package_tarball64/build/src/jltypes.c:1696; type_eqv__
    15 ...ot/slave/package_tarball64/build/src/jltypes.c:1717; type_eqv__
  21  ...bot/slave/package_tarball64/build/src/typemap.c:835; jl_typemap_entry_assoc_exact

cc @JeffBezanson

martinholters · 2017-04-13T11:35:04Z

With that benchmark, I also get:
0.6: 11.105 μs
0.5: 5.36 μs
0.4: 648.852 ns

fredrikekre · 2017-04-13T12:10:02Z

From #20993 (benchmarking 0.6 vs 0.5):

| ["sort","insertionsort",("sort forwards","ascending")] | 1.59 (30%) ❌ | 1.00 (1%) |
| ["sort","insertionsort",("sort! forwards","ascending")] | 2.48 (30%) ❌ | 1.00 (1%) |
| ["sort","insertionsort",("sortperm forwards","ascending")] | 1.51 (30%) ❌ | 1.00 (1%) |
| ["sort","insertionsort",("sortperm forwards","descending")] | 1.42 (30%) ❌ | 1.00 (1%) |
| ["sort","insertionsort",("sortperm forwards","random")] | 1.41 (30%) ❌ | 1.00 (1%) |
| ["sort","insertionsort",("sortperm reverse","ascending")] | 1.42 (30%) ❌ | 1.00 (1%) |
| ["sort","insertionsort",("sortperm reverse","random")] | 1.41 (30%) ❌ | 1.00 (1%) |
| ["sort","insertionsort",("sortperm! forwards","descending")] | 1.42 (30%) ❌ | 1.00 (1%) |
| ["sort","insertionsort",("sortperm! forwards","random")] | 1.41 (30%) ❌ | 1.00 (1%) |
| ["sort","insertionsort",("sortperm! reverse","ascending")] | 1.42 (30%) ❌ | 1.00 (1%) |
| ["sort","insertionsort",("sortperm! reverse","random")] | 1.41 (30%) ❌ | 1.00 (1%) |
| ["sort","issorted",("reverse","ascending")] | 1.34 (30%) ❌ | 1.00 (1%) |
| ["sort","issorted",("reverse","random")] | 1.35 (30%) ❌ | 1.00 (1%) |
| ["sort","quicksort",("sortperm forwards","ascending")] | 1.36 (30%) ❌ | 1.00 (1%) |
| ["sort","quicksort",("sortperm forwards","descending")] | 1.33 (30%) ❌ | 1.00 (1%) |
| ["sort","quicksort",("sortperm! reverse","descending")] | 1.30 (30%) ❌ | 1.00 (1%) |

KristofferC · 2017-04-13T12:15:29Z

Is that the actual sorting that got slower or just overhead like here?

StefanKarpinski · 2017-04-13T14:56:25Z

We should probably have a dedicated sortperm implementation that doesn't use the same code paths as everything else. I believe can use a linear time sorting algorithm since you know the exact distribution of values.

KristofferC · 2017-04-13T15:00:59Z

While true, it's perhaps not relevant for this issue.

jebej · 2017-04-13T16:54:11Z

That sort of thing is why I wanted to check performance vs 0.4 in #20947

fix #21370, regression in dynamic dispatch of complex constructors

tkelman · 2017-04-16T00:40:14Z

We should add something derived from this to the performance tracking tests. I'll have to familiarize myself with how to structure those if no one beats me to it.

RalphAS · 2017-04-17T02:26:08Z

Although the constructor issue is solved, sparse matmul is still often dominated by sorting, so let me leave a note for future issue-posters or potential push-requesters to consider.
While investigating Julia's sorting schemes, I noticed @JeffBezanson's Base.Sort.sortperm_int_range, which is a degenerate form (i.e. stunningly elegant reduction) of radix sort. It seems tailor-made for this use case (post-anticipated by @StefanKarpinski above), but is not invoked because the CSC matrix code uses the pre-allocated sortperm! interface. (FWIW my experiments suggest that sortperm_int_range is faster than [tuned] traditional radix sort for ranges smaller than about 10^6.)

tkelman added performance Must go faster kind:regression Regression in behavior compared to a previous version domain:arrays:sparse Sparse arrays kind:potential benchmark Could make a good benchmark in BaseBenchmarks labels Apr 13, 2017

JeffBezanson added a commit that referenced this issue Apr 13, 2017

fix #21370, regression in dynamic dispatch of complex constructors

b2c43cc

JeffBezanson mentioned this issue Apr 13, 2017

fix #21370, regression in dynamic dispatch of complex constructors #21377

Merged

JeffBezanson closed this as completed Apr 15, 2017

JeffBezanson added a commit that referenced this issue Apr 15, 2017

Merge pull request #21377 from JuliaLang/jb/fix21370

19644a4

fix #21370, regression in dynamic dispatch of complex constructors

JeffBezanson mentioned this issue Apr 18, 2017

Constructor with trivial assert much slower than without #21323

Closed

JeffBezanson mentioned this issue May 9, 2017

dispatch speedups #21760

Closed

jebej mentioned this issue Jan 14, 2022

Regression with arithmetic operations on sparse matrices JuliaSparse/SparseArrays.jl#57

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparse-sparse matrix multiplication 4x slower on 0.5 than 0.4, 6x slower on master #21370

Sparse-sparse matrix multiplication 4x slower on 0.5 than 0.4, 6x slower on master #21370

tkelman commented Apr 13, 2017 •

edited

KristofferC commented Apr 13, 2017 •

edited

KristofferC commented Apr 13, 2017 •

edited

KristofferC commented Apr 13, 2017 •

edited

martinholters commented Apr 13, 2017

fredrikekre commented Apr 13, 2017

KristofferC commented Apr 13, 2017

StefanKarpinski commented Apr 13, 2017

KristofferC commented Apr 13, 2017

jebej commented Apr 13, 2017

tkelman commented Apr 16, 2017

RalphAS commented Apr 17, 2017

Sparse-sparse matrix multiplication 4x slower on 0.5 than 0.4, 6x slower on master #21370

Sparse-sparse matrix multiplication 4x slower on 0.5 than 0.4, 6x slower on master #21370

Comments

tkelman commented Apr 13, 2017 • edited

KristofferC commented Apr 13, 2017 • edited

KristofferC commented Apr 13, 2017 • edited

KristofferC commented Apr 13, 2017 • edited

martinholters commented Apr 13, 2017

fredrikekre commented Apr 13, 2017

KristofferC commented Apr 13, 2017

StefanKarpinski commented Apr 13, 2017

KristofferC commented Apr 13, 2017

jebej commented Apr 13, 2017

tkelman commented Apr 16, 2017

RalphAS commented Apr 17, 2017

tkelman commented Apr 13, 2017 •

edited

KristofferC commented Apr 13, 2017 •

edited

KristofferC commented Apr 13, 2017 •

edited

KristofferC commented Apr 13, 2017 •

edited