Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse-sparse matrix multiplication 4x slower on 0.5 than 0.4, 6x slower on master #21370

Closed
tkelman opened this issue Apr 13, 2017 · 11 comments
Closed
Labels
domain:arrays:sparse Sparse arrays kind:potential benchmark Could make a good benchmark in BaseBenchmarks kind:regression Regression in behavior compared to a previous version performance Must go faster

Comments

@tkelman
Copy link
Contributor

tkelman commented Apr 13, 2017

Test code:

$ cat compare.jl
using BenchmarkTools
testfile = "amazon0302"
if !isfile("$(testfile)_inc.tsv")
    download("https://graphchallenge.s3.amazonaws.com/snap/$testfile/$(testfile)_inc.tsv",
        "$(testfile)_inc.tsv")
end
ii = readdlm("$(testfile)_inc.tsv", '\t', Int64)
E = sparse(ii[:,1], ii[:,2], ii[:,3])
@show VERSION
display(@benchmark E'*E)
println()

$ for v in 0.4 0.5 0.6; do ~/Julia/julia-$v/julia compare.jl; done
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 26.5M  100 26.5M    0     0  14.3M      0  0:00:01  0:00:01 --:--:-- 14.3M
VERSION = v"0.4.8-pre+3"
BenchmarkTools.Trial:
  memory estimate:  662.97 MiB
  allocs estimate:  4497172
  --------------
  minimum time:     1.231 s (0.00% GC)
  median time:      1.255 s (1.43% GC)
  mean time:        1.268 s (2.51% GC)
  maximum time:     1.331 s (6.86% GC)
  --------------
  samples:          4
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
VERSION = v"0.5.2-pre+1"
BenchmarkTools.Trial:
  memory estimate:  660.97 MiB
  allocs estimate:  2698301
  --------------
  minimum time:     5.189 s (4.54% GC)
  median time:      5.189 s (4.54% GC)
  mean time:        5.189 s (4.54% GC)
  maximum time:     5.189 s (4.54% GC)
  --------------
  samples:          1
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
VERSION = v"0.6.0-pre.beta.121"
BenchmarkTools.Trial:
  memory estimate:  660.97 MiB
  allocs estimate:  2698301
  --------------
  minimum time:     7.657 s (3.10% GC)
  median time:      7.657 s (3.10% GC)
  mean time:        7.657 s (3.10% GC)
  maximum time:     7.657 s (3.10% GC)
  --------------
  samples:          1
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

From looking into the profiles, almost all of the slowdown is from the sorting of the row indices that happens here https://github.com/JuliaLang/julia/blame/c8026a91acd988fbbb08158092a89e111c243965/base/sparse/linalg.jl#L192. Neither spmatmul nor sortSparseMatrixCSC! (https://github.com/JuliaLang/julia/blame/c8026a91acd988fbbb08158092a89e111c243965/base/sparse/sparsematrix.jl#L3310-L3362) have changed much from 0.4 to 0.5 or 0.5 to now.

@tkelman tkelman added performance Must go faster kind:regression Regression in behavior compared to a previous version domain:arrays:sparse Sparse arrays kind:potential benchmark Could make a good benchmark in BaseBenchmarks labels Apr 13, 2017
@KristofferC
Copy link
Sponsor Member

KristofferC commented Apr 13, 2017

julia> Profile.print(noisefloor=2)
 [...]
      1600 ./sparse/linalg.jl:134; Ac_mul_B(::SparseMatrixCSC{Int64,Int64}, :...
       1450 ./sparse/linalg.jl:192; #spmatmul#54(::Symbol, ::Function, ::Spar...
        1450 ./<missing>:0; (::Base.SparseArrays.#kw##sortSparseMatrix...
         1435 ./sparse/sparsematrix.jl:3350; #sortSparseMatrixCSC!#37(::Symbol, ::Fu...
          1409 ./sort.jl:660; #sortperm!#12(::Base.Sort.QuickSortAlg,...

Ref: #939 (comment) (sortperm! 150x slower than sort!). Also, sortperm! is 3x slower on 0.6 than 0.5.

@KristofferC
Copy link
Sponsor Member

KristofferC commented Apr 13, 2017

It seems that creating the Perm object takes the vast majority of the time.

Adding (using https://github.com/KristofferC/TimerOutputs.jl):

@timeit "create ord" begin
  order = ord(lt,by,rev,order)
end
@timeit "create perm" begin
  p = Perm(order, v)
end
@timeit "do sort" begin
  sort!(x, alg, p)
end

to sortperm! gives (after a few loops calling sortperm!):

 Section       ncalls     time   %tot     avg
 ────────────────────────────────────────────
 create perm       50   1.14ms  97.5%  22.8μs
 do sort           50   21.0μs  1.80%   420ns
 create ord        50   8.27μs  0.71%   165ns
 ────────────────────────────────────────────

As a note, it is a bit sad that 2% is spent doing the sorting and 98% is overhead.

@KristofferC
Copy link
Sponsor Member

KristofferC commented Apr 13, 2017

Reduced to:

using BenchmarkTools

function _sortperm!(v::AbstractVector;
                   lt=isless,
                   by=identity,
                   rev::Bool=false,
                   order::Base.Ordering=Base.Order.Forward)
        order = Base.Order.ord(lt,by,rev,order)
        p = Base.Order.Perm(order, v)
end

v = rand(12)
@btime _sortperm!($v)

0.6: 14.255 μs (1 allocation: 16 bytes)
0.5: 5.316 μs (1 allocation: 16 bytes)

Changes in 0.6 vs 0.5 is that jl_typemap_assoc_exact got a lot slower.

Full relevant 0.6 trace:

1618 /home/kristoffer/juliamkl/src/gf.c:1910; jl_apply_generic
  1614 /home/kristoffer/juliamkl/src/gf.c:1858; jl_lookup_generic_
   1614 /home/kristoffer/juliamkl/src/julia_internal.h:876; jl_typemap_assoc_exact
    1614 /home/kristoffer/juliamkl/src/typemap.c:828; jl_typemap_level_assoc_exact
     729 /home/kristoffer/juliamkl/src/typemap.c:764; jl_typemap_entry_assoc_exact
     643 /home/kristoffer/juliamkl/src/typemap.c:783; jl_typemap_entry_assoc_exact
      63  /home/kristoffer/juliamkl/src/typemap.c:137; sig_match_simple
      383 /home/kristoffer/juliamkl/src/typemap.c:145; sig_match_simple
       90  /home/kristoffer/juliamkl/src/subtype.c:977; jl_types_equal
        48 /home/kristoffer/juliamkl/src/subtype.c:178; obviously_egal
       190 /home/kristoffer/juliamkl/src/subtype.c:978; jl_types_equal
        58 /home/kristoffer/juliamkl/src/subtype.c:210; obviously_unequal
         42 /home/kristoffer/juliamkl/src/jltypes.c:819; jl_unwrap_unionall
        76 /home/kristoffer/juliamkl/src/subtype.c:211; obviously_unequal
         52 /home/kristoffer/juliamkl/src/jltypes.c:819; jl_unwrap_unionall

Full relevant 0.5 trace:

603 .../buildbot/slave/package_tarball64/build/src/gf.c:1909; jl_apply_generic
603 ...ave/package_tarball64/build/src/julia_internal.h:749; jl_typemap_assoc_exact
 601 ...bot/slave/package_tarball64/build/src/typemap.c:866; jl_typemap_level_assoc_exact
  33  ...bot/slave/package_tarball64/build/src/typemap.c:803; jl_typemap_entry_assoc_exact
  24  ...bot/slave/package_tarball64/build/src/typemap.c:817; jl_typemap_entry_assoc_exact
   24 ...bot/slave/package_tarball64/build/src/typemap.c:107; sig_match_leaf
  250 ...bot/slave/package_tarball64/build/src/typemap.c:821; jl_typemap_entry_assoc_exac
    31 ...ldbot/slave/package_tarball64/build/src/julia.h:969; jl_is_type_type
   19  ...bot/slave/package_tarball64/build/src/typemap.c:136; sig_match_simple
    19 ...ldbot/slave/package_tarball64/build/src/julia.h:690; jl_svecref
   50  ...bot/slave/package_tarball64/build/src/typemap.c:142; sig_match_simple
   101 ...bot/slave/package_tarball64/build/src/typemap.c:147; sig_match_simple
    28 ...ot/slave/package_tarball64/build/src/jltypes.c:1675; type_eqv__
    14 ...ot/slave/package_tarball64/build/src/jltypes.c:1696; type_eqv__
    15 ...ot/slave/package_tarball64/build/src/jltypes.c:1717; type_eqv__
  21  ...bot/slave/package_tarball64/build/src/typemap.c:835; jl_typemap_entry_assoc_exact

cc @JeffBezanson

@martinholters
Copy link
Member

With that benchmark, I also get:
0.6: 11.105 μs
0.5: 5.36 μs
0.4: 648.852 ns

@fredrikekre
Copy link
Member

From #20993 (benchmarking 0.6 vs 0.5):

| ["sort","insertionsort",("sort forwards","ascending")] | 1.59 (30%) ❌ | 1.00 (1%) |
| ["sort","insertionsort",("sort! forwards","ascending")] | 2.48 (30%) ❌ | 1.00 (1%) |
| ["sort","insertionsort",("sortperm forwards","ascending")] | 1.51 (30%) ❌ | 1.00 (1%) |
| ["sort","insertionsort",("sortperm forwards","descending")] | 1.42 (30%) ❌ | 1.00 (1%) |
| ["sort","insertionsort",("sortperm forwards","random")] | 1.41 (30%) ❌ | 1.00 (1%) |
| ["sort","insertionsort",("sortperm reverse","ascending")] | 1.42 (30%) ❌ | 1.00 (1%) |
| ["sort","insertionsort",("sortperm reverse","random")] | 1.41 (30%) ❌ | 1.00 (1%) |
| ["sort","insertionsort",("sortperm! forwards","descending")] | 1.42 (30%) ❌ | 1.00 (1%) |
| ["sort","insertionsort",("sortperm! forwards","random")] | 1.41 (30%) ❌ | 1.00 (1%) |
| ["sort","insertionsort",("sortperm! reverse","ascending")] | 1.42 (30%) ❌ | 1.00 (1%) |
| ["sort","insertionsort",("sortperm! reverse","random")] | 1.41 (30%) ❌ | 1.00 (1%) |
| ["sort","issorted",("reverse","ascending")] | 1.34 (30%) ❌ | 1.00 (1%) |
| ["sort","issorted",("reverse","random")] | 1.35 (30%) ❌ | 1.00 (1%) |
| ["sort","quicksort",("sortperm forwards","ascending")] | 1.36 (30%) ❌ | 1.00 (1%) |
| ["sort","quicksort",("sortperm forwards","descending")] | 1.33 (30%) ❌ | 1.00 (1%) |
| ["sort","quicksort",("sortperm! reverse","descending")] | 1.30 (30%) ❌ | 1.00 (1%) |

@KristofferC
Copy link
Sponsor Member

Is that the actual sorting that got slower or just overhead like here?

@StefanKarpinski
Copy link
Sponsor Member

We should probably have a dedicated sortperm implementation that doesn't use the same code paths as everything else. I believe can use a linear time sorting algorithm since you know the exact distribution of values.

@KristofferC
Copy link
Sponsor Member

While true, it's perhaps not relevant for this issue.

@jebej
Copy link
Contributor

jebej commented Apr 13, 2017

That sort of thing is why I wanted to check performance vs 0.4 in #20947

@tkelman
Copy link
Contributor Author

tkelman commented Apr 16, 2017

We should add something derived from this to the performance tracking tests. I'll have to familiarize myself with how to structure those if no one beats me to it.

@RalphAS
Copy link

RalphAS commented Apr 17, 2017

Although the constructor issue is solved, sparse matmul is still often dominated by sorting, so let me leave a note for future issue-posters or potential push-requesters to consider.
While investigating Julia's sorting schemes, I noticed @JeffBezanson's Base.Sort.sortperm_int_range, which is a degenerate form (i.e. stunningly elegant reduction) of radix sort. It seems tailor-made for this use case (post-anticipated by @StefanKarpinski above), but is not invoked because the CSC matrix code uses the pre-allocated sortperm! interface. (FWIW my experiments suggest that sortperm_int_range is faster than [tuned] traditional radix sort for ranges smaller than about 10^6.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:arrays:sparse Sparse arrays kind:potential benchmark Could make a good benchmark in BaseBenchmarks kind:regression Regression in behavior compared to a previous version performance Must go faster
Projects
None yet
Development

No branches or pull requests

8 participants