Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparsity-preserving outer products #24980

Merged
merged 7 commits into from
Jan 7, 2019
Merged

Conversation

jmert
Copy link
Contributor

@jmert jmert commented Dec 8, 2017

This PR adds sparsity-preserving outer products of sparse vectors and views into sparse matrices, implemented as methods of kron and *.

The motivation that caused me to dig into this is the desire to have fast/efficient quadratic matrix products. My specific case is to compute a very large sparse-dense-sparse product, but computation of the middle dense matrix can be parallelized as computations of single columns.

Some example code showing the impact:

using BenchmarkTools
using Test

m,n = (25_000, 100_000);  # dimensions of matrices
k = 1000;                 # column vector location to fill
A = sprand(m, n, 0.01);   # large, sparse operator
b = rand(n);              # column subset of dense matrix
B = sparse(collect(1:n), fill(k, n), b, n, n); # sparse representation

if true
    # outer products only
    @btime $b * $A[:, $k]';
    @btime $b * view($A, :, $k)';

    # quadratic matrix products which motivated
    C1 = @btime $A * $B * $A';
    C2 = @btime ($A * $b) * view($A, :, $k)';

    @test C1 == C2
end

On master (Version 0.7.0-DEV.2770, Commit 11d5a53):

  1.830 s (6 allocations: 1.86 GiB)
  1.953 s (2 allocations: 1.86 GiB)
  39.669 ms (190 allocations: 116.06 MiB)
  199.583 ms (4 allocations: 190.77 MiB)
Test Passed

This PR:

  4.948 ms (12 allocations: 40.48 MiB)
  4.938 ms (10 allocations: 40.47 MiB)
  36.562 ms (205 allocations: 116.23 MiB)
  3.570 ms (12 allocations: 4.12 MiB)
Test Passed

I've marked this as a work-in-progress because I'd welcome comments on how to potentially better integrate the new methods with existing code. The first commit could be thrown away / extracted to separate PR — it was just something I noticed while grapping with current code — and the tests still need to add a Complex case which exercises handling of the RowVector{<:Any, ConjArray} case.

Copy link
Member

@andreasnoack andreasnoack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. Thanks for adding this functionality.

base/sparse/linalg.jl Outdated Show resolved Hide resolved
@jmert
Copy link
Contributor Author

jmert commented Dec 8, 2017

Looks like the first commit created a method ambiguity (resolved), and I might want to work on an implementation which will rebase on top of #24969 easily.

@ViralBShah
Copy link
Member

@jmert Would it be possible for you to revive this PR so that it can help resolve various open issues?

@jmert
Copy link
Contributor Author

jmert commented Dec 17, 2018

I have been meaning to return to this for quite some time, so yes, but possibly not until mid-January or so. Between the holidays and my own academic work, I don't have much free time in the next few weeks. (Maybe I'll squeeze it in between things as my own form of relaxing and getting away from Matlab ;-P)

@jmert
Copy link
Contributor Author

jmert commented Dec 18, 2018

As a reference point for myself — current master (v1.2.0-DEV.27) using similar benchmarks as the original post destroys my computer for a few minutes. I think the second @benchmark's timing is inflated because I actually run out of RAM and start hitting swap (i.e. during the second benchmark, htop output showed all 16GB of RAM on my laptop being used, and ~57 GB of virtual memory mapped, with periodic stalls as my computer becomes overtaxed.)

  1.998 s (46 allocations: 37.25 GiB)
  43.104 s (4 allocations: 18.63 GiB)
  945.432 ms (26 allocations: 904.92 MiB)
  6.910 s (6 allocations: 4.66 GiB)

@jmert
Copy link
Contributor Author

jmert commented Dec 26, 2018

I've reimplemented this PR on the current master. The big differences from before are:

  1. The core optimization is implemented within the broadcasting code path rather than as specific methods of kron and *. Unfortunately, I don't really understand the sparse broadcasting all that well, so adding in this special case is done through a sort of hack. (See is_specialcase_sparse_broadcast().)
  2. No specialization occurs for mixed dense-sparse outer products (yet).

The updated benchmark I'm working with is:

using BenchmarkTools, SparseArrays, Test

m,n = (2_000, 10_000);
k = 1000;
A = sprand(m, n, 0.01);
a = view(A, :, k);
u = A[:, k];
x = sparse(rand(n));
X = sparse(collect(1:n), fill(k, n), x, n, n);

suite = BenchmarkGroup()
# outer products only
suite["x × u'"] = @benchmarkable $x * $(u');
suite["x × a'"] = @benchmarkable $x * $(a');
# quadratic matrix products which motivated
suite["A × X × A'"] = @benchmarkable $A * $X * $(A');
suite["(A × x) × a'"] = @benchmarkable ($A * $x) * $(a');
tune!(suite)

before = run(suite)

using Revise
Revise.track(Base)

after = run(suite)

if true
    println("Before vs after for specific operations")
    println("=======================================\n")
    for key in keys(after)
        t1, t0 = minimum.((after[key], before[key]))
        r = judge(t1, t0)
        print(key, ": ")
        show(stdout, MIME"text/plain"(), r)
        println()
    end

    println("\n\n")
    println("Optimized quadratic product, before and after")
    println("=============================================\n")
    for (trial,name) in [(before,"before"), (after,"after")]
        key1 = "(A × x) × a'"
        key2 = "A × X × A'"
        t1, t0 = minimum.((trial[key1], trial[key2]))
        r = judge(t1, t0)
        print("$key1 vs $key2 ($name): ")
        show(stdout, MIME"text/plain"(), r)
        println()
    end
end

and gives results

Before vs after for specific operations
=======================================

(A × x) × a': BenchmarkTools.TrialJudgement:
  time:   -99.82% => improvement (5.00% tolerance)
  memory: -97.70% => improvement (1.00% tolerance)
x × a': BenchmarkTools.TrialJudgement:
  time:   -99.96% => improvement (5.00% tolerance)
  memory: -97.89% => improvement (1.00% tolerance)
A × X × A': BenchmarkTools.TrialJudgement:
  time:   +0.38% => invariant (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)
x × u': BenchmarkTools.TrialJudgement:
  time:   -98.03% => improvement (5.00% tolerance)
  memory: -98.95% => improvement (1.00% tolerance)



Optimized quadratic product, before and after
=============================================

(A × x) × a' vs A × X × A' (before): BenchmarkTools.TrialJudgement:
  time:   +6941.20% => regression (5.00% tolerance)
  memory: +290.58% => regression (1.00% tolerance)
(A × x) × a' vs A × X × A' (after): BenchmarkTools.TrialJudgement:
  time:   -87.66% => improvement (5.00% tolerance)
  memory: -91.02% => improvement (1.00% tolerance)

@StefanKarpinski
Copy link
Sponsor Member

cc @mbauman, @KristofferC

@ViralBShah
Copy link
Member

The outer product part is straightforward. If @mbauman or @KristofferC can check the broadcasting bits, we can probably merge. @andreasnoack mentioned about lowrankupdate! and I am not sure if we want to do anything further.

Also, it is fine to get this performance improvement in and do further improvements in other PRs.

Copy link
Sponsor Member

@mbauman mbauman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm impressed — the sparse broadcasting machinery has survived a quite a few incremental code changes and is a bit of a tangled mess. You've adeptly hooked in right where it'd make sense.

I think doing this kind of peep-hole optimization is worth doing and with a few minor renames we can set the stage for supporting more and more array types without needing to _sparsifystructured them.

stdlib/SparseArrays/src/higherorderfns.jl Outdated Show resolved Hide resolved
stdlib/SparseArrays/src/higherorderfns.jl Outdated Show resolved Hide resolved
stdlib/SparseArrays/src/higherorderfns.jl Outdated Show resolved Hide resolved
@mbauman
Copy link
Sponsor Member

mbauman commented Jan 3, 2019

Am I correct in understanding that the only functional change here is that kron used to return a dense matrix whereas now it's appropriately sparse? Ah, and also multiplication with views. Everything else is performance.

@mbauman mbauman added performance Must go faster domain:broadcast Applying a function over a collection labels Jan 3, 2019
@ViralBShah
Copy link
Member

IIUC, this used to be ok, but after the whole adjoint business, the kron of sparse with a transposed matrix became dense. That's one of the cases this fixes, views being the other one.

@jmert
Copy link
Contributor Author

jmert commented Jan 4, 2019

@andreasnoack mentioned about lowrankupdate! and I am not sure if we want to do anything further.

Correct. The old PR had put in a stub function, but any performance "improvement" implied by providing an inplace update would have been a lie since I wasn't implementing such a feature at that time. If someone wants to tackle that, I agree that it can be in a new PR.

I'm impressed — the sparse broadcasting machinery has survived a quite a few incremental code changes and is a bit of a tangled mess. You've adeptly hooked in right where it'd make sense.

Thanks!! I did have to abandon several approaches after realizing it wasn't going to work out as I'd hoped.

Am I correct in understanding that the only functional change here is that kron used to return a dense matrix whereas now it's appropriately sparse? Ah, and also multiplication with views. Everything else is performance.

@ViralBShah has the current story correct — since I started this PR a year ago, the switch to Transpose/Adjoint has caused a huge slow-down by calculating the outer product via the dense LinearAlgebra generic code. This PR now maintains sparsity in kron(u, v), u * v', and u .* v'.

After the rewrite to current master, I dropped one of my original goals — having dense-sparse outer products produce a sparse output. I decided I was unsure enough of the broadcasting implementation at this point to spend time working on that goal; issue #26613 would be a good place to revive that conversation and let those enhancements occur in synchrony with a future PR to extend this implementation.

@ViralBShah
Copy link
Member

@mbauman @andreasnoack Is this good to merge?

@jmert
Copy link
Contributor Author

jmert commented Jan 5, 2019

I did complete a test of just SparseArrays and broadcast test suites that passed locally when switching return broadcast to return _copy as @mbauman suggested in a review comment — I can add a final commit with that change if it's desired.

@ViralBShah
Copy link
Member

Since it was requested, please go ahead.

@jmert
Copy link
Contributor Author

jmert commented Jan 5, 2019

Commit added. This is ready to go from my perspective if/when it passes CI.

* Change is_specialcase_sparse_broadcast -> can_skip_sparsification.
* Lift parent(y) to one function earlier for clarify
@jmert
Copy link
Contributor Author

jmert commented Jan 5, 2019

Can you rebase on master (unless you did it in the last day or two) so that we pick up all the CI fixes?

No problem — done.

@jmert jmert changed the title [WIP] Sparsity-preserving outer products Sparsity-preserving outer products Jan 5, 2019
@andreasnoack andreasnoack merged commit dffe119 into JuliaLang:master Jan 7, 2019
@KristofferC
Copy link
Sponsor Member

Can you rebase on master (unless you did it in the last day or two) so that we pick up all the CI fixes?

Note that CI runs on the merge commit so rebasing doesn't really do anything w.r.t to CI.

I don't really see how this can be backported to a patch release since it is changing behavior? Tentatively removing backport label. Also, this needs a NEWS entry.

@KristofferC KristofferC added needs news A NEWS entry is required for this change and removed backport pending 1.0 labels Jan 10, 2019
@mbauman
Copy link
Sponsor Member

mbauman commented Jan 10, 2019

Agreed — this shouldn't be backported.

@StefanKarpinski StefanKarpinski added the kind:minor change Marginal behavior change acceptable for a minor release label Jan 10, 2019
@jmert jmert deleted the sparse_outer branch January 13, 2019 15:51
jmert added a commit to jmert/julia that referenced this pull request Jan 13, 2019
@abraunst
Copy link
Contributor

abraunst commented Jan 19, 2019

Should kron(::Diagonal, ::Union{SparseVector, SparseMatrix}) (and viceversa) return SparseMatrixCSC? (sure!)
Should kron(::Diagonal, ::Matrix) (and viceversa) return SparseMatrixCSC? (this would be consistent with this PR -- but currently there is an implementation for diagonal times dense...)

@saolof
Copy link

saolof commented Apr 5, 2019

This will be utterly amazing for my research and I'd like to thank you all for working on this.

jmert added a commit to jmert/CMB.jl that referenced this pull request Jul 8, 2020
The feature was added to Julia in time for v1.2 in JuliaLang/julia#24980,
so get rid of the custom `outer()` method here and rewrite `quadprod()`
in terms of just standard matrix methods. Julia v1.2 is the
minimum-supported version at this point, so no need to worry about
backporting the functionality.

In the future, this function may yet still go away since the
implementation is nearly trivial at this point, but that can be a
follow-up PR.
jmert added a commit to jmert/CMB.jl that referenced this pull request Jul 8, 2020
The feature was added to Julia in time for v1.2 in JuliaLang/julia#24980,
so get rid of the custom `outer()` method here and rewrite `quadprod()`
in terms of just standard matrix methods. Julia v1.2 is the
minimum-supported version at this point, so no need to worry about
backporting the functionality.

In the future, this function may yet still go away since the
implementation is nearly trivial at this point, but that can be a
follow-up PR.
jmert added a commit to jmert/CMB.jl that referenced this pull request Jul 8, 2020
The feature was added to Julia in time for v1.2 in JuliaLang/julia#24980,
so get rid of the custom `outer()` method here and rewrite `quadprod()`
in terms of just standard matrix methods. Julia v1.2 is the
minimum-supported version at this point, so no need to worry about
backporting the functionality.

In the future, this function may yet still go away since the
implementation is nearly trivial at this point, but that can be a
follow-up PR.
jmert added a commit to jmert/CMB.jl that referenced this pull request Oct 28, 2020
First, this removes the option to do row-wise quadratic products since
they aren't being used within this package anyway. That allows removing
the "keyword" argument for choosing which direction to apply.

Second, the original implementation was optimized for the fast vector
outer product (that I got added to SparseArrays in JuliaLang/julia#24980
and made it into Julia v1.2), but when scaling up to multiple columns
the performance was disastrous because the dispatch of a transposed
view led to generic matrix multiplication which did the full
dense-dense style loops. By not using views, we get the desired
sparse matrix multiplication instead.
jmert added a commit to jmert/CMB.jl that referenced this pull request Oct 28, 2020
First, this removes the option to do row-wise quadratic products since
they aren't being used within this package anyway. That allows removing
the "keyword" argument for choosing which direction to apply.

Second, the original implementation was optimized for the fast vector
outer product (that I got added to SparseArrays in JuliaLang/julia#24980
and made it into Julia v1.2), but when scaling up to multiple columns
the performance was disastrous because the dispatch of a transposed
view led to generic matrix multiplication which did the full
dense-dense style loops. By not using views, we get the desired
sparse matrix multiplication instead.
jmert added a commit to jmert/CMB.jl that referenced this pull request Oct 28, 2020
First, this removes the option to do row-wise quadratic products since
they aren't being used within this package anyway. That allows removing
the "keyword" argument for choosing which direction to apply.

Second, the original implementation was optimized for the fast vector
outer product (that I got added to SparseArrays in JuliaLang/julia#24980
and made it into Julia v1.2), but when scaling up to multiple columns
the performance was disastrous because the dispatch of a transposed
view led to generic matrix multiplication which did the full
dense-dense style loops. By not using views, we get the desired
sparse matrix multiplication instead.
jmert added a commit to jmert/CMB.jl that referenced this pull request Nov 4, 2020
First, this removes the option to do row-wise quadratic products since
they aren't being used within this package anyway. That allows removing
the "keyword" argument for choosing which direction to apply.

Second, the original implementation was optimized for the fast vector
outer product (that I got added to SparseArrays in JuliaLang/julia#24980
and made it into Julia v1.2), but when scaling up to multiple columns
the performance was disastrous because the dispatch of a transposed
view led to generic matrix multiplication which did the full
dense-dense style loops. By not using views, we get the desired
sparse matrix multiplication instead.
jmert added a commit to jmert/Healpix.jl that referenced this pull request Oct 23, 2022
The feature was added to Julia in time for v1.2 in JuliaLang/julia#24980,
so get rid of the custom `outer()` method here and rewrite `quadprod()`
in terms of just standard matrix methods. Julia v1.2 is the
minimum-supported version at this point, so no need to worry about
backporting the functionality.

In the future, this function may yet still go away since the
implementation is nearly trivial at this point, but that can be a
follow-up PR.
jmert added a commit to jmert/Healpix.jl that referenced this pull request Oct 23, 2022
First, this removes the option to do row-wise quadratic products since
they aren't being used within this package anyway. That allows removing
the "keyword" argument for choosing which direction to apply.

Second, the original implementation was optimized for the fast vector
outer product (that I got added to SparseArrays in JuliaLang/julia#24980
and made it into Julia v1.2), but when scaling up to multiple columns
the performance was disastrous because the dispatch of a transposed
view led to generic matrix multiplication which did the full
dense-dense style loops. By not using views, we get the desired
sparse matrix multiplication instead.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:arrays:sparse Sparse arrays domain:broadcast Applying a function over a collection domain:linear algebra Linear algebra kind:minor change Marginal behavior change acceptable for a minor release needs news A NEWS entry is required for this change performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants