Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparsity-preserving outer products #24980

Merged
merged 7 commits into from Jan 7, 2019

Conversation

@jmert
Copy link
Contributor

commented Dec 8, 2017

This PR adds sparsity-preserving outer products of sparse vectors and views into sparse matrices, implemented as methods of kron and *.

The motivation that caused me to dig into this is the desire to have fast/efficient quadratic matrix products. My specific case is to compute a very large sparse-dense-sparse product, but computation of the middle dense matrix can be parallelized as computations of single columns.

Some example code showing the impact:

using BenchmarkTools
using Test

m,n = (25_000, 100_000);  # dimensions of matrices
k = 1000;                 # column vector location to fill
A = sprand(m, n, 0.01);   # large, sparse operator
b = rand(n);              # column subset of dense matrix
B = sparse(collect(1:n), fill(k, n), b, n, n); # sparse representation

if true
    # outer products only
    @btime $b * $A[:, $k]';
    @btime $b * view($A, :, $k)';

    # quadratic matrix products which motivated
    C1 = @btime $A * $B * $A';
    C2 = @btime ($A * $b) * view($A, :, $k)';

    @test C1 == C2
end

On master (Version 0.7.0-DEV.2770, Commit 11d5a53):

  1.830 s (6 allocations: 1.86 GiB)
  1.953 s (2 allocations: 1.86 GiB)
  39.669 ms (190 allocations: 116.06 MiB)
  199.583 ms (4 allocations: 190.77 MiB)
Test Passed

This PR:

  4.948 ms (12 allocations: 40.48 MiB)
  4.938 ms (10 allocations: 40.47 MiB)
  36.562 ms (205 allocations: 116.23 MiB)
  3.570 ms (12 allocations: 4.12 MiB)
Test Passed

I've marked this as a work-in-progress because I'd welcome comments on how to potentially better integrate the new methods with existing code. The first commit could be thrown away / extracted to separate PR — it was just something I noticed while grapping with current code — and the tests still need to add a Complex case which exercises handling of the RowVector{<:Any, ConjArray} case.

@andreasnoack
Copy link
Member

left a comment

Great. Thanks for adding this functionality.

base/sparse/linalg.jl Outdated Show resolved Hide resolved
@jmert

This comment has been minimized.

Copy link
Contributor Author

commented Dec 8, 2017

Looks like the first commit created a method ambiguity (resolved), and I might want to work on an implementation which will rebase on top of #24969 easily.

@ViralBShah

This comment has been minimized.

Copy link
Member

commented Dec 17, 2018

@jmert Would it be possible for you to revive this PR so that it can help resolve various open issues?

@jmert

This comment has been minimized.

Copy link
Contributor Author

commented Dec 17, 2018

I have been meaning to return to this for quite some time, so yes, but possibly not until mid-January or so. Between the holidays and my own academic work, I don't have much free time in the next few weeks. (Maybe I'll squeeze it in between things as my own form of relaxing and getting away from Matlab ;-P)

@jmert

This comment has been minimized.

Copy link
Contributor Author

commented Dec 18, 2018

As a reference point for myself — current master (v1.2.0-DEV.27) using similar benchmarks as the original post destroys my computer for a few minutes. I think the second @benchmark's timing is inflated because I actually run out of RAM and start hitting swap (i.e. during the second benchmark, htop output showed all 16GB of RAM on my laptop being used, and ~57 GB of virtual memory mapped, with periodic stalls as my computer becomes overtaxed.)

  1.998 s (46 allocations: 37.25 GiB)
  43.104 s (4 allocations: 18.63 GiB)
  945.432 ms (26 allocations: 904.92 MiB)
  6.910 s (6 allocations: 4.66 GiB)

@jmert jmert force-pushed the jmert:sparse_outer branch from 0aa366b to df1e18f Dec 26, 2018

@jmert

This comment has been minimized.

Copy link
Contributor Author

commented Dec 26, 2018

I've reimplemented this PR on the current master. The big differences from before are:

  1. The core optimization is implemented within the broadcasting code path rather than as specific methods of kron and *. Unfortunately, I don't really understand the sparse broadcasting all that well, so adding in this special case is done through a sort of hack. (See is_specialcase_sparse_broadcast().)
  2. No specialization occurs for mixed dense-sparse outer products (yet).

The updated benchmark I'm working with is:

using BenchmarkTools, SparseArrays, Test

m,n = (2_000, 10_000);
k = 1000;
A = sprand(m, n, 0.01);
a = view(A, :, k);
u = A[:, k];
x = sparse(rand(n));
X = sparse(collect(1:n), fill(k, n), x, n, n);

suite = BenchmarkGroup()
# outer products only
suite["x × u'"] = @benchmarkable $x * $(u');
suite["x × a'"] = @benchmarkable $x * $(a');
# quadratic matrix products which motivated
suite["A × X × A'"] = @benchmarkable $A * $X * $(A');
suite["(A × x) × a'"] = @benchmarkable ($A * $x) * $(a');
tune!(suite)

before = run(suite)

using Revise
Revise.track(Base)

after = run(suite)

if true
    println("Before vs after for specific operations")
    println("=======================================\n")
    for key in keys(after)
        t1, t0 = minimum.((after[key], before[key]))
        r = judge(t1, t0)
        print(key, ": ")
        show(stdout, MIME"text/plain"(), r)
        println()
    end

    println("\n\n")
    println("Optimized quadratic product, before and after")
    println("=============================================\n")
    for (trial,name) in [(before,"before"), (after,"after")]
        key1 = "(A × x) × a'"
        key2 = "A × X × A'"
        t1, t0 = minimum.((trial[key1], trial[key2]))
        r = judge(t1, t0)
        print("$key1 vs $key2 ($name): ")
        show(stdout, MIME"text/plain"(), r)
        println()
    end
end

and gives results

Before vs after for specific operations
=======================================

(A × x) × a': BenchmarkTools.TrialJudgement:
  time:   -99.82% => improvement (5.00% tolerance)
  memory: -97.70% => improvement (1.00% tolerance)
x × a': BenchmarkTools.TrialJudgement:
  time:   -99.96% => improvement (5.00% tolerance)
  memory: -97.89% => improvement (1.00% tolerance)
A × X × A': BenchmarkTools.TrialJudgement:
  time:   +0.38% => invariant (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)
x × u': BenchmarkTools.TrialJudgement:
  time:   -98.03% => improvement (5.00% tolerance)
  memory: -98.95% => improvement (1.00% tolerance)



Optimized quadratic product, before and after
=============================================

(A × x) × a' vs A × X × A' (before): BenchmarkTools.TrialJudgement:
  time:   +6941.20% => regression (5.00% tolerance)
  memory: +290.58% => regression (1.00% tolerance)
(A × x) × a' vs A × X × A' (after): BenchmarkTools.TrialJudgement:
  time:   -87.66% => improvement (5.00% tolerance)
  memory: -91.02% => improvement (1.00% tolerance)
@StefanKarpinski

This comment has been minimized.

Copy link
Member

commented Jan 1, 2019

@ViralBShah ViralBShah requested review from mbauman and KristofferC Jan 3, 2019

@ViralBShah

This comment has been minimized.

Copy link
Member

commented Jan 3, 2019

The outer product part is straightforward. If @mbauman or @KristofferC can check the broadcasting bits, we can probably merge. @andreasnoack mentioned about lowrankupdate! and I am not sure if we want to do anything further.

Also, it is fine to get this performance improvement in and do further improvements in other PRs.

@mbauman
mbauman approved these changes Jan 3, 2019
Copy link
Member

left a comment

I'm impressed — the sparse broadcasting machinery has survived a quite a few incremental code changes and is a bit of a tangled mess. You've adeptly hooked in right where it'd make sense.

I think doing this kind of peep-hole optimization is worth doing and with a few minor renames we can set the stage for supporting more and more array types without needing to _sparsifystructured them.

stdlib/SparseArrays/src/higherorderfns.jl Outdated Show resolved Hide resolved
stdlib/SparseArrays/src/higherorderfns.jl Outdated Show resolved Hide resolved
stdlib/SparseArrays/src/higherorderfns.jl Outdated Show resolved Hide resolved
@mbauman

This comment has been minimized.

Copy link
Member

commented Jan 3, 2019

Am I correct in understanding that the only functional change here is that kron used to return a dense matrix whereas now it's appropriately sparse? Ah, and also multiplication with views. Everything else is performance.

@ViralBShah

This comment has been minimized.

Copy link
Member

commented Jan 3, 2019

IIUC, this used to be ok, but after the whole adjoint business, the kron of sparse with a transposed matrix became dense. That's one of the cases this fixes, views being the other one.

@jmert

This comment has been minimized.

Copy link
Contributor Author

commented Jan 4, 2019

@andreasnoack mentioned about lowrankupdate! and I am not sure if we want to do anything further.

Correct. The old PR had put in a stub function, but any performance "improvement" implied by providing an inplace update would have been a lie since I wasn't implementing such a feature at that time. If someone wants to tackle that, I agree that it can be in a new PR.

I'm impressed — the sparse broadcasting machinery has survived a quite a few incremental code changes and is a bit of a tangled mess. You've adeptly hooked in right where it'd make sense.

Thanks!! I did have to abandon several approaches after realizing it wasn't going to work out as I'd hoped.

Am I correct in understanding that the only functional change here is that kron used to return a dense matrix whereas now it's appropriately sparse? Ah, and also multiplication with views. Everything else is performance.

@ViralBShah has the current story correct — since I started this PR a year ago, the switch to Transpose/Adjoint has caused a huge slow-down by calculating the outer product via the dense LinearAlgebra generic code. This PR now maintains sparsity in kron(u, v), u * v', and u .* v'.

After the rewrite to current master, I dropped one of my original goals — having dense-sparse outer products produce a sparse output. I decided I was unsure enough of the broadcasting implementation at this point to spend time working on that goal; issue #26613 would be a good place to revive that conversation and let those enhancements occur in synchrony with a future PR to extend this implementation.

@ViralBShah

This comment has been minimized.

Copy link
Member

commented Jan 5, 2019

@mbauman @andreasnoack Is this good to merge?

@jmert

This comment has been minimized.

Copy link
Contributor Author

commented Jan 5, 2019

I did complete a test of just SparseArrays and broadcast test suites that passed locally when switching return broadcast to return _copy as @mbauman suggested in a review comment — I can add a final commit with that change if it's desired.

@ViralBShah

This comment has been minimized.

Copy link
Member

commented Jan 5, 2019

Since it was requested, please go ahead.

@jmert

This comment has been minimized.

Copy link
Contributor Author

commented Jan 5, 2019

Commit added. This is ready to go from my perspective if/when it passes CI.

@ViralBShah

This comment has been minimized.

Copy link
Member

commented Jan 5, 2019

Can you rebase on master (unless you did it in the last day or two) so that we pick up all the CI fixes?

@jmert jmert force-pushed the jmert:sparse_outer branch from e28e256 to 9bec023 Jan 5, 2019

@jmert

This comment has been minimized.

Copy link
Contributor Author

commented Jan 5, 2019

Can you rebase on master (unless you did it in the last day or two) so that we pick up all the CI fixes?

No problem — done.

@jmert jmert changed the title [WIP] Sparsity-preserving outer products Sparsity-preserving outer products Jan 5, 2019

@andreasnoack andreasnoack merged commit dffe119 into JuliaLang:master Jan 7, 2019

3 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
julia freebsd ci Build done
Details
@KristofferC

This comment has been minimized.

Copy link
Contributor

commented Jan 10, 2019

Can you rebase on master (unless you did it in the last day or two) so that we pick up all the CI fixes?

Note that CI runs on the merge commit so rebasing doesn't really do anything w.r.t to CI.

I don't really see how this can be backported to a patch release since it is changing behavior? Tentatively removing backport label. Also, this needs a NEWS entry.

@mbauman

This comment has been minimized.

Copy link
Member

commented Jan 10, 2019

Agreed — this shouldn't be backported.

@jmert jmert deleted the jmert:sparse_outer branch Jan 13, 2019

jmert added a commit to jmert/julia that referenced this pull request Jan 13, 2019
@abraunst

This comment has been minimized.

Copy link
Contributor

commented Jan 19, 2019

Should kron(::Diagonal, ::Union{SparseVector, SparseMatrix}) (and viceversa) return SparseMatrixCSC? (sure!)
Should kron(::Diagonal, ::Matrix) (and viceversa) return SparseMatrixCSC? (this would be consistent with this PR -- but currently there is an implementation for diagonal times dense...)

ViralBShah added a commit that referenced this pull request Jan 20, 2019
@saolof

This comment has been minimized.

Copy link

commented Apr 5, 2019

This will be utterly amazing for my research and I'd like to thank you all for working on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants
You can’t perform that action at this time.