Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Rust-based SparsePauliOp.to_matrix and Miri tests #11388

Merged
merged 11 commits into from
Apr 26, 2024

Conversation

jakelishman
Copy link
Member

@jakelishman jakelishman commented Dec 8, 2023

New SparsePauliOp.to_matrix

This rewrites the numerical version of SparsePauliOp.to_matrix to be written in parallelised Rust, building up the matrices row-by-row rather than converting each contained operator to a matrix individually and summing them.

The new algorithms are complete row-based, which is embarrassingly parallel for dense matrices, and parallelisable with additional copies and cumulative sums in the CSR case.

The two new algorithms are an asymptotic complexity improvement for both dense and sparse matrices over the "sum the individual matrices" version. In the dense case, the cost goes from $\mathcal O\bigl(4^{\text{qubits}} \cdot \text{ops}\bigr)$ to $\mathcal O\bigl(4^{\text{qubits}} + 2^{\text{qubits}} \cdot \text{reduced ops}\bigr)$, where the first term is from the zeroing of the matrix, and the second is from updating the elements in-place. $\text{reduced ops}$ is the number of explicitly stored operations after Pauli-operator uniqueness compaction, so is upper-bounded as $4^{\text{qubits}}$. (The Pauli compaction goes as $\mathcal O(\text{ops})$, so is irrelevant to the complexity discussion.)

The CSR form of the algorithm goes as $\mathcal O\bigl(2^{\text{qubits}} \cdot \text{reduced ops}\cdot\lg(\text{reduced ops})\bigr)$, which (I think! - I didn't fully calculate it) is asymptotically the same as before, but in practice the constant factors and intermediate memory use are dramatically reduced, and the new algorithm is threadable with an additional $\mathcal O\bigl(2^{\text{qubits}} \cdot \text{reduced ops}\bigr)$ intermediate memory overhead (the serial form has only $\mathcal O(\text{reduced ops})$ memory overhead).

The object-coefficients form is left as-is to avoid exploding the complexity in Rust space; these objects are already slow and unsuited for high-performance code, so the effects should be minimal.

Miri tests

As we begin to include more unsafe code in the Rust-accelerated components, it is becoming more important for us to test these in an undefined-behaviour sanitiser. This is done in a separate CI job because:

  • we do not yet know how stable Miri will be, so we don't want to block on it.

  • some dependencies need their version-resolution patching to Miri-compatible versions, but we want to run our regular test suites with the same versions of packages we will be building against.

Performance

Using a setup of:

import numpy as np
import pyqrusty as qrusty
from qiskit.quantum_info import SparsePauliOp
from qiskit.quantum_info.random import random_pauli_list
from scipy.sparse import csr_matrix

def rand_spo(num_qubits, num_ops, seed=None):
    rng = np.random.default_rng(seed)
    paulis = random_pauli_list(num_qubits, num_ops, seed=rng)
    coeffs = rng.random(num_ops) + 1j*rng.random(num_ops)
    return SparsePauliOp(paulis, coeffs)

def prep_qrusty(spo):
    return [l.to_label() for l in spo.paulis], spo.coeffs

def qrusty_to_sparse(paulis, coeffs, scipy=False):
    paulis = [qrusty.Pauli(l) for l in paulis]
    shape, data, indices, indptr = qrusty.SparsePauliOp(paulis, coeffs).to_matrix_mode("RowwiseUnsafeChunked/128").export()
    if scipy:
        return csr_matrix((data, indices, indptr), shape=shape)
    return data, indices, indptr

then timing with:

import collections

outs = collections.defaultdict(list)
qubits = list(range(4, 19))

for num_qubits in qubits:
    print(f"\n{num_qubits = }")
    spo = rand_spo(num_qubits, 1_000, seed=0)

    paulis, coeffs = prep_qrusty(spo)
    res = %timeit -o qrusty_to_sparse(paulis, coeffs, scipy=False)
    outs["qrusty"].append(res)
    res = %timeit -o qrusty_to_sparse(paulis, coeffs, scipy=True)
    outs["qrusty-scipy"].append(res)
    res = %timeit -o spo.to_matrix(sparse=True, force_serial=True)
    outs["qiskit-serial"].append(res)
    res = %timeit -o spo.to_matrix(sparse=True, force_serial=False)
    outs["qiskit-parallel"].append(res)

the resulting graph looks like

Figure_1

The solid Qrusty line is the full path, including the construction of the Scipy matrix at the end. The dashed Qrusty line is just to construct the three arrays, since the index types of Qrusty's native CSR representation get converted when moving to Scipy. Both Qrusty lines are parallelised at 8 SMT threads.

The solid Qiskit line is with 8-thread SMT parallelism on my Macbook, the dashed one is the serial form. In absolute timings for the 18q, 1000 ops case, the full Qrusty path took ~4.54s (or ~3.68s without Scipy construction), while the parallel Qiskit path took ~2.86s (or ~3.91s in serial), so it takes under 2/3 the time to build a Scipy matrix in both or a 20% speedup to build compared to using Qrusty to build its native CSR type and for Qiskit to build a Scipy CSR.

Older version of the graph that used averages instead of the best times (despite what the axis says).

Figure_1

The big bumps in the parallel lines at 6 and 12 qubits appear legitimate and reproducible, but I'm not sure at all why they exist - maybe a weirdness where we move from one allocator to another mid algorithm? Update: the bump appears to be caused by the same medium-size allocator buggy shenanigans first detected by @kevinhartman in #10827, since on macOS 14, setting the environment variable MallocMediumZone=0 removes the big bump, and the bump doesn't seem to appear in the same way on Linux.

Fix #8772.

@jakelishman jakelishman added type: qa Issues and PRs that relate to testing and code quality performance Changelog: New Feature Include in the "Added" section of the changelog mod: quantum info Related to the Quantum Info module (States & Operators) Rust This PR or issue is related to Rust code in the repository labels Dec 8, 2023
@jakelishman jakelishman added this to the 1.0.0 milestone Dec 8, 2023
@jakelishman jakelishman requested review from ikkoham and a team as code owners December 8, 2023 20:55
@qiskit-bot
Copy link
Collaborator

One or more of the the following people are requested to review this:

}
}

impl<T: Send, S: ?Sized> ParallelSliceMutExt<T> for S where S: ParallelSliceMut<T> {}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did I write <S: ?Sized> ... where S: ParallelSliceMut<T> and not put both either in the <> or in the where? Who knows.

@coveralls
Copy link

coveralls commented Dec 8, 2023

Pull Request Test Coverage Report for Build 8848600959

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 414 of 471 (87.9%) changed or added relevant lines in 3 files are covered.
  • 662 unchanged lines in 65 files lost coverage.
  • Overall coverage increased (+0.2%) to 89.444%

Changes Missing Coverage Covered Lines Changed/Added Lines %
crates/accelerate/src/rayon_ext.rs 61 82 74.39%
crates/accelerate/src/sparse_pauli_op.rs 344 380 90.53%
Files with Coverage Reduction New Missed Lines %
qiskit/circuit/library/standard_gates/r.py 1 97.62%
qiskit/transpiler/target.py 1 93.74%
crates/qasm2/src/expr.rs 1 94.03%
qiskit/primitives/backend_sampler_v2.py 1 99.21%
qiskit/circuit/library/generalized_gates/permutation.py 1 93.1%
qiskit/primitives/containers/data_bin.py 1 97.92%
qiskit/circuit/library/standard_gates/rxx.py 1 97.44%
qiskit/circuit/library/standard_gates/u2.py 1 96.67%
qiskit/circuit/library/standard_gates/rzx.py 1 97.44%
qiskit/quantum_info/operators/operator.py 1 94.94%
Totals Coverage Status
Change from base Build 8728350061: 0.2%
Covered Lines: 60930
Relevant Lines: 68121

💛 - Coveralls

This rewrites the numerical version of `SparsePauliOp.to_matrix` to be
written in parallelised Rust, building up the matrices row-by-row rather
than converting each contained operator to a matrix individually and
summing them.

The new algorithms are complete row-based, which is embarrassingly
parallel for dense matrices, and parallelisable with additional copies
and cumulative sums in the CSR case.

The two new algorithms are an asymptotic complexity improvement for both
dense and sparse matrices over the "sum the individual matrices"
version.  In the dense case, the cost goes from

        O(4 ** num_qubits * num_ops)

to

        O(4 ** num_qubits + (2 ** num_qubits) * reduced_num_ops)

where the first term is from the zeroing of the matrix, and the second
is from updating the elements in-place.  `reduced_num_ops` is the number
of explicitly stored operations after Pauli-operator uniqueness
compaction, so is upper-bounded as `4 ** num_qubits`.  (The Pauli
compaction goes as `O(num_ops)`, so is irrelevant to the complexity
discussion.) The CSR form of the algorithm goes as

        O(2 ** num_qubits * reduced_num_ops * lg(reduced_num_ops))

which (I think! - I didn't fully calculate it) is asymptotically the
same as before, but in practice the constant factors and intermediate
memory use are *dramatically* reduced, and the new algorithm is
threadable with an additional `O(2 ** num_qubits * reduced_num_ops)`
intermediate memory overhead (the serial form has only
`O(reduced_num_ops)` memory overhead).

The `object`-coefficients form is left as-is to avoid exploding the
complexity in Rust space; these objects are already slow and unsuited
for high-performance code, so the effects should be minimal.
As we begin to include more `unsafe` code in the Rust-accelerated
components, it is becoming more important for us to test these in an
undefined-behaviour sanitiser.  This is done in a separate CI job
because:

- we do not yet know how stable Miri will be, so we don't want to block
  on it.

- some dependencies need their version-resolution patching to
  Miri-compatible versions, but we want to run our regular test suites
  with the same versions of packages we will be building against.
This parallelises the previously serial-only cumulative sum of the
`indptr` array of number of non-zero entries at the end.  In practice, I
didn't actually see any change in performance from this, but
philosophically it feels like the right thing to do.
@jakelishman
Copy link
Member Author

jakelishman commented Jan 13, 2024

Now rebased over #11133.

edit: ugh, Miri's failing after the version of crossbeam-epoch got updated in our lock. I'll have to look into that again.

@mtreinish mtreinish modified the milestones: 1.0.0, 1.1.0 Jan 23, 2024
Copy link
Member

@mtreinish mtreinish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this LGTM, the underlying algorithm here makes sense and I learned a fair amount about rayon internals and plumbing by reviewing this and looking up the details in the rayon docs. For future observers https://github.com/rayon-rs/rayon/blob/v1.10.0/src/iter/plumbing/README.md provides a useful overview.

In the back of my head while reviewing this I was wondering how much the unsafe usage around using uninitialized memory actually saves us performance wise. But the analysis, inline comments, and miri tests give me confidence that the usage is ok and it definitely is faster. I left a few inline questions and comments but they're mostly minor musing.

The other thing is I was looking at the coverage report and there are a few coverage gaps reported, mainly the 64bit sparse path and then the serial paths for everything. They're definitely covered by the rust tests, but I'm wondering if you want to add some python side test coverage just so we're exercising them full path (and including them in the coverage report).

crates/accelerate/src/rayon_ext.rs Show resolved Hide resolved
Comment on lines 251 to 252
let x_like = pack_bits(self.x).unwrap();
let z_like = pack_bits(self.z).unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to return a real error here since pack_bits can error. Probably a PyOverflowError since the error condition is exceeding 64bits.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unwrap failing here would have been an internal logic error, because ZXPaulis validates that x and z have the same shape on construction from Python space, then the start of ZXPaulisView::matrix_compress validates that the number of qubits will fit in a u64 after the packing, which is the error condition for this (and elevates that to a ValueError).

I switched these over to be expect with an error message explaining where the internal-logic error would come in, but can switch them to full Python exceptions if you think it's better (though it should be impossible to trigger them, even from Rust space, using the public interface).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, if you're saying that the input to pack_bits will always fit in 64 bits because of earlier assertions than I would probably just remove the size check and result return from the function then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 6b82a7b.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops sorry, I got out-of-sync with your comments. The "done in" is referring to my prior comment.

The pack_bits function could still fail if it were called from other valid places in Rust space using its regular public interface (and as we expand SparsePauliOp more in Rust space, we're more likely to use that), so I think it's right to return a Result. It's just that in this particular case, we already validated the inputs so we know it should always work.

crates/accelerate/src/sparse_pauli_op.rs Outdated Show resolved Hide resolved
Comment on lines +717 to +718
// It's also important that `$int_ty` does not implement `Drop`, since otherwise it
// will be called on uninitialised memory (all primitive int types satisfy this).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's worth restricting the int type to match PrimInt (https://docs.rs/num-traits/latest/num_traits/int/trait.PrimInt.html) just so we don't have potential reuse of the macro with an unsafe type (I could see someone trying to use BigInt or something)? It's probably over-engineering since this really won't be exposed publically.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a way in mind to do this assertion? The macro_rules is more like a template-match than a generic, so I'm not sure of an obvious way to put a "trait bound" on the input.

I guess I could possibly hack something together where we attempt to codegen something invalid if $int_ty or $uint_ty don't satisfy PrimInt, but I don't know if there's a better pattern.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not specifically I guess, I was thinking something like adding a dummy generic function that had the trait bound and passed $int_ty to it so we'd get a compile error. I'm not sure if there is a better pattern either TBH.

But, it's not important I was more just my idle musing. Especially since I don't think we'll ever make the macro use public so it's fine with just the comment.

@jakelishman
Copy link
Member Author

In the back of my head while reviewing this I was wondering how much the unsafe usage around using uninitialized memory actually saves us performance wise.

I wrote this PR quite some time ago now, so I've forgotten exactly what I did, but typically when I'm making these performance-improvement PRs, I make them in a series of teeny-tiny commits where I change one thing at a time (like changing from growing a Vec to pre-allocating empty space) and verify that each change improves performance as expected, then squash the chain to make a single coherent chain. I imagine that I did the same thing here, but I don't 100% remember for sure.

The other thing is I was looking at the coverage report and there are a few coverage gaps reported, mainly the 64bit sparse path and then the serial paths for everything. They're definitely covered by the rust tests, but I'm wondering if you want to add some python side test coverage just so we're exercising them full path (and including them in the coverage report).

The problem with this is that from Python space, there's no real way to force the 64-bit mode without generating a very large operator that would take an annoyingly long time to test. It does cause a bit of a testing gap, though, for the Python-space components, so if you think it's still worth it I can add one.

@mtreinish
Copy link
Member

The problem with this is that from Python space, there's no real way to force the 64-bit mode without generating a very large operator that would take an annoyingly long time to test. It does cause a bit of a testing gap, though, for the Python-space components, so if you think it's still worth it I can add one.

Yeah, that's what I figured for the 64bit case, I'm fine skipping those as realistically the only option would be to have a slow test we only ran nightly which wouldn't show up in our coverage report. But for the serial paths we should at least verify that via Python I think.

@jakelishman
Copy link
Member Author

jakelishman commented Apr 26, 2024

Whoops, totally missed that you'd mentioned serial stuff too. Yeah, that was an oversight thanks. Serial vs parallel tests added in edfe2c6.

Copy link
Member

@mtreinish mtreinish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now, thanks for the quick updates.

@mtreinish mtreinish added this pull request to the merge queue Apr 26, 2024
Merged via the queue into Qiskit:main with commit d084aeb Apr 26, 2024
13 checks passed
@jakelishman jakelishman deleted the spo-to-matrix branch April 26, 2024 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Changelog: New Feature Include in the "Added" section of the changelog mod: quantum info Related to the Quantum Info module (States & Operators) performance Rust This PR or issue is related to Rust code in the repository type: qa Issues and PRs that relate to testing and code quality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Speedup sparse matrix construction in SparsePauliOp
5 participants