Add Rust-based `SparsePauliOp.to_matrix` and Miri tests #11388

jakelishman · 2023-12-08T20:55:39Z

New `SparsePauliOp.to_matrix`

This rewrites the numerical version of SparsePauliOp.to_matrix to be written in parallelised Rust, building up the matrices row-by-row rather than converting each contained operator to a matrix individually and summing them.

The new algorithms are complete row-based, which is embarrassingly parallel for dense matrices, and parallelisable with additional copies and cumulative sums in the CSR case.

The two new algorithms are an asymptotic complexity improvement for both dense and sparse matrices over the "sum the individual matrices" version. In the dense case, the cost goes from $\mathcal O\bigl(4^{\text{qubits}} \cdot \text{ops}\bigr)$ to $\mathcal O\bigl(4^{\text{qubits}} + 2^{\text{qubits}} \cdot \text{reduced ops}\bigr)$, where the first term is from the zeroing of the matrix, and the second is from updating the elements in-place. $\text{reduced ops}$ is the number of explicitly stored operations after Pauli-operator uniqueness compaction, so is upper-bounded as $4^{\text{qubits}}$. (The Pauli compaction goes as $\mathcal O(\text{ops})$, so is irrelevant to the complexity discussion.)

The CSR form of the algorithm goes as $\mathcal O\bigl(2^{\text{qubits}} \cdot \text{reduced ops}\cdot\lg(\text{reduced ops})\bigr)$, which (I think! - I didn't fully calculate it) is asymptotically the same as before, but in practice the constant factors and intermediate memory use are dramatically reduced, and the new algorithm is threadable with an additional $\mathcal O\bigl(2^{\text{qubits}} \cdot \text{reduced ops}\bigr)$ intermediate memory overhead (the serial form has only $\mathcal O(\text{reduced ops})$ memory overhead).

The object-coefficients form is left as-is to avoid exploding the complexity in Rust space; these objects are already slow and unsuited for high-performance code, so the effects should be minimal.

Miri tests

As we begin to include more unsafe code in the Rust-accelerated components, it is becoming more important for us to test these in an undefined-behaviour sanitiser. This is done in a separate CI job because:

we do not yet know how stable Miri will be, so we don't want to block on it.
some dependencies need their version-resolution patching to Miri-compatible versions, but we want to run our regular test suites with the same versions of packages we will be building against.

Performance

Using a setup of:

import numpy as np
import pyqrusty as qrusty
from qiskit.quantum_info import SparsePauliOp
from qiskit.quantum_info.random import random_pauli_list
from scipy.sparse import csr_matrix

def rand_spo(num_qubits, num_ops, seed=None):
    rng = np.random.default_rng(seed)
    paulis = random_pauli_list(num_qubits, num_ops, seed=rng)
    coeffs = rng.random(num_ops) + 1j*rng.random(num_ops)
    return SparsePauliOp(paulis, coeffs)

def prep_qrusty(spo):
    return [l.to_label() for l in spo.paulis], spo.coeffs

def qrusty_to_sparse(paulis, coeffs, scipy=False):
    paulis = [qrusty.Pauli(l) for l in paulis]
    shape, data, indices, indptr = qrusty.SparsePauliOp(paulis, coeffs).to_matrix_mode("RowwiseUnsafeChunked/128").export()
    if scipy:
        return csr_matrix((data, indices, indptr), shape=shape)
    return data, indices, indptr

then timing with:

import collections

outs = collections.defaultdict(list)
qubits = list(range(4, 19))

for num_qubits in qubits:
    print(f"\n{num_qubits = }")
    spo = rand_spo(num_qubits, 1_000, seed=0)

    paulis, coeffs = prep_qrusty(spo)
    res = %timeit -o qrusty_to_sparse(paulis, coeffs, scipy=False)
    outs["qrusty"].append(res)
    res = %timeit -o qrusty_to_sparse(paulis, coeffs, scipy=True)
    outs["qrusty-scipy"].append(res)
    res = %timeit -o spo.to_matrix(sparse=True, force_serial=True)
    outs["qiskit-serial"].append(res)
    res = %timeit -o spo.to_matrix(sparse=True, force_serial=False)
    outs["qiskit-parallel"].append(res)

the resulting graph looks like

The solid Qrusty line is the full path, including the construction of the Scipy matrix at the end. The dashed Qrusty line is just to construct the three arrays, since the index types of Qrusty's native CSR representation get converted when moving to Scipy. Both Qrusty lines are parallelised at 8 SMT threads.

The solid Qiskit line is with 8-thread SMT parallelism on my Macbook, the dashed one is the serial form. In absolute timings for the 18q, 1000 ops case, the full Qrusty path took ~4.54s (or ~3.68s without Scipy construction), while the parallel Qiskit path took ~2.86s (or ~3.91s in serial), so it takes under 2/3 the time to build a Scipy matrix in both or a 20% speedup to build compared to using Qrusty to build its native CSR type and for Qiskit to build a Scipy CSR.

Older version of the graph that used averages instead of the best times (despite what the axis says).

The big bumps in the parallel lines at 6 and 12 qubits appear legitimate and reproducible, but I'm not sure at all why they exist - maybe a weirdness where we move from one allocator to another mid algorithm? Update: the bump appears to be caused by the same medium-size allocator buggy shenanigans first detected by @kevinhartman in #10827, since on macOS 14, setting the environment variable MallocMediumZone=0 removes the big bump, and the bump doesn't seem to appear in the same way on Linux.

Fix #8772.

qiskit-bot · 2023-12-08T20:55:45Z

One or more of the the following people are requested to review this:

@Eric-Arellano
@Qiskit/terra-core
@ikkoham
@kevinhartman
@mtreinish

jakelishman · 2023-12-08T21:06:23Z

crates/accelerate/src/rayon_ext.rs

+    }
+}
+
+impl<T: Send, S: ?Sized> ParallelSliceMutExt<T> for S where S: ParallelSliceMut<T> {}


Why did I write <S: ?Sized> ... where S: ParallelSliceMut<T> and not put both either in the <> or in the where? Who knows.

coveralls · 2023-12-08T21:19:12Z

Pull Request Test Coverage Report for Build 8848600959

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

414 of 471 (87.9%) changed or added relevant lines in 3 files are covered.
662 unchanged lines in 65 files lost coverage.
Overall coverage increased (+0.2%) to 89.444%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
crates/accelerate/src/rayon_ext.rs	61	82	74.39%
crates/accelerate/src/sparse_pauli_op.rs	344	380	90.53%

Files with Coverage Reduction	New Missed Lines	%
qiskit/circuit/library/standard_gates/r.py	1	97.62%
qiskit/transpiler/target.py	1	93.74%
crates/qasm2/src/expr.rs	1	94.03%
qiskit/primitives/backend_sampler_v2.py	1	99.21%
qiskit/circuit/library/generalized_gates/permutation.py	1	93.1%
qiskit/primitives/containers/data_bin.py	1	97.92%
qiskit/circuit/library/standard_gates/rxx.py	1	97.44%
qiskit/circuit/library/standard_gates/u2.py	1	96.67%
qiskit/circuit/library/standard_gates/rzx.py	1	97.44%
qiskit/quantum_info/operators/operator.py	1	94.94%

Totals
Change from base Build 8728350061:	0.2%
Covered Lines:	60930
Relevant Lines:	68121

💛 - Coveralls

This rewrites the numerical version of `SparsePauliOp.to_matrix` to be written in parallelised Rust, building up the matrices row-by-row rather than converting each contained operator to a matrix individually and summing them. The new algorithms are complete row-based, which is embarrassingly parallel for dense matrices, and parallelisable with additional copies and cumulative sums in the CSR case. The two new algorithms are an asymptotic complexity improvement for both dense and sparse matrices over the "sum the individual matrices" version. In the dense case, the cost goes from O(4 ** num_qubits * num_ops) to O(4 ** num_qubits + (2 ** num_qubits) * reduced_num_ops) where the first term is from the zeroing of the matrix, and the second is from updating the elements in-place. `reduced_num_ops` is the number of explicitly stored operations after Pauli-operator uniqueness compaction, so is upper-bounded as `4 ** num_qubits`. (The Pauli compaction goes as `O(num_ops)`, so is irrelevant to the complexity discussion.) The CSR form of the algorithm goes as O(2 ** num_qubits * reduced_num_ops * lg(reduced_num_ops)) which (I think! - I didn't fully calculate it) is asymptotically the same as before, but in practice the constant factors and intermediate memory use are *dramatically* reduced, and the new algorithm is threadable with an additional `O(2 ** num_qubits * reduced_num_ops)` intermediate memory overhead (the serial form has only `O(reduced_num_ops)` memory overhead). The `object`-coefficients form is left as-is to avoid exploding the complexity in Rust space; these objects are already slow and unsuited for high-performance code, so the effects should be minimal.

As we begin to include more `unsafe` code in the Rust-accelerated components, it is becoming more important for us to test these in an undefined-behaviour sanitiser. This is done in a separate CI job because: - we do not yet know how stable Miri will be, so we don't want to block on it. - some dependencies need their version-resolution patching to Miri-compatible versions, but we want to run our regular test suites with the same versions of packages we will be building against.

This parallelises the previously serial-only cumulative sum of the `indptr` array of number of non-zero entries at the end. In practice, I didn't actually see any change in performance from this, but philosophically it feels like the right thing to do.

jakelishman · 2024-01-13T00:00:14Z

Now rebased over #11133.

edit: ugh, Miri's failing after the version of crossbeam-epoch got updated in our lock. I'll have to look into that again.

mtreinish

Overall this LGTM, the underlying algorithm here makes sense and I learned a fair amount about rayon internals and plumbing by reviewing this and looking up the details in the rayon docs. For future observers https://github.com/rayon-rs/rayon/blob/v1.10.0/src/iter/plumbing/README.md provides a useful overview.

In the back of my head while reviewing this I was wondering how much the unsafe usage around using uninitialized memory actually saves us performance wise. But the analysis, inline comments, and miri tests give me confidence that the usage is ok and it definitely is faster. I left a few inline questions and comments but they're mostly minor musing.

The other thing is I was looking at the coverage report and there are a few coverage gaps reported, mainly the 64bit sparse path and then the serial paths for everything. They're definitely covered by the rust tests, but I'm wondering if you want to add some python side test coverage just so we're exercising them full path (and including them in the coverage report).

crates/accelerate/src/rayon_ext.rs

qiskit/quantum_info/operators/symplectic/sparse_pauli_op.py

mtreinish · 2024-04-25T10:36:36Z

crates/accelerate/src/sparse_pauli_op.rs

+        let x_like = pack_bits(self.x).unwrap();
+        let z_like = pack_bits(self.z).unwrap();


Do you want to return a real error here since pack_bits can error. Probably a PyOverflowError since the error condition is exceeding 64bits.

unwrap failing here would have been an internal logic error, because ZXPaulis validates that x and z have the same shape on construction from Python space, then the start of ZXPaulisView::matrix_compress validates that the number of qubits will fit in a u64 after the packing, which is the error condition for this (and elevates that to a ValueError).

I switched these over to be expect with an error message explaining where the internal-logic error would come in, but can switch them to full Python exceptions if you think it's better (though it should be impossible to trigger them, even from Rust space, using the public interface).

Hmm, if you're saying that the input to pack_bits will always fit in 64 bits because of earlier assertions than I would probably just remove the size check and result return from the function then.

Done in 6b82a7b.

Whoops sorry, I got out-of-sync with your comments. The "done in" is referring to my prior comment.

The pack_bits function could still fail if it were called from other valid places in Rust space using its regular public interface (and as we expand SparsePauliOp more in Rust space, we're more likely to use that), so I think it's right to return a Result. It's just that in this particular case, we already validated the inputs so we know it should always work.

crates/accelerate/src/sparse_pauli_op.rs

mtreinish · 2024-04-25T11:49:15Z

crates/accelerate/src/sparse_pauli_op.rs

+            // It's also important that `$int_ty` does not implement `Drop`, since otherwise it
+            // will be called on uninitialised memory (all primitive int types satisfy this).


Do you think it's worth restricting the int type to match PrimInt (https://docs.rs/num-traits/latest/num_traits/int/trait.PrimInt.html) just so we don't have potential reuse of the macro with an unsafe type (I could see someone trying to use BigInt or something)? It's probably over-engineering since this really won't be exposed publically.

Do you have a way in mind to do this assertion? The macro_rules is more like a template-match than a generic, so I'm not sure of an obvious way to put a "trait bound" on the input.

I guess I could possibly hack something together where we attempt to codegen something invalid if $int_ty or $uint_ty don't satisfy PrimInt, but I don't know if there's a better pattern.

Not specifically I guess, I was thinking something like adding a dummy generic function that had the trait bound and passed $int_ty to it so we'd get a compile error. I'm not sure if there is a better pattern either TBH.

But, it's not important I was more just my idle musing. Especially since I don't think we'll ever make the macro use public so it's fine with just the comment.

jakelishman · 2024-04-26T11:52:38Z

In the back of my head while reviewing this I was wondering how much the unsafe usage around using uninitialized memory actually saves us performance wise.

I wrote this PR quite some time ago now, so I've forgotten exactly what I did, but typically when I'm making these performance-improvement PRs, I make them in a series of teeny-tiny commits where I change one thing at a time (like changing from growing a Vec to pre-allocating empty space) and verify that each change improves performance as expected, then squash the chain to make a single coherent chain. I imagine that I did the same thing here, but I don't 100% remember for sure.

The other thing is I was looking at the coverage report and there are a few coverage gaps reported, mainly the 64bit sparse path and then the serial paths for everything. They're definitely covered by the rust tests, but I'm wondering if you want to add some python side test coverage just so we're exercising them full path (and including them in the coverage report).

The problem with this is that from Python space, there's no real way to force the 64-bit mode without generating a very large operator that would take an annoyingly long time to test. It does cause a bit of a testing gap, though, for the Python-space components, so if you think it's still worth it I can add one.

mtreinish · 2024-04-26T12:22:13Z

The problem with this is that from Python space, there's no real way to force the 64-bit mode without generating a very large operator that would take an annoyingly long time to test. It does cause a bit of a testing gap, though, for the Python-space components, so if you think it's still worth it I can add one.

Yeah, that's what I figured for the 64bit case, I'm fine skipping those as realistically the only option would be to have a slow test we only ran nightly which wouldn't show up in our coverage report. But for the serial paths we should at least verify that via Python I think.

jakelishman · 2024-04-26T12:55:01Z

Whoops, totally missed that you'd mentioned serial stuff too. Yeah, that was an oversight thanks. Serial vs parallel tests added in edfe2c6.

mtreinish

LGTM now, thanks for the quick updates.

jakelishman added this to the 1.0.0 milestone Dec 8, 2023

jakelishman requested review from ikkoham and a team as code owners December 8, 2023 20:55

jakelishman mentioned this pull request Dec 8, 2023

Speedup sparse matrix construction in SparsePauliOp #8772

Closed

jakelishman commented Dec 8, 2023

View reviewed changes

jakelishman mentioned this pull request Dec 13, 2023

Use "tensorised-Pauli decomposition" in SparsePauliOp.from_operator #11133

Merged

jakelishman added 3 commits January 12, 2024 23:59

Parallelise cumulative nnz sum

afa9726

This parallelises the previously serial-only cumulative sum of the `indptr` array of number of non-zero entries at the end. In practice, I didn't actually see any change in performance from this, but philosophically it feels like the right thing to do.

jakelishman force-pushed the spo-to-matrix branch from 22f83a2 to afa9726 Compare January 12, 2024 23:59

mtreinish modified the milestones: 1.0.0, 1.1.0 Jan 23, 2024

jakelishman added 2 commits April 8, 2024 14:40

Merge remote-tracking branch 'ibm/main' into spo-to-matrix

62a4549

Update Miri pin to later version of crossbeam-epohc

500f55d

jakelishman removed the request for review from ikkoham April 11, 2024 14:33

jakelishman assigned mtreinish Apr 11, 2024

mtreinish assigned kevinhartman Apr 11, 2024

Merge remote-tracking branch 'ibm/main' into spo-to-matrix

53d9404

mtreinish reviewed Apr 25, 2024

View reviewed changes

jakelishman added 2 commits April 26, 2024 13:08

Improve error handling and messages

6b82a7b

Simplify unnecessary match

e71b3d4

jakelishman added 2 commits April 26, 2024 13:14

Add link to environment variable configuration

5ad30e3

Add link to Rayon plumbing README

a374c24

Add explicit test of serial and parallel modes

edfe2c6

mtreinish approved these changes Apr 26, 2024

View reviewed changes

mtreinish added this pull request to the merge queue Apr 26, 2024

Merged via the queue into Qiskit:main with commit d084aeb Apr 26, 2024
13 checks passed

jakelishman deleted the spo-to-matrix branch April 26, 2024 16:14

mtreinish mentioned this pull request May 1, 2024

Optimise SparsePauliOp.from_operator #11557

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Rust-based `SparsePauliOp.to_matrix` and Miri tests #11388

Add Rust-based `SparsePauliOp.to_matrix` and Miri tests #11388

jakelishman commented Dec 8, 2023 •

edited

Loading

qiskit-bot commented Dec 8, 2023

jakelishman Dec 8, 2023

coveralls commented Dec 8, 2023 •

edited

Loading

jakelishman commented Jan 13, 2024 •

edited

Loading

mtreinish left a comment

mtreinish Apr 25, 2024

jakelishman Apr 26, 2024

mtreinish Apr 26, 2024

jakelishman Apr 26, 2024

jakelishman Apr 26, 2024

mtreinish Apr 25, 2024

jakelishman Apr 26, 2024

mtreinish Apr 26, 2024

jakelishman commented Apr 26, 2024

mtreinish commented Apr 26, 2024

jakelishman commented Apr 26, 2024 •

edited

Loading

mtreinish left a comment

		let x_like = pack_bits(self.x).unwrap();
		let z_like = pack_bits(self.z).unwrap();

		// It's also important that `$int_ty` does not implement `Drop`, since otherwise it
		// will be called on uninitialised memory (all primitive int types satisfy this).

Add Rust-based SparsePauliOp.to_matrix and Miri tests #11388

Add Rust-based SparsePauliOp.to_matrix and Miri tests #11388

Conversation

jakelishman commented Dec 8, 2023 • edited Loading

New SparsePauliOp.to_matrix

Miri tests

Performance

qiskit-bot commented Dec 8, 2023

Choose a reason for hiding this comment

coveralls commented Dec 8, 2023 • edited Loading

Pull Request Test Coverage Report for Build 8848600959

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

jakelishman commented Jan 13, 2024 • edited Loading

mtreinish left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakelishman commented Apr 26, 2024

mtreinish commented Apr 26, 2024

jakelishman commented Apr 26, 2024 • edited Loading

mtreinish left a comment

Choose a reason for hiding this comment

Add Rust-based `SparsePauliOp.to_matrix` and Miri tests #11388

Add Rust-based `SparsePauliOp.to_matrix` and Miri tests #11388

jakelishman commented Dec 8, 2023 •

edited

Loading

New `SparsePauliOp.to_matrix`

coveralls commented Dec 8, 2023 •

edited

Loading

jakelishman commented Jan 13, 2024 •

edited

Loading

jakelishman commented Apr 26, 2024 •

edited

Loading