Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster QubitDevice.marginal_prob #799

Merged
merged 33 commits into from
Sep 28, 2020
Merged

Faster QubitDevice.marginal_prob #799

merged 33 commits into from
Sep 28, 2020

Conversation

antalszava
Copy link
Contributor

@antalszava antalszava commented Sep 15, 2020

Context:
While looking at improvements for the QubitDevice class, it was found, that certain internal functions for computing statistics might not scale very well with larger numbers of qubits. One method, in particular, the marginal_prob method seemed to have performed poorly with an increasing number of wires.

Running a benchmark for 22 qubits and extracting probabilities showed the following results:

61333601 function calls (61261521 primitive calls) in 924.477 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     9570  618.700    0.065  618.700    0.065 {built-in method numpy.core._multiarray_umath.c_einsum}
19410/19380  205.138    0.011  205.138    0.011 {built-in method numpy.array}
       30   61.982    2.066  278.104    9.270 _qubit_device.py:425(marginal_prob)
19470/19410   10.603    0.001  629.722    0.032 {built-in method numpy.core._multiarray_umath.implement_array_function}
  4277871    5.188    0.000   11.272    0.000 {method 'update' of 'dict' objects}
  4277790    2.770    0.000   16.490    0.000 unweighted.py:69(_single_shortest_path_length)
     9570    2.309    0.000  623.943    0.065 default_qubit.py:485(_apply_unitary_einsum)

This and further results for smaller systems showed that considerable time is taken for conversion to numpy.array:
image

A numpy.array is being created in the following expression as part of marginal_probs:

basis_states = np.array(list(itertools.product([0, 1], repeat=len(device_wires))))

Description of the Change:

  1. Adds a faster approach for computing the basis_states by using the states_to_binary method.
  2. Further improves the original array creation by using np.fromiter and itertools.chain

Benefits:

  1. Significant decrease in the contribution of the marginal_probs and sub-methods to the overall time taken for the benchmark on extracting probabilities (>205.138 seconds -> ~35 seconds):
61333691 function calls (61261611 primitive calls) in 627.123 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     9570  566.435    0.059  566.435    0.059 {built-in method numpy.core._multiarray_umath.c_einsum}
       30   19.549    0.652   37.160    1.239 _qubit_device.py:425(marginal_prob)
19470/19410   10.014    0.001  576.823    0.030 {built-in method numpy.core._multiarray_umath.implement_array_function}
       30    4.529    0.151    6.861    0.229 _qubit_device.py:314(states_to_binary)
  4277871    4.285    0.000    9.177    0.000 {method 'update' of 'dict' objects}
       30    2.332    0.078    2.332    0.078 {method 'astype' of 'numpy.ndarray' objects}
  4277790    2.222    0.000   13.308    0.000 unweighted.py:69(_single_shortest_path_length)
     9570    2.174    0.000  571.274    0.060 default_qubit.py:485(_apply_unitary_einsum)

image

  1. Faster default way when the states_to_binary method is not ideal to be used (>30 qubits):
%%time
num = len(device_wires)
basis_states = np.fromiter(itertools.chain(*itertools.product((0, 1), repeat=num)),dtype=int).reshape(-1,num)
CPU times: user 4.59 s, sys: 357 ms, total: 4.95 s
Wall time: 4.95 s

Original:

%%time
num = len(device_wires)
basis_states = np.array(list(itertools.product((0, 1), repeat=num)))
CPU times: user 6.74 s, sys: 304 ms, total: 7.05 s
Wall time: 7.05 s

Possible Drawbacks:
The states_to_binary method creates basis states faster (for larger systems at times over x25 times faster) than the approach
using itertools, at the expense of using slightly more memory when using dtype=int64. Hence we constraint the dtype of the array to representing unsigned integers on 32 bits (uint32).

Due to this constraint, an overflow occurs for 32 or more wires:

import numpy as np

arr_32 = np.zeros(1, dtype=np.uint32)

for i in range(40):
    arr_32[0] = 2 ** i
    if arr_32[0] != 2 ** i:
        print('Overflow.', i)
        break
Overflow: 32

Therefore this approach is used only for fewer wires.

For a smaller number of wires, speed is comparable to the improved original approach, hence we resort to that one for testing purposes

Related GitHub Issues:
N/A

@antalszava antalszava changed the title Faster marginal_prob Faster QubitDevice.marginal_prob Sep 15, 2020
@codecov
Copy link

codecov bot commented Sep 15, 2020

Codecov Report

Merging #799 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #799   +/-   ##
=======================================
  Coverage   91.09%   91.10%           
=======================================
  Files         130      130           
  Lines        8701     8709    +8     
=======================================
+ Hits         7926     7934    +8     
  Misses        775      775           
Impacted Files Coverage Δ
pennylane/_qubit_device.py 98.82% <100.00%> (+0.05%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 379c45a...d471bbb. Read the comment docs.

Copy link
Member

@josh146 josh146 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a huge improvement @antalszava! Great work 🙂

pennylane/_qubit_device.py Outdated Show resolved Hide resolved
pennylane/_qubit_device.py Outdated Show resolved Hide resolved
pennylane/_qubit_device.py Outdated Show resolved Hide resolved
pennylane/_qubit_device.py Outdated Show resolved Hide resolved
Comment on lines 499 to 500
states_base_ten = np.arange(2 ** num_wires, dtype=np.int32)
basis_states = self.states_to_binary(states_base_ten, num_wires)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be made less memory intensive by having states_base_ten be a generator (so that you don't have to store all basis states in memory at once?)

I'm not sure this idea would work though, since then you might not be able to perform the bitwise shift << 🙂

Copy link
Contributor Author

@antalszava antalszava Sep 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have looked into this. As we'd like to eventually use basis_states to compute perm and permute the probabilities, we'd need to have all the basis states in a single array at one time.

One thing I found was that we can use a 32-bit integer for casting in states_to_binary, that helped with memory to some extent (turned to 32 bits from 64).

tests/test_prob.py Outdated Show resolved Hide resolved
pennylane/_qubit_device.py Outdated Show resolved Hide resolved
pennylane/_qubit_device.py Show resolved Hide resolved
# therefore this approach is used only for fewer wires.
# For smaller number of wires speed is comparable to the next
# approach, hence we resort to that one for testing purposes
states_base_ten = np.arange(2 ** num_wires, dtype=np.int32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use np.uint32, do we make it to 32 qubits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Yeah, could increase it so that we can have 31 qubits (overflow at 32)

josh146 and others added 23 commits September 15, 2020 22:05
Co-authored-by: Josh Izaac <josh146@gmail.com>
Co-authored-by: Josh Izaac <josh146@gmail.com>
Co-authored-by: Josh Izaac <josh146@gmail.com>
@antalszava
Copy link
Contributor Author

Thanks @josh146 @trbromley for the comments! Addressed them.

(Also found a cool memory profiler, pip install -U memory_profiler. Can be used akin to timeit: %memit 😄 . In a Jupyter notebook after loading with %load_ext memory_profiler, each cell can be run with the %%memit magic to see the memory consumption of the statement.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants