Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache tape executions #817

Merged
merged 32 commits into from Sep 25, 2020
Merged

Cache tape executions #817

merged 32 commits into from Sep 25, 2020

Conversation

trbromley
Copy link
Contributor

@trbromley trbromley commented Sep 21, 2020

This PR adds the ability for QNodes and quantum tapes to cache previous evaluations:

from pennylane.beta.tapes import qnode
from pennylane.beta.queuing import expval
import time

dev = qml.device("default.qubit", wires=2)

@qnode(dev, caching=10)  # cache up to 10 evaluations
def qfunc(x):
    qml.RX(x, wires=0)
    qml.RX(0.3, wires=1)
    qml.CNOT(wires=[0, 1])
    return expval(qml.PauliZ(1))

for i in range(100):
    qfunc(0.1)
# Even with 100 calls, the internal execute_device() method
# of the quantum tape has only been called once

This is a replacement of #776 to work with the new QTape core.
It also replaces #795 which was a branch from quantum-tape 😅

Things to note

  • Caching works with both the underlying tape and the QNode.
  • Caching is off by default, so it should not break anything
  • Caching still takes place if the user is returning samples or if operating in non-analytic mode. It isn't hard to prevent caching in these cases (and is done in [WIP] Cache previous calls to QNode evaluate #776), but I ended up being on the fence about whether to. It seems too implicit to turn off caching - indeed if a user turns on caching then they should get what they are expecting (cached results). Is there a case people can think of as a counterargument?

Benchmarking

Let's take a look to see if caching is working. We'll consider a 6-qubit circuit composed of StronglyEntanglingLayers of depth 6 (similarly to #776). We will also look at benchmarking both the quantum tape and the QNode in two settings: evaluation (with 500 repetitions) and gradient evaluation (with 20 repetitions), where each repetition is with the same input parameters.

Tape

Without caching:
Evaluate - 5.36 seconds
Gradient - 24.67 seconds

With caching:
Evaluate - 0.18 seconds
Gradient - 1.43 seconds

QNode

Without caching:
Evaluate - 9.79 seconds
Gradient - 69.20 seconds

With caching:
Evaluate - 3.92 seconds
Gradient - 3.98 seconds

Summary

Caching is helping us! Also, we can see that using the QNode has an additional overhead, e.g. from methods such as construct().

@trbromley trbromley self-assigned this Sep 21, 2020
@trbromley trbromley added the review-ready 👌 PRs which are ready for review by someone from the core team. label Sep 21, 2020
@codecov
Copy link

codecov bot commented Sep 21, 2020

Codecov Report

Merging #817 into master will increase coverage by 0.02%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #817      +/-   ##
==========================================
+ Coverage   91.02%   91.05%   +0.02%     
==========================================
  Files         129      129              
  Lines        8649     8678      +29     
==========================================
+ Hits         7873     7902      +29     
  Misses        776      776              
Impacted Files Coverage Δ
pennylane/beta/tapes/qnode.py 98.75% <100.00%> (+0.10%) ⬆️
pennylane/beta/tapes/tape.py 98.82% <100.00%> (+0.05%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d237ab4...9765a13. Read the comment docs.

@trbromley
Copy link
Contributor Author

Local code coverage is ok:
image
(line 194 is for something else)

image
(the contribution in this PR is from line 438 onwards)

"""float: number of device executions to store in a cache to speed up subsequent
executions. If set to zero, no caching occurs."""

if caching is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the default for the caching argument be 0? Also, it seems that if the user specified 0, then this condition would evaluate to True, correct? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! This is now done: f86adc7
It also makes the if statement correct - you are right that a user specifying 0 would have resulted in a warning!

Comment on lines 1016 to 1019
if self._caching and hashed_params not in self._cache_execute:
self._cache_execute[hashed_params] = res
if len(self._cache_execute) > self._caching:
self._cache_execute.popitem(last=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to document this behaviour: users could also think that once the caching limit is reached, then nothing happens to the cache. However, this seems to be not the case as each time there's a new execution, the very first cached result is dropped and we're adding the latest result.

Overall also might be worth considering which of the two options is more beneficial.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I've added it to the caching part of the docstring: 5bd1d37
I think the current behaviour makes the most sense, else you might end up with a cache that is very stale. Moreover, we always want to be caching the last execution since that is the most likely one to be repeated.

Comment on lines 152 to 154
warnings.warn(
"Caching mode activated. The quantum circuit being executed by the QNode must have "
"a fixed structure.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way a user would receive this warning even if the quantum circuit was fine, right? Could an error be raised if a user creates a mutable QNode? (E.g. hashing the circuit operations: name of the operation and wire they act on and raising an error if the latest hash differs from the stored hash for the circuit.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is now done: instead of hashing the tape arguments, we use the hash of the circuit graph.

There are performance hits for both approaches: we need to use set_parameters() for the former and we need to serialize for the latter. I hope to summarize more the relative performance in a follow up comment.

Comment on lines 399 to 406
if self._caching:
self.qtape._cache_execute = self._cache_execute

# execute the tape
return self.qtape.execute(device=self.device)
res = self.qtape.execute(device=self.device)

if self._caching:
self._cache_execute = self.qtape._cache_execute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could only the qtape have the _cache_execute attribute? Since a QNode can access the attributes of a qtape

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, but unfortunately the self.construct() method wipes the previous QTape and starts with a fresh one, so this line allows the cache to persist across multiple QTapes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this answers my question above! Probably a good idea to add a line comment, since Antal and I both independently had the same question

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If caching is on, can we avoid redundant tape constructions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just had another quick try of: when caching is on, tape is only constructed first time.
This results in the tests related to mutability and classical processing to fail. For example, the cache is still being used when parameters are the same but the circuit differs. Although, I'm not sure if the problem is even deeper deeper, since isn't construction the place where the input arguments are fixed to the gate ops?

tests/test_utils.py Outdated Show resolved Hide resolved
"""Tests for the _hash_iterable function."""

iterables = [
[1, 1.4, -1],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be worth adding unsupported cases:

The iterable must be flat and can contain only numbers and NumPy arrays.

When this condition is not satisfied.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 Ok, I can think about this if we decide to keep this test. Currently the test has been removed because we are using the hash of the circuit graph.

Copy link
Contributor

@antalszava antalszava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is overall looking great @trbromley! 💯 Curious to hear your thoughts on a couple of things, e.g. _get_all_parameters and the fifo nature of _cache_execute, so leaving a comment review for now.

Copy link
Member

@josh146 josh146 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Tom!

I have to admit, I'm somewhat shocked at the overhead between tape and qnode gradients. I knew it was there, I didn't know it was that large. I wonder what we can do to mitigate this.

A couple of questions:

  • What interface did you use in the benchmarks?

  • How does the non-cached time compare to the existing QNode time?

Comment on lines 101 to 103
caching (int): number of device executions to store in a cache to speed up subsequent
executions. Caching does not take place by default. In caching mode, the quantum circuit
being executed must have a constant structure and only its parameters can be varied.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition (not sure if it's implied by 'only its parameters can be varied', but it wasn't clear to me): there can't be any classical processing within the QNode.

E.g., qml.RX(2*x, wires=0) is not allowed.

This is a side effect of the tape being a part of the users quantum function --- we can't extract the tape independently of the classical processing :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 @josh146 it's not clear to me why classical processing is not allowed?
I added some tests and it seems compatible: a640e26
(note, this is using a new approach that came out of code review, where we are caching hashes of the circuit, which can pick up if the circuit has mutated)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to admit, it's not something I've tested, just a 'hunch' that it wouldn't work. My thinking was that:

When the QNode is evaluated, the users qfunc() is called inside a quantum tape:

with QuantumTape() as tape:
    qfunc(*args, **kwargs)

This will do two things:

  1. classically process the args and kwargs as per whatever classical processing the user has within the QNode, e.g., tf.sin(a)

  2. construct the tape, with incident gate arguments potentially different to args and kwargs due to post-processing.

Say the user passes a qfunc of the form:

def circuit(x):
    qml.RX(tf.sin(x), wires=0)
    return qml.expval(qml.PauliZ(0))

After the first construction (the user has passed x=0.6), we 'cache' or store the tape. On the second evaluation (now the user passes x=0.1), we want to execute the previous tape, but with the new gate argument; however, we can't do tape.set_parameters([0.1]), because the user actually wants tape.set_parameters([tf.sin(0.1)]) --- how do we determine the classical post-processing step?

So having immutable tapes is not enough as far as I can see. Or maybe I am missing something --- would be happy if that's the case and this works!

@@ -890,6 +906,60 @@ def execute(self, device, params=None):

return self._execute(params, device=device)

def _get_all_parameters(self, params):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the same as doing

>>> tape.set_parameters([0.1, 0.2])
>>> tape.get_parameters(trainable_only=False)
[0.1, 0.543, 0.2]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! 😁

Well, _get_all_parameters allows you to get all the params without modifying the trainable params themselves. Whereas we need to do

saved_params = tape.get_parameters()
tape.set_parameters([0.1, 0.2])
all_parameters = tape.get_parameters(trainable=False)
# Do stuff
tape.set_parameters(saved_parameters)

However, this is already what is happening in execute_device, so you're right maybe it's not worth making a new method.

I've now removed the method and used the suggested approach db32341.

pennylane/beta/tapes/tape.py Outdated Show resolved Hide resolved
Copy link
Contributor

@thisac thisac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks great @trbromley. A really nice addition (that I actually would have needed a few times already). 😄

pennylane/beta/tapes/qnode.py Outdated Show resolved Hide resolved
pennylane/beta/tapes/qnode.py Outdated Show resolved Hide resolved
pennylane/beta/tapes/tape.py Outdated Show resolved Hide resolved
tests/beta/tapes/test_caching.py Outdated Show resolved Hide resolved
Copy link
Contributor Author

@trbromley trbromley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @antalszava @josh146 for the comments! Your comments resulted in one big change:

  • Instead of hashing the quantum tape parameters, we can use the qtape.graph.hash which describes both the parameters and the gates and is calculated from a serialization of the circuit.
  • This allows us to remove _get_all_parameters() and _hash_iterable() (and corresponding tests)
  • It allows us to not have the implicit assumption that the circuit structure is unchanged, since the generated hash depends on the circuit structure. Caching is hence compatible with mutable circuits.
  • This change has performance implications that I will summarize below.

Performance implications

We compare 4 options:

(A) The way the PR is now, using the hash of the circuit graph but being explicitly safe regarding mutability.

(B) When you originally reviewed, this PR cached the internal variables of the tape. This requires the assumption that the QNode is immutable.

(C) Using the new quantum-tape core, but without caching

(D) Using the current, non-beta, core

These are the results:
image

image

Things to note:

  • C is generally better than D except for two cases: (i) TF interface evaluation, (ii) autograd interface gradient. The reason for (ii) is that the set_parameters() method is quite an overhead in autograd. The reason for (i) I'm not entirely sure of, but the broadcast() function is a massive overhead in the new core (57% new vs 20% old).

  • (B) is quicker than (A). This is because (B) does not need to serialize the circuit graph and call set_parameters() - it is just hashing the parameters and assuming the circuit graph is the same.

Choosing (A) or (B)

What do people recommend? Slower caching time with more safety, or quicker caching time but relying on users to keep their circuits immutable? I'd probably edge toward (A) (slower but safer) and hope to speed up the slow bits in future (e.g. could we do quick serialization?)

@@ -890,6 +906,60 @@ def execute(self, device, params=None):

return self._execute(params, device=device)

def _get_all_parameters(self, params):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! 😁

Well, _get_all_parameters allows you to get all the params without modifying the trainable params themselves. Whereas we need to do

saved_params = tape.get_parameters()
tape.set_parameters([0.1, 0.2])
all_parameters = tape.get_parameters(trainable=False)
# Do stuff
tape.set_parameters(saved_parameters)

However, this is already what is happening in execute_device, so you're right maybe it's not worth making a new method.

I've now removed the method and used the suggested approach db32341.

pennylane/beta/tapes/tape.py Outdated Show resolved Hide resolved
pennylane/beta/tapes/tape.py Outdated Show resolved Hide resolved
Comment on lines 1016 to 1019
if self._caching and hashed_params not in self._cache_execute:
self._cache_execute[hashed_params] = res
if len(self._cache_execute) > self._caching:
self._cache_execute.popitem(last=False)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I've added it to the caching part of the docstring: 5bd1d37
I think the current behaviour makes the most sense, else you might end up with a cache that is very stale. Moreover, we always want to be caching the last execution since that is the most likely one to be repeated.

"""float: number of device executions to store in a cache to speed up subsequent
executions. If set to zero, no caching occurs."""

if caching is not None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! This is now done: f86adc7
It also makes the if statement correct - you are right that a user specifying 0 would have resulted in a warning!

Comment on lines 399 to 406
if self._caching:
self.qtape._cache_execute = self._cache_execute

# execute the tape
return self.qtape.execute(device=self.device)
res = self.qtape.execute(device=self.device)

if self._caching:
self._cache_execute = self.qtape._cache_execute
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, but unfortunately the self.construct() method wipes the previous QTape and starts with a fresh one, so this line allows the cache to persist across multiple QTapes.

tests/beta/tapes/test_caching.py Outdated Show resolved Hide resolved
tests/beta/tapes/test_tape.py Outdated Show resolved Hide resolved
tests/test_utils.py Outdated Show resolved Hide resolved
"""Tests for the _hash_iterable function."""

iterables = [
[1, 1.4, -1],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 Ok, I can think about this if we decide to keep this test. Currently the test has been removed because we are using the hash of the circuit graph.

Co-authored-by: Theodor <theodor@xanadu.ai>
@trbromley
Copy link
Contributor Author

Thanks @thisac for the review!

@trbromley
Copy link
Contributor Author

Also, updates for tape benchmark:

With current caching (A):
Evaluate - 0.59 seconds
Gradient - 3.09 seconds

With previous caching (B):
Evaluate - 0.18 seconds
Gradient - 1.43 seconds

Without caching (C):
Evaluate - 5.36 seconds
Gradient - 24.67 seconds

Again, the new approach (A) is slower due to having to hash the circuit graph.

Copy link
Contributor

@antalszava antalszava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @trbromley, looks good to me! Saw the options, though at this point I'm also happy to approve whichever would be eventually accepted.

Couple of thoughts/questions:

  • What happens if user inputs a mutable qfunc in the (B) case? Could one end up with invalid results?
  • The CircuitGraph.serialize doesn't seem to use the "".join(serialization_string) for string concatenation, which I've come across as one of the fastest options. Yet, it seemed to me that even when using "".join, serialization is not significantly faster. Perhaps hash with a long string argument would take a long time. If going with (A), we could perhaps consider a shorter string to be hashed or further explore as you suggested.
  • Wondering why it could be that (A) much slower than (B) for the gradient when using Autograd? 🤔

As we are doing some hashing even in case (B), might be leaning towards going with (A) and tracking down why hashing takes considerably more time.

Copy link
Member

@josh146 josh146 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Tom, this looks great! I agree, I prefer the slower but safer approach where circuit mutability is taken into account by hashing.

I suppose I have one quite general question, that doesn't affect merging this PR:

  • Under what general use case do you see improvement? Is it often that general use, e.g., computing gradients, results in repeated evaluation with the same parameters?

pennylane/beta/interfaces/autograd.py Outdated Show resolved Hide resolved
@@ -114,9 +120,11 @@ class QNode:
>>> qnode = QNode(circuit, dev)
"""

# pylint:disable=too-many-instance-attributes
# pylint:disable=too-many-instance-attributes,too-many-arguments
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😆

pennylane/beta/tapes/qnode.py Outdated Show resolved Hide resolved
Comment on lines 399 to 406
if self._caching:
self.qtape._cache_execute = self._cache_execute

# execute the tape
return self.qtape.execute(device=self.device)
res = self.qtape.execute(device=self.device)

if self._caching:
self._cache_execute = self.qtape._cache_execute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this answers my question above! Probably a good idea to add a line comment, since Antal and I both independently had the same question

Comment on lines 399 to 406
if self._caching:
self.qtape._cache_execute = self._cache_execute

# execute the tape
return self.qtape.execute(device=self.device)
res = self.qtape.execute(device=self.device)

if self._caching:
self._cache_execute = self.qtape._cache_execute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If caching is on, can we avoid redundant tape constructions?

Comment on lines 501 to 509
@property
def caching(self):
"""float: number of device executions to store in a cache to speed up subsequent
executions. If set to zero, no caching occurs."""
return self._caching

@caching.setter
def caching(self, value):
self._caching = value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these two properties needed? It doesn't seem that they are used anywhere. Is this to allow the user to modify the cache size dynamically on an existing QNode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I decided to remove the setter but keep the property, just in case users want to see the current caching value.

With the setter there, I was half thinking to let users dynamically set the cache. This is fine if they set it to a bigger number than the current size, but if setting smaller we need to drop multiple parts of the existing cache. I thought this was a bit of an overcomplication for now, so just got rid of the setter.

tests/beta/tapes/test_caching.py Show resolved Hide resolved
tests/beta/tapes/test_caching.py Outdated Show resolved Hide resolved
tests/beta/tapes/test_caching.py Show resolved Hide resolved
tests/beta/tapes/test_caching.py Show resolved Hide resolved
Copy link
Contributor Author

@trbromley trbromley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @josh146 !!

pennylane/beta/interfaces/autograd.py Outdated Show resolved Hide resolved
pennylane/beta/tapes/qnode.py Outdated Show resolved Hide resolved
Comment on lines 501 to 509
@property
def caching(self):
"""float: number of device executions to store in a cache to speed up subsequent
executions. If set to zero, no caching occurs."""
return self._caching

@caching.setter
def caching(self, value):
self._caching = value
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I decided to remove the setter but keep the property, just in case users want to see the current caching value.

With the setter there, I was half thinking to let users dynamically set the cache. This is fine if they set it to a bigger number than the current size, but if setting smaller we need to drop multiple parts of the existing cache. I thought this was a bit of an overcomplication for now, so just got rid of the setter.

tests/beta/tapes/test_caching.py Outdated Show resolved Hide resolved
@trbromley trbromley merged commit 27dc036 into master Sep 25, 2020
@trbromley trbromley deleted the add_caching_to_tape branch September 25, 2020 15:55
@trbromley
Copy link
Contributor Author

Under what general use case do you see improvement? Is it often that general use, e.g., computing gradients, results in repeated evaluation with the same parameters?

I forgot to answer this question!

One big saving can be when using the autograd interface for calculating the jacobian, since as we discussed that has some extra repetition (the saving was pretty good there, but I'd have to dig up the numbers). Another use case might be more on the ML testing side, e.g. after the model has been trained and most of the weights are fixed. Depending on the data, we might be often repeating previously evaluated runs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
review-ready 👌 PRs which are ready for review by someone from the core team.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants