New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache tape executions #817
Conversation
Codecov Report
@@ Coverage Diff @@
## master #817 +/- ##
==========================================
+ Coverage 91.02% 91.05% +0.02%
==========================================
Files 129 129
Lines 8649 8678 +29
==========================================
+ Hits 7873 7902 +29
Misses 776 776
Continue to review full report at Codecov.
|
pennylane/beta/tapes/qnode.py
Outdated
"""float: number of device executions to store in a cache to speed up subsequent | ||
executions. If set to zero, no caching occurs.""" | ||
|
||
if caching is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the default for the caching
argument be 0
? Also, it seems that if the user specified 0
, then this condition would evaluate to True
, correct? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! This is now done: f86adc7
It also makes the if
statement correct - you are right that a user specifying 0 would have resulted in a warning!
pennylane/beta/tapes/tape.py
Outdated
if self._caching and hashed_params not in self._cache_execute: | ||
self._cache_execute[hashed_params] = res | ||
if len(self._cache_execute) > self._caching: | ||
self._cache_execute.popitem(last=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to document this behaviour: users could also think that once the caching
limit is reached, then nothing happens to the cache. However, this seems to be not the case as each time there's a new execution, the very first cached result is dropped and we're adding the latest result.
Overall also might be worth considering which of the two options is more beneficial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I've added it to the caching
part of the docstring: 5bd1d37
I think the current behaviour makes the most sense, else you might end up with a cache that is very stale. Moreover, we always want to be caching the last execution since that is the most likely one to be repeated.
pennylane/beta/tapes/qnode.py
Outdated
warnings.warn( | ||
"Caching mode activated. The quantum circuit being executed by the QNode must have " | ||
"a fixed structure.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way a user would receive this warning even if the quantum circuit was fine, right? Could an error be raised if a user creates a mutable QNode? (E.g. hashing the circuit operations: name of the operation and wire they act on and raising an error if the latest hash differs from the stored hash for the circuit.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is now done: instead of hashing the tape arguments, we use the hash of the circuit graph.
There are performance hits for both approaches: we need to use set_parameters()
for the former and we need to serialize for the latter. I hope to summarize more the relative performance in a follow up comment.
pennylane/beta/tapes/qnode.py
Outdated
if self._caching: | ||
self.qtape._cache_execute = self._cache_execute | ||
|
||
# execute the tape | ||
return self.qtape.execute(device=self.device) | ||
res = self.qtape.execute(device=self.device) | ||
|
||
if self._caching: | ||
self._cache_execute = self.qtape._cache_execute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could only the qtape have the _cache_execute
attribute? Since a QNode can access the attributes of a qtape
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, but unfortunately the self.construct()
method wipes the previous QTape and starts with a fresh one, so this line allows the cache to persist across multiple QTapes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this answers my question above! Probably a good idea to add a line comment, since Antal and I both independently had the same question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If caching is on, can we avoid redundant tape constructions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just had another quick try of: when caching is on, tape is only constructed first time.
This results in the tests related to mutability and classical processing to fail. For example, the cache is still being used when parameters are the same but the circuit differs. Although, I'm not sure if the problem is even deeper deeper, since isn't construction the place where the input arguments are fixed to the gate ops?
tests/test_utils.py
Outdated
"""Tests for the _hash_iterable function.""" | ||
|
||
iterables = [ | ||
[1, 1.4, -1], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be worth adding unsupported cases:
The iterable must be flat and can contain only numbers and NumPy arrays.
When this condition is not satisfied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 Ok, I can think about this if we decide to keep this test. Currently the test has been removed because we are using the hash of the circuit graph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is overall looking great @trbromley! 💯 Curious to hear your thoughts on a couple of things, e.g. _get_all_parameters
and the fifo nature of _cache_execute
, so leaving a comment review for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Tom!
I have to admit, I'm somewhat shocked at the overhead between tape and qnode gradients. I knew it was there, I didn't know it was that large. I wonder what we can do to mitigate this.
A couple of questions:
-
What interface did you use in the benchmarks?
-
How does the non-cached time compare to the existing QNode time?
pennylane/beta/tapes/qnode.py
Outdated
caching (int): number of device executions to store in a cache to speed up subsequent | ||
executions. Caching does not take place by default. In caching mode, the quantum circuit | ||
being executed must have a constant structure and only its parameters can be varied. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition (not sure if it's implied by 'only its parameters can be varied', but it wasn't clear to me): there can't be any classical processing within the QNode.
E.g., qml.RX(2*x, wires=0)
is not allowed.
This is a side effect of the tape being a part of the users quantum function --- we can't extract the tape independently of the classical processing :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to admit, it's not something I've tested, just a 'hunch' that it wouldn't work. My thinking was that:
When the QNode is evaluated, the users qfunc()
is called inside a quantum tape:
with QuantumTape() as tape:
qfunc(*args, **kwargs)
This will do two things:
-
classically process the
args
andkwargs
as per whatever classical processing the user has within the QNode, e.g.,tf.sin(a)
-
construct the tape, with incident gate arguments potentially different to
args
andkwargs
due to post-processing.
Say the user passes a qfunc of the form:
def circuit(x):
qml.RX(tf.sin(x), wires=0)
return qml.expval(qml.PauliZ(0))
After the first construction (the user has passed x=0.6), we 'cache' or store the tape. On the second evaluation (now the user passes x=0.1), we want to execute the previous tape, but with the new gate argument; however, we can't do tape.set_parameters([0.1])
, because the user actually wants tape.set_parameters([tf.sin(0.1)])
--- how do we determine the classical post-processing step?
So having immutable tapes is not enough as far as I can see. Or maybe I am missing something --- would be happy if that's the case and this works!
pennylane/beta/tapes/tape.py
Outdated
@@ -890,6 +906,60 @@ def execute(self, device, params=None): | |||
|
|||
return self._execute(params, device=device) | |||
|
|||
def _get_all_parameters(self, params): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this the same as doing
>>> tape.set_parameters([0.1, 0.2])
>>> tape.get_parameters(trainable_only=False)
[0.1, 0.543, 0.2]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! 😁
Well, _get_all_parameters
allows you to get all the params without modifying the trainable params themselves. Whereas we need to do
saved_params = tape.get_parameters()
tape.set_parameters([0.1, 0.2])
all_parameters = tape.get_parameters(trainable=False)
# Do stuff
tape.set_parameters(saved_parameters)
However, this is already what is happening in execute_device
, so you're right maybe it's not worth making a new method.
I've now removed the method and used the suggested approach db32341.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks great @trbromley. A really nice addition (that I actually would have needed a few times already). 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @antalszava @josh146 for the comments! Your comments resulted in one big change:
- Instead of hashing the quantum tape parameters, we can use the
qtape.graph.hash
which describes both the parameters and the gates and is calculated from a serialization of the circuit. - This allows us to remove
_get_all_parameters()
and_hash_iterable()
(and corresponding tests) - It allows us to not have the implicit assumption that the circuit structure is unchanged, since the generated hash depends on the circuit structure. Caching is hence compatible with mutable circuits.
- This change has performance implications that I will summarize below.
Performance implications
We compare 4 options:
(A) The way the PR is now, using the hash of the circuit graph but being explicitly safe regarding mutability.
(B) When you originally reviewed, this PR cached the internal variables of the tape. This requires the assumption that the QNode is immutable.
(C) Using the new quantum-tape
core, but without caching
(D) Using the current, non-beta, core
Things to note:
-
C is generally better than D except for two cases: (i) TF interface evaluation, (ii) autograd interface gradient. The reason for (ii) is that the
set_parameters()
method is quite an overhead in autograd. The reason for (i) I'm not entirely sure of, but thebroadcast()
function is a massive overhead in the new core (57% new vs 20% old). -
(B) is quicker than (A). This is because (B) does not need to serialize the circuit graph and call
set_parameters()
- it is just hashing the parameters and assuming the circuit graph is the same.
Choosing (A) or (B)
What do people recommend? Slower caching time with more safety, or quicker caching time but relying on users to keep their circuits immutable? I'd probably edge toward (A) (slower but safer) and hope to speed up the slow bits in future (e.g. could we do quick serialization?)
pennylane/beta/tapes/tape.py
Outdated
@@ -890,6 +906,60 @@ def execute(self, device, params=None): | |||
|
|||
return self._execute(params, device=device) | |||
|
|||
def _get_all_parameters(self, params): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! 😁
Well, _get_all_parameters
allows you to get all the params without modifying the trainable params themselves. Whereas we need to do
saved_params = tape.get_parameters()
tape.set_parameters([0.1, 0.2])
all_parameters = tape.get_parameters(trainable=False)
# Do stuff
tape.set_parameters(saved_parameters)
However, this is already what is happening in execute_device
, so you're right maybe it's not worth making a new method.
I've now removed the method and used the suggested approach db32341.
pennylane/beta/tapes/tape.py
Outdated
if self._caching and hashed_params not in self._cache_execute: | ||
self._cache_execute[hashed_params] = res | ||
if len(self._cache_execute) > self._caching: | ||
self._cache_execute.popitem(last=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I've added it to the caching
part of the docstring: 5bd1d37
I think the current behaviour makes the most sense, else you might end up with a cache that is very stale. Moreover, we always want to be caching the last execution since that is the most likely one to be repeated.
pennylane/beta/tapes/qnode.py
Outdated
"""float: number of device executions to store in a cache to speed up subsequent | ||
executions. If set to zero, no caching occurs.""" | ||
|
||
if caching is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! This is now done: f86adc7
It also makes the if
statement correct - you are right that a user specifying 0 would have resulted in a warning!
pennylane/beta/tapes/qnode.py
Outdated
if self._caching: | ||
self.qtape._cache_execute = self._cache_execute | ||
|
||
# execute the tape | ||
return self.qtape.execute(device=self.device) | ||
res = self.qtape.execute(device=self.device) | ||
|
||
if self._caching: | ||
self._cache_execute = self.qtape._cache_execute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, but unfortunately the self.construct()
method wipes the previous QTape and starts with a fresh one, so this line allows the cache to persist across multiple QTapes.
tests/test_utils.py
Outdated
"""Tests for the _hash_iterable function.""" | ||
|
||
iterables = [ | ||
[1, 1.4, -1], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 Ok, I can think about this if we decide to keep this test. Currently the test has been removed because we are using the hash of the circuit graph.
Co-authored-by: Theodor <theodor@xanadu.ai>
Thanks @thisac for the review! |
Also, updates for tape benchmark: With current caching (A): With previous caching (B): Without caching (C): Again, the new approach (A) is slower due to having to hash the circuit graph. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @trbromley, looks good to me! Saw the options, though at this point I'm also happy to approve whichever would be eventually accepted.
Couple of thoughts/questions:
- What happens if user inputs a mutable qfunc in the (B) case? Could one end up with invalid results?
- The
CircuitGraph.serialize
doesn't seem to use the"".join(serialization_string)
for string concatenation, which I've come across as one of the fastest options. Yet, it seemed to me that even when using"".join
, serialization is not significantly faster. Perhapshash
with a long string argument would take a long time. If going with (A), we could perhaps consider a shorter string to be hashed or further explore as you suggested. - Wondering why it could be that (A) much slower than (B) for the gradient when using Autograd? 🤔
As we are doing some hashing even in case (B), might be leaning towards going with (A) and tracking down why hashing takes considerably more time.
…nto add_caching_to_tape
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Tom, this looks great! I agree, I prefer the slower but safer approach where circuit mutability is taken into account by hashing.
I suppose I have one quite general question, that doesn't affect merging this PR:
- Under what general use case do you see improvement? Is it often that general use, e.g., computing gradients, results in repeated evaluation with the same parameters?
@@ -114,9 +120,11 @@ class QNode: | |||
>>> qnode = QNode(circuit, dev) | |||
""" | |||
|
|||
# pylint:disable=too-many-instance-attributes | |||
# pylint:disable=too-many-instance-attributes,too-many-arguments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😆
pennylane/beta/tapes/qnode.py
Outdated
if self._caching: | ||
self.qtape._cache_execute = self._cache_execute | ||
|
||
# execute the tape | ||
return self.qtape.execute(device=self.device) | ||
res = self.qtape.execute(device=self.device) | ||
|
||
if self._caching: | ||
self._cache_execute = self.qtape._cache_execute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this answers my question above! Probably a good idea to add a line comment, since Antal and I both independently had the same question
pennylane/beta/tapes/qnode.py
Outdated
if self._caching: | ||
self.qtape._cache_execute = self._cache_execute | ||
|
||
# execute the tape | ||
return self.qtape.execute(device=self.device) | ||
res = self.qtape.execute(device=self.device) | ||
|
||
if self._caching: | ||
self._cache_execute = self.qtape._cache_execute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If caching is on, can we avoid redundant tape constructions?
pennylane/beta/tapes/qnode.py
Outdated
@property | ||
def caching(self): | ||
"""float: number of device executions to store in a cache to speed up subsequent | ||
executions. If set to zero, no caching occurs.""" | ||
return self._caching | ||
|
||
@caching.setter | ||
def caching(self, value): | ||
self._caching = value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these two properties needed? It doesn't seem that they are used anywhere. Is this to allow the user to modify the cache size dynamically on an existing QNode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I decided to remove the setter but keep the property, just in case users want to see the current caching value.
With the setter there, I was half thinking to let users dynamically set the cache. This is fine if they set it to a bigger number than the current size, but if setting smaller we need to drop multiple parts of the existing cache. I thought this was a bit of an overcomplication for now, so just got rid of the setter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @josh146 !!
pennylane/beta/tapes/qnode.py
Outdated
@property | ||
def caching(self): | ||
"""float: number of device executions to store in a cache to speed up subsequent | ||
executions. If set to zero, no caching occurs.""" | ||
return self._caching | ||
|
||
@caching.setter | ||
def caching(self, value): | ||
self._caching = value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I decided to remove the setter but keep the property, just in case users want to see the current caching value.
With the setter there, I was half thinking to let users dynamically set the cache. This is fine if they set it to a bigger number than the current size, but if setting smaller we need to drop multiple parts of the existing cache. I thought this was a bit of an overcomplication for now, so just got rid of the setter.
I forgot to answer this question! One big saving can be when using the autograd interface for calculating the jacobian, since as we discussed that has some extra repetition (the saving was pretty good there, but I'd have to dig up the numbers). Another use case might be more on the ML testing side, e.g. after the model has been trained and most of the weights are fixed. Depending on the data, we might be often repeating previously evaluated runs. |
This PR adds the ability for QNodes and quantum tapes to cache previous evaluations:
This is a replacement of #776 to work with the new QTape core.
It also replaces #795 which was a branch from
quantum-tape
😅Things to note
Benchmarking
Let's take a look to see if caching is working. We'll consider a 6-qubit circuit composed of
StronglyEntanglingLayers
of depth 6 (similarly to #776). We will also look at benchmarking both the quantum tape and the QNode in two settings: evaluation (with 500 repetitions) and gradient evaluation (with 20 repetitions), where each repetition is with the same input parameters.Tape
Without caching:
Evaluate - 5.36 seconds
Gradient - 24.67 seconds
With caching:
Evaluate - 0.18 seconds
Gradient - 1.43 seconds
QNode
Without caching:
Evaluate - 9.79 seconds
Gradient - 69.20 seconds
With caching:
Evaluate - 3.92 seconds
Gradient - 3.98 seconds
Summary
Caching is helping us! Also, we can see that using the QNode has an additional overhead, e.g. from methods such as
construct()
.