Cache tape executions #817

trbromley · 2020-09-21T14:12:12Z

This PR adds the ability for QNodes and quantum tapes to cache previous evaluations:

from pennylane.beta.tapes import qnode
from pennylane.beta.queuing import expval
import time

dev = qml.device("default.qubit", wires=2)

@qnode(dev, caching=10)  # cache up to 10 evaluations
def qfunc(x):
    qml.RX(x, wires=0)
    qml.RX(0.3, wires=1)
    qml.CNOT(wires=[0, 1])
    return expval(qml.PauliZ(1))

for i in range(100):
    qfunc(0.1)
# Even with 100 calls, the internal execute_device() method
# of the quantum tape has only been called once

This is a replacement of #776 to work with the new QTape core.
It also replaces #795 which was a branch from quantum-tape 😅

Things to note

Caching works with both the underlying tape and the QNode.
Caching is off by default, so it should not break anything
Caching still takes place if the user is returning samples or if operating in non-analytic mode. It isn't hard to prevent caching in these cases (and is done in [WIP] Cache previous calls to QNode evaluate #776), but I ended up being on the fence about whether to. It seems too implicit to turn off caching - indeed if a user turns on caching then they should get what they are expecting (cached results). Is there a case people can think of as a counterargument?

Benchmarking

Let's take a look to see if caching is working. We'll consider a 6-qubit circuit composed of StronglyEntanglingLayers of depth 6 (similarly to #776). We will also look at benchmarking both the quantum tape and the QNode in two settings: evaluation (with 500 repetitions) and gradient evaluation (with 20 repetitions), where each repetition is with the same input parameters.

Tape

Without caching:
Evaluate - 5.36 seconds
Gradient - 24.67 seconds

With caching:
Evaluate - 0.18 seconds
Gradient - 1.43 seconds

QNode

Without caching:
Evaluate - 9.79 seconds
Gradient - 69.20 seconds

With caching:
Evaluate - 3.92 seconds
Gradient - 3.98 seconds

Summary

Caching is helping us! Also, we can see that using the QNode has an additional overhead, e.g. from methods such as construct().

codecov · 2020-09-21T14:33:24Z

Codecov Report

Merging #817 into master will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #817      +/-   ##
==========================================
+ Coverage   91.02%   91.05%   +0.02%     
==========================================
  Files         129      129              
  Lines        8649     8678      +29     
==========================================
+ Hits         7873     7902      +29     
  Misses        776      776

Impacted Files	Coverage Δ
pennylane/beta/tapes/qnode.py	`98.75% <100.00%> (+0.10%)`	⬆️
pennylane/beta/tapes/tape.py	`98.82% <100.00%> (+0.05%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d237ab4...9765a13. Read the comment docs.

trbromley · 2020-09-21T14:35:07Z

Local code coverage is ok:

(line 194 is for something else)

(the contribution in this PR is from line 438 onwards)

antalszava · 2020-09-21T19:38:14Z

pennylane/beta/tapes/qnode.py

+        """float: number of device executions to store in a cache to speed up subsequent
+        executions. If set to zero, no caching occurs."""
+
+        if caching is not None:


Could the default for the caching argument be 0? Also, it seems that if the user specified 0, then this condition would evaluate to True, correct? 🤔

Good idea! This is now done: f86adc7
It also makes the if statement correct - you are right that a user specifying 0 would have resulted in a warning!

pennylane/beta/tapes/tape.py

antalszava · 2020-09-21T20:16:32Z

pennylane/beta/tapes/tape.py

+        if self._caching and hashed_params not in self._cache_execute:
+            self._cache_execute[hashed_params] = res
+            if len(self._cache_execute) > self._caching:
+                self._cache_execute.popitem(last=False)


It would be good to document this behaviour: users could also think that once the caching limit is reached, then nothing happens to the cache. However, this seems to be not the case as each time there's a new execution, the very first cached result is dropped and we're adding the latest result.

Overall also might be worth considering which of the two options is more beneficial.

Good point! I've added it to the caching part of the docstring: 5bd1d37
I think the current behaviour makes the most sense, else you might end up with a cache that is very stale. Moreover, we always want to be caching the last execution since that is the most likely one to be repeated.

antalszava · 2020-09-21T20:20:56Z

pennylane/beta/tapes/qnode.py

+            warnings.warn(
+                "Caching mode activated. The quantum circuit being executed by the QNode must have "
+                "a fixed structure.",


This way a user would receive this warning even if the quantum circuit was fine, right? Could an error be raised if a user creates a mutable QNode? (E.g. hashing the circuit operations: name of the operation and wire they act on and raising an error if the latest hash differs from the stored hash for the circuit.)

Yes, this is now done: instead of hashing the tape arguments, we use the hash of the circuit graph.

There are performance hits for both approaches: we need to use set_parameters() for the former and we need to serialize for the latter. I hope to summarize more the relative performance in a follow up comment.

antalszava · 2020-09-21T20:29:29Z

pennylane/beta/tapes/qnode.py

+        if self._caching:
+            self.qtape._cache_execute = self._cache_execute
+
        # execute the tape
-        return self.qtape.execute(device=self.device)
+        res = self.qtape.execute(device=self.device)
+
+        if self._caching:
+            self._cache_execute = self.qtape._cache_execute


Could only the qtape have the _cache_execute attribute? Since a QNode can access the attributes of a qtape

Ideally, but unfortunately the self.construct() method wipes the previous QTape and starts with a fresh one, so this line allows the cache to persist across multiple QTapes.

Ah, this answers my question above! Probably a good idea to add a line comment, since Antal and I both independently had the same question

If caching is on, can we avoid redundant tape constructions?

I just had another quick try of: when caching is on, tape is only constructed first time.
This results in the tests related to mutability and classical processing to fail. For example, the cache is still being used when parameters are the same but the circuit differs. Although, I'm not sure if the problem is even deeper deeper, since isn't construction the place where the input arguments are fixed to the gate ops?

tests/test_utils.py

antalszava · 2020-09-21T20:32:58Z

tests/test_utils.py

+    """Tests for the _hash_iterable function."""
+
+    iterables = [
+        [1, 1.4, -1],


Could be worth adding unsupported cases:

The iterable must be flat and can contain only numbers and NumPy arrays.

When this condition is not satisfied.

🤔 Ok, I can think about this if we decide to keep this test. Currently the test has been removed because we are using the hash of the circuit graph.

tests/beta/tapes/test_tape.py

tests/beta/tapes/test_caching.py

antalszava

This is overall looking great @trbromley! 💯 Curious to hear your thoughts on a couple of things, e.g. _get_all_parameters and the fifo nature of _cache_execute, so leaving a comment review for now.

josh146

Thanks Tom!

I have to admit, I'm somewhat shocked at the overhead between tape and qnode gradients. I knew it was there, I didn't know it was that large. I wonder what we can do to mitigate this.

A couple of questions:

What interface did you use in the benchmarks?
How does the non-cached time compare to the existing QNode time?

josh146 · 2020-09-22T08:44:25Z

pennylane/beta/tapes/qnode.py

+        caching (int): number of device executions to store in a cache to speed up subsequent
+            executions. Caching does not take place by default. In caching mode, the quantum circuit
+            being executed must have a constant structure and only its parameters can be varied.


In addition (not sure if it's implied by 'only its parameters can be varied', but it wasn't clear to me): there can't be any classical processing within the QNode.

E.g., qml.RX(2*x, wires=0) is not allowed.

This is a side effect of the tape being a part of the users quantum function --- we can't extract the tape independently of the classical processing :(

🤔 @josh146 it's not clear to me why classical processing is not allowed?
I added some tests and it seems compatible: a640e26
(note, this is using a new approach that came out of code review, where we are caching hashes of the circuit, which can pick up if the circuit has mutated)

I have to admit, it's not something I've tested, just a 'hunch' that it wouldn't work. My thinking was that:

When the QNode is evaluated, the users qfunc() is called inside a quantum tape:

with QuantumTape() as tape: qfunc(*args, **kwargs)

This will do two things:

classically process the args and kwargs as per whatever classical processing the user has within the QNode, e.g., tf.sin(a)

construct the tape, with incident gate arguments potentially different to args and kwargs due to post-processing.

Say the user passes a qfunc of the form:

def circuit(x): qml.RX(tf.sin(x), wires=0) return qml.expval(qml.PauliZ(0))

After the first construction (the user has passed x=0.6), we 'cache' or store the tape. On the second evaluation (now the user passes x=0.1), we want to execute the previous tape, but with the new gate argument; however, we can't do tape.set_parameters([0.1]), because the user actually wants tape.set_parameters([tf.sin(0.1)]) --- how do we determine the classical post-processing step?

So having immutable tapes is not enough as far as I can see. Or maybe I am missing something --- would be happy if that's the case and this works!

josh146 · 2020-09-22T08:48:02Z

pennylane/beta/tapes/tape.py

@@ -890,6 +906,60 @@ def execute(self, device, params=None):

        return self._execute(params, device=device)

+    def _get_all_parameters(self, params):


Isn't this the same as doing

>>> tape.set_parameters([0.1, 0.2]) >>> tape.get_parameters(trainable_only=False) [0.1, 0.543, 0.2]

Yes! 😁

Well, _get_all_parameters allows you to get all the params without modifying the trainable params themselves. Whereas we need to do

saved_params = tape.get_parameters() tape.set_parameters([0.1, 0.2]) all_parameters = tape.get_parameters(trainable=False) # Do stuff tape.set_parameters(saved_parameters)

However, this is already what is happening in execute_device, so you're right maybe it's not worth making a new method.

I've now removed the method and used the suggested approach db32341.

pennylane/beta/tapes/tape.py

thisac

I think this looks great @trbromley. A really nice addition (that I actually would have needed a few times already). 😄

pennylane/beta/tapes/qnode.py

pennylane/beta/tapes/tape.py

tests/beta/tapes/test_caching.py

trbromley

Thanks @antalszava @josh146 for the comments! Your comments resulted in one big change:

Instead of hashing the quantum tape parameters, we can use the qtape.graph.hash which describes both the parameters and the gates and is calculated from a serialization of the circuit.
This allows us to remove _get_all_parameters() and _hash_iterable() (and corresponding tests)
It allows us to not have the implicit assumption that the circuit structure is unchanged, since the generated hash depends on the circuit structure. Caching is hence compatible with mutable circuits.
This change has performance implications that I will summarize below.

Performance implications

We compare 4 options:

(A) The way the PR is now, using the hash of the circuit graph but being explicitly safe regarding mutability.

(B) When you originally reviewed, this PR cached the internal variables of the tape. This requires the assumption that the QNode is immutable.

(C) Using the new quantum-tape core, but without caching

(D) Using the current, non-beta, core

These are the results:

Things to note:

C is generally better than D except for two cases: (i) TF interface evaluation, (ii) autograd interface gradient. The reason for (ii) is that the set_parameters() method is quite an overhead in autograd. The reason for (i) I'm not entirely sure of, but the broadcast() function is a massive overhead in the new core (57% new vs 20% old).
(B) is quicker than (A). This is because (B) does not need to serialize the circuit graph and call set_parameters() - it is just hashing the parameters and assuming the circuit graph is the same.

Choosing (A) or (B)

What do people recommend? Slower caching time with more safety, or quicker caching time but relying on users to keep their circuits immutable? I'd probably edge toward (A) (slower but safer) and hope to speed up the slow bits in future (e.g. could we do quick serialization?)

trbromley · 2020-09-22T11:36:29Z

pennylane/beta/tapes/tape.py

@@ -890,6 +906,60 @@ def execute(self, device, params=None):

        return self._execute(params, device=device)

+    def _get_all_parameters(self, params):


Yes! 😁

Well, _get_all_parameters allows you to get all the params without modifying the trainable params themselves. Whereas we need to do

saved_params = tape.get_parameters() tape.set_parameters([0.1, 0.2]) all_parameters = tape.get_parameters(trainable=False) # Do stuff tape.set_parameters(saved_parameters)

However, this is already what is happening in execute_device, so you're right maybe it's not worth making a new method.

I've now removed the method and used the suggested approach db32341.

pennylane/beta/tapes/tape.py

trbromley · 2020-09-22T11:49:34Z

pennylane/beta/tapes/tape.py

+        if self._caching and hashed_params not in self._cache_execute:
+            self._cache_execute[hashed_params] = res
+            if len(self._cache_execute) > self._caching:
+                self._cache_execute.popitem(last=False)


Good point! I've added it to the caching part of the docstring: 5bd1d37
I think the current behaviour makes the most sense, else you might end up with a cache that is very stale. Moreover, we always want to be caching the last execution since that is the most likely one to be repeated.

trbromley · 2020-09-22T11:59:29Z

pennylane/beta/tapes/qnode.py

+        """float: number of device executions to store in a cache to speed up subsequent
+        executions. If set to zero, no caching occurs."""
+
+        if caching is not None:


Good idea! This is now done: f86adc7
It also makes the if statement correct - you are right that a user specifying 0 would have resulted in a warning!

trbromley · 2020-09-22T15:14:58Z

pennylane/beta/tapes/qnode.py

+        if self._caching:
+            self.qtape._cache_execute = self._cache_execute
+
        # execute the tape
-        return self.qtape.execute(device=self.device)
+        res = self.qtape.execute(device=self.device)
+
+        if self._caching:
+            self._cache_execute = self.qtape._cache_execute


Ideally, but unfortunately the self.construct() method wipes the previous QTape and starts with a fresh one, so this line allows the cache to persist across multiple QTapes.

tests/beta/tapes/test_caching.py

tests/beta/tapes/test_tape.py

tests/test_utils.py

trbromley · 2020-09-22T15:18:17Z

tests/test_utils.py

+    """Tests for the _hash_iterable function."""
+
+    iterables = [
+        [1, 1.4, -1],


🤔 Ok, I can think about this if we decide to keep this test. Currently the test has been removed because we are using the hash of the circuit graph.

Co-authored-by: Theodor <theodor@xanadu.ai>

trbromley · 2020-09-22T18:11:31Z

Thanks @thisac for the review!

trbromley · 2020-09-22T18:18:15Z

Also, updates for tape benchmark:

With current caching (A):
Evaluate - 0.59 seconds
Gradient - 3.09 seconds

With previous caching (B):
Evaluate - 0.18 seconds
Gradient - 1.43 seconds

Without caching (C):
Evaluate - 5.36 seconds
Gradient - 24.67 seconds

Again, the new approach (A) is slower due to having to hash the circuit graph.

antalszava

Hi @trbromley, looks good to me! Saw the options, though at this point I'm also happy to approve whichever would be eventually accepted.

Couple of thoughts/questions:

What happens if user inputs a mutable qfunc in the (B) case? Could one end up with invalid results?
The CircuitGraph.serialize doesn't seem to use the "".join(serialization_string) for string concatenation, which I've come across as one of the fastest options. Yet, it seemed to me that even when using "".join, serialization is not significantly faster. Perhaps hash with a long string argument would take a long time. If going with (A), we could perhaps consider a shorter string to be hashed or further explore as you suggested.
Wondering why it could be that (A) much slower than (B) for the gradient when using Autograd? 🤔

As we are doing some hashing even in case (B), might be leaning towards going with (A) and tracking down why hashing takes considerably more time.

…nto add_caching_to_tape

josh146

Thanks Tom, this looks great! I agree, I prefer the slower but safer approach where circuit mutability is taken into account by hashing.

I suppose I have one quite general question, that doesn't affect merging this PR:

Under what general use case do you see improvement? Is it often that general use, e.g., computing gradients, results in repeated evaluation with the same parameters?

pennylane/beta/interfaces/autograd.py

josh146 · 2020-09-25T11:10:13Z

pennylane/beta/tapes/qnode.py

@@ -114,9 +120,11 @@ class QNode:
    >>> qnode = QNode(circuit, dev)
    """

-    # pylint:disable=too-many-instance-attributes
+    # pylint:disable=too-many-instance-attributes,too-many-arguments


pennylane/beta/tapes/qnode.py

josh146 · 2020-09-25T11:11:50Z

pennylane/beta/tapes/qnode.py

+        if self._caching:
+            self.qtape._cache_execute = self._cache_execute
+
        # execute the tape
-        return self.qtape.execute(device=self.device)
+        res = self.qtape.execute(device=self.device)
+
+        if self._caching:
+            self._cache_execute = self.qtape._cache_execute


Ah, this answers my question above! Probably a good idea to add a line comment, since Antal and I both independently had the same question

josh146 · 2020-09-25T11:12:23Z

pennylane/beta/tapes/qnode.py

+        if self._caching:
+            self.qtape._cache_execute = self._cache_execute
+
        # execute the tape
-        return self.qtape.execute(device=self.device)
+        res = self.qtape.execute(device=self.device)
+
+        if self._caching:
+            self._cache_execute = self.qtape._cache_execute


If caching is on, can we avoid redundant tape constructions?

josh146 · 2020-09-25T11:14:35Z

pennylane/beta/tapes/qnode.py

+    @property
+    def caching(self):
+        """float: number of device executions to store in a cache to speed up subsequent
+        executions. If set to zero, no caching occurs."""
+        return self._caching
+
+    @caching.setter
+    def caching(self, value):
+        self._caching = value


Are these two properties needed? It doesn't seem that they are used anywhere. Is this to allow the user to modify the cache size dynamically on an existing QNode?

Good point! I decided to remove the setter but keep the property, just in case users want to see the current caching value.

With the setter there, I was half thinking to let users dynamically set the cache. This is fine if they set it to a bigger number than the current size, but if setting smaller we need to drop multiple parts of the existing cache. I thought this was a bit of an overcomplication for now, so just got rid of the setter.

tests/beta/tapes/test_caching.py

trbromley

Thanks @josh146 !!

pennylane/beta/interfaces/autograd.py

pennylane/beta/tapes/qnode.py

trbromley · 2020-09-25T14:35:40Z

pennylane/beta/tapes/qnode.py

+    @property
+    def caching(self):
+        """float: number of device executions to store in a cache to speed up subsequent
+        executions. If set to zero, no caching occurs."""
+        return self._caching
+
+    @caching.setter
+    def caching(self, value):
+        self._caching = value


Good point! I decided to remove the setter but keep the property, just in case users want to see the current caching value.

With the setter there, I was half thinking to let users dynamically set the cache. This is fine if they set it to a bigger number than the current size, but if setting smaller we need to drop multiple parts of the existing cache. I thought this was a bit of an overcomplication for now, so just got rid of the setter.

tests/beta/tapes/test_caching.py

trbromley · 2020-09-25T19:18:46Z

Under what general use case do you see improvement? Is it often that general use, e.g., computing gradients, results in repeated evaluation with the same parameters?

I forgot to answer this question!

One big saving can be when using the autograd interface for calculating the jacobian, since as we discussed that has some extra repetition (the saving was pretty good there, but I'd have to dig up the numbers). Another use case might be more on the ML testing side, e.g. after the model has been trained and most of the weights are fixed. Depending on the data, we might be often repeating previously evaluated runs.

trbromley added 6 commits September 21, 2020 09:55

Add to qnode

e653628

Add to tape

62f3582

Add to utils

9686447

Add caching test

67c7f61

Add to tape tests

625e3b7

Add to utils tests

5fea812

trbromley mentioned this pull request Sep 21, 2020

Cache tape executions [OLD] #795

Closed

trbromley added 3 commits September 21, 2020 10:13

Fix typo

f084d9d

Fix pylint

fe453d8

Fix pylint

e6740ce

trbromley self-assigned this Sep 21, 2020

trbromley added the review-ready 👌 PRs which are ready for review by someone from the core team. label Sep 21, 2020

trbromley requested review from josh146, thisac and antalszava September 21, 2020 14:36