# Parallelized Remote QNode Execution

This notebook demonstrates how training and execution of remote qnodes can be parallelized.
For this demonstration, we use open-access IBM quantum hardware simulators.
These simulators have relatively short queue-times in comparison to the available quantum computers yet reveal the advantages of parallelizing the web requests used to invoke the remote IBM Q services.

In [1]:
import pennylane as qml
from pennylane import numpy as np

from context import QNetOptimizer as QNopt

In [2]:
from qiskit import IBMQ

# For details regarding integration between PennyLane and IMB Q,
# see https://pennylaneqiskit.readthedocs.io/en/latest/devices/ibmq.html#accounts-and-tokens
provider = IBMQ.load_account()

## Setup

For simplicity, we consider a CHSH scenario ansatz with a static Bell state preparation and local qubit measurements optimized over the $xz$-plane.
The qnodes are executed/trained remotely on the `ibmq_qasm_simulator`.
In certain cases, qnode execution is performed locally for greater efficiency.

In [3]:
prep_nodes = [
    QNopt.PrepareNode(1, [0,1], QNopt.ghz_state, 0)
]
meas_nodes = [
    QNopt.MeasureNode(2, 2, [0], QNopt.local_RY, 1),
    QNopt.MeasureNode(2, 2, [1], QNopt.local_RY, 1)
]

dev_ibm_qasm = {
    "name" : "qiskit.ibmq",
    "shots" : 2000,
    "backend" : "ibmq_qasm_simulator",
    "provider" : provider
}

local_sim_chsh_ansatz = QNopt.NetworkAnsatz(prep_nodes, meas_nodes)
ibm_sim_chsh_ansatz = QNopt.NetworkAnsatz(
    prep_nodes, meas_nodes, dev_kwargs = dev_ibm_qasm
)

## QNode Execution Parallelization

We now demonstrate the performance gains granted by parallelizing qnode execution across IBM Q hardware simulator devices.
In this example the CHSH cost function requires 4 remote qnode executions which are first run serially and then, in parallel.

### Non-Parallelized Remote QNode Execution

By default, PennyLane will chain web requests invoking remote qnode execution serially. 
This is a great ineffeciency granted that each of these qnode executions are independent from each other.

In [4]:
%%time

chsh_cost = QNopt.chsh_inequality_cost(ibm_sim_chsh_ansatz)

np.random.seed(13)
rand_settings = ibm_sim_chsh_ansatz.rand_scenario_settings()

chsh_cost(rand_settings)

CPU times: user 357 ms, sys: 30.3 ms, total: 387 ms
Wall time: 17.5 s


-0.5859999999999999

### Parallelized Remote QNode Execution

A factor of roughly 2-4x speedup is found by parallelizing the qnode execution across 4 separate threads.
This speedup persists even when a single remote device runs all qnode executions because the parallelized web requests populate the queue sooner than serial web requests.



In [5]:
%%time

parallel_chsh_cost = QNopt.chsh_inequality_cost(
    ibm_sim_chsh_ansatz, parallel=True
)

np.random.seed(13)
rand_settings = ibm_sim_chsh_ansatz.rand_scenario_settings()

parallel_chsh_cost(rand_settings)

CPU times: user 460 ms, sys: 43.1 ms, total: 503 ms
Wall time: 5.37 s


-0.599

## QNode Gradients on Remote Hardware

In this section we demonstrate that performance gains granted by parallelizing gradient computation of remote qnodes.
The gradient is evaluated using the parameter shift rule.


### Non-Parallelized Remote Gradient Computation

By default, all cost functions and gradients are evaluated serially on remote hardware.
This requires 28 web requests in total to IBM remote simulator.

In [6]:
%%time

ibm_sim_chsh_cost = QNopt.chsh_inequality_cost(ibm_sim_chsh_ansatz)

np.random.seed(13)
rand_settings = ibm_sim_chsh_ansatz.rand_scenario_settings()

QNopt.gradient_descent(
    ibm_sim_chsh_cost,
    rand_settings,
    num_steps=1,
    sample_width=1,
    step_size=0.1
)

iteration :  0 , score :  0.611
CPU times: user 1.89 s, sys: 187 ms, total: 2.08 s
Wall time: 2min 3s


{'opt_score': 1.149,
 'opt_settings': [[array([], shape=(1, 0), dtype=float64)],
  [array([[ 1.86700571],
          [-1.60032715]]),
   array([[2.06005211],
          [2.73548852]])]],
 'scores': [0.611, 1.149],
 'samples': [0, 0],
 'settings_history': [[[array([], shape=(1, 0), dtype=float64)],
   [array([[ 1.86700571],
           [-1.60032715]]),
    array([[2.06005211],
           [2.73548852]])]]]}

### Parallelized Remote Gradient Computation

To speed up the gradient computation, we parallelize the parameter shift rule across 4 web requests.
In total only 20 web requests to remote IBM hardware are needed where we are able to drop 8 requests from the previous example by evaluating the cost function locally.
The training is then isolated in 20 web requests split across four independent threads.
We see a rough 4x improvement in training time.

In [7]:
%%time

local_sim_chsh_cost = QNopt.chsh_inequality_cost(local_sim_chsh_ansatz)
parallel_grad_fn = QNopt.parallel_chsh_grad(
    ibm_sim_chsh_ansatz, diff_method="parameter-shift"
)


np.random.seed(13)
rand_settings = ibm_sim_chsh_ansatz.rand_scenario_settings()

QNopt.gradient_descent(
    local_sim_chsh_cost,
    rand_settings,
    num_steps=1,
    sample_width=1,
    step_size=0.1,
    grad_fn=parallel_grad_fn
)

iteration :  0 , score :  0.6183525458603142
CPU times: user 1.39 s, sys: 137 ms, total: 1.53 s
Wall time: 25.3 s


{'opt_score': 1.127667258710372,
 'opt_settings': [[array([], shape=(1, 0), dtype=float64)],
  [array([[ 1.86550571],
          [-1.60027715]]),
   array([[2.06000211],
          [2.73408852]])]],
 'scores': [0.6183525458603142, 1.127667258710372],
 'samples': [0, 0],
 'settings_history': [[[array([], shape=(1, 0), dtype=float64)],
   [array([[ 1.86550571],
           [-1.60027715]]),
    array([[2.06000211],
           [2.73408852]])]]]}