# Parallelizing Remote QNode Execution

This notebook demonstrates the importance of parallelizing the evaluation of cost functions gradients when using remote hardware.
For this demonstration, we use open-access IBM quantum hardware simulators and qNetVO's parallelization functionality built on dask.
These simulators have relatively short queue-times in comparison to the available quantum computers yet reveal the advantages of parallelizing the web requests used to invoke the remote IBM Q services.

In [1]:
import pennylane as qml
from pennylane import numpy as np

import qnetvo as qnet

In [2]:
from qiskit import IBMQ

# For details regarding integration between PennyLane and IMB Quantum,
# see https://pennylaneqiskit.readthedocs.io/en/latest/devices/ibmq.html#accounts-and-tokens
provider = IBMQ.load_account()

## Setup

For simplicity, we consider a CHSH scenario ansatz with a static Bell state preparation and local qubit measurements optimized over the $xz$-plane.
The qnodes are executed/trained remotely on the `ibmq_qasm_simulator`.
In certain cases, qnode execution is performed locally for greater efficiency.

In [3]:
prep_nodes = [
    qnet.PrepareNode(1, [0,1], qnet.ghz_state, 0)
]
meas_nodes = [
    qnet.MeasureNode(2, 2, [0], qnet.local_RY, 1),
    qnet.MeasureNode(2, 2, [1], qnet.local_RY, 1)
]

dev_ibm_qasm = {
    "name" : "qiskit.ibmq",
    "shots" : 2000,
    "backend" : "ibmq_qasm_simulator",
    "provider" : provider
}

local_sim_chsh_ansatz = qnet.NetworkAnsatz(prep_nodes, meas_nodes)
ibm_sim_chsh_ansatz = qnet.NetworkAnsatz(
    prep_nodes, meas_nodes, dev_kwargs = dev_ibm_qasm
)

## QNode Execution Parallelization

We now demonstrate the performance gains granted by parallelizing qnode execution across IBM Q hardware simulator devices.
In this example the CHSH cost function requires 4 remote qnode executions which are first run serially and then, in parallel.

### Non-Parallelized Remote QNode Execution

By default, PennyLane will chain web requests invoking remote qnode execution serially. 
This is a great ineffeciency granted that each of these qnode executions are independent from each other.

In [4]:
%%time

chsh_cost = qnet.chsh_inequality_cost_fn(ibm_sim_chsh_ansatz)

np.random.seed(13)
rand_settings = ibm_sim_chsh_ansatz.rand_network_settings()

chsh_cost(*rand_settings)



CPU times: user 477 ms, sys: 53.9 ms, total: 531 ms
Wall time: 25.7 s


-0.589

### Parallelized Remote QNode Execution

A factor of roughly 2-4x speedup is found by parallelizing the qnode execution across 4 separate threads.
This speedup persists even when a single remote device runs all qnode executions because the parallelized web requests populate the queue sooner than serial web requests.



In [5]:
%%time

parallel_chsh_cost = qnet.chsh_inequality_cost_fn(
    ibm_sim_chsh_ansatz, parallel=True
)

np.random.seed(13)
rand_settings = ibm_sim_chsh_ansatz.rand_network_settings()

parallel_chsh_cost(*rand_settings)



CPU times: user 424 ms, sys: 49.3 ms, total: 473 ms
Wall time: 14.1 s


-0.626

## QNode Gradients on Remote Hardware

In this section we demonstrate that performance gains granted by parallelizing gradient computation of remote qnodes.
The gradient is evaluated using the parameter shift rule.


### Non-Parallelized Remote Gradient Computation

By default, all cost functions and gradients are evaluated serially on remote hardware.
This requires 28 web requests in total to IBM remote simulator.

In [6]:
%%time

ibm_sim_chsh_cost = qnet.chsh_inequality_cost_fn(ibm_sim_chsh_ansatz)

np.random.seed(13)
rand_settings = ibm_sim_chsh_ansatz.rand_network_settings()

qnet.gradient_descent(
    ibm_sim_chsh_cost,
    rand_settings,
    num_steps=1,
    sample_width=1,
    step_size=0.1
)



iteration :  0 , score :  0.647




elapsed time :  199.9208381175995




CPU times: user 1.41 s, sys: 98.4 ms, total: 1.51 s
Wall time: 4min 46s


{'datetime': '2022-09-23T14:57:29Z',
 'opt_score': 1.1340000000000001,
 'opt_settings': [tensor(1.86520571, requires_grad=True),
  tensor(-1.60177715, requires_grad=True),
  tensor(2.05925211, requires_grad=True),
  tensor(2.73383852, requires_grad=True)],
 'scores': [0.647, 1.1340000000000001],
 'samples': [0, 1],
 'settings_history': [[tensor(1.74485571, requires_grad=True),
   tensor(-1.64907715, requires_grad=True),
   tensor(2.03750211, requires_grad=True),
   tensor(2.92638852, requires_grad=True)],
  [tensor(1.86520571, requires_grad=True),
   tensor(-1.60177715, requires_grad=True),
   tensor(2.05925211, requires_grad=True),
   tensor(2.73383852, requires_grad=True)]],
 'step_times': [199.9208381175995, 199.9208381175995],
 'step_size': 0.1}

### Parallelized Remote Gradient Computation

To speed up the gradient computation, we parallelize the parameter shift rule across 4 web requests.
In total only 20 web requests to remote IBM hardware are needed where we are able to drop 8 requests from the previous example by evaluating the cost function locally.
The training is then isolated in 20 web requests split across four independent threads.
We see a rough 4x improvement in training time.

In [7]:
%%time

local_sim_chsh_cost = qnet.chsh_inequality_cost_fn(local_sim_chsh_ansatz)
parallel_grad_fn = qnet.parallel_chsh_grad_fn(
    ibm_sim_chsh_ansatz, diff_method="parameter-shift"
)


np.random.seed(13)
rand_settings = ibm_sim_chsh_ansatz.rand_network_settings()

qnet.gradient_descent(
    local_sim_chsh_cost,
    rand_settings,
    num_steps=1,
    sample_width=1,
    step_size=0.1,
    grad_fn=parallel_grad_fn
)

iteration :  0 , score :  0.6183525458603143




elapsed time :  28.06047487258911
CPU times: user 762 ms, sys: 56.7 ms, total: 818 ms
Wall time: 28.1 s


{'datetime': '2022-09-23T15:02:16Z',
 'opt_score': 1.1265334933929982,
 'opt_settings': [tensor(1.86705571, requires_grad=True),
  tensor(-1.60237715, requires_grad=True),
  tensor(2.05830211, requires_grad=True),
  tensor(2.73478852, requires_grad=True)],
 'scores': [0.6183525458603143, 1.1265334933929982],
 'samples': [0, 1],
 'settings_history': [[tensor(1.74485571, requires_grad=True),
   tensor(-1.64907715, requires_grad=True),
   tensor(2.03750211, requires_grad=True),
   tensor(2.92638852, requires_grad=True)],
  [tensor(1.86705571, requires_grad=True),
   tensor(-1.60237715, requires_grad=True),
   tensor(2.05830211, requires_grad=True),
   tensor(2.73478852, requires_grad=True)]],
 'step_times': [28.06047487258911, 28.06047487258911],
 'step_size': 0.1}

In [8]:
%%time

local_sim_chsh_cost = qnet.chsh_inequality_cost_fn(local_sim_chsh_ansatz)
parallel_grad_fn = qnet.parallel_chsh_grad_fn(
    ibm_sim_chsh_ansatz, natural_grad=True, diff_method="parameter-shift"
)


np.random.seed(13)
rand_settings = ibm_sim_chsh_ansatz.rand_network_settings()

qnet.gradient_descent(
    local_sim_chsh_cost,
    rand_settings,
    num_steps=1,
    sample_width=1,
    step_size=0.1,
    grad_fn=parallel_grad_fn
)

iteration :  0 , score :  0.6183525458603143




elapsed time :  115.36319899559021
CPU times: user 953 ms, sys: 71.8 ms, total: 1.02 s
Wall time: 1min 55s


{'datetime': '2022-09-23T15:02:44Z',
 'opt_score': 1.5436297536118553,
 'opt_settings': [tensor(1.98802261, requires_grad=True),
  tensor(-1.55292314, requires_grad=True),
  tensor(2.08209161, requires_grad=True),
  tensor(2.5424781, requires_grad=True)],
 'scores': [0.6183525458603143, 1.5436297536118553],
 'samples': [0, 1],
 'settings_history': [[tensor(1.74485571, requires_grad=True),
   tensor(-1.64907715, requires_grad=True),
   tensor(2.03750211, requires_grad=True),
   tensor(2.92638852, requires_grad=True)],
  [tensor(1.98802261, requires_grad=True),
   tensor(-1.55292314, requires_grad=True),
   tensor(2.08209161, requires_grad=True),
   tensor(2.5424781, requires_grad=True)]],
 'step_times': [115.36319899559021, 115.36319899559021],
 'step_size': 0.1}