# $n$-Local Chain: Parallelized Remote QNode Execution

This notebook demonstrates how training and execution of remote qnodes can be parallelized for $n$-local chain networks.
For this demonstration, we use open-access IBM quantum hardware simulators.
These simulators have relatively short queue-times in comparison to the available quantum computers yet reveal the advantages of parallelizing the web requests used to invoke the remote IBM Q services.

In [1]:
import pennylane as qml
from pennylane import numpy as np

from context import qnetvo as QNopt

In [2]:
from qiskit import IBMQ

# For details regarding integration between PennyLane and IBM Q,
# see https://pennylaneqiskit.readthedocs.io/en/latest/devices/ibmq.html#accounts-and-tokens
provider = IBMQ.load_account()

## Setup

For simplicity, we consider the bilocal chain network where two static Bell states are prepared and local qubit measurements parmeterized over the $xz$-plane.
The qnodes are executed/trained remotely on the `ibmq_qasm_simulator`.
In certain cases, qnode execution is performed locally for greater efficiency.

In [3]:
prep_nodes = [
    QNopt.PrepareNode(1, [0,1], QNopt.ghz_state, 0),
    QNopt.PrepareNode(1, [2,3], QNopt.ghz_state, 0)
]
meas_nodes = [
    QNopt.MeasureNode(2, 2, [0], QNopt.local_RY, 1),
    QNopt.MeasureNode(2, 2, [1, 2], QNopt.local_RY, 2),
    QNopt.MeasureNode(2, 2, [3], QNopt.local_RY, 1),
]

# executes on local device
local_sim_chain_ansatz = QNopt.NetworkAnsatz(prep_nodes, meas_nodes)

dev_ibm_qasm = {
    "name" : "qiskit.ibmq",
    "shots" : 2000,
    "backend" : "ibmq_qasm_simulator",
    "provider" : provider
}

# executes on IBM hardware simulator
ibm_sim_chain_ansatz = QNopt.NetworkAnsatz(
    prep_nodes, meas_nodes, dev_kwargs = dev_ibm_qasm
)

## QNode Execution Parallelization

We now demonstrate the performance gains granted by parallelizing qnode execution across IBM Q hardware simulator devices.
In this example the bilocal chain cost function requires 4 remote qnode executions which are first run serially and then, in parallel.

### Benchmark: Local QNode Execution

As a benchmark, we first execute the bilocal chain simulation locally. Due to the small circuit size, this is very efficient to compute even when circuit executions are done serially.

In [4]:
%%time

local_chain_cost = QNopt.nlocal_chain_cost_22(local_sim_chain_ansatz)

np.random.seed(13)
rand_settings = local_sim_chain_ansatz.rand_scenario_settings()

local_chain_cost(rand_settings)

CPU times: user 40.3 ms, sys: 3.43 ms, total: 43.7 ms
Wall time: 43.7 ms


-0.4091725165341098

### Serial Remote QNode Execution

By default, PennyLane will chain web requests invoking remote qnode execution serially. 
This is a great ineffeciency granted that each of these qnode executions are independent from each other.
In total, we run 8 quantum circuits to collect enough data to evaluate the bilocal Bell inequality cost function.

In [5]:
%%time

chain_cost = QNopt.nlocal_chain_cost_22(ibm_sim_chain_ansatz)

np.random.seed(13)
rand_settings = ibm_sim_chain_ansatz.rand_scenario_settings()

chain_cost(rand_settings)

CPU times: user 714 ms, sys: 60 ms, total: 774 ms
Wall time: 37.4 s


-0.40430852320829

### Parallel Remote QNode Execution

A factor of roughly 4x speedup is found by parallelizing the qnode execution across 4 separate threads.
This speedup persists even when a single remote device runs all qnode executions because the parallelized web requests populate the queue sooner than serial web requests.
Note that with the current implementation, we perform two batches of four parallel circuit executions. We break into two batches because IBM allows no more than 5 parallel requests per user.

In [6]:
%%time

parallel_chain_cost = QNopt.nlocal_chain_cost_22(
    ibm_sim_chain_ansatz, parallel=True
)

np.random.seed(13)
rand_settings = ibm_sim_chain_ansatz.rand_scenario_settings()

parallel_chain_cost(rand_settings)

CPU times: user 753 ms, sys: 70.5 ms, total: 823 ms
Wall time: 11.2 s


-0.45880553786673633

## QNode Gradients on Remote Hardware

In this section we demonstrate that performance gains granted by parallelizing gradient computation of remote qnodes.
The gradient is evaluated using the parameter shift rule.


### Benchmark: Local Serial Gradient Computation

As a benchmark, we iterate one step of gradient descent computing the gradient locally using parameter shift rule. Note that the quantities evaluated on remote simulator deviate from the benchmark due to finite statistics.

In [7]:
%%time

local_sim_chain_cost = QNopt.nlocal_chain_cost_22(local_sim_chain_ansatz)

np.random.seed(13)
rand_settings = local_sim_chain_ansatz.rand_scenario_settings()

serial_local_opt_dict = QNopt.gradient_descent(
    local_sim_chain_cost,
    rand_settings,
    num_steps=1,
    sample_width=1,
    step_size=0.1
)

print("max score : ", serial_local_opt_dict["opt_score"])

iteration :  0 , score :  0.4091725165341098
elapsed time :  0.044850826263427734
max score :  0.4947684708958426
CPU times: user 97.2 ms, sys: 7.16 ms, total: 104 ms
Wall time: 102 ms


### Serial Remote Gradient Computation

By default, all cost functions and gradients are evaluated serially on remote hardware resulting in 88 total requests.
This requires 16 web requests to evaluate the cost function before and after the gradient evaluation.
Additionally, 8 web requests are used to evaluate $I_{22}$ and $J_{22}$ quantities during the gradient computation.
Finally, the parameter-shift with 4 parameters runs 64 times to evaluate the gradient.

In [8]:
%%time

ibm_sim_chain_cost = QNopt.nlocal_chain_cost_22(ibm_sim_chain_ansatz)

np.random.seed(13)
rand_settings = ibm_sim_chain_ansatz.rand_scenario_settings()

serial_remote_opt_dict = QNopt.gradient_descent(
    ibm_sim_chain_cost,
    rand_settings,
    num_steps=1,
    sample_width=1,
    step_size=0.1
)

print("max score : ", serial_remote_opt_dict["opt_score"])

iteration :  0 , score :  0.35410250498847456
elapsed time :  322.37982392311096
max score :  0.4952960262985534
CPU times: user 6.84 s, sys: 581 ms, total: 7.42 s
Wall time: 6min 33s


### Parallel Remote Gradient Computation

To speed up the gradient computation, we first evaluate the cost function locally, saving 16 web requests.
Second, we parallelize the $I_{22}$ and $J_{22}$ evaluations across 4 threads each. Finally, the remaining 64 web requests are parallelized using 4 threads. 
This roughly yields a 4x improvement in training time.

In [9]:
%%time

local_sim_chain_cost = QNopt.nlocal_chain_cost_22(local_sim_chain_ansatz)
parallel_grad_fn = QNopt.parallel_nlocal_chain_grad_fn(
    ibm_sim_chain_ansatz, diff_method="parameter-shift"
)

np.random.seed(13)
rand_settings = ibm_sim_chain_ansatz.rand_scenario_settings()

parallel_remote_opt_dict = QNopt.gradient_descent(
    local_sim_chain_cost,
    rand_settings,
    num_steps=1,
    sample_width=1,
    step_size=0.1,
    grad_fn=parallel_grad_fn
)

print("max score : ", parallel_remote_opt_dict["opt_score"])

iteration :  0 , score :  0.4091725165341098
elapsed time :  88.0086669921875
max score :  0.4907867582069676
CPU times: user 6.54 s, sys: 591 ms, total: 7.13 s
Wall time: 1min 28s


### Parallel Natural Gradient Descent Optimization Step

In [10]:
%%time

local_sim_chain_cost = QNopt.nlocal_chain_cost_22(local_sim_chain_ansatz)
natural_grad_fn = QNopt.parallel_nlocal_chain_grad_fn(
    ibm_sim_chain_ansatz, natural_gradient=True, diff_method="parameter-shift"
)

np.random.seed(13)
rand_settings = ibm_sim_chain_ansatz.rand_scenario_settings()

nat_grad_remote_opt_dict = QNopt.gradient_descent(
    local_sim_chain_cost,
    rand_settings,
    num_steps=1,
    sample_width=1,
    step_size=0.5,
    grad_fn=natural_grad_fn
)

print("max score : ", nat_grad_remote_opt_dict["opt_score"])

iteration :  0 , score :  0.4091725165341098
elapsed time :  95.79748916625977
max score :  0.8289509758224339
CPU times: user 6.91 s, sys: 610 ms, total: 7.52 s
Wall time: 1min 35s
