# $n$-Local Chain: Parallelized Remote QNode Execution

This notebook demonstrates how training and execution of remote qnodes can be parallelized for $n$-local chain networks.
For this demonstration, we use open-access IBM quantum hardware simulators.
These simulators have relatively short queue-times in comparison to the available quantum computers yet reveal the advantages of parallelizing the web requests used to invoke the remote IBM Q services.

In [1]:
import pennylane as qml
from pennylane import numpy as np

from context import QNetOptimizer as QNopt

In [2]:
from qiskit import IBMQ

# For details regarding integration between PennyLane and IMB Q,
# see https://pennylaneqiskit.readthedocs.io/en/latest/devices/ibmq.html#accounts-and-tokens
provider = IBMQ.load_account()

## Setup

For simplicity, we consider the bilocal chain network where two static Bell states are prepared and local qubit measurements parmeterized over the $xz$-plane.
The qnodes are executed/trained remotely on the `ibmq_qasm_simulator`.
In certain cases, qnode execution is performed locally for greater efficiency.

In [3]:
prep_nodes = [
    QNopt.PrepareNode(1, [0,1], QNopt.ghz_state, 0),
    QNopt.PrepareNode(1, [2,3], QNopt.ghz_state, 0)
]
meas_nodes = [
    QNopt.MeasureNode(2, 2, [0], QNopt.local_RY, 1),
    QNopt.MeasureNode(2, 2, [1, 2], QNopt.local_RY, 2),
    QNopt.MeasureNode(2, 2, [3], QNopt.local_RY, 1),
]

dev_ibm_qasm = {
    "name" : "qiskit.ibmq",
    "shots" : 4000,
    "backend" : "ibmq_qasm_simulator",
    "provider" : provider
}

local_sim_chain_ansatz = QNopt.NetworkAnsatz(prep_nodes, meas_nodes)
ibm_sim_chain_ansatz = QNopt.NetworkAnsatz(
    prep_nodes, meas_nodes, dev_kwargs = dev_ibm_qasm
)

## QNode Execution Parallelization

We now demonstrate the performance gains granted by parallelizing qnode execution across IBM Q hardware simulator devices.
In this example the bilocal chain cost function requires 4 remote qnode executions which are first run serially and then, in parallel.

### Ideal Cost Evaluation

First, we demonstrate the evaluation of the cost function on the local `"default.qubit"` simulator.

In [4]:
%%time

local_chain_cost = QNopt.nlocal_chain_cost_22(local_sim_chain_ansatz)

np.random.seed(13)
rand_settings = local_sim_chain_ansatz.rand_scenario_settings()

local_chain_cost(rand_settings)

CPU times: user 30 ms, sys: 3.54 ms, total: 33.5 ms
Wall time: 31.4 ms


-0.4091725165341098

### Non-Parallelized Remote QNode Execution

By default, PennyLane will chain web requests invoking remote qnode execution serially. 
This is a great ineffeciency granted that each of these qnode executions are independent from each other.

In [5]:
%%time

serial_chain_cost = QNopt.nlocal_chain_cost_22(ibm_sim_chain_ansatz)

np.random.seed(13)
rand_settings = ibm_sim_chain_ansatz.rand_scenario_settings()

serial_chain_cost(rand_settings)

CPU times: user 711 ms, sys: 58.8 ms, total: 769 ms
Wall time: 38.7 s


-0.3560966626228658

### Parallelized Remote QNode Execution

A factor of roughly 2-4x speedup is found by parallelizing the qnode execution across 4 separate threads.
This speedup persists even when a single remote device runs all qnode executions because the parallelized web requests populate the queue sooner than serial web requests.



In [6]:
%%time

parallel_chain_cost = QNopt.nlocal_chain_cost_22(
    ibm_sim_chain_ansatz, parallel=True
)

np.random.seed(13)
rand_settings = ibm_sim_chain_ansatz.rand_scenario_settings()

parallel_chain_cost(rand_settings)

[[tensor([], shape=(1, 0), dtype=float64, requires_grad=True), tensor([], shape=(1, 0), dtype=float64, requires_grad=True)], [tensor([[ 1.74485571],
        [-1.64907715]], requires_grad=True), tensor([[ 2.03750211,  2.92638852],
        [ 2.96944038, -0.292487  ]], requires_grad=True), tensor([[0.685134  ],
        [1.73118415]], requires_grad=True)]]
CPU times: user 801 ms, sys: 71.1 ms, total: 872 ms
Wall time: 10.6 s


-0.4663243690497446

## QNode Gradients on Remote Hardware

In this section we demonstrate that performance gains granted by parallelizing gradient computation of remote qnodes.
The gradient is evaluated using the parameter shift rule.


### Non-Parallelized Remote Gradient Computation

By default, all cost functions and gradients are evaluated serially on remote hardware.
This requires 28 web requests in total to IBM remote simulator.

In [None]:
%%time

ibm_sim_chsh_cost = QNopt.chsh_inequality_cost(ibm_sim_chain_ansatz)

np.random.seed(13)
rand_settings = ibm_sim_chsh_ansatz.rand_scenario_settings()

QNopt.gradient_descent(
    ibm_sim_chsh_cost,
    rand_settings,
    num_steps=1,
    sample_width=1,
    step_size=0.1
)

### Parallelized Remote Gradient Computation

To speed up the gradient computation, we parallelize the parameter shift rule across 4 web requests.
In total only 20 web requests to remote IBM hardware are needed where we are able to drop 8 requests from the previous example by evaluating the cost function locally.
The training is then isolated in 20 web requests split across four independent threads.
We see a rough 4x improvement in training time.

In [None]:
%%time

local_sim_chsh_cost = QNopt.chsh_inequality_cost(local_sim_chsh_ansatz)
parallel_grad_fn = QNopt.parallel_chsh_grad(
    ibm_sim_chsh_ansatz, diff_method="parameter-shift"
)


np.random.seed(13)
rand_settings = ibm_sim_chsh_ansatz.rand_scenario_settings()

QNopt.gradient_descent(
    local_sim_chsh_cost,
    rand_settings,
    num_steps=1,
    sample_width=1,
    step_size=0.1,
    grad_fn=parallel_grad_fn
)