# Qubit routing using Graph neural network aided Monte Carlo tree search

Animesh Sinha,\* Utkarsh Azad,† and Harjinder Singh‡
Center for Computational Natural Sciences and Bioinformatics,
International Institute of Information Technology, Hyderabad.

(Dated: April 1, 2021)

Near-term quantum hardware can support two-qubit operations only on the qubits that can interact with each other. Therefore, to execute an arbitrary quantum circuit on the hardware, compilers have to first perform the task of qubit routing, i.e., to transform the quantum circuit either by inserting additional SWAP gates or by reversing existing CNOT gates to satisfy the connectivity constraints of the target topology. We propose a procedure for qubit routing that is architecture agnostic and that outperforms other available routing implementations on various circuit benchmarks. The depth of the transformed quantum circuits is minimised by utilizing the Monte Carlo tree search to perform qubit routing, aided by a Graph neural network that evaluates the value function and action probabilities for each state.

#### I. INTRODUCTION

The present-day quantum computers, more generally known as Noisy Intermediate-Scale quantum (NISQ) devices [1] come in a variety of hardware architectures [2–5], but there exist a few problems pervading across all of them. These problems constitute the poor quality of qubits, limited connectivity between qubits, and the absence of error-correction for noise-induced errors encountered during the execution of gate operations. These place a considerable restriction on the number of instructions that can be executed to perform useful quantum computation [1]. Collectively these instructions can be realized as a sequential series of one or two-qubit gates that can be visualized more easily as a quantum circuit as shown in Fig. 1a [6].

To execute an arbitrarily given quantum circuit on the target quantum hardware, a compiler routine must transform it to satisfy the connectivity constraints of the topology of the hardware [7]. These transformations usually include the addition of SWAP gates and the reversal of existing CNOT gates. This ensures that any non-local quantum operations are performed only between the qubits that are nearest-neighbors. This process of circuit transformation by a compiler routine for the target hardware is known as qubit routing [7]. The output instructions in the transformed quantum circuit should follow the connectivity constraints and essentially result in the same overall unitary evolution as the original circuit [8].

In the context of NISQ hardware, this procedure is of extreme importance as the transformed circuit will, in general, have higher depth due to the insertion of extra SWAP gates. This overhead in the circuit depth becomes more prominent due to the high decoherence rates of the qubits and it becomes essential to find the most optimal

and efficient strategy to minimize it [7–9]. In this article, we present a procedure that we refer to as *QRoute*. In this procedure, we use Monte Carlo tree search (MCTS), which is a look-ahead search algorithm for finding optimal decisions in the decision space guided by a heuristic evaluation function [10–12]. We use it for explicitly searching the decision space for depth minimization and as a stable and performant machine learning setting. It is aided by a Graph neural network (GNN) [13], which is a neural network architecture that is being used to learn and evaluate the heuristic function that will help guide the MCTS.

Structure: In Section II, we introduce the problem of qubit routing, the previous works that are done in the field, and show how our work differs from them. Next, the methodology of the QRoute algorithm is described in Section III. Then, in Section IV, we benchmark the performance of our algorithm against other state-of-theart quantum compilers. Finally, we discuss our results and possible improvements in Section V.

# II. QUBIT ROUTING

In this section, we begin by defining the problem of qubit routing formally and discussing the work done previously in the field.

# A. Describing the Problem

The topology of quantum hardware can be visualized as a qubit connectivity graph (Fig. 2). Each node in this graph would correspond to a physical qubit which in turn might correspond to a logical qubit depending on whether it is being used in the execution of quantum instruction set or not. The quantum instruction set, which is also referred to as quantum circuit (Fig. 1a), is a sequential series of single-qubit and two-qubit gate operations that act on the logical qubits. Amongst these two type of gate operations, two-qubit gates such as CNOT can only be performed between two logical qubits iff there exists an

<sup>\*</sup> animesh.sinha@research.iiit.ac.in

 $<sup>^{\</sup>dagger}$ utkarsh.azad@research.iiit.ac.in

<sup>&</sup>lt;sup>‡</sup> laltu@iiit.ac.in



FIG. 1: An example of qubit routing on a quantum circuit for  $3\times3$  grid architecture (Figure 2a). (a) For simplicity, the original quantum circuit consists only of two-qubit gate operations. (b) Decomposition of the original quantum circuit into series of slices such that all the instructions present in a slice can be executed in parallel. Here, the two-qubit gate operations:  $\{d,e\}$  (green) comply with the topology of the grid architecture whereas the operations:  $\{a,b,c,f\}$  (red) does not comply with the topology (and therefore cannot be performed). Note that the successive two-qubit gate operations on  $q_3 \to q_4$  (blue) are redundant and are not considered while routing. (c) Decomposition of the transformed quantum circuit we get after qubit routing. Four additional SWAP gates are added that increased the circuit depth to 5, i.e., an overhead circuit depth of 2. Here, final qubit labels are represented at the end right side of the circuit. The qubits that are not moved (or swapped) are shown in brown ( $\{q_1,q_5\}$ ), while the rest of them are shown in blue. Note that the overall unitary operation performed by the circuit is preserved despite the changes in the order of two-qubit gate operations



(a) 3×3 grid architecture

(b) IBMQX-20 architecture

FIG. 2: Examples of qubit connectivity graphs for some common quantum architectures

edge between the nodes that correspond to the physical qubits, [9]. This edge could be either unidirectional or bidirectional, i.e., CNOT can be performed either in one direction or in both directions. In this work, we consider only the bidirectional case, while noting that the direction of a CNOT gate can be reversed by sandwiching it between pair of Hadamard gates acting on both control and target qubits [14].

Now, given a target hardware topology and a quantum circuit, the task of qubit routing refers to transforming this quantum circuit by adding series of SWAP gates such

that all its gate operations then satisfy the connectivity constraints of the target topology  $\mathcal{D}$  (Fig. 1c). Formally, for a routing algorithm R, we can represent this process as follows:

$$R(C, D) \to C'$$
 (1)

Depth of  $\mathcal{C}'$  (transformed quantum circuit) will, in general, be more than that of the original circuit due to the insertion of additional SWAP gates. This comes from the definition of the term depth in the context of quantum circuits. This can be understood by decomposing a quantum circuit into series of individual slices, each of which contains a group of gate operations that have no overlapping qubits, i.e., all the instructions present in a slice can be executed in parallel (Fig. 1b). The depth of the quantum circuit then refers to the minimum number of such slices the circuit can be decomposed into, i.e., the minimum amount of parallel executions needed to execute the circuit. The goal is to minimize the overhead depth of the transformed circuit with respect to the original circuit.

This goal involves solving two subsequent problems of (i) qubit allocation, which refers to the mapping of program qubits to logic qubits, and (ii) qubit movement, which refers to routing qubits between different locations such that interaction can be made possible [15]. In this

work, we focus on the latter problem of qubit movement only and refer to it as qubit routing. However, it should be noted that qubit allocation is also an important problem and it can play an important role in minimizing the effort needed to perform qubit movement.

## B. Related Work

The first major attraction for solving the qubit routing problem was the competition organized by IBM in 2018 to find the best routing algorithm. This competition was won by Zulehner et al. [16], for developing a Computer Aided Design-based (CAD) routing strategy. Since then, this problem has been outlined and presented in many different ways. These include graph-based architecture-agnostic solution by Cowtan et al. [7] and reinforcement learning in a combinatorial action space solution by Herbert and Sengupta [9]. In the latter, they also proposed the use of simulated annealing to search through the combinatorial action space, aided by a Feed-Forward neural network to judge the long-term expected depth. This was further extended to use Double Deep Q-learning and prioritized experience replay by Pozzi et al. [8].

Recently, Monte Carlo tree search (MCTS), a popular reinforcement learning algorithm [17] previously proven successful in a variety of domains like playing puzzle games such as Chess and Go [18], and was used by Zhou et al. [19] to develop a qubit routing solution. However, they used MCTS in the context of minimizing the total volume of quantum circuits (i.e., the number of gates ignoring the parallelization).

### C. Our Contributions

As mentioned before, in this work we propose to use the Monte Carlo tree search (MCTS), a reinforcement learning (RL) algorithm, for the task of total depth minimization. In the context of RL, we use an array of mutex locks to represent the information that an agent needs to know to make effective use of parallelization. Moreover, we propose a Graph neural network (GNN) that aids the MCTS in its exploration by evaluating function and action probabilities throughout the routing procedure. From these additional features, we aim to gain an agent that is not greedy immediately since it can only see the current state of the system and not effectively plan for the future. This also stabilizes the decision process which would have suffered from randomness if a method such as simulated annealing was used. Finally, we provide a simple python package that can be used to test and visualize routing algorithms with different neural net architectures, combining algorithms, reward structures, etc. [20]

### III. METHOD

The QRoute algorithm takes in an input circuit and an injective map,  $\mathcal{M}:Q\to N$ , from logical qubits to nodes (physical qubits). Iteratively, over multiple timesteps, it tries to schedule the gate operations that are present in the input circuit onto the target hardware. To do so, from the set of unscheduled gate operations,  $\mathcal{P}$ , it takes all the current operations, which are the first unscheduled operation for both the qubits that they act on, and try to make them into local operations, which are those two-qubit operations that involve qubits that are mapped to nodes connected on the target hardware.

In every timestep t, QRoute starts by greedily scheduling all the operations that are both current and local in  $\mathcal{P}$ . To evolve  $\mathcal{M}$ , it then performs a Monte Carlo tree search (MCTS) to find an optimal set of SWAPs by the evaluation metrics described in the Section III B such that all operations in the current timestep put together form a parallelizable set, i.e., a set of local operations such that no two operations in the set act on the same qubit. The number of states we can encounter in the action space explodes exponentially with the depth of our search, therefore an explicit search till the circuit is done compiling is not possible. Therefore we cut short our search at some shallow intermediate state, and use a neural network to get its heuristic evaluation.

The following subsections describe in greater detail the working of the search and the heuristic evaluation.

# A. Search and Action Space

**Definition III.1 (State)** It captures entire specification of the state of compilation at some timestep t. Abstractly, it is described as:

$$s_t = (\mathcal{C}, \ \mathcal{D}, \ \mathcal{M}_t, \ \mathcal{P}_t, \ \mathcal{L}_t)$$

where, the C is the list of gate operations from the input circuit and D is the topology of target hardware.  $\mathcal{M}_t$  and  $\mathcal{P}_t$  represents the current values of  $\mathcal{M}$  and  $\mathcal{P}$  respectively.  $\mathcal{L}_t$  is the set of nodes that are locked by the gate operations from the previous timestep and therefore cannot be operated in the current timestep.

Definition III.2 (Action) It is a set of SWAP gates (represented by the pair of qubits it acts on) such that all gates are local, and its union with the set of operations that were scheduled in the same timestep forms a parallelizable set.

We are performing a tree search over state-action pairs. Since the number of actions that can be taken at any timestep is exponential in the number of connections on the hardware, we are forced to build a single action up, step-by-step.



FIG. 3: Iteration of a Monte Carlo tree search: (i) select - recursively choosing a node in the search tree for exploration using a selection criteria, (ii) expand - expanding the tree using the available actions if an unexplored leaf node is selected, (iii) rollout - estimating long term reward using a neural network for the action-state pair for the new node added to search tree, and (iv) backup - propagating reward value from the leaf nodes upwards to update the evaluations of its ancestors.

**Definition III.3 (Move)** It is a single step in a search procedure which either builds up the action or applies it to the current state. Moves are of the following two types:

- SWAP(n<sub>1</sub>, n<sub>2</sub>): Inserts a new SWAP on nodes n<sub>1</sub> and n<sub>2</sub> into the action set. Such an insertion is only possible if the operation is local and resulting set of operations for the timestep form a parallel set.
- 2. COMMIT: Finishes the construction of the action set for that timestep. It also uses the action formed until now to update the state  $s_t$  (schedules the SWAP gates on the hardware), and resets the action set for the next step.

In reality, different gate operation takes different amounts of timesteps for execution. For example, if a hardware requires SWAP gate to be broken down into CNOT gates, then it would take three timesteps for complete execution [14]. This means, operations which are being scheduled must maintain mutual exclusivity with other other operations over the nodes which participates

in them. This is essential to minimizing the depth of the circuit since it models paralleizability of operations.

However, constructing a parallelizable set and representing the state of parallelization to our heuristic evaluator is a challenge. But an analogy can be drawn here to the nodes being thought of as "resources" that cannot be shared, and the operations as "consumers" [21]. This motivates us to propose using Mutex Locks for this purpose. These will lock a node until a scheduled gate operation involving that node executes completely. Therefore, this allows our framework to naturally handle different types of operations which take different amounts of time to complete.

For every state-action pair, the application of a feasible move m on it will result in a new state-action pair:  $(s,a) \xrightarrow{m} (s',a')$ . Associated with each such state-action-move pair ((s,a),m), we maintain two additional values that are used by MCTS:

- (i) N-value The number of times we have taken the said move m from said state-action pair (s, a).
- (ii) Q-value Given a reward function  $\mathcal{R}$ , it is the average long-term reward expected after taking said move m over all iterations of the search. (Future

rewards are discounted by a factor  $\gamma$ )

$$Q((s,a),m) = \mathcal{R}((s,a),m) + \frac{\sum_{m'} N((s',a'),m') \cdot Q((s',a'),m')}{\sum_{m'} N((s',a'),m')}$$
(2)

#### B. Monte Carlo Tree Search

Monte Carlo tree search progresses iteratively by executing its four phases of select, expand, rollout, and backup as illustrated in Fig. 3. In each iteration, it begins traversing down the existing search tree by selecting the node with the maximum UCT value (Eq. 3) at each level. During this traversal, whenever it encounters a leaf node, it expands the tree by choosing a move m from that leaf node. Then, it estimates the scalar evaluation for the new state-action pair and backpropagates it up the tree to update evaluations of its ancestors.

To build an optimal action set, we would want to select the move m with the maximum true Q-value. But since true Q-values are intractably expensive to compute, we can only approximate the Q-values through efficient exploration. We use the Upper Confidence Bound on Trees (UCT) objective [12] to balance exploration and exploitation as we traverse through the search tree. Moreover, as this problem results in a highly assymetric tree, since some move block a lot of other moves, while others block fewer moves, we use the formulation of UCT adapted for assymetric trees [22]:

$$UCT((s,a),m) = Q((s,a),m) + c\frac{\sqrt{N((s,a))}}{N((s,a),m)} \times p(m|(s,a))$$
(3)

Here, the value p(m|(s,a)) is the prior policy function, which is obtained by adding a Dirichlet noise to the policy output of the neural network [23]. As MCTS continues probing the action space, it gets a better estimate of the true values of the actions. This means that it act as a policy enhancement function whose output policy (Eq. 4) can be used to train the neural network's prior and average Q-value computed can be used to train its scalar evaluation (Eq. 5).

$$\pi(m|(s,a)) \propto N((s,a),m)) \tag{4}$$

$$\mathcal{V}((s,a)) = \frac{\sum_{m} Q((s,a),m)}{\sum_{m} N((s,a),m)}$$
 (5)

The details of how MCTS progresses have been elaborated in Algo. 1. Once it gets terminated, i.e., the search gets completed, we go down the tree selecting the child with the maximum Q-value at each step until a

# Algorithm 1: Monte Carlo tree search

```
Data: state s_t
 1 Initialize: root \leftarrow (s_t, \text{ empty action set})
 2 loop n_mcts times
 3
        (s, a) \leftarrow root
        repeat
 4
             Compute UCT values using prior + noise
 5
             Select move that maximizes UCT value
 6
             if (s, a).child[move] \neq null then
 7
                 (s, a) \leftarrow (s, a).child[move]
 8
 9
             else
10
                 if move = COMMIT then
11
                      s' \leftarrow \mathbf{step} (s, a)
                      a' \leftarrow \text{empty set}
12
                 else
13
                      s' \leftarrow s
14
                      a' \leftarrow \mathbf{insert} into a the qubit pair
15
                       corresponding to the move
16
                 state.child[move] \leftarrow (s', a')
17
                 store reward[(s, a), move] \leftarrow \mathcal{R}(s', a') -
18
19
            end
20
        until previous move was expand
21
        reward \leftarrow evaluation from model of (s, a)
22
        while (s, a) \neq root do
             p-move \leftarrow move from parent of (s, a) to (s, a)
23
             (s, a) \leftarrow parent in tree of (s, a)
24
25
             reward \leftarrow reward[(s, a), p-move] + \gamma \cdot reward
             update (s, a).Q-value[move] with reward
26
             increment (s, a).N-value[move] by 1
27
28
        end
29 end
30 memorize the Q-values and N-values at the root for
     training the model later
31 (s, a) \leftarrow root
32 repeat
     (s, a) \leftarrow \text{child of } (s, a) \text{ with maximum Q-value}
34 until move \neq COMMIT
35 return a
```

COMMIT action is found, we use the action set of the selected state-action pair to schedule SWAPs for the current timestep, and we re-root the tree at the child node of the COMMIT action to prepare for the next timestep.

# C. Neural Network Architecture

Each iteration of the MCTS requires evaluation of Q-values for a newly encountered state-action pair. But these values are impossible to be computed exactly since it would involve an intractable number of iteration in exploring and expanding the complete search tree. Therefore, it is favorable to heuristically evaluate the expected long-term reward from the state-action pair using a Neural Network, as it acts as an excellent function approximator that can learn the symmetries and general rules inherent to the system.



FIG. 4: Graph Neural Network Architecture with a value head and a policy head

So, once the MCTS sends a state-action pair to the evaluator, it begins by committing the action to the state and getting the resultant state. We then generate the following featurized representation of this state and pass this representation through the neural-network architecture as shown in Fig. 4.

- (i) Node Targets It is a square boolean matrix whose rows and columns correspond to the nodes on a target device. An element (i,j) is true iff some logical qubits  $q_x$  and  $q_y$  are currently mapped to nodes i and j respectively, such that  $(q_x, q_y)$  is the first unscheduled operation that  $q_x$  partakes in.
- (ii) Locked Edges It is a set of edges (pairs of connected nodes) that are still locked due to either of its qubits being involved in an operation in the current timestep or another longer operation that hasn't yet terminated from the previous timesteps.
- (iii) Remaining targets It is a list of the number of gate-operations that are yet to be scheduled for each logical qubit.

The SWAP operations each qubit would partake in depends primarily on its target node, and on those of the nodes in its neighborhood that might be competing for the same resources. It seems reasonable that we can use a Graph Neural Network with the device topology graph for its connectivity since the decision of the optimal SWAP action for some node is largely affected by other nodes in its physical neighborhood. Therefore, our architecture includes an edge-convolution block [13], followed by some fully-connected layers with Swish [24] activations for the policy and value heads. The value function and the policy function computed from this neural network are returned back to the MCTS.

### IV. RESULTS

We compare QRoute against the routing algorithms from other state-of-the-art frameworks on various circuit benchmarks: (i) Qiskit and its three variants [25]: (a) basic, (b) stochastic, and (c) sabre, (ii) Deep-Q-Networks (DQN) from [8], (iii) Cirq [26], and (iv) t|ket\rangle from Cambridge Quantum Computing (CQC) [27]. The algorithm t|ket\rangle uses BRIDGE gates along with SWAP gates and Qiskit uses gate commutation rules while perform qubit routing. These strategies are shown to be advantageous in achieving lower circuit depths [28] but were disabled in our simulations to have a fair comparison. The results for DQN shown are adapted from the data provided by the authors of Pozzi  $et\ al.\ [8]$ , and were not replicated by us.

## A. Random Test Circuits

Our first benchmark for comparing our performance is the random circuits. These circuits are generated on the fly and initialized with the same number of qubits as there are nodes on the device. Then two-qubit gates are put between any pair of qubits chosen at random. In our simulations, the number of such gates is varied from 30 to 150 and the results for assessing performance of different frameworks are given in Fig. 5. The experiments were repeated 10 times on each circuit size, and final results were aggregated over this repetition.

Amongst the frameworks compared, QRoute ranks a very close second only to Deep-Q-Network (DQN) guided simulated annealer, after performing equivalent to it on low layer density circuits. Nevertheless, QRoute still does consistently better than all the other major frameworks:



FIG. 5: Comparative performance of routing algorithms on random circuits as a function of the number of two-qubit operations in the circuit. The results for DQN have been adapted from the CDR data (Eq. 6) received from the authors, and has not been replicated by us.

Qiskit, Cirq and  $t|\text{ket}\rangle$ , despite the increase in the average layer density and it also scales reasonably well. Thus, even though the MCTS is guided reasonably well through the action space by the GNN, in few cases it might still be unable to distinguish between the optimality of two actions differing by few Q-values due to limited number of iterations. This could lead to a few layer errors in the overall procedure which can be attributed as the reason to our close (on an average  $\leq 4$  layers), but not identical performance as DQN on the random circuits with larger layer density.

### B. Small Realistic Circuits

Next we test on the set of all circuits which use 100 or less gates from the IBM-Q realistic quantum circuit dataset used by dat [29]. The comparative performance of all routing frameworks has been shown by plotting the depths of the output circuits summed over all the circuits in the test set in Fig. 6. Since the lack of a good initial qubit allocation becomes a significant problem for all pure routing algorithms on small circuits, we have benchmarked QRoute on this dataset from three trials with different initial allocations.

Our model has the best performance on this dataset. We also compare the best result from a pool of all routers including QRoute against that of another pool of the same routers but excluding QRoute. The pool including QRoute gives on an average 2.5% lower circuit depth, indicating that there is a significant number of circuits where QRoute is the best routing method available.

On this dataset also, closet to QRoute performance is shown by Deep-Q-Network guided simulated annealer. To compare performances, we look at the average circuit



FIG. 6: Plots of output circuit depths of routing algorithms over small realistic circuits ( $\leq 100$  gates), summed over the entire dataset. The inset shows the results on the same data comparing the best performant scheduler excluding and including QRoute on each circuit respectively. The results for DQN have been extrapolated from the CDR data (Eq. 6) received from the authors, and has not been replicated by us.

depth ratio (CDR), which is defined by [8]:

$$CDR = \frac{1}{\text{#circuits}} \sum_{\text{circuits}} \frac{\text{Output Circuit Depth}}{\text{Input Circuit Depth}}$$
 (6)

The resultant CDR for QRoute is 1.178, where as the reported CDR for the DQN is 1.19. In fact, QRoute outperforms DQN on at least 80% of the circuits. This is significant because in contrast to the random circuit benchmark, the realistic benchmarks consist of the circuits that are closer to the circuits used in useful computation.

## C. Large Realistic Circuit

For final benchmark, we took eight large circuits ranging from 154 gates to 5960 gates in its input from the IBM-Quantum realistic test dataset [29]. The results are plotted in Fig. 7. Our model has the best performance of all available routing methods: Qiskit and  $t|ket\rangle$ , on every one of these sampled circuits with on an average 13.6% lower circuit depth, and notable increase in winning difference on the larger circuits.

The results from DQN and Cirq are not available for this benchmarks as they are not designed to scale to such huge circuits. In case of DQN, the CDR data results were not provided for the circuits over 200 gates, mainly because simulated annealing used in it is computationally expensive. Similarly, for Cirq, it takes several days to compile each of the near 5000 qubit circuits. In contrast to them, in our case, MCTS has to return a recommended optimal action at the end of the search iteration. These



FIG. 7: The results over eight circuits sampled from the large realistic dataset benchmark, the outputs of each routing algorithm are shown for every circuit.

recommended actions gradually improve as the statistics about the simulated routing-tree gets updated.

### V. DISCUSSION AND CONCLUSION

In this article, we have shown that the problem of qubit routing has a very powerful and elegant formulation in Reinforcement Learning (RL) which can surpass the results of any classical heuristic algorithm across all sizes of circuits and types of architectures. Furthermore, the central idea of building up solutions step-by-step when searching in combinatorial action spaces and enforcing constraints using mutex locks can be adapted for several other combinatorial optimization problems [30–34].

Our approach is also flexible enough that it allows us to compile circuits of any size onto any device, from small ones like IBMQX20 with 20 qubits, to much larger hardware like Google Sycamore with 53 qubits (the Circuit Depth Ratio on small circuits on Google Sycamore was 1.64,). Our model can also intrinsically deal with hardware with different primitive instruction set, for example on hardware where SWAP gates are not a primitive and they get decomposed to 3 operations. Our approach also enjoys significant tunability; hyperparameters can be changed easily to alter the tradeoff between time and optimality of decisions, exploration and exploitation, etc.

QRoute is a reasonably fast method, taking well under 10 minutes to route a circuit with under 100 operations, and at most 4 hours for those with upto 5000 operations, when tested on personal computer with an i3 processor (3.7 GHz) and no GPU acceleration. Yet more can be desired in terms of speed, given that Qiskit and  $t|ket\rangle$  are a lot faster than QRoute, but it is hard to achieve that without reducing the number of search iterations and trading off a bit of performance. Better, more pre-

dictive neural networks can help squeeze in slightly better speeds.

One of the challenges of methods like DQN, that use Simulated Annealing to build up their actions is that the algorithm cannot plan for the gates which are not yet waiting to be scheduled, those which will come to the head of the list once the gates which are currently waiting are executed [8]. QRoute also shares this deficiency, but the effect of this issue is mitigated by the explicit tree search which takes into account the rewards that will be accrued in the longer-term future. There is scope to further improve this by feeding the entire list of future targets directly into our neural network by using transformer encoders to handle the arbitrary length sequence data. This and other aspects of neural network design will be a primary facet of future explorations.

Another means of improving the performance would be to introduce new actions by incorporating use of BRIDGE gates [28] and gate commutation rules [14] alongside currently used SWAP gates. The advantage of former is that it allows running CNOT gates on non-adjacent qubit without permuting the ordering of the logical qubits; whereas, the latter would allow MCTS to recognize the redundancy in action space, making its exploration and selection more efficient.

Finally, we provide an open-sourced access to our software library [20]. It will allow researchers and developers to implement variants of our methods with minimal effort. We hope that this will aid future research in quantum circuit transformations.

On the whole, the Monte Carlo Tree Search for building up solutions in combinatorial actions spaces has exceeded the current state of art methods that perform qubit routing. Despite its success, we note that QRoute is a primitive implementation of our ideas, and there is great scope of improvement in future.

### ACKNOWLEDGEMENTS

A. S. and U. A. have contributed equally to the manuscript. A. S. and U. A. would like to acknowledge

the assistance from their colleague Bhuvanesh Sridharan for his help in implementing and refining MCTS, Additionally, they would also like to thank their colleagues Jai Bardhan, Kalp Shah for their suggestions.

- [1] J. Preskill, Quantum 2, 79 (2018).
- [2] "IBM Quantum," https://quantum-computing.ibm.com/ (2021).
- [3] F. Arute, K. Arya, R. Babbush, and et al., Nature 574, 505–510 (2019).
- [4] P. J. Karalekas, N. A. Tezak, E. C. Peterson, C. A. Ryan, M. P. da Silva, and R. S. Smith, "A quantum-classical cloud platform optimized for variational hybrid algorithms," Quantum Sci. Technol. 5 024003 (2020) (2020), arXiv:2001.04449.
- [5] J. E. Bourassa, R. N. Alexander, M. Vasmer, A. Patil, I. Tzitrin, T. Matsuura, D. Su, B. Q. Baragiola, S. Guha, G. Dauphinais, K. K. Sabapathy, N. C. Menicucci, and I. Dhand, "Blueprint for a scalable photonic faulttolerant quantum computer," Quantum 5, 392 (2021) (2020), arXiv:2010.02905.
- [6] A. M. Childs, E. Schoute, and C. M. Unsal, in 14th Conference on the Theory of Quantum Computation, Communication and Cryptography (TQC 2019), Leibniz International Proceedings in Informatics (LIPIcs), Vol. 135 (Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2019) pp. 3:1–3:24.
- [7] A. Cowtan, S. Dilkes, R. Duncan, A. Krajenbrink, W. Simmons, and S. Sivarajah, in 14th Conference on the Theory of Quantum Computation, Communication and Cryptography (TQC 2019), Leibniz International Proceedings in Informatics (LIPIcs), Vol. 135 (Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2019) pp. 5:1-5:32.
- [8] M. G. Pozzi, S. J. Herbert, A. Sengupta, and R. D. Mullins, arXiv e-prints, arXiv:2007.15957 (2020), arXiv:2007.15957 [quant-ph].
- [9] S. Herbert and A. Sengupta, arXiv e-prints arXiv:1812.11619 (2018), arXiv:1812.11619 [quant-ph].
- [10] L. Kocsis and C. Szepesvari, in ECML (2006).
- [11] R. Munos, Found. Trends Mach. Learn. 7, 1 (2014).
- [12] L. Kocsis and C. Szepesvári, in European conference on machine learning (Springer, 2006) pp. 282–293.
- [13] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon, arXiv e-prints, arXiv:1801.07829 (2018), arXiv:1801.07829 [cs.CV].
- [14] J. C. Garcia-Escartin and P. Chamorro-Posada, arXiv eprints, arXiv:1110.2998 (2011), arXiv:1110.2998 [quantph].
- [15] S. S. Tannu and M. K. Qureshi, in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '19 (Association for Computing Machinery, New York, NY, USA, 2019) p. 987–999.
- [16] A. Zulehner, A. Paler, and R. Wille, IEEE Transactions on Computer-Aided Design of Integrated Circuits and

- Systems 38, 1226 (2019).
- [17] C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton, IEEE Transactions on Computational Intelligence and AI in Games 4, 1 (2012).
- [18] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, and et al., Nature 529, 484 (2016).
- [19] X. Zhou, Y. Feng, and S. Li, in Proceedings of the 39th International Conference on Computer-Aided Design, ICCAD '20 (Association for Computing Machinery, New York, NY, USA, 2020).
- [20] (2021), code for the complete simulation and visualization is available on https://github.com/ AnimeshSinha1309/quantum-rl.
- [21] E. W. Dijkstra, "Cooperating sequential processes," in *The Origin of Concurrent Programming: From Semaphores to Remote Procedure Calls* (Springer New York, New York, NY, 2002) pp. 65–138.
- [22] T. M. Moerland, J. Broekens, A. Plaat, and C. M. Jonker, arXiv e-prints, arXiv:1805.09218 (2018), arXiv:1805.09218 [stat.ML].
- [23] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. P. Lillicrap, K. Simonyan, and D. Hassabis, CoRR abs/1712.01815 (2017), arXiv:1712.01815.
- [24] P. Ramachandran, B. Zoph, and Q. V. Le, arXiv e-prints , arXiv:1710.05941 (2017), arXiv:1710.05941 [cs.NE].
- [25] H. Abraham and et al., "Qiskit: An Open-source Framework for Quantum Computing," (2019).
- [26] Cirq Developers, "Cirq," (2021).
- [27] S. Sivarajah, S. Dilkes, A. Cowtan, W. Simmons, A. Edgington, and R. Duncan, Quantum Science and Technology 6, 014003 (2020).
- [28] T. Itoko, R. Raymond, T. Imamichi, and A. Matsuo, arXiv e-prints, arXiv:1907.02686 (2019), arXiv:1907.02686 [quant-ph].
- [29] "IBM Qiskit developer challenge," https://www.ibm.com/blogs/research/2018/08/winners-qiskit-developer-challenge/ (2018), accessed on: 2021-03-23.
- [30] N. Mazyavkina, S. Sviridov, S. Ivanov, and E. Burnaev, arXiv e-prints, arXiv:2003.03600 (2020), arXiv:2003.03600 [cs.LG].
- [31] Z. Xing and S. Tu, IEEE Access 8, 108418 (2020).
- [32] R. Xu and K. Lieberherr, arXiv e-prints arXiv:1903.03674 (2019), arXiv:1903.03674 [cs.AI].
- [33] K. Abe, Z. Xu, I. Sato, and M. Sugiyama, arXiv e-prints , arXiv:1905.11623 (2019), arXiv:1905.11623 [cs.LG].
- [34] A. Laterre, Y. Fu, M. Khalil Jabri, A.-S. Cohen, D. Kas, K. Hajjar, T. S. Dahl, A. Kerkeni, and K. Beguir, arXiv e-prints , arXiv:1807.01672 (2018), arXiv:1807.01672 [cs.LG].