# Generalised Circuit Partitioning for Distributed Quantum Computing

Felix Burt
Department of Electrical and
Electronic Engineering
Imperial College London
London, UK
f.burt23@imperial.ac.uk

Kuan-Cheng Chen
Department of Electrical and
Electronic Engineering
Imperial College London
London, UK
kuan-cheng.chen17@imperial.ac.uk

Kin K. Leung
Department of Electrical and
Electronic Engineering
Imperial College London
London, UK
kin.leung@imperial.ac.uk

Abstract—Distributed quantum computing (DOC) is a new paradigm aimed at scaling up quantum computing via the interconnection of smaller quantum processing units (QPUs). Shared entanglement allows teleportation of both states and gates between QPUs. This leads to an attractive horizontal scaling of quantum processing power, which comes at the expense of the additional time and noise introduced by entanglement sharing protocols. Consequently, methods for partitioning quantum circuits across multiple QPUs should aim to minimise the amount of entanglement-based communication required between distributed QPUs. Existing protocols tend to focus primarily on optimising entanglement costs for gate teleportation or state teleportation to cover operations between QPUs, rather than both at the same time. The most general form of the problem should treat gate and state teleportation on the same footing, allowing minimal cost circuit partitions through a combination of the two. This work introduces a graph-based formulation which allows joint optimisation of gate and state teleportation cost, including extensions of gate teleportation which group gates together for distribution using common resources. The formulation permits low e-bit cost for a variety of circuit types. Using a basic genetic algorithm, improved performance over state-of-the-art methods is obtained in terms of both average e-bit cost and time scaling.

Index Terms—Quantum Computing, Distributed Quantum Computing, Optimization, Quantum Networks, Quantum Communication

# I. INTRODUCTION

Distributed quantum computing (DQC) is a new paradigm aimed at connecting multiple quantum processing units (QPUs) to form larger, more powerful, quantum computing systems [1]. This provides an alternative route to scaling qubit numbers [2] which avoids the challenges of connecting many qubits on a single quantum processor [3]. However, DQC creates new difficulties relating to the interaction of QPUs in both hardware and software. Quantum information can be distributed between OPUs using entangled pairs of qubits, or e-bits, which must be efficiently distributed between devices [2]. E-bits can be consumed to teleport quantum states between QPUs, or to directly perform certain quantum gates non-locally [4]. However, preparation of e-bits is estimated to be over 10 times slower than local two-qubit gates [5], resulting in a new objective for quantum circuit optimisation - minimisation of e-bit consumption. This has been referred to as the "DQC problem" [6] or the "circuit partitioning problem" [7]. Broadly, the aim of the problem is to cover all gates in a circuit split between multiple QPUs using as few e-bits as possible. Dominant methods in the literature for solving circuit partitioning focus on optimising e-bit cost from either state teleportation [8]–[10] or gate teleportation [6], [11]–[14], failing to fully leverage the possibilities available to cover inter-QPU operations. While there are works which consider both [9], [15], [16], they fail to treat them jointly as target of optimisation, resulting in higher e-bit costs than what is possible in general. This work proposes a generalised graph formulation of the circuit partitioning problem, aiming to unify the dominant approaches. The results are compared with leading methods from different paradigms [6], [8], which both perform very well on circuits which fit their methods but perform sub-optimally in certain cases. Using simple heuristics, we obtain competitive performance with the best cases of existing methods, while retaining strong performance in instances where the benchmark methods break down, leading to lower e-bit costs on average. The results are obtained using a genetic algorithm with time complexity linear in the total circuit size (width × depth), with experimental demonstrations showing significant improvement over the scaling of the benchmark methods.

# II. THE CIRCUIT PARTITIONING PROBLEM (CPP): BACKGROUND

# A. Static circuit partitioning

The general aim of circuit partitioning is to split a quantum circuit C, consisting of list of operations on a set of logical qubits  $q=\{q_0,q_1...\}$ , among a set of QPUs  $Q=\{Q_0,Q_1...\}$ . Splitting quantum circuits into separate parts can be considered as assignment  $\phi$  of logical qubits to QPUs,  $\phi:q\to Q$ , resulting in a collection of operations which are either local or non-local. Local operations either act only on a single qubit, or act on multiple qubits which are assigned to the same QPU, i.e.  $\phi(q_i)=\phi(q_j)$ . Non-local operations, on the other hand, are gates which involve two or more qubits assigned to different QPUs,  $\phi(q_i)\neq\phi(q_j)$ . Any assignment  $\phi$  will give a corresponding number of non-local operations to be covered. In order to minimise the number of non-local operations, a quantum circuit can be converted to an interaction graph

where each qubit corresponds to a node on the graph, while interactions of all pairs of qubits are summed into weighted edges. This reduces the problem to min-cut graph partitioning [16]. For balanced partitions min-cut is NP-hard, but good heuristics are available [17]. In terms of circuit partitioning, this is referred to as *static* partitioning because a single qubit assignment is used for the full circuit. Static partitioning methods minimise the number of non-local operations, which is only equal to the e-bit cost if each non-local operation takes a single e-bit. In general, this is not the case, as e-bits can be used in more flexible ways.



(a) Random quantum circuit. See section V-B1.



Fig. 1: **Static partitioning workflow.** The circuit in a) is converted to a static interaction graph shown in b). The graph is then partitioned according to the constraints of the QPUs, such that the edge weight between partitions is minimised.

#### B. Teleportation

1) State teleportation: Quantum state teleportation is the "conventional" teleportation procedure, which uses a single e-bit, shared between a two parties, to move an arbitrary quantum state from one party to the other [18].



Fig. 2: **State teleportation.** This is the well known quantum teleportation procedure implemented using a controlled-phase gate rather than a CNOT.

State teleportation can be used to cover a non-local operation by moving one qubit state to its partner's QPU, then

performing the operation locally, at the cost of a single e-bit. If the state is then returned to its original QPU, the overall e-bit cost is two. However, it is not always required to return the qubit to its original QPU – in some cases it may even be beneficial for the qubit to remain at the destination of the teleportation.

2) Gate teleportation: A less well-known procedure is gate teleportation, which uses a single e-bit to perform a two-qubit controlled unitary, without directly transferring either qubit to the other QPU [4].



Fig. 3: **Gate teleportation.** The circuit displays how a single controlled-phase gate can be teleported using an e-bit. The procedure is general to all controlled-unitary gates with  $q_0$  as the control.

Gate teleportation works by entangling the control qubit with a communication qubit in another QPU, such that both qubits behave the same way when controlling two-qubit gates on other qubits. The state of the control qubit is mapped from  $|\psi\rangle=\alpha\,|0\rangle+\beta\,|1\rangle$  to the joint state  $|\psi\rangle=\alpha\,|00\rangle+\beta\,|11\rangle$  with the communication qubit. This allows the communication qubit to control a two-qubit (or multi-qubit gate) in place of the original qubit. After the gate has been performed, the communication qubit is measured, and the result used to correct the state of the original control qubit. In contrast with state teleportation, at the end of the procedure both qubits remain assigned their original QPUs.

3) Extended gate teleportation: While the basic gate teleportation procedure focuses on teleporting a single qubit controlled-unitary, it has been shown that larger sequences of operations can in fact be teleported, sometimes at no additional cost [4].



Fig. 4: **Extended gate teleportation.** The circuit displays how two controlled-unitary gates can be teleported using the same e-bit.

In particular, for a collection of two-qubit controlledunitaries, if each gate shares a common control qubit and their target qubits are assigned to the same QPU, then all gates can be teleported using a single e-bit. This requires a delay in the measurement of the communication qubit in the target QPU, waiting until all gates in the group being covered have been completed to end the process and send the classical information back to the original QPU. In addition, any diagonal or anti-diagonal single-qubit gate acting on the common control qubit can be included in the teleportation procedure [19], as this does not change the behaviour of the linked communication qubit as a control.

Extended gate teleportation can be broken up into a "starting process", which creates the linked entangled state, an "intermediate process" where the corresponding local operations are performed, and an "ending process" which ends the link with measurement and classical communication [12]. The starting and ending processes have also been referred to as the "cat-entangler" and "cat-disentangler", respectively [20]. The groups of gates covered is referred to as a "distributable packet" [12]. Wu et al. go on to describe how non-adjacent distributable packets can be merged together via "embedding" distribution processes for intermediate unitaries within a larger process. Embedding processes allow further reduction of e-bit costs but are beyond the scope of the paper - details can be found in Wu et al. [12]. Following the literature, we will also refer to extended gate teleportation methods as "gate packing" methods.

#### C. Competing methods

The dominant methods in the DQC literature tend to focus heavily on exploiting the benefits of either state teleportation, or gate packing to produce low e-bit costs. We will roughly outline these two extreme ends of the spectrum giving reference to the relevant existing works.

1) Fine-grained partitioning: As an alternative to basic graph partitioning, Baker et al. [8] propose a method referred to as fine-grained partitioning (FGP). Rather than reduce a quantum circuit to a single, static, interaction graph, the essence of FGP is to produce a sequence of interaction graphs for each layer, or "time-slice", of the circuit. Timesliced graphs can consider future interactions in the circuit by adding time-decaying weighted edges for gates in later slices. Edges representing gates in the current slice are given infinite weights to ensure a local assignment at the time of interaction. The strategy is then to find a low-cost sequence of non-local swaps (two-way gate teleportation) to transition between each slice of the circuit. The result is that all nonlocal operations are covered by state teleportation, while the look-ahead weights work to ensure that swaps also consider future interactions in the circuit. The methods produce low cost assignment sequences in certain scenarios but are limited by the restriction that all non-local operations are covered using state teleportation, especially for circuits for which qubit connectivity changes erratically. Similar methods are presented in Sundaram et al. [9] and Nikhad et al. [10]. The significance of FGP type methods is that the solution for the partitioning problem is no longer a single state assignment of qubits, rather a different assignment for each layer of the circuit. Partitioning over time captures gradual changes in interactivity of quantum circuits very well.

- 2) Gate packing: A number of other methods are motivated rather by extended gate teleportation, focusing on packing sequences of gates together into the same teleportation protocol to reduce e-bit cost. A hypergraph partitioning method was proposed originally by Andres-Martinez et al. [11]. This work extends the normal graph approach by using hyperedges to represent groups of gates which can be distributed using a single e-bit, such that the number of hyperedges which cut the partition directly corresponds to e-bit count rather than non-local gate count. The methods are extended in Andres-Martinez et al. [6] and combined with the methods from Wu et al. [12], which extend the scope of gate teleportation further. The methods go beyond just gate packing, using the embedding methods mentioned in Section II-B3. The methods presented in the latter work are capable of finding very low cost distributions in instances of circuits which contain large sequences of gates meeting the conditions for gate packing and embedding, but are unable to take advantage of state teleportation when such sequences are not available, such as in the quantum volume benchmarks used in their work.
- 3) Hybrid methods: We acknowledge the existence of methods that consider both extended gate teleportation and state teleportation [9], [15], [16], but note that they treat state teleportation and gate teleportation separately in the optimisation process. For example, Sundaram et al. first optimise extended gate teleportation cost using static graph partitioning, then see if the result can be improved by strategically inserting teleportations into circuits [9]. Similarly, Ferrari et al. begin with static graph partitioning using the METIS library [17] followed by a scheduling pass which chooses whether to use state or gate teleportation to cover the gates based on cost [16]. In addition, an appendix in Andres-Martinez et al. describes how large circuits are broken up into sections which are partitioned separately, requiring teleportations to transition between each section, making this method technically a hybrid method [11]. However, this is dealt with using a pre-processing routine to determine where circuits should be split, rather than combined treatment of state and gate teleportation.

#### III. GENERALISED CIRCUIT PARTITIONING (GCP)

To capture full generality, the problem formulation should cover instances where circuits are best executed using state teleportation only, gate teleportation only, and a combination of both – including gate packing. Doing so results in a unification between fine-grained partitioning methods and gate packing methods, thus permitting low e-bit cost in all cases. We will present two variants of GCP; a *simple* variant which considers only regular, non-extended, gate teleportation, and an *extended* variant which includes simple gate packing procedures.

#### A. Generalised circuit partitioning - simple (GCP-S)

Central to both methods is an extended circuit interaction graph (see Figure 5, which retains the time-step/layer number



Fig. 5: **Extended circuit interaction graph.** The graph shown here represented a QFT circuit on 8 qubits. Here the nodes are all assigned an unoptimised static mapping indicating gate teleportation for all edges cutting the partition.

for all gates in the circuit. In contrast to FGP methods, which produce a different graph for each layer of the circuit, the extended interaction graph is a single graph G(V, E), containing separate nodes for each qubit at each layer of the circuit. Gate-like edges connect partner nodes at the layer of their interaction, while additional state-like edges connect adjacent nodes representing the same qubit. The result is that an assignment of nodes to QPUs corresponds to an assignment of qubits for each layer of the circuit, thus determining both the state teleportations and the gate teleportations required distribute the circuit. Formally, for each logical qubit i at the  $l^{th}$  layer of the circuit,  $q_i^{(l)}$  maps uniquely to a node  $v_k$ . However, valid solutions are constrained to those which always meet the QPU size constraints. These are enforced by including nodes to represent any idle qubits in each QPU, thus we increase the size of the set of logical qubits to match the total number of physical qubits. This allows the space of all valid solutions to be explored via permutations of nodes within their own layer. This picture extends the problem to a more general, constrained, graph partitioning problem. The aim is still to find a min-cut partition of the nodes but the number of nodes has been extended to reflect the dynamics of the circuit. This picture aligns closely with that presented in Davis et al. under the name "gate partitioning" [21]. However, the authors do not include the constraints to enforce the size constraints of the QPUs, rather they perform post-processing to ensure that these conditions are met, resulting in high e-bit costs in many of the presented benchmarks. The authors also consider no extension to include gate packing, as is done in Section III-B.

Formally, we define a partition of the circuit as a mapping from the set of nodes to the set of QPUs,  $\Phi:V\to Q$ , though since the nodes represent qubits at each layer of the circuit, we can equivalently think of a partition as an assignment of qubits to QPUs for each layer of the circuit. We retain similar notation to the static case, except that the capital  $\Phi$  is used



Fig. 6: **Simple optimisation.** This is the same QFT as Figure 5, optimised according to GCP-S. Nodes have been dragged across to indicate where state teleportation occurs. The circuit is covered using 6 state teleportations, provided there is a single additional space in the second QPU.

to denote assignment across all layers. As such, a partition can be viewed as a matrix where the element  $\phi_{i,j}$  indicates the QPU that qubit  $q_j$  is assigned to at layer i. The cost of a partition is given by:

$$c = \sum_{u,v} e_{u,v} (1 - \delta_{\Phi(u)\Phi(v)}) f(r(\Phi(u), \Phi(v))), \tag{1}$$

where we have included a generic function f to denote the cost scaling, as a function of the length  $r(\Phi(u), \Phi(v))$  of the path separating the nodes u and v in the network. We only consider homogeneous networks, for which  $f \to 1$  in all cases, though it is included to demonstrate ease of extension.



Fig. 7: **Visualising assignments on graphs.** In a) we see the cost of a naive assignment, resulting in an e-bit cost of 13. In b) we see the effect of an optimised assignment sequence, resulting in an e-bit cost of 5. Coloured nodes indicate where e-bits are used.

#### B. Generalised circuit partitioning - extended (GCP-E)

The extension to GCP can be achieved by modifying the edge structure of the graph. The graph is first scanned to identify where gates can be grouped together according to the conditions identified in Section II-B3. Since different possibilities are available for grouping gates, a greedy approach is used, which always adds gates to the largest available group. The conditions for grouping gates are:

- 1) All two-qubit gates share a common control qubit.
- 2) All two-gates are adjacent on the control.



Fig. 8: **Extended optimisation.** The Quantum Fourier Transform (QFT) circuit optimised using the GCP-E formulation. In this case, no state teleportation is required, with a total e-bit cost of 3 with the same size constraints as in Figure 6. Edges which have been merged onto a control node are coloured red.

For condition 1., the shared control must be the same for each gate. Since there may be many controls involved in gate groups which are not the shared control, we refer to the qubit representing the shared control as the "lead control qubit". The node representing the first appearance of the shared control is referred to as the "original control node" for a group. "Adjacent", in condition 2., means that there are no non-diagonal single-qubit gates acting on the original control between any consecutive gates in the group. From section II-B3, we know that diagonal single-qubit gates on the original control can form part of a group, so these can be freely applied to a qubit while it is linked to other QPUs. For symmetric two-qubit gates, in which both involved qubits are control qubits, the conditions are only relevant to the lead control qubit. There are no restrictions on the partner control qubits.

The edges for the gate groups identified are all merged onto the original control node. This is demonstrated in Figure 8. Edges grouped on a particular original control node can be treated similarly to hyperedges in a hypergraph. This is because the e-bit requirements for distributing a group of gates will correspond not to the number of edges but rather the number of different QPUs to which the edges in the group connect to, since e-bits can be re-used within each linked QPU. This modification allows us to explore the full solution space from assignments which imply state teleportation only to those

which rely on extended gate teleportation as well as everything in between. The new cost function is:

$$c = \sum_{u,v \in V} e_{u,v} (1 - \delta_{\Phi(u)\Phi(v)}) h(u, \Phi(v)) f(r(\Phi(u), \Phi(v))),$$
(2)

where  $h(u, \Phi(v))$  is a function which returns 1 if  $e_{u,v}$  is the first edge from source node u connecting to a node in  $\Phi(v)$ . With this modification, we do not count edges which can benefit from an existing linked copy. To see why this is the case, consider the natural extension of the gate teleportation procedure in Section II-B1. In order for a qubit to be linked to k QPUs, its state must be transformed to to a (k+1)-fold entangled state  $|\psi\rangle = \alpha |0\rangle^{\otimes k+1} + \beta |1\rangle^{\otimes k+1}$ , such that each linked communication qubit can be used to control operations for the lead control. Creating such a state requires the gate teleportation starting procedure to be repeated for each QPU, requiring k e-bits to be consumed. Thus, for each group of edges which have been merged onto a control node, the ebit cost is equal to the number of external QPUs which are connected. Once the links are active, the states can either remain at the communication qubit, or swap onto idle data qubits to free up communication qubit space. An example of grouping gates and calculating costs is shown in Figures 9a and 9b.

#### IV. SOLUTION METHODS

#### A. Genetic algorithm

To solve GCP, we propose a genetic algorithm (GA). GAs are evolutionary algorithms which are well known for being able to produce good heuristic solutions in combinatorial optimisation problems [22], particularly if tailored by prior knowledge. In this case, we can constrain solutions of the problems to take the shape of matrices, where the column index corresponds to qubit number j, the row index to the layer i of the circuit. Corresponding elements are assignments  $\Phi(v)$  where v corresponds to qubit  $q_j$  at layer i. Each solution corresponds to a set of state and gate teleportations, shown in 7. We can explore all valid solutions with permutations of the elements within each layer.

The workflow for GCP is described in algorithm 1. For costfunc, the cost functions in equations 1 and 2 are used, depending on which variant of the problem is being considered. The crossover and mutation functions can be chosen to suit the problem. We use a single point crossover, which selects crossover points by row of the matrix, so that the size constraints remain satisfied. For mutation, we use a custom mutation function which samples random pairs of nodes in a given interval and exchanges them if they have a net difference of neighbours assigned to each other's partition, inspired by the Kernighan-Lin heuristic [23] This mutation function is slower than a normal flip mutation but better avoids local minima.



(a) Gate grouping. All edges are merged onto the original control (b) Interrupted gate grouping. Since the control line is internode. Since this node connects to 3 partitions outside its own, the ebit cost for the assignment is 3.



rupted by a non-diagonal single-qubit gate (green node), two gate groups must be formed. The resulting e-bit cost is 5, since the first group connects to 3 partitions and the second to 2.

#### Algorithm 1 Genetic Circuit Partitioning

```
1: Input: A graph G(V, E) for circuit of depth d with n_a
    qubits. A set of QPUs Q and their sizes |Q_n|. Population
    size |L|. Number of generations/iterations N_q.
 2: Output: Sorted population \tilde{L} of candidate assignments.
 3: Initialise empty population L
 4: for n = 0 to |L| - 1 do
         Produce a single-layer, static, assignment \phi.
 5:
         Repeat \phi for depth d to create full assignment \Phi.
 6:
         Add candidate to L
 7.
 8: end for
 9: for n=1 to N_g-1 do
         for candidate in L do
10:
             cost \leftarrow costfunc(G, candidate)
11:
              Add cost to costlist
12:
         end for
13:
        Initialise empty population \tilde{L} for m=0 to \frac{|L|}{2}-1 do
14:
15:
             dist \leftarrow softmax(costlist)
16:
             Sample c_a and c_b from L according to dist.
17:
             \tilde{c_a}, \tilde{c_b} \leftarrow crossover(c_a, c_b)
18:
             \tilde{c_a}, \tilde{c_b} \leftarrow mutation(\tilde{c_a}, \tilde{c_b})
19:
              Add \tilde{c_a}, \tilde{c_b} to \tilde{L}
20:
         end for
21:
         L \leftarrow \tilde{L}
22:
23: end for
24: return L
```

#### B. Complexity analysis

The time complexity of the genetic algorithm is determined by population size |L|, number of generations  $N_a$ , cost function evaluation and the crossover and mutation functions. For all tests, we used constant L and  $N_q$  such that the scaling is independent of these factors. Typically, as the size of the problem increases, a constant population size and number of iterations is less likely to reach to an optimal solution, however, we retain competitive and often superior performance to the benchmark methods with constant parameters in all tests. In this case the time complexity is  $\mathcal{O}(cost + selection +$ crossover + mutation). The cost function requires checking

the assignment of nodes for  $n_q(d+1)$  state edges and a maximum of  $\frac{1}{2}n_qd$  gate edges, such that the cost function time complexity is  $\mathcal{O}(n_q d)$ . The selection employs a softmax over all fitness values which has complexity  $\mathcal{O}(|L|)$  which is constant since |L| is fixed. Crossover simply selects a random point at which to split two solutions which is also a constant time operation. In our case, the mutation function checks the partitions of a subset of nodes for two qubits in a randomly chosen interval. This is done for a constant k pairs of qubits, such that we query the partition of at worst 2dk nodes, making the complexity  $\mathcal{O}(d)$ . Altogether this gives a time complexity  $\mathcal{O}(n_q d) + \mathcal{O}(d) = \mathcal{O}(n_q d)$ , such that the scaling is linear in the overall circuit size. The complexity for specific circuits is discussed in section VI.

#### V. RESULTS

We use the GA to produce results for both the simple and the extended variants of GCP, here termed GCP-S and GCP-E respectively. These are compared against two leading methods in the literature, exemplifying the extremes of state and gate teleportation-focused algorithms. The comparison graphs are shown in Figures 11, 12 and 13, while the results are summarised in Table I. All results are produced using an HP Elite Tower 800 G9, running Windows 11 Pro with an Intel Core i5-12500 processor, and 16GB of DDR5 RAM.

# A. Benchmark algorithms

- 1) FGP-rOEE: The first is the algorithm presented in Baker et al. for solving their FGP problem formulation. The algorithm is a variant of the overall extreme exchange (OEE) graph partitioning heuristic [24], called relaxed-OEE. We use a version of rOEE which differs only in the initial phase, where the authors use the regular OEE algorithm to find a starting assignment. We use a simplified version of this which simply swaps the assignment of nodes until no benefit can be gained, then proceed with rOEE.
- 2) Pytket-DQC: The second algorithm comes from Andres-Martinez et al. [6] and is freely available in pytket-DQC [25]. The authors present a number of different workflows – we use a custom workflow which is the same as the best performing "EmbedSteinerDetached (ESD)" but that it uses the annealing allocator rather than the hypergraph partitioner. We will refer



Fig. 10: Example of a CP fraction circuit mapped using different techniques. The best e-bit cost is in d) where all teleportation procedures are optimised over.

to this workflow as AESD. Using the annealer may result in a small performance decrease, however, reliance on external software led to implementation issues.

#### B. Benchmark circuits

Four different circuit types are used for comparing methods, each with a very different qubit interaction structure. The choices are motivated such that there are instances where each of the benchmark algorithms performs both strongly and poorly. For all circuits, the universal gate set of controlled-phase gates  $CP(\vartheta)$  and general unitary rotations  $U(\phi,\theta,\lambda)$ , is used. This gate set is useful for DQC as the  $CP(\vartheta)$  allows for many gate packing possibilities. In all cases, circuits are partitioned among QPUs containing 8 qubits each.

- 1) CP fraction: CP fraction circuits are a generalisation of the CZ fraction introduced in Sundaram et al. [14] and used in later works [6], [26]. A CZ fraction circuit, for a given number of qubits  $n_q$ , depth d and fraction p, consists of d layers of gates where, for each layer, a Hadamard (H) gate is applied to each qubit with probability 1-p, while all qubits for which H was not applied are randomly paired with CZ gates. The only difference in a CP fraction is that each H is replaced with  $U(\phi, \theta, \lambda)$  for randomly chosen  $\phi, \theta$  and  $\lambda$ . In addition, each CZ is replaced with  $CP(\vartheta)$ , for some randomly chosen  $\vartheta$ . In all tests, we set  $d=n_q$ .
- 2) QFT: The QFT is a well-known subroutine in many quantum algorithms [27], implemented using a very regular structure of two-qubit gates [28]. The QFT has been used as a benchmark for various circuit partitioning methods and is an example of a circuit type consisting of large distributable packets.
- 3) Quantum volume (QV): QV circuits are randomised circuits used for benchmarking the performance of quantum computers [29]. It is reasonable to believe that QV circuits are also a strong indicator of performance of a distributed quantum architecture, where good compiling strongly impacts the QV score. Another reason for choosing QV circuits for benchmarking is that they represent a middle ground between highly structured and fully random circuits. QV circuits consist of sequences of Haar random unitaries acting on pairs of

qubits interleaved with permutations of qubit numbers [29]. The resulting circuit consists of alternating blocks of repeated single- and two-qubit gate patterns.

4) Quantum approximate optimisation algorithm (QAOA): QAOA circuits are designed to solve combinatorial optimisation problems using parameterised quantum circuits [30]. The circuit consists of parameterised single- and two-qubit gates which are classically optimised to find an approximate solution to the problem. The structure of the circuit depends partly on the structure of the problem being solved. We consider randomly initialised QAOA circuits for MaxCut problems over random graphs with an edge probability of 50% for each pair of nodes.

| Method  | FGP-rOEE | GCP-S | GCP-E | AESD |
|---------|----------|-------|-------|------|
| CP 0.3  | 0.66     | 0.54  | 0.50  | 0.49 |
| CO 0.4  | 0.59     | 0.57  | 0.50  | 0.52 |
| CP 0.5  | 0.65     | 0.55  | 0.47  | 0.45 |
| CP 0.7  | 0.66     | 0.57  | 0.42  | 0.35 |
| CP 0.8  | 0.59     | 0.58  | 0.36  | 0.28 |
| QFT     | 0.34     | 0.59  | 0.14  | 0.12 |
| QV      | 0.20     | 0.43  | 0.42  | 0.73 |
| QAOA    | 0.35     | 0.51  | 0.12  | 0.14 |
| Average | 0.52     | 0.55  | 0.37  | 0.39 |
|         |          |       |       |      |

TABLE I: Average performance across all tests. The metric used is the ratio of e-bits required to the total number of two-qubit gates in the circuit, such that a lower score indicates a more efficient circuit distribution.

#### VI. DISCUSSION AND CONCLUSION

It is clear from the results in Figures 11, 12 and 13 that methods which depend heavily on state teleportation or gate teleportation perform very well in specific circumstances but do not retain the performance in all cases. For example, AESD performs very well on QFT circuits, as these have multiple long chains of two-qubit gates which share a common control qubit. As a result, no state teleportation is required for efficient distribution. However, the AESD performs poorly on QV circuits, which do not contain any distributable packets. In this case, these methods are limited to providing a qubit assignment



Fig. 11: Results from real circuits distributed over QPUs consisting of 8 qubits each.



Fig. 12: Results for random, CP fraction circuits, as described in section V-B1, distributed over QPUs consisting of 8 qubits each

which performs only as well as static graph partitioning. In contrast, FGP-rOEE performs exceptionally well on QV circuits, where full reliance on state teleportation is always most effective. This is because all blocks of a QV circuit can be made local if state teleportations are inserted to match the qubit permutation between each section. This makes it

particularly easy for FGP-rOEE to find an optimal sequence of assignments. In principle, both GCP-S and GCP-E should be able to find the same assignment, but it becomes increasingly less likely as the size of the circuit increases so would require longer runtime for the GA. Again, since there are no distributable packets in QV circuits, GCP-E reduces to GCP-S



Fig. 13: Results from larger CP fraction circuits distributed over QPUs consisting of 16 qubits each.

which is reflected in the similarity of their performance.

For CP fraction circuits, which have a more random structure, FGP-rOEE performs the worst, particularly as the number of qubits and fraction of two-qubit gates increases. Since these circuits have non-uniform qubit interactions, it can be inefficient to be constrained to state teleportation. In contrast, AESD performs worse than GCP-E when the fraction of twoqubit gates is lower. This is because there are more singlequbit gates separating the two-qubit gates, which decreases the average size of the distributable packets. GCP-E can benefit from state teleportation in this cases so retains low e-bit cost. As the fraction of two-qubit gates increases however, AESD performs the best in terms of e-bit cost. AESD contains the most flexible conditions for extended gate teleportation, with the benefits of embedding reflected for two-qubit gate fractions greater than 0.4. This improved performance comes at a significant time cost, however, which can be seen in the distribution time graphs. For all circuits containing 40 qubits or more, the distribution time of AESD drastically increases. In these instances, the cost performance of GCP-E remains competitive for a much shorter distribution time. The overall, average performance is shown in Table I. GCP-E is the best performer in terms of e-bit cost, outperforming FGP-rOEE by around 20% and AESD by around 2%.

The time graphs in Figures 11, 12 and 13 accord with the scaling from section IV-B. While slower at smaller scales, GCP-S and GCP-E scale better and are the fastest in most cases by the end. CP fractions circuits are built with  $d=n_q$ ,

while QFT and QV circuits have  $d \propto n_q$ , so in these cases the complexity reduces to  $\mathcal{O}(n_q^2)$ . For QAOA, the depth is more variable and depends on the optimisation problem. Since we use circuits designed for MaxCut on random graphs, the depth scales with the number of edges of the graphs and the number of repetitions, so steeper polynomial scaling is expected. The experimental results demonstrate superior scaling compared to the benchmark methods for most circuits. However, the complexities of these methods are highly dependent on the type of circuit. The results from larger circuits, in Figure 13, show significant time improvements over the benchmark methods.

In future, we look to extend these methods to general network topologies. We avoided doing this here, as current methods for doing so (scaling cost linearly with path distance [6], [9]) are over-simplified and require careful treatment. We acknowledge another assumption regarding the number of communication qubits available to facilitate e-bit generation at each QPU. At present, this assumption can be justified on the basis that communication qubits can freely swap states in and out of data qubits which are idle in computation. For realistic systems with data and communication qubit number constraints, restrictions will need to be enforced and used to guide the optimisation process.

# VII. ACKNOWLEDGEMENTS

The authors thank Richard Meister for engaging discussions and acknowledge funding from the Engineering and Physical Sciences Research Council (EPSRC) funded Distributed Quantum Computing project, grant number EP/W032643/1

#### REFERENCES

- [1] D. CUOMO, M. CALEFFI AND A. S. CACCIAPUOTI, Towards a distributed quantum computing ecosystem, IET Quantum Communication 1 (2020), pp. 3–8. \_eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1049/iet-qtc.2020.0002. URL: https://onlinelibrary.wiley.com/doi/abs/10.1049/iet-qtc.2020.0002, doi:10.1049/iet-qtc.2020.0002.
- [2] M. CALEFFI ET AL., Distributed Quantum Computing: a Survey, Dec. 2022. arXiv:2212.10609 [quant-ph]. URL: http://arxiv.org/abs/2212.10609, doi:10.48550/arXiv.2212.10609.
- [3] W. G. UNRUH, Maintaining coherence in quantum computers, Physical Review A 51 (1995), pp. 992–997. Publisher: American Physical Society. URL: https://link.aps.org/doi/10.1103/PhysRevA.51.992, doi: 10.1103/PhysRevA.51.992.
- [4] J. EISERT, K. JACOBS, P. PAPADOPOULOS AND M. B. PLENIO, Optimal local implementation of nonlocal quantum gates, Physical Review A 62 (2000), p. 052317. Publisher: American Physical Society. URL: https://link.aps.org/doi/10.1103/PhysRevA.62.052317, doi:10.1103/PhysRevA.62.052317.
- [5] N. ISAILOVIC, Y. PATEL, M. WHITNEY AND J. KUBIATOW-ICZ, Interconnection Networks for Scalable Quantum Computers, ACM SIGARCH Computer Architecture News 34 (2006), pp. 366– 377. URL: https://dl.acm.org/doi/10.1145/1150019.1136505, doi:10. 1145/1150019.1136505.
- [6] P. ANDRES-MARTINEZ ET AL., Distributing circuits over heterogeneous, modular quantum computing network architectures, July 2023. arXiv:2305.14148 [quant-ph]. URL: http://arxiv.org/abs/2305.14148.
- [7] D. BARRAL ET AL., Review of Distributed Quantum Computing. From single QPU to High Performance Quantum Computing, 2024. Publisher: arXiv Version Number: 1. URL: https://arxiv.org/abs/2404.01265, doi: 10.48550/ARXIV.2404.01265.
- [8] J. M. BAKER, C. DUCKERING, A. HOOVER AND F. T. CHONG, Time-Sliced Quantum Circuit Partitioning for Modular Architectures, May 2020. URL: https://arxiv.org/abs/2005.12259v1, doi:10.1145/ 3387902.3392617
- [9] R. G. SUNDARAM AND H. GUPTA, Distributing Quantum Circuits Using Teleportations, May 2023. arXiv:2306.00195 [quant-ph]. URL: http://arxiv.org/abs/2306.00195, doi:10.48550/arXiv.2306.00195.
- [10] E. NIKAHD, N. MOHAMMADZADEH, M. SEDIGHI AND M. S. ZA-MANI, Automated window-based partitioning of quantum circuits, Physica Scripta 96 (2021), p. 035102. Publisher: IOP Publishing. URL: https://dx.doi.org/10.1088/1402-4896/abd57c, doi:10.1088/1402-4896/abd57c.
- [11] P. ANDRÉS-MARTÍNEZ AND C. HEUNEN, Automated distribution of quantum circuits via hypergraph partitioning, Physical Review A 100 (2019), p. 032308. Publisher: American Physical Society. URL: https://link.aps.org/doi/10.1103/PhysRevA.100.032308, doi:10. 1103/PhysRevA.100.032308.
- [12] J.-Y. WU ET AL., Entanglement-efficient bipartite-distributed quantum computing, Quantum 7 (2023), p. 1196. Publisher: Verein zur Förderung des Open Access Publizierens in den Quantenwissenschaften. URL: https://quantum-journal.org/papers/q-2023-12-05-1196/, doi: 10.22331/q-2023-12-05-1196.
- [13] D. CUOMO ET AL., Optimized Compiler for Distributed Quantum Computing, ACM Transactions on Quantum Computing 4 (2023), pp. 1– 29. URL: https://dl.acm.org/doi/10.1145/3579367, doi:10.1145/ 3579367.
- [14] R. G. SUNDARAM, Efficient Distribution of Quantum Circuits, 2021.
- [15] D. FERRARI, A. S. CACCIAPUOTI, M. AMORETTI AND M. CALEFFI, Compiler Design for Distributed Quantum Computing, IEEE Transactions on Quantum Engineering 2 (2021), pp. 1–20. URL: https:// ieeexplore.ieee.org/document/9334411/, doi:10.1109/TQE.2021. 3053921.
- [16] D. FERRARI, S. CARRETTA AND M. AMORETTI, A Modular Quantum Compilation Framework for Distributed Quantum Computing, IEEE Transactions on Quantum Engineering 4 (2023), pp. 1–13. arXiv:2305.02969 [quant-ph]. URL: http://arxiv.org/abs/2305.02969, doi:10.1109/TQE.2023.3303935.

- [17] G. KARYPIS AND V. KUMAR, A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs, SIAM Journal on Scientific Computing 20 (1998), pp. 359–392. URL: http://epubs.siam.org/doi/10. 1137/S1064827595287997, doi:10.1137/S1064827595287997.
- [18] C. H. BENNETT, G. BRASSARD, C. CRÉPEAU, R. JOZSA, A. PERES AND W. K. WOOTTERS, Teleporting an unknown quantum state via dual classical and Einstein-Podolsky-Rosen channels, Physical Review Letters 70 (1993), pp. 1895–1899. Publisher: American Physical Society. URL: https://link.aps.org/doi/10.1103/PhysRevLett.70.1895, doi: 10.1103/PhysRevLett.70.1895.
- [19] S. F. HUELGA, J. A. VACCARO, A. CHEFLES AND M. B. PLENIO, Quantum remote control: Teleportation of unitary operations, Physical Review A 63 (2001), p. 042303. Publisher: American Physical Society. URL: https://link.aps.org/doi/10.1103/PhysRevA.63.042303, doi: 10.1103/PhysRevA.63.042303.
- [20] A. YIMSIRIWATTANA AND S. J. LOMONACO JR, Generalized GHZ States and Distributed Quantum Computing, Mar. 2004. arXiv:quantph/0402148. URL: http://arxiv.org/abs/quant-ph/0402148.
- [21] M. G. DAVIS, J. CHUNG, D. ENGLUND AND R. KETTIMUTHU, To-wards Distributed Quantum Computing by Qubit and Gate Graph Partitioning Techniques, Oct. 2023. arXiv:2310.03942 [quant-ph]. URL: http://arxiv.org/abs/2310.03942.
- [22] H. MÜHLENBEIN, M. GORGES-SCHLEUTER AND O. KRÄMER, Evolution algorithms in combinatorial optimization, Parallel Computing 7 (1988), pp. 65–85. URL: https://www.sciencedirect.com/science/ article/pii/0167819188900981, doi:10.1016/0167-8191(88) 90098-1.
- [23] B. W. KERNIGHAN AND S. LIN, An efficient heuristic procedure for partitioning graphs, The Bell System Technical Journal 49 (1970), pp. 291–307. Conference Name: The Bell System Technical Journal. URL: https://ieeexplore.ieee.org/document/6771089, doi:10.1002/j.1538-7305.1970.tb01770.x.
- [24] T. PARK AND C. Y. LEE, Algorithms for partitioning a graph, Computers & Industrial Engineering 28 (1995), pp. 899–909. URL: https://www.sciencedirect.com/science/article/pii/036083529500003J, doi: 10.1016/0360-8352(95)00003-J.
- [25] P. ANDRES-MARTINEZ, D. MILLS, T. FORRER AND L. HENAUT, CQCL/pytket-dqc, June 2024. original-date: 2021-12-20T18:54:45Z. URL: https://github.com/CQCL/pytket-dqc.
- [26] R. G. SUNDARAM, H. GUPTA AND C. R. RAMAKRISHNAN, Distribution of Quantum Circuits Over General Quantum Networks, June 2022. arXiv:2206.06437 [quant-ph]. URL: http://arxiv.org/abs/2206.06437, doi:10.48550/arXiv.2206.06437.
- [27] P. SHOR, Algorithms for quantum computation: discrete logarithms and factoring, in Proceedings 35th Annual Symposium on Foundations of Computer Science, Nov. 1994, pp. 124–134. URL: https://ieeexplore.ieee.org/document/365700/?arnumber=365700, doi: 10.1109/SFCS.1994.365700.
- [28] D. COPPERSMITH, An approximate Fourier transform useful in quantum factoring, Jan. 2002. arXiv:quant-ph/0201067. URL: http://arxiv.org/abs/ quant-ph/0201067, doi:10.48550/arXiv.quant-ph/0201067.
- [29] A. W. CROSS, L. S. BISHOP, S. SHELDON, P. D. NATION AND J. M. GAMBETTA, Validating quantum computers using randomized model circuits, Physical Review A 100 (2019), p. 032328. arXiv:1811.12926 [quant-ph]. URL: http://arxiv.org/abs/1811.12926, doi:10.1103/PhysRevA.100.032328.
- [30] E. FARHI, J. GOLDSTONE AND S. GUTMANN, A Quantum Approximate Optimization Algorithm, Nov. 2014. arXiv:1411.4028 [quant-ph]. URL: http://arxiv.org/abs/1411.4028.