# Compiling Unitaries Using Diffusion Models

**What you'll do:**
* Synthesize quantum circuits corresponding to a given unitary matrix with a diffusion model
* Evaluate if the obtained circuit is accurate
* Filter better quantum circuits
* Sample a circuit using a noise model

The notebook is organized as follows:
* Overview of the Compiling Unitaries with Diffusion Models workflow
* Generate circuits for a given unitary
* Select a circuit that meets specific criteria
* Compare circuits under the presence of noise

This notebook is based on the work presented in [Quantum circuit synthesis with diffusion models, Florian Fürrutter, Gorka Muñoz-Gil & Hans J. Briegel , Nat. Mach. Intell. **6**, 515–524 (2024)](https://doi.org/10.1038/s42256-024-00831-9).

Let's begin with installing the relevant packages.

In [None]:
#!pip install genQC==0.1.0 torch==2.9.1 -q

In [None]:
import functools
import itertools
import numpy as np
import torch
import cudaq
import matplotlib.pyplot as plt

import genQC
from genQC.imports import *
from genQC.pipeline.diffusion_pipeline import DiffusionPipeline
from genQC.inference.export_cudaq import genqc_to_cudaq
import genQC.inference.infer_compilation as infer_comp
import genQC.util as util


# Fixed seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)

---
## Overview

Quantum computing relies on efficiently translating quantum operations into viable physical realizations on existing quantum hardware. This compilation task can be a computationally intensive task. Recently machine learning techniques have demonstrated exceptional performance on this task (see, for instance [(Moro et al., 2021)](https://www.nature.com/articles/s42005-021-00684-3) and [(Fürrutter, et al., 2024)](https://doi.org/10.1038/s42256-024-00831-9)).

This notebook focuses on circuit synthesis with diffusion models [(Fürrutter, et al., 2024)](https://doi.org/10.1038/s42256-024-00831-9) — a powerful class of generative models in machine learning. We will demonstrate how to use this method to synthesize arbitrary unitaries into a `cudaq.kernel`, effectively decomposing them into sequences of quantum gates, a process commonly known as unitary compilation.
In particular, we will use a pre-trained model to generate circuits with specific characteristics, such as being limited to a restricted set of gates, for a given unitary.  The following figure provides an overview of the circuit generation pipeline:


![Quantum Synthesis with Diffusion Model](images/unitary_synthesis_dm.png)

The pipeline consists of 3 main components:

**1) Circuit encoding:** Like any neural network, diffusion models operate with continuous inputs and outputs. However, since the circuits we consider are composed of discrete gates (i.e., with no continuous parameters), we develop a mapping that transforms each gate into a continuous vector. This allows us to represent a given circuit as a three-dimensional tensor, as illustrated. Crucially, this mapping is invertible: when the DM generates continuous tensors, we can apply the inverse map to convert them back into the circuit form. 

**2) Conditioning:** The user's input (the set of available gates and the unitary to compile) is also transformed into a continuous tensor by two neural networks. For the gate set description, where the input is a text prompt (e.g., "Compile using ['x', 'h']"), we utilize a pre-trained language model. For the unitary, we employ a neural network that is trained jointly with the diffusion model.

**3) Unitary compilation:** The generation procedure follows the typical DM process: the model is given a fully noisy tensor which is iteratively de-noised until reaching a clean sample based on the given conditioning (the desired unitary and gate set). The tensors generated by the DM are then mapped to circuits via the inverse encoding procedure. To learn more about the practical implementation of diffusion models, we recommend [this tutorial](https://course.fast.ai/Lessons/lesson9.html).

In the following we will use `CUDA-Q` and `genQC` to perform all these steps and go from a desired unitary matrix, $U$, to a quantum circuit that we can execute using CUDA-Q.

---
## Generate circuits for a given unitary using a diffusion model

Let's start by defining the unitary that we want to compile. Note that this model has been trained to compile unitaries arising from circuits composed of the gates `['h', 'cx', 'z', 'x', 'ccx', 'swap']`. Although this is a universal gate set (i.e., a gate set capable of universal computation), performing arbitrary computations requires an unlimited number of gates. For this tutorial, we will use a model trained to generate kernels with at most 12 gates. Therefore, we can only expect the model to work for unitaries under this constraint. Let's consider here the compilation of one of such unitary. 

We start by defining our unitary as a `numpy.array`:

In [None]:
U = np.array(
    [
        [ 0.70710678,  0.        ,  0.        , 0.        ,  0.70710678,  0.        , 0.        ,  0.        ],
        [ 0.        , -0.70710678,  0.        , 0.        ,  0.        , -0.70710678, 0.        ,  0.        ],
        [-0.70710678,  0.        ,  0.        , 0.        ,  0.70710678,  0.        , 0.        ,  0.        ],
        [ 0.        ,  0.70710678,  0.        , 0.        ,  0.        , -0.70710678, 0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.70710678, 0.        ,  0.        ,  0.        , 0.        ,  0.70710678],
        [ 0.        ,  0.        ,  0.        , 0.70710678,  0.        ,  0.        , 0.70710678,  0.        ],
        [ 0.        ,  0.        , -0.70710678, 0.        ,  0.        ,  0.        , 0.        ,  0.70710678],
        [ 0.        ,  0.        ,  0.        ,-0.70710678,  0.        ,  0.        , 0.70710678,  0.        ]
    ],
    dtype=np.complex64
)

### The Diffusion Model

#### Setup and compilation

In [None]:
device = util.infer_torch_device()  # Use CUDA if we can
util.MemoryCleaner.purge_mem()  # Clean existing memory allocation
print(device)

#### Load model

First, we load the pre-trained model weights and set up the DM pipeline.

In [None]:
pipeline = DiffusionPipeline.from_pretrained("Floki00/qc_unitary_3qubit", device)  # Download from Hugging Face
pipeline.scheduler.set_timesteps(40)

Next, we set the parameters the model was trained on. Note that these are fixed and depend on the pre-trained model.

In [None]:
vocab = {i + 1: gate for i, gate in enumerate(pipeline.gate_pool)}  # Gateset used during training, used for decoding
print("The vocabulary of the gate set:", vocab)

num_of_qubits = 3  # Number of qubits
max_gates = 12

### Unitary compilation

The loaded model was trained on the gate set `['h', 'cx', 'z', 'x', 'ccx', 'swap']`. Specifically, it was trained to generate circuits using any arbitrary subset of this gate set. Therefore, during inference, we can instruct the model to compile the unitary using any of these subsets. However, it is crucial to follow the prompt structure `Compile using [...]`, as the model was trained with this specific format. For example, let's consider a scenario where we compile the unitary without using the `'ccx'` and  `'swap'` gates:

In [None]:
basis_gates = ['h', 'cx', 'z', 'x']
prompt = "Compile using: " + str(basis_gates)

Now, we call the diffusion model pipeline to generate encoded circuits based on the specified conditions: `prompt` and `U`. One of the key advantages of this method is that, once the model is trained, sampling new circuits is very fast. Therefore, under the same conditions defined above, we will generate a bunch of circuits and analyze them in the following sections.

In [None]:
# Number of circuits to sample from the trained DM.
samples = 256

# As the neural network works only with real numbers, we first separate
# the two components and create a 2-dimensional tensor for the magnitude
# of each component:
U_r, U_i = torch.Tensor(np.real(U)), torch.Tensor(np.imag(U))
U_tensor = torch.stack([U_r, U_i], dim=0)

# Now, we generate a tensor representation of the desired quantum circuit using the DM based on the prompt and U. This is also known as inference.
out_tensors = infer_comp.generate_comp_tensors(
    pipeline=pipeline,
    prompt=prompt,
    U=U_tensor,
    samples=samples,
    system_size=num_of_qubits,  # Max qubit number allowed by the model (this model is only trained with 3 qubits)
    num_of_qubits=num_of_qubits,
    max_gates=max_gates,
    g=10,  # classifier-free-guidance (CFG) scale
)

Below is an example of a generated tensor which represents a quantum circuit:

In [None]:
out_tensors[0]

### Convert tensors to CUDA-Q kernels

Next, we convert each generated tensor into a `cudaq.kernel`.

In [None]:
#cudaq.set_target("qpp-cpu")  # Note that cpu could be faster for 3-qubit kernels

cudaq.set_target('nvidia') # Set to GPU for larger circuits

Note that some generated tensors might not correspond to a valid kernel.  For example, a generated tensor might have encoded a CNOT gate to be applied to a single qubit, and another generated tensor may have encoded an X and Y gate, applied separately and simultaneously to two qubits at step 1.  Neither of these are meaningful quantum kernels.  Therefore, in the next code block, we filter out only the valid tensors.

In [None]:
kernel_list = []
valid_tensors = []

invalid_tensors = 0
for out_tensors_i in tqdm(out_tensors):

    # Use a try-except to catch invalid tensors (if any)
    try:
        kernel = genqc_to_cudaq(out_tensors_i, vocab)  # Convert out_tensors to CUDA-Q kernels
    except:
        kernel = None

    if kernel:
        kernel_list.append(kernel)
        valid_tensors.append(out_tensors_i)
    else:
        invalid_tensors += 1

print(f"The model generated {invalid_tensors} invalid tensors that does not correspond to a circuit.")

In [None]:
valid_tensors[0]

Subsequently, we can then verify whether the valid tensors contain only the specified basis gate set.

In [None]:
labels = [label for label, gate in vocab.items() if gate in set(basis_gates)] + [0]  #Include identity gates
labels.extend([-gate for gate in [2, 5] if gate in labels])  #Include control/target labels for controlled gates
correct_tensors = []
correct_kernels = []

for i in range(len(valid_tensors)):
    if bool(set(torch.unique(valid_tensors[i]).tolist()).issubset(set(labels))):
        correct_tensors.append(valid_tensors[i])
        correct_kernels.append(kernel_list[i])

print(f"The model generated {len(correct_kernels)} correct kernels that meet the prompted criteria.")

In [None]:
correct_tensors[0]

For example, the following tensor encoding

corresponds to the following `cudaq.kernel`

In [None]:
# Arbitrary input state to the circuit for plotting

input_state = [0] * (2**num_of_qubits)

print(cudaq.draw(correct_kernels[0], input_state))

### Evaluate generated circuits

As mentioned earlier, one of the key advantages of using diffusion models (DMs) as a unitary compiler is the ability to rapidly sample many circuits. However, as is common in machine learning, the model has a certain accuracy, meaning not all generated circuits are expected to exactly compile the specified unitary. In this section, we will evaluate how many of the generated circuits are indeed correct and then perform post-selection to identify (at least) one circuit that successfully performs the desired unitary operation.

### Simulate kernels

First, we calculate the $2^n\times2^n$ unitary matrix $U$ implemented by each of the kernels. The elements of this matrix are defined by the transition amplitudes between the basis states, which can be expressed as:
$$
\begin{equation}
   \langle i|kernel|j\rangle = U_{ij},
\end{equation}
$$
where $|i\rangle$ and $|j\rangle$ are computational basis states (typically in the $z$-basis), with $|i\rangle$ representing the standard basis vector of dimension $2^n$ that has a $1$ in the $i^{th}$ position and $0$ elsewhere.

In [None]:
def get_unitary(kernel: cudaq.PyKernel) -> np.ndarray:
    N = 2**num_of_qubits
    unitary = np.zeros((N, N), dtype=np.complex64)

    for j in range(N):
        basis_state_j = np.zeros(N, dtype=np.complex64)
        basis_state_j[j] = 1
        unitary[:, j] = np.array(cudaq.get_state(kernel, basis_state_j), copy=False)

    return unitary

In [None]:
N = 2**num_of_qubits
got_unitaries = np.zeros((len(correct_kernels), N, N), dtype=np.complex64)

for i, kernel in tqdm(enumerate(correct_kernels), total=got_unitaries.shape[0]):
    got_unitaries[i, :, :] = get_unitary(kernel)

For example, the circuit printed above corresponds to the following unitary:

In [None]:
np.set_printoptions(linewidth=1000)
print(np.round(got_unitaries[0], 4))

### Compare unitaries

Now that we have the unitaries for each of the kernels, we compare them to the user provided unitary matrix, `U`.
To do so, we compute the infidelity between the exact unitary and the generated ones.
The infidelity is defined as follows:

\begin{equation}
\text{Infidelity}(U, V) = 1 -  \left|\frac{1}{2^n} \text{Tr} (U^\dagger V) \right|^2.
\end{equation}

The infidelity is a value between 0 and 1, where 0 indicates that the unitaries are identical (up to a global phase).

In [None]:
def infidelity(want_unitary, got_unitary):
    return 1 - np.abs(np.trace(np.conj(want_unitary).T @ got_unitary) / 2**num_of_qubits) ** 2

infidelities = np.array([infidelity(U, got_unitary) for got_unitary in got_unitaries])

Let's now plot a histogram of the infidelities for our generated and correct circuits:

In [None]:
plt.figure(figsize=(7, 4))
plt.title(
    f"Distribution of infidelities for {len(got_unitaries)} generated circuits",
    fontsize=12,
)
plt.ylabel("Number of circuits", fontsize=14)
plt.xlabel("Unitary infidelity", fontsize=14)
plt.hist(infidelities, bins=30)
plt.show()

Here we can find the kernel with the lowest infidelity:

In [None]:
min_index = np.argmin(infidelities)

print(f"The best kernel has an infidelity of {infidelities[min_index]:0.2},")

input_state = [0] * (2**num_of_qubits)
input_state[0] = 1
print(cudaq.draw(correct_kernels[min_index], input_state))

print(f"with the unitary:")
print(np.round(got_unitaries[min_index], 4))

which, as we can see, exactly compiled our targeted unitary:

In [None]:
print(np.round(U, 4))

---
## Select a circuit that meets specific criteria

To identify optimal circuits, a practical approach is to find the circuit with the fewest CNOT gates (also known as `cx`). In our `vocab` definition above, we identified that `cx` corresponds to the label 2 in our tokenized tensors. Let's use this information to search for the circuit that minimizes the number of `cx` gates:

In [None]:
# First, we remove possible duplicates and only pick distinct circuits
_, idx_unique = np.unique(np.array(correct_tensors), axis=0, return_index=True)
unique_tensors = torch.stack(correct_tensors)[idx_unique]
unique_infidelities = infidelities[idx_unique]
unique_kernels = [correct_kernels[idx] for idx in idx_unique]

# Then, filter the distinct circuits
threshold = 0.5
idx_filtered = torch.argwhere(torch.tensor(unique_infidelities) < threshold).flatten().tolist()
filtered_tensors = unique_tensors[idx_filtered]
filtered_kernels = [unique_kernels[idx] for idx in idx_filtered]
filtered_infidelities = unique_infidelities[idx_filtered]
print(f"The model generated {filtered_tensors.shape[0]} distinct correct circuits.")

# Now let's flatten the last two dimensions (related to the actual circuit) 
# and find out how many 2's (i.e., cx) gates each circuit has:
num_cx = (filtered_tensors.flatten(1, 2) == 2).sum(1)
print("These circuits have this number of cx gates:", num_cx.tolist())

# Get optimal/nonoptimal circuits (with minimum/maximum cx gates)
min_idx, max_idx = torch.argmin(num_cx), torch.argmax(num_cx)
print(f"Optimal circuit: index {min_idx}, {num_cx[min_idx]} cx gates, infidelity {filtered_infidelities[min_idx]:.6f}")
print(f"Nonoptimal circuit: index {max_idx}, {num_cx[max_idx]} cx gates, infidelity {filtered_infidelities[max_idx]:.6f}")

As we can see, it appears that the diffusion model (DM) requires at least one Toffoli gate to compile the unitary. We can now print a few of these circuits to select the one that best suits our needs or to identify any noteworthy patterns the model employs for this specific unitary.

In [None]:
# Get the circuits with the lowest cx gate count
optimal_kernels = [filtered_kernels[idx] for idx in torch.argwhere(num_cx == num_cx[min_idx]).flatten()]

# Draw a few of these circuits
for kernel in optimal_kernels[:2]:
    print(cudaq.draw(kernel, input_state))

In [None]:
# Get the circuits with more cx gates
nonoptimal_kernels = [filtered_kernels[idx] for idx in torch.argwhere(num_cx == num_cx[max_idx]).flatten()]

# Draw a few of these circuits
for kernel in nonoptimal_kernels[:2]:
    print(cudaq.draw(kernel, input_state))

---
## Compare circuits under the presence of noise

In this section, we'll define a `noise_model` and verify that a lower number of cx gates yields better results under this noise model.

Tips:
1. The `nvidia` backend in CUDA-Q supports [Trajectory Noisy Simulation](https://nvidia.github.io/cuda-quantum/latest/using/backends/sims/noisy.html#trajectory-noisy-simulation) — just keep using cudaq.set_target("nvidia").

2. The `cudaq.sample` function can take a noise model as an argument to perform a simulation with noise: `cudaq.sample(kernel, noise_model=noise_model)`

See [Noisy Simulation example](https://nvidia.github.io/cuda-quantum/latest/examples/python/noisy_simulations.html) for more details.
</div>

In [None]:
# Define a noise model 

def tensor(matrices):
    return functools.reduce(np.kron, matrices)

def depolarizing_kraus(p: float, n: int = 2):
    I = np.array([[1, 0], [0, 1]], dtype=np.complex128)
    X = np.array([[0, 1], [1, 0]], dtype=np.complex128)
    Y = np.array([[0, -1j], [1j, 0]], dtype=np.complex128)
    Z = np.array([[1, 0], [0, -1]], dtype=np.complex128)

    paulis = [I, X, Y, Z]

    # Kraus operators
    kraus_operators = [np.sqrt(1 - p) * tensor([I] * n)]
    coeff = np.sqrt(p / (4**n - 1))

    for paulis in itertools.product(paulis, repeat=n):
        if not all(np.array_equal(p, I) for p in paulis):
            kraus_operators.append(coeff * tensor(paulis))

    return kraus_operators


noise_model = cudaq.NoiseModel()
noise_model.add_all_qubit_channel("cx", cudaq.KrausChannel(depolarizing_kraus(0.03)))

In [None]:
# Sample with noiseless simulation
cudaq.set_target("nvidia")
shots_count = 1000

result = dict(cudaq.sample(filtered_kernels[0], input_state, shots_count=shots_count).items())

In [None]:
# Sample using noisy simulation for a kernel with the lowest number of cx gates
result_optimal = dict(
    cudaq.sample(
        optimal_kernels[0],
        input_state,
        noise_model=noise_model,
        shots_count=shots_count,
    ).items()
)

In [None]:
# Sample using noisy simulation for a kernel with more cx gates
result_nonoptimal = dict(
    cudaq.sample(
        nonoptimal_kernels[0],
        input_state,
        noise_model=noise_model,
        shots_count=shots_count,
    ).items()
)

In [None]:
# Merge all bitstrings to ensure consistency across results
bitstrings = sorted(set(result_optimal.keys()) | set(result.keys()) | set(result_nonoptimal.keys()))

# Function to extract probabilities
def get_probabilities(result, keys):
    total_shots = sum(result.values())
    return [result.get(k, 0) / total_shots for k in keys]

# Extracting probabilities
prob = get_probabilities(result, bitstrings)
prob_optimal = get_probabilities(result_optimal, bitstrings)
prob_nonoptimal = get_probabilities(result_nonoptimal, bitstrings)

# Bar width
bar_width = 0.3
x = np.arange(len(bitstrings))

# Plot bars
plt.figure(figsize=(10, 6))
plt.bar(x - bar_width, prob, bar_width, label="Noiseless simulation (on GPU)", color="#808080")
plt.bar(x, prob_optimal, bar_width, label="Noisy simulation for optimal kernel", color="#76B900")
plt.bar(x + bar_width, prob_nonoptimal, bar_width, label="Noisy simulation for nonoptimal kernel", color="#c4e884")

# Labels
plt.xticks(x, bitstrings)
plt.xlabel("Bitstring Outcomes")
plt.ylabel("Probability")
plt.title("Comparison of sampling simulations")
plt.legend(fontsize=14)

# Show plot
plt.show()