# Abstract

Variational Quantum Algorithms (VQAs) are becoming increasingly significant with the advancement of quantum computing technologies. However, the benchmarking of such algorithms remains fragmented, with most existing benchmarks relying either on classical datasets mapped to quantum representations or on custom datasets created for specific research purposes. This lack of standardization hinders fair and reproducible evaluation of quantum machine learning pipelines. In this project, we address this gap by revamping the Datasets module of the Qiskit Machine Learning repository. Our contributions include the systematic standardization of dataset generators, enabling consistent and reproducible benchmarking of VQAs. This work lays the foundation for a more robust evaluation framework and facilitates future developments in quantum machine learning research.

# Motivation & Previous Work

Recent literature has discussed that to really search for Quantum advantage, we first have to start using natively quantum datasets. This is because benchmarking the pipeline when a feature map has been used to map classical data to quantum features will invariably also include the performance of the feature-map in the benchmark. In efforts towards this, Qiskit ML repository already had a natively quantum toy dataset called the `ad_hoc_data`. 

While the original implementation was contraint to 2 and 3 qubits, our refactoring initiative aimed to generalize this dataset generator to support arbitrary qubit counts while preserving its core mathematical structure. This dataset encodes data vectors $\vec{x} \in (0, 2\pi]^n$ through a parameterized quantum circuit:


$$U_{\Phi}(\vec{x}) = \exp\left(i\sum_{S \subseteq [n]}\phi_S(\vec{x})\prod_{i\in S}Z_i\right)$$

where $\phi_{\{i,j\}} = (\pi-x_i)(\pi-x_j)$ and $\phi_{\{i\}} = x_i$. Then the labels are assigned with the below expression, where V is a random unitary.

$$m(\vec{x}) = \text{sign}\left(\langle\Phi(\vec{x})|V^\dagger\left(\prod_i Z_i\right)V|\Phi(\vec{x})\rangle\right)$$

Below is an example call of our re-factored Ad Hoc that can run for any number of qubits

In [3]:
from qiskit_machine_learning.datasets import ad_hoc_data
x_train, _, _, _ = ad_hoc_data(
    training_size = 4,
    test_size = 2,
    n = 4,
    gap = 0.2,
    plot_data = False,
    one_hot = True,
    include_sample_total = False,
    entanglement = "full",
    sampling_method = "sobol",
    divisions = 0,
    labelling_method = "expectation",
    class_labels = None)
x_train

ImportError: cannot import name 'QuantumDataGenerator' from 'qiskit_machine_learning.datasets.hamiltonians.hamiltonian_base' (D:\Code\Qiskit-ML\qiskit-ML-workspace\qiskit_machine_learning\datasets\hamiltonians\hamiltonian_base.py)

Following this, we propose three new datasets:
1. Entanglement Concentration Dataset for Binary Classification pipelines: https://github.com/qiskit-community/qiskit-machine-learning/pull/915
2. Phase Of Matter Dataset for Multi-class Classification pipelines: https://github.com/qiskit-community/qiskit-machine-learning/pull/918
3. H-Molecule Evolution Dataset for Fast Forwarding pipelines: https://github.com/qiskit-community/qiskit-machine-learning/pull/916

# Entanglement Concentration Dataset

One of the intrinsic properties of Quantum Data is correlation between qubits. Louis Schatzki et al. introduced the NTangled dataset, which consists of quantum states with varied multipartite entanglement types and degrees. These states are generated using parameterized quantum circuits (PQCs) trained to produce specific entanglement levels which are quantified by measures like concentratable entanglement (CE). Then algorithms can be benchmarked on the seperating the different entanglement types into different classes

This dataset has been constructed to benchmark Binary Classification tasks with two modes: hard and easy.``"easy"``: uses CE values 0.18 and 0.40 for n = 3 and 0.12 and 0.43 for n = 4. ``"hard"``: uses CE values 0.28 and 0.40 for n = 3 and 0.22 and 0.34 for n = 4

The generation process involves three key steps:
1. Circuit Initialization: Loading pre-trained weights from the repository to configure the PQCs.
2. Input Preparation: Sampling product states (for example, random computational basis states) as inputs.
3. State Evolution: Passing inputs through the initialized circuits to output entangled states.



## Circuit Initialization
The below snippet shows the structure of the Ansatz used. 

In [None]:
from qiskit import QuantumCircuit
from qiskit.quantum_info import Operator, Statevector
from qiskit.circuit import ParameterVector

def _hardware_efficient_ansatz(
    qc: QuantumCircuit, 
    params: ParameterVector, 
    n_qubits: int, 
    depth: int
) -> None:
    """Append a hardware‑efficient ansatz layer‑by‑layer to the Quantum Circuit."""

    p = iter(params) 

    for _ in range(depth):
        for q in range(n_qubits):
            qc.rx(next(p), q)
            qc.ry(next(p), q)
            qc.rz(next(p), q)

        for q in range(n_qubits - 1):
            qc.cx(q, q + 1)
        if n_qubits > 1:
            qc.cx(n_qubits - 1, 0)

## Input Preparation
Qubit states are sampled in the Block Sphere and then taken Kronecker products of. This ensure the input states given have no entanglement to begin with. We have two options for the sampling. Isotropic generates qubit states that are sampled randomly in the Bloch Sphere and takes the tensor product of all the qubits to build the input state. Cardinal generates only states that fall on the axes of the Bloch Sphere before taking the tensor product. Below snippet demonstrates Isotropic sampling

In [None]:
import numpy as np
def isotropic(n_qubits: int, n_points: int) -> np.ndarray:  
    """Samples Qubit States uniformly in the Block Sphere"""
    rng = np.random
    # Uniform sampling on the sphere
    z = rng.uniform(-1, 1, size=(n_points, n_qubits))
    phi = rng.uniform(0, 2 * np.pi, size=(n_points, n_qubits))
    theta = np.arccos(z)
    cos = np.cos(theta / 2)
    sin = np.sin(theta / 2)
    q_vectors = np.stack(
        [cos, sin * np.exp(1j * phi)],
        axis=-1
    )
    # Broadcast-and-Product
    ints   = np.arange(2**n_qubits, dtype=np.uint16)[:, None]
    bits   = ((ints >> np.arange(n_qubits)) & 1).astype(np.int8)
    labels = np.flip(bits, axis=1)
    picked = np.take_along_axis(
        q_vectors[:, None, :, :],
        labels[None, :, :, None],  
        axis=3,
    )  
    amplitudes = picked.squeeze(-1).prod(axis=2)
    return amplitudes[:, :, None]
isotropic(2, 2)

## State Evolution
Finally the states are taken through two different circuits for classes A and B respectively. 

In [None]:
from qiskit_machine_learning.datasets import entanglement_concentration_data
count = 5
x_train, _, _, _ = entanglement_concentration_data(
    training_size=count, test_size=0, n=3, mode="easy", formatting="statevector"
)

## Concentration of Entanglement
Below snippet verifies that the above points have the expected CE

In [None]:
def _compute_ce(sv):
    """Computing CE using Mathematical Expression due to Beckey, J. L. et al.
    (alternatively SWAP test can be used if done in a Quantum Circuit)"""
    n = sv.num_qubits
    # Convert to density matrix
    rho = sv.to_operator().data
    ce_sum = 0.0
    # Generate all non-empty subsets of qubit indices
    qubit_indices = list(range(n))
    for r in range(1, n + 1):
        for subset in itertools.combinations(qubit_indices, r):
            # Compute the reduced density matrix for the subset
            traced_out = [i for i in qubit_indices if i not in subset]
            reduced_rho = partial_trace(rho, traced_out)
            ce_sum += reduced_rho.purity()
    ce = 1 - (ce_sum / (2**n))
    return ce
low_ce = np.mean([_compute_ce(x_train[i]) for i in range(count)])
high_ce = np.mean([_compute_ce(x_train[i + count]) for i in range(count)])
print(low_ce, high_ce)