# Abstract

Variational Quantum Algorithms (VQAs) are becoming increasingly significant with the advancement of quantum computing technologies. However, the benchmarking of such algorithms remains fragmented, with most existing benchmarks relying either on classical datasets mapped to quantum representations or on custom datasets created for specific research purposes. This lack of standardization hinders fair and reproducible evaluation of quantum machine learning pipelines. In this project, we address this gap by revamping the Datasets module of the Qiskit Machine Learning repository. Our contributions include the systematic standardization of dataset generators, enabling consistent and reproducible benchmarking of VQAs. This work lays the foundation for a more robust evaluation framework and facilitates future developments in quantum machine learning research.

# Motivation & Previous Work

Recent literature has discussed that to really search for Quantum advantage, we first have to start using natively quantum datasets. This is because benchmarking the pipeline when a feature map has been used to map classical data to quantum features will invariably also include the performance of the feature-map in the benchmark. In efforts towards this, Qiskit ML repository already had a natively quantum toy dataset called the `ad_hoc_data`. 

While the original implementation was contraint to 2 and 3 qubits, our refactoring initiative aimed to generalize this dataset generator to support arbitrary qubit counts while preserving its core mathematical structure. This dataset encodes data vectors $\vec{x} \in (0, 2\pi]^n$ through a parameterized quantum circuit:


$$U_{\Phi}(\vec{x}) = \exp\left(i\sum_{S \subseteq [n]}\phi_S(\vec{x})\prod_{i\in S}Z_i\right)$$

where $\phi_{\{i,j\}} = (\pi-x_i)(\pi-x_j)$ and $\phi_{\{i\}} = x_i$. Then the labels are assigned with the below expression, where V is a random unitary.

$$m(\vec{x}) = \text{sign}\left(\langle\Phi(\vec{x})|V^\dagger\left(\prod_i Z_i\right)V|\Phi(\vec{x})\rangle\right)$$

Below is an example call of our re-factored Ad Hoc that can run for any number of qubits

In [1]:
from qiskit_machine_learning.datasets import ad_hoc_data


31.882813 s
(200000, 256, 1) (200000, 2) [[[-0.03102683+0.01821032j]
  [ 0.03273911+0.06978137j]
  [-0.01073349-0.01157602j]
  ...
  [-0.00187201+0.02382701j]
  [-0.01902454-0.04338348j]
  [ 0.03132947+0.01186094j]]

 [[ 0.04245913+0.01339607j]
  [-0.01755919-0.00327305j]
  [-0.0421539 +0.09837246j]
  ...
  [-0.05324089+0.00456716j]
  [-0.04564579+0.02327108j]
  [ 0.05332191+0.01937745j]]

 [[-0.00174141+0.00225008j]
  [ 0.02233321-0.12677285j]
  [-0.02055231+0.00794574j]
  ...
  [ 0.0234014 -0.05031609j]
  [-0.06779369-0.02058058j]
  [ 0.04085395+0.02256062j]]

 ...

 [[ 0.00948967-0.04619195j]
  [-0.02238625-0.11372839j]
  [ 0.05672997+0.01039174j]
  ...
  [ 0.04229421-0.02849684j]
  [ 0.00091702-0.00915222j]
  [ 0.05536808-0.00716505j]]

 [[-0.0317338 +0.05295308j]
  [ 0.02866339+0.01782857j]
  [-0.0660053 +0.04523696j]
  ...
  [-0.03769681-0.07213433j]
  [-0.06995552+0.06203371j]
  [ 0.01410807+0.05502951j]]

 [[-0.03686306-0.02279928j]
  [ 0.01298147-0.04450951j]
  [ 0.03570063-0.

Following this, we propose three new datasets