Skip to content

Apple Silicon MPS issues on macOS: parametric gates fail due to mixed float/complex stack, and dense circuits fail around 16 qubits #149

@flyingparanoia

Description

@flyingparanoia

Hi, thanks for maintaining DeepQuantum.

I tested deepquantum on an Apple Silicon Mac and found two MPS-related problems that look reproducible with the current release combination:

  • deepquantum==4.4.0
  • torch==2.10.0
  • macOS on Apple M2 Ultra
  • Python 3.11.11

PyTorch MPS is available in this environment:

import torch
print(torch.__version__)
print(torch.backends.mps.is_built())
print(torch.backends.mps.is_available())

Output:

2.10.0
True
True

Summary

There seem to be two separate issues on mps:

  1. Parametric single-qubit gates such as Rx fail even for very small circuits.
  2. Dense statevector simulation on mps fails around 16 qubits for a simple GHZ-style circuit.

In contrast, the same circuits run on CPU.

Issue 1: Rx fails on MPS because of mixed float / complex stacking

Minimal reproduction

import torch
import deepquantum as dq

print("torch", torch.__version__)
print("mps available", torch.backends.mps.is_available())

cir = dq.QubitCircuit(2)
cir.rx(0, 0.1)
cir.to("mps")
cir()

Observed error

RuntimeError: Failed to create function state object for: cat_int32_t_float_float2

Likely cause

From local inspection, Rx.get_matrix() currently does:

theta = self.inputs_to_tensor(theta)
cos  = torch.cos(theta / 2)
isin = torch.sin(theta / 2) * 1j
return torch.stack([cos, -isin, -isin, cos]).reshape(2, 2)

On MPS, cos is float32 while isin is complex64, and torch.stack([...]) with that mixture fails.

I verified that a temporary local patch which casts both branches to complex64 before stacking makes the parametric circuit work on MPS up to 14 qubits in my test:

def patched_get_matrix(self, theta):
    theta = self.inputs_to_tensor(theta)
    cos = torch.cos(theta / 2).to(torch.complex64)
    isin = (torch.sin(theta / 2).to(torch.complex64)) * 1j
    return torch.stack([cos, -isin, -isin, cos]).reshape(2, 2)

This suggests the first issue is likely fixable inside DeepQuantum.

Issue 2: dense MPS execution fails around 16 qubits on Apple MPS

Minimal reproduction

import deepquantum as dq

cir = dq.QubitCircuit(16)
cir.h(0)
for i in range(15):
    cir.cnot(i, i + 1)
cir.to("mps")
cir()

Observed error

RuntimeError: MPS supports tensors with dimensions <= 16, but got 17.

Additional observation

  • 14 qubits works for a fixed GHZ circuit on MPS in my test.
  • 16 qubits fails.
  • CPU works for the same circuit up to at least 24 qubits in my local probe.

This second issue may be a PyTorch MPS backend limitation rather than a pure DeepQuantum bug, because DeepQuantum reshapes the state into a high-rank tensor during evolution. However, it would still be very helpful if DeepQuantum could either:

  • detect this case and fall back to CPU automatically, or
  • raise a clearer compatibility warning for Apple MPS.

Larger probe results

I ran a small probe comparing CPU vs MPS for two circuit types:

  • fixed_ghz: H + CNOT chain + expectation
  • param_ghz: fixed_ghz plus one Rx on each qubit

Observed behavior:

  • fixed_ghz on CPU: successful up to 24 qubits
  • fixed_ghz on MPS: successful up to 14 qubits, failed at 16 qubits
  • param_ghz on CPU: successful up to 24 qubits
  • param_ghz on MPS: failed already at 2 qubits with the cat_int32_t_float_float2 error

Selected console output:

[fixed_ghz]
  Device: mps
     2 qubits -> ok
     4 qubits -> ok
     6 qubits -> ok
     8 qubits -> ok
    10 qubits -> ok
    12 qubits -> ok
    14 qubits -> ok
    16 qubits -> FAIL (RuntimeError: MPS supports tensors with dimensions <= 16, but got 17.)

[param_ghz]
  Device: mps
     2 qubits -> FAIL (RuntimeError: Failed to create function state object for: cat_int32_t_float_float2)

Environment

macOS: Darwin 25.3.0
Machine: arm64
Hardware: Apple M2 Ultra
Python: 3.11.11
torch: 2.10.0
deepquantum: 4.4.0
cuda available: False
mps built: True
mps available: True

Request

Would you consider:

  1. fixing the parametric gate matrix construction for MPS by avoiding mixed float/complex torch.stack paths, and
  2. documenting or guarding the dense high-qubit MPS limitation on Apple Silicon?

If you prefer, I can split this into two separate issues because the first one looks like a library bug while the second one may partly come from a PyTorch MPS backend limitation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions