Apple Silicon MPS issues on macOS: parametric gates fail due to mixed float/complex stack, and dense circuits fail around 16 qubits

Hi, thanks for maintaining DeepQuantum.

I tested `deepquantum` on an Apple Silicon Mac and found two MPS-related problems that look reproducible with the current release combination:

- `deepquantum==4.4.0`
- `torch==2.10.0`
- macOS on Apple M2 Ultra
- Python 3.11.11

PyTorch MPS is available in this environment:

```python
import torch
print(torch.__version__)
print(torch.backends.mps.is_built())
print(torch.backends.mps.is_available())
```

Output:

```text
2.10.0
True
True
```

## Summary

There seem to be two separate issues on `mps`:

1. Parametric single-qubit gates such as `Rx` fail even for very small circuits.
2. Dense statevector simulation on `mps` fails around 16 qubits for a simple GHZ-style circuit.

In contrast, the same circuits run on CPU.

## Issue 1: `Rx` fails on MPS because of mixed float / complex stacking

### Minimal reproduction

```python
import torch
import deepquantum as dq

print("torch", torch.__version__)
print("mps available", torch.backends.mps.is_available())

cir = dq.QubitCircuit(2)
cir.rx(0, 0.1)
cir.to("mps")
cir()
```

### Observed error

```text
RuntimeError: Failed to create function state object for: cat_int32_t_float_float2
```

### Likely cause

From local inspection, `Rx.get_matrix()` currently does:

```python
theta = self.inputs_to_tensor(theta)
cos  = torch.cos(theta / 2)
isin = torch.sin(theta / 2) * 1j
return torch.stack([cos, -isin, -isin, cos]).reshape(2, 2)
```

On MPS, `cos` is `float32` while `isin` is `complex64`, and `torch.stack([...])` with that mixture fails.

I verified that a temporary local patch which casts both branches to `complex64` before stacking makes the parametric circuit work on MPS up to 14 qubits in my test:

```python
def patched_get_matrix(self, theta):
    theta = self.inputs_to_tensor(theta)
    cos = torch.cos(theta / 2).to(torch.complex64)
    isin = (torch.sin(theta / 2).to(torch.complex64)) * 1j
    return torch.stack([cos, -isin, -isin, cos]).reshape(2, 2)
```

This suggests the first issue is likely fixable inside DeepQuantum.

## Issue 2: dense MPS execution fails around 16 qubits on Apple MPS

### Minimal reproduction

```python
import deepquantum as dq

cir = dq.QubitCircuit(16)
cir.h(0)
for i in range(15):
    cir.cnot(i, i + 1)
cir.to("mps")
cir()
```

### Observed error

```text
RuntimeError: MPS supports tensors with dimensions <= 16, but got 17.
```

### Additional observation

- 14 qubits works for a fixed GHZ circuit on MPS in my test.
- 16 qubits fails.
- CPU works for the same circuit up to at least 24 qubits in my local probe.

This second issue may be a PyTorch MPS backend limitation rather than a pure DeepQuantum bug, because DeepQuantum reshapes the state into a high-rank tensor during evolution. However, it would still be very helpful if DeepQuantum could either:

- detect this case and fall back to CPU automatically, or
- raise a clearer compatibility warning for Apple MPS.

## Larger probe results

I ran a small probe comparing CPU vs MPS for two circuit types:

- `fixed_ghz`: `H` + `CNOT` chain + expectation
- `param_ghz`: `fixed_ghz` plus one `Rx` on each qubit

Observed behavior:

- `fixed_ghz` on CPU: successful up to 24 qubits
- `fixed_ghz` on MPS: successful up to 14 qubits, failed at 16 qubits
- `param_ghz` on CPU: successful up to 24 qubits
- `param_ghz` on MPS: failed already at 2 qubits with the `cat_int32_t_float_float2` error

Selected console output:

```text
[fixed_ghz]
  Device: mps
     2 qubits -> ok
     4 qubits -> ok
     6 qubits -> ok
     8 qubits -> ok
    10 qubits -> ok
    12 qubits -> ok
    14 qubits -> ok
    16 qubits -> FAIL (RuntimeError: MPS supports tensors with dimensions <= 16, but got 17.)

[param_ghz]
  Device: mps
     2 qubits -> FAIL (RuntimeError: Failed to create function state object for: cat_int32_t_float_float2)
```

## Environment

```text
macOS: Darwin 25.3.0
Machine: arm64
Hardware: Apple M2 Ultra
Python: 3.11.11
torch: 2.10.0
deepquantum: 4.4.0
cuda available: False
mps built: True
mps available: True
```

## Request

Would you consider:

1. fixing the parametric gate matrix construction for MPS by avoiding mixed float/complex `torch.stack` paths, and
2. documenting or guarding the dense high-qubit MPS limitation on Apple Silicon?

If you prefer, I can split this into two separate issues because the first one looks like a library bug while the second one may partly come from a PyTorch MPS backend limitation.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apple Silicon MPS issues on macOS: parametric gates fail due to mixed float/complex stack, and dense circuits fail around 16 qubits #149

Summary

Issue 1: `Rx` fails on MPS because of mixed float / complex stacking

Minimal reproduction

Observed error

Likely cause

Issue 2: dense MPS execution fails around 16 qubits on Apple MPS

Minimal reproduction

Observed error

Additional observation

Larger probe results

Environment

Request

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Apple Silicon MPS issues on macOS: parametric gates fail due to mixed float/complex stack, and dense circuits fail around 16 qubits #149

Description

Summary

Issue 1: Rx fails on MPS because of mixed float / complex stacking

Minimal reproduction

Observed error

Likely cause

Issue 2: dense MPS execution fails around 16 qubits on Apple MPS

Minimal reproduction

Observed error

Additional observation

Larger probe results

Environment

Request

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Issue 1: `Rx` fails on MPS because of mixed float / complex stacking