custatevec OOM on DGX Spark (GB10 UMA) — cudaMalloc fails on unified memory architecture

## Summary

custatevec-backed quantum simulation (via PennyLane `lightning.gpu`) fails with `out of memory` on NVIDIA DGX Spark (GB10 Superchip, SM 12.1) even for trivially small state vectors (4 qubits = 256 bytes). The failure originates from `cudaMalloc` encountering the unified memory architecture where `cudaMemGetInfo` reports near-zero free memory despite 128GB of shared DRAM being available.

## Environment

- **Hardware**: NVIDIA DGX Spark, GB10 Superchip (SM 12.1)
- **Memory**: 128GB LPDDR5X unified CPU+GPU (no discrete VRAM)
- **CUDA**: 13.0, Driver 580.126.09
- **OS**: Ubuntu (aarch64)
- **custatevec**: 1.12.0 (via `custatevec-cu12` pip package)

## Reproducer

```python
import pennylane as qml

# Uses custatevec under the hood
dev = qml.device('lightning.gpu', wires=4)

@qml.qnode(dev)
def circuit():
    qml.Hadamard(wires=0)
    return qml.expval(qml.PauliZ(0))

result = circuit()  # Fails with OOM
```

```
pennylane_lightning.lightning_gpu_ops.LightningException: 
[...DevTag.hpp][Line:65][Method:refresh]: Error in PennyLane Lightning: out of memory
```

## Root Cause

DGX Spark implements Unified Memory Architecture (UMA) — the GPU and CPU share the same physical DRAM. Standard CUDA memory query APIs (`cudaMemGetInfo`) return misleading values on UMA systems, causing libraries that pre-check available memory to believe no GPU memory exists.

This is documented by NVIDIA:
- [DGX Spark Known Issues](https://docs.nvidia.com/dgx/dgx-spark/known-issues.html): "`cudaMemGetInfo` does not account for memory that could potentially be reclaimed from SWAP"
- [DGX Spark CUDA Porting Guide](https://docs.nvidia.com/dgx/dgx-spark-porting-guide/porting/cuda.html)

Similar issues in other CUDA libraries:
- [PyTorch #174358](https://github.com/pytorch/pytorch/issues/174358)
- [RAPIDS cuml #3979](https://github.com/rapidsai/cuml/issues/3979)

## Questions

1. Does custatevec internally use `cudaMemGetInfo` for pre-allocation checks? If so, is there a plan to support UMA platforms?
2. Would using `cudaMallocManaged` instead of `cudaMalloc` on UMA systems (detected via `cudaDeviceProperties::integrated`) be a viable fix?
3. Is there a custatevec configuration option or environment variable to bypass the memory pre-check?
4. What is the roadmap for cuQuantum support on DGX Spark / Grace-Blackwell UMA systems?

## Context

The DGX Spark is shipping to quantum computing researchers and developers. GPU-accelerated quantum simulation is a natural use case for this hardware. Currently, custatevec is unusable on it, forcing fallback to CPU-only simulation.

We've also filed a related issue on [PennyLaneAI/pennylane-lightning](https://github.com/PennyLaneAI/pennylane-lightning) since the allocation code path goes through their `DataBuffer.hpp`, but the underlying question is whether custatevec itself has UMA-incompatible memory assumptions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

custatevec OOM on DGX Spark (GB10 UMA) — cudaMalloc fails on unified memory architecture #215

Summary

Environment

Reproducer

Root Cause

Questions

Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

custatevec OOM on DGX Spark (GB10 UMA) — cudaMalloc fails on unified memory architecture #215

Description

Summary

Environment

Reproducer

Root Cause

Questions

Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions