Background
Mixed-precision eigenvalue decomposition is constrained by the forward error bound: |λ_computed - λ_true| >= ε_machine × κ(λ). For chemical accuracy (1.6 mHa), FP16 (ε ≈ 1e-3) and below are fundamentally insufficient for direct eigensolve when ||H|| ≈ O(1-100) Ha.
However, low precision CAN accelerate specific stages of the computation while preserving FP64 final accuracy.
Research directions
P1: FP32 projected Hamiltonian construction
Build H_proj in FP32 instead of FP64 (50% memory reduction). Only upcast to FP64 before eigensolve. Validated by CIM-QS(H)CI (arXiv:2603.13160, March 2026).
Target: matrix_elements_fast() in hamiltonians/molecular/hamiltonian.py
P2: Chebyshev filtered subspace iteration with TF32/BF16
Replace scipy.sparse.linalg.expm_multiply in SKQD with Chebyshev polynomial filtering using TF32 tensor cores for matvec. R-ChFSI (arXiv:2503.22652) demonstrated 2.1x speedup with BF16 communication and TF32 filtering, with tolerance to inexact matvec.
P3: Randomized basis selection with FP16 projection
Use FP16 random projection for initial basis selection (noise-tolerant by construction), then FP64 for the small dense eigenvalue problem. Theoretical foundation in arXiv:2601.19250 (Jan 2026).
P4: cuSOLVER BF16x9 math mode
cuSOLVER 13.2 supports CUSOLVER_FP32_EMULATED_BF16X9_MATH for syevd. Internal GEMMs use 9x BF16 tensor core operations to emulate FP32 accuracy with higher throughput on Blackwell GPUs. Requires calling cuSOLVER directly (not exposed via PyTorch).
What's already done
References
- Higham & Mary, "Mixed Precision Algorithms in Numerical Linear Algebra", Acta Numerica (2022)
- CIM-QS(H)CI (arXiv:2603.13160) — FP32 Hamiltonian construction validated
- R-ChFSI (arXiv:2503.22652) — TF32/BF16 Chebyshev filtering, 2.1x speedup
- JCTC 2026 — BF16 preconditioning for DFT eigensolver on AI-focused GPUs
- Xu et al. (arXiv:2601.19250) — Precision-adaptive randomized SVD
- cuSOLVER 13.2 docs — BF16x9 emulated math mode
Background
Mixed-precision eigenvalue decomposition is constrained by the forward error bound:
|λ_computed - λ_true| >= ε_machine × κ(λ). For chemical accuracy (1.6 mHa), FP16 (ε ≈ 1e-3) and below are fundamentally insufficient for direct eigensolve when ||H|| ≈ O(1-100) Ha.However, low precision CAN accelerate specific stages of the computation while preserving FP64 final accuracy.
Research directions
P1: FP32 projected Hamiltonian construction
Build H_proj in FP32 instead of FP64 (50% memory reduction). Only upcast to FP64 before eigensolve. Validated by CIM-QS(H)CI (arXiv:2603.13160, March 2026).
Target:
matrix_elements_fast()inhamiltonians/molecular/hamiltonian.pyP2: Chebyshev filtered subspace iteration with TF32/BF16
Replace
scipy.sparse.linalg.expm_multiplyin SKQD with Chebyshev polynomial filtering using TF32 tensor cores for matvec. R-ChFSI (arXiv:2503.22652) demonstrated 2.1x speedup with BF16 communication and TF32 filtering, with tolerance to inexact matvec.P3: Randomized basis selection with FP16 projection
Use FP16 random projection for initial basis selection (noise-tolerant by construction), then FP64 for the small dense eigenvalue problem. Theoretical foundation in arXiv:2601.19250 (Jan 2026).
P4: cuSOLVER BF16x9 math mode
cuSOLVER 13.2 supports
CUSOLVER_FP32_EMULATED_BF16X9_MATHfor syevd. Internal GEMMs use 9x BF16 tensor core operations to emulate FP32 accuracy with higher throughput on Blackwell GPUs. Requires calling cuSOLVER directly (not exposed via PyTorch).What's already done
mixed_precision_eigh(PR feat: mixed-precision eigensolve, persistent integral cache, pre-commit hooks #5): FP32 solve + FP64 Rayleigh quotient refinement with TF32 enabled. 6.3x speedup on DGX Spark GB10 at n=2000.References