feat: mixed-precision eigensolve, persistent integral cache, pre-commit hooks#5
Conversation
FP32 solve + FP64 Rayleigh quotient refinement. Benchmarked on DGX Spark GB10: 6.3x speedup at n=2000, 2.6x at n=4000 vs CPU FP64. Handles known torch.linalg.eigh FP32 GPU bug (NaN/Inf on low-rank matrices) with automatic fallback to FP64. Small matrices (n<=64) skip mixed-precision entirely. 10 tests: correctness, sorting, dtype, orthonormality, GPU (2 tests on DGX Spark NVIDIA GB10). Signed-off-by: thc1006 <84045975+thc1006@users.noreply.github.com>
cached_compute_molecular_integrals() wraps compute_molecular_integrals with joblib.Memory disk caching. Second call with same geometry/basis returns instantly from disk, saving 10-50s of PySCF computation. - get_integral_cache(cache_dir) for custom cache location - clear_integral_cache(cache_dir) to remove cached data - Default cache: ~/.cache/qvartools/integrals (configurable via QVARTOOLS_CACHE_DIR env var) - 5 tests: import, correctness, cache hit, different geometry, clear Signed-off-by: thc1006 <84045975+thc1006@users.noreply.github.com>
Compatible with both pre-commit and prek (Rust reimplementation, 5-6x faster, adopted by CPython and FastAPI). Enforces the same ruff check + ruff format policy locally that CI enforces remotely — catches issues before push. Signed-off-by: thc1006 <84045975+thc1006@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR introduces three independent developer-facing improvements: a mixed-precision eigensolver utility (Torch-based), a persistent on-disk cache for PySCF integral computation (joblib.Memory), and a pre-commit configuration to enforce Ruff lint/format locally.
Changes:
- Added
mixed_precision_eigh(H)for FP32 eigensolve with FP64 Rayleigh-quotient refinement and FP64 fallback handling. - Added persistent integral caching utilities:
get_integral_cache,cached_compute_molecular_integrals, andclear_integral_cache. - Added Ruff-based pre-commit hooks plus new unit tests for both features.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/qvartools/_utils/gpu/linear_algebra.py |
Adds mixed_precision_eigh mixed-precision eigendecomposition utility. |
src/qvartools/hamiltonians/integrals.py |
Adds joblib-based persistent caching helpers around compute_molecular_integrals. |
tests/test_utils/test_mixed_precision.py |
Adds correctness/behavior tests for mixed_precision_eigh (CPU + optional GPU). |
tests/test_utils/test_hamiltonian_cache.py |
Adds tests validating the integral caching and cache clearing behavior. |
.pre-commit-config.yaml |
Adds Ruff lint/format hooks for local enforcement. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Symmetrize to eliminate floating-point asymmetry | ||
| H = 0.5 * (H + H.T) | ||
|
|
||
| # For small matrices or already-FP64 on CPU: direct FP64 solve | ||
| if n <= _MIXED_PREC_MIN_SIZE or (H.dtype == torch.float64 and device.type == "cpu"): | ||
| H_fp64 = H.to(dtype=torch.float64) | ||
| vals, vecs = torch.linalg.eigh(H_fp64) | ||
| order = torch.argsort(vals) | ||
| return vals[order].cpu(), vecs[:, order].cpu() | ||
|
|
||
| # --- Mixed precision path --- | ||
| H_fp64 = H.to(dtype=torch.float64, device=device) | ||
|
|
||
| # Step 1: FP32 solve for speed | ||
| H_fp32 = H_fp64.to(dtype=torch.float32) | ||
| try: | ||
| vals_fp32, vecs_fp32 = torch.linalg.eigh(H_fp32) | ||
|
|
||
| # Validate: check for NaN/Inf (known FP32 GPU bug) | ||
| if not torch.isfinite(vals_fp32).all() or not torch.isfinite(vecs_fp32).all(): | ||
| raise RuntimeError("FP32 eigh produced non-finite values") | ||
|
|
||
| except (RuntimeError, torch.linalg.LinAlgError): | ||
| logger.warning("FP32 eigh failed; falling back to FP64.") | ||
| vals_fp64, vecs_fp64 = torch.linalg.eigh(H_fp64) | ||
| order = torch.argsort(vals_fp64) | ||
| return vals_fp64[order].cpu(), vecs_fp64[:, order].cpu() | ||
|
|
||
| # Step 2: Rayleigh quotient refinement in FP64 | ||
| vecs_fp64 = vecs_fp32.to(dtype=torch.float64) | ||
| Hv = H_fp64 @ vecs_fp64 # (n, n) @ (n, n) in FP64 | ||
| # E_i = v_i^T H v_i / (v_i^T v_i) | ||
| numerator = (vecs_fp64 * Hv).sum(dim=0) # diag(V^T H V) | ||
| denominator = (vecs_fp64 * vecs_fp64).sum(dim=0) # diag(V^T V) | ||
| vals_refined = numerator / denominator |
There was a problem hiding this comment.
mixed_precision_eigh claims to accept “any dtype”, but the implementation assumes a real symmetric matrix: it symmetrizes with H.T (no conjugate) and the Rayleigh-quotient refinement uses elementwise products without conjugation. If callers pass a complex Hermitian matrix, the refinement will be mathematically wrong. Consider either (a) explicitly rejecting complex inputs (raise with a clear error) or (b) supporting complex Hermitian inputs by using conjugate-transpose for symmetrization and v.conj() in the Rayleigh quotient.
| For matrices larger than 64x64, solves in float32 (leveraging TF32 | ||
| matmul on NVIDIA GPUs for ~10x speedup), then refines eigenvalues | ||
| via Rayleigh quotient in float64 to recover full precision. | ||
|
|
||
| For small matrices or float64 input on CPU, delegates directly to | ||
| ``torch.linalg.eigh`` in float64. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| H : torch.Tensor | ||
| Real symmetric matrix, shape ``(n, n)``. Any dtype accepted; | ||
| output is always float64. | ||
|
|
||
| Returns | ||
| ------- | ||
| eigenvalues : torch.Tensor | ||
| Eigenvalues sorted ascending, shape ``(n,)``, dtype float64, | ||
| on CPU. | ||
| eigenvectors : torch.Tensor | ||
| Corresponding eigenvectors as columns, shape ``(n, n)``, | ||
| dtype float64, on CPU. | ||
|
|
||
| Notes | ||
| ----- | ||
| ``torch.linalg.eigh`` has a known issue with float32 on GPU for | ||
| large low-rank matrices (may produce NaN/Inf). This function | ||
| detects such failures and falls back to float64 automatically. | ||
|
|
||
| Reference: Higham & Mary, "Mixed Precision Algorithms in Numerical | ||
| Linear Algebra", Acta Numerica (2022). | ||
| """ | ||
| n = H.shape[0] | ||
| device = H.device | ||
|
|
||
| # Symmetrize to eliminate floating-point asymmetry | ||
| H = 0.5 * (H + H.T) | ||
|
|
||
| # For small matrices or already-FP64 on CPU: direct FP64 solve | ||
| if n <= _MIXED_PREC_MIN_SIZE or (H.dtype == torch.float64 and device.type == "cpu"): | ||
| H_fp64 = H.to(dtype=torch.float64) | ||
| vals, vecs = torch.linalg.eigh(H_fp64) | ||
| order = torch.argsort(vals) | ||
| return vals[order].cpu(), vecs[:, order].cpu() | ||
|
|
||
| # --- Mixed precision path --- | ||
| H_fp64 = H.to(dtype=torch.float64, device=device) | ||
|
|
||
| # Step 1: FP32 solve for speed | ||
| H_fp32 = H_fp64.to(dtype=torch.float32) | ||
| try: |
There was a problem hiding this comment.
The mixed-precision path currently runs on CPU for large FP32 inputs as well (only skips when H is FP64 on CPU). On CPU there’s no TF32 acceleration benefit, so this adds extra casts/memory and does an FP64 matmul refinement, which is likely slower than a straight FP64 torch.linalg.eigh. Consider gating the mixed-precision branch to device.type == "cuda" (and otherwise always do the direct FP64 solve), or update the docstring to reflect intended CPU behavior.
| """Second call with same args should use cache (faster).""" | ||
| import time | ||
|
|
||
| from qvartools.hamiltonians.integrals import get_integral_cache | ||
|
|
||
| cache = get_integral_cache(str(tmp_path / "cache")) | ||
| geometry = [("H", (0.0, 0.0, 0.0)), ("H", (0.0, 0.0, 0.74))] | ||
|
|
||
| # First call — computes | ||
| t0 = time.perf_counter() | ||
| r1 = cache(geometry, basis="sto-3g") | ||
| t_first = time.perf_counter() - t0 | ||
|
|
||
| # Second call — should be cached | ||
| t0 = time.perf_counter() | ||
| r2 = cache(geometry, basis="sto-3g") | ||
| t_second = time.perf_counter() - t0 | ||
|
|
||
| # Cached call should be significantly faster | ||
| assert t_second < t_first or t_first < 0.1 # very fast compute is OK too |
There was a problem hiding this comment.
test_second_call_is_cached relies on wall-clock timing (t_second < t_first) to assert caching, which is inherently flaky on shared/variable CI environments. Since this is specifically testing joblib.Memory behavior, it should use a deterministic signal (e.g., joblib’s check_call_in_cache(...) / cache metadata inspection, or verifying the cache directory contents change only on the first call) rather than performance measurements.
| """Second call with same args should use cache (faster).""" | |
| import time | |
| from qvartools.hamiltonians.integrals import get_integral_cache | |
| cache = get_integral_cache(str(tmp_path / "cache")) | |
| geometry = [("H", (0.0, 0.0, 0.0)), ("H", (0.0, 0.0, 0.74))] | |
| # First call — computes | |
| t0 = time.perf_counter() | |
| r1 = cache(geometry, basis="sto-3g") | |
| t_first = time.perf_counter() - t0 | |
| # Second call — should be cached | |
| t0 = time.perf_counter() | |
| r2 = cache(geometry, basis="sto-3g") | |
| t_second = time.perf_counter() - t0 | |
| # Cached call should be significantly faster | |
| assert t_second < t_first or t_first < 0.1 # very fast compute is OK too | |
| """Second call with same args should use cache (reuse on-disk data).""" | |
| import pathlib | |
| from qvartools.hamiltonians.integrals import get_integral_cache | |
| cache_dir = tmp_path / "cache" | |
| cache = get_integral_cache(str(cache_dir)) | |
| geometry = [("H", (0.0, 0.0, 0.0)), ("H", (0.0, 0.0, 0.74))] | |
| # First call — should create cache entries on disk | |
| r1 = cache(geometry, basis="sto-3g") | |
| cache_files_after_first = sorted( | |
| p.relative_to(cache_dir) | |
| for p in pathlib.Path(cache_dir).rglob("*") | |
| if p.is_file() | |
| ) | |
| # Ensure that something was actually cached | |
| assert len(cache_files_after_first) > 0 | |
| # Second call — should reuse existing cache entries (no new files) | |
| r2 = cache(geometry, basis="sto-3g") | |
| cache_files_after_second = sorted( | |
| p.relative_to(cache_dir) | |
| for p in pathlib.Path(cache_dir).rglob("*") | |
| if p.is_file() | |
| ) | |
| assert cache_files_after_second == cache_files_after_first |
| # joblib may leave directory but should have no cached results | ||
| cached_files = list(cache_path.rglob("*.npy")) |
There was a problem hiding this comment.
test_clear_cache only checks for leftover *.npy files, but joblib caches may store results as other file types (e.g., pickles/metadata). This can let the test pass even if cached data remains. Consider asserting that the cache directory does not exist after clear_integral_cache, or that it contains no files of any kind (excluding empty directories).
| # joblib may leave directory but should have no cached results | |
| cached_files = list(cache_path.rglob("*.npy")) | |
| # joblib may leave directory but should have no cached results of any type | |
| cached_files = [p for p in cache_path.rglob("*") if p.is_file()] |
mixed_precision_eigh: - Reject complex input with TypeError (was silently wrong) - Gate mixed-precision to CUDA only (no benefit on CPU) - Enable TF32 tensor cores during FP32 solve (Blackwell/Hopper) - Restore TF32 setting in finally block - Move import torch to module top (remove noqa: E402) - Fix docstring: real benchmarks (6.3x not ~10x) integrals.py: - Move logging/os/shutil imports to file top - Add safety check in clear_integral_cache (refuse to delete directories without 'qvartools' or 'cache' in path) Tests: - Add complex input rejection test - Replace timing-based cache test with deterministic file check - Fix clear_cache test to check all files (not just *.npy) - Use tmp_path in all cache tests (no home dir pollution) Signed-off-by: thc1006 <84045975+thc1006@users.noreply.github.com>
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
Critical fixes: - #1/#2: Use persistent prev_coeffs/prev_batch_configs for PT2 scoring (was using best_coeffs before it was defined in current iteration) - QuantumNoLab#5: Eviction checks cumulative_basis size (was checking best_batch_configs) - QuantumNoLab#7: Convert numpy indices to torch.LongTensor before indexing CUDA tensor - QuantumNoLab#8: Default energy_weight=0.0, entropy_weight=0.0 (was 0.1/0.05, silently changing existing behavior) Lower priority fixes: - QuantumNoLab#6: CIPSI docstring mentions build_sparse_hamiltonian requirement - QuantumNoLab#9: Warn when energy_weight>0 but hamiltonian=None Not applicable (won't fix): - QuantumNoLab#3/QuantumNoLab#4: config_integer_hash returns tuple for n_sites>=64; our CAS max is 58Q (<64), so int cast is safe Co-authored-by: leo07010 <leo07010@gmail.com>
TLDR
Three independent improvements: GPU-accelerated eigendecomposition, joblib-based integral caching, and pre-commit hook configuration. No algorithm changes, no public API breaks. All changes are in files untouched by PR #4 (zero conflict verified).
What changed
1. Mixed-Precision Eigensolve (
_utils/gpu/linear_algebra.py)mixed_precision_eigh(H)— FP32 solve + FP64 Rayleigh quotient refinement.Benchmarked on DGX Spark NVIDIA GB10 (UMA 128GB):
Handles known
torch.linalg.eighFP32 GPU bug (NaN/Inf on low-rank matrices) with automatic fallback to FP64. Small matrices (n<=64) skip mixed-precision. Output always FP64.2. Persistent Integral Cache (
hamiltonians/integrals.py)cached_compute_molecular_integrals()— joblib.Memory disk caching for PySCF integral computations. Second call with same geometry/basis returns instantly from disk, saving 10-50s per molecule.get_integral_cache(cache_dir)for custom cache locationclear_integral_cache()to remove cached data~/.cache/qvartools/integrals(configurable viaQVARTOOLS_CACHE_DIR)3. Pre-commit Hooks (
.pre-commit-config.yaml)Ruff lint + format hooks compatible with both pre-commit and prek (Rust reimplementation, 5-6x faster, adopted by CPython/FastAPI). Enforces same policy locally that CI enforces remotely.
Test plan
ruff check + formatclean (145 files)