Skip to content

feat: mixed-precision eigensolve, persistent integral cache, pre-commit hooks#5

Merged
George930502 merged 4 commits into
QuantumNoLab:mainfrom
thc1006:feat/mixed-precision-and-cache
Mar 26, 2026
Merged

feat: mixed-precision eigensolve, persistent integral cache, pre-commit hooks#5
George930502 merged 4 commits into
QuantumNoLab:mainfrom
thc1006:feat/mixed-precision-and-cache

Conversation

@thc1006
Copy link
Copy Markdown
Member

@thc1006 thc1006 commented Mar 26, 2026

TLDR

Three independent improvements: GPU-accelerated eigendecomposition, joblib-based integral caching, and pre-commit hook configuration. No algorithm changes, no public API breaks. All changes are in files untouched by PR #4 (zero conflict verified).

What changed

1. Mixed-Precision Eigensolve (_utils/gpu/linear_algebra.py)

mixed_precision_eigh(H) — FP32 solve + FP64 Rayleigh quotient refinement.

Benchmarked on DGX Spark NVIDIA GB10 (UMA 128GB):

n CPU FP64 GPU Mixed Speedup
1000 0.119s 0.418s 0.3x
2000 0.570s 0.090s 6.3x
4000 4.089s 1.551s 2.6x

Handles known torch.linalg.eigh FP32 GPU bug (NaN/Inf on low-rank matrices) with automatic fallback to FP64. Small matrices (n<=64) skip mixed-precision. Output always FP64.

2. Persistent Integral Cache (hamiltonians/integrals.py)

cached_compute_molecular_integrals() — joblib.Memory disk caching for PySCF integral computations. Second call with same geometry/basis returns instantly from disk, saving 10-50s per molecule.

  • get_integral_cache(cache_dir) for custom cache location
  • clear_integral_cache() to remove cached data
  • Default: ~/.cache/qvartools/integrals (configurable via QVARTOOLS_CACHE_DIR)

3. Pre-commit Hooks (.pre-commit-config.yaml)

Ruff lint + format hooks compatible with both pre-commit and prek (Rust reimplementation, 5-6x faster, adopted by CPython/FastAPI). Enforces same policy locally that CI enforces remotely.

Test plan

thc1006 added 3 commits March 26, 2026 21:12
FP32 solve + FP64 Rayleigh quotient refinement. Benchmarked on DGX
Spark GB10: 6.3x speedup at n=2000, 2.6x at n=4000 vs CPU FP64.

Handles known torch.linalg.eigh FP32 GPU bug (NaN/Inf on low-rank
matrices) with automatic fallback to FP64. Small matrices (n<=64)
skip mixed-precision entirely.

10 tests: correctness, sorting, dtype, orthonormality, GPU (2 tests
on DGX Spark NVIDIA GB10).

Signed-off-by: thc1006 <84045975+thc1006@users.noreply.github.com>
cached_compute_molecular_integrals() wraps compute_molecular_integrals
with joblib.Memory disk caching. Second call with same geometry/basis
returns instantly from disk, saving 10-50s of PySCF computation.

- get_integral_cache(cache_dir) for custom cache location
- clear_integral_cache(cache_dir) to remove cached data
- Default cache: ~/.cache/qvartools/integrals (configurable via
  QVARTOOLS_CACHE_DIR env var)
- 5 tests: import, correctness, cache hit, different geometry, clear

Signed-off-by: thc1006 <84045975+thc1006@users.noreply.github.com>
Compatible with both pre-commit and prek (Rust reimplementation,
5-6x faster, adopted by CPython and FastAPI).

Enforces the same ruff check + ruff format policy locally that CI
enforces remotely — catches issues before push.

Signed-off-by: thc1006 <84045975+thc1006@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 26, 2026 13:16
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces three independent developer-facing improvements: a mixed-precision eigensolver utility (Torch-based), a persistent on-disk cache for PySCF integral computation (joblib.Memory), and a pre-commit configuration to enforce Ruff lint/format locally.

Changes:

  • Added mixed_precision_eigh(H) for FP32 eigensolve with FP64 Rayleigh-quotient refinement and FP64 fallback handling.
  • Added persistent integral caching utilities: get_integral_cache, cached_compute_molecular_integrals, and clear_integral_cache.
  • Added Ruff-based pre-commit hooks plus new unit tests for both features.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/qvartools/_utils/gpu/linear_algebra.py Adds mixed_precision_eigh mixed-precision eigendecomposition utility.
src/qvartools/hamiltonians/integrals.py Adds joblib-based persistent caching helpers around compute_molecular_integrals.
tests/test_utils/test_mixed_precision.py Adds correctness/behavior tests for mixed_precision_eigh (CPU + optional GPU).
tests/test_utils/test_hamiltonian_cache.py Adds tests validating the integral caching and cache clearing behavior.
.pre-commit-config.yaml Adds Ruff lint/format hooks for local enforcement.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +141 to +175
# Symmetrize to eliminate floating-point asymmetry
H = 0.5 * (H + H.T)

# For small matrices or already-FP64 on CPU: direct FP64 solve
if n <= _MIXED_PREC_MIN_SIZE or (H.dtype == torch.float64 and device.type == "cpu"):
H_fp64 = H.to(dtype=torch.float64)
vals, vecs = torch.linalg.eigh(H_fp64)
order = torch.argsort(vals)
return vals[order].cpu(), vecs[:, order].cpu()

# --- Mixed precision path ---
H_fp64 = H.to(dtype=torch.float64, device=device)

# Step 1: FP32 solve for speed
H_fp32 = H_fp64.to(dtype=torch.float32)
try:
vals_fp32, vecs_fp32 = torch.linalg.eigh(H_fp32)

# Validate: check for NaN/Inf (known FP32 GPU bug)
if not torch.isfinite(vals_fp32).all() or not torch.isfinite(vecs_fp32).all():
raise RuntimeError("FP32 eigh produced non-finite values")

except (RuntimeError, torch.linalg.LinAlgError):
logger.warning("FP32 eigh failed; falling back to FP64.")
vals_fp64, vecs_fp64 = torch.linalg.eigh(H_fp64)
order = torch.argsort(vals_fp64)
return vals_fp64[order].cpu(), vecs_fp64[:, order].cpu()

# Step 2: Rayleigh quotient refinement in FP64
vecs_fp64 = vecs_fp32.to(dtype=torch.float64)
Hv = H_fp64 @ vecs_fp64 # (n, n) @ (n, n) in FP64
# E_i = v_i^T H v_i / (v_i^T v_i)
numerator = (vecs_fp64 * Hv).sum(dim=0) # diag(V^T H V)
denominator = (vecs_fp64 * vecs_fp64).sum(dim=0) # diag(V^T V)
vals_refined = numerator / denominator
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mixed_precision_eigh claims to accept “any dtype”, but the implementation assumes a real symmetric matrix: it symmetrizes with H.T (no conjugate) and the Rayleigh-quotient refinement uses elementwise products without conjugation. If callers pass a complex Hermitian matrix, the refinement will be mathematically wrong. Consider either (a) explicitly rejecting complex inputs (raise with a clear error) or (b) supporting complex Hermitian inputs by using conjugate-transpose for symmetrization and v.conj() in the Rayleigh quotient.

Copilot uses AI. Check for mistakes.
Comment on lines +107 to +156
For matrices larger than 64x64, solves in float32 (leveraging TF32
matmul on NVIDIA GPUs for ~10x speedup), then refines eigenvalues
via Rayleigh quotient in float64 to recover full precision.

For small matrices or float64 input on CPU, delegates directly to
``torch.linalg.eigh`` in float64.

Parameters
----------
H : torch.Tensor
Real symmetric matrix, shape ``(n, n)``. Any dtype accepted;
output is always float64.

Returns
-------
eigenvalues : torch.Tensor
Eigenvalues sorted ascending, shape ``(n,)``, dtype float64,
on CPU.
eigenvectors : torch.Tensor
Corresponding eigenvectors as columns, shape ``(n, n)``,
dtype float64, on CPU.

Notes
-----
``torch.linalg.eigh`` has a known issue with float32 on GPU for
large low-rank matrices (may produce NaN/Inf). This function
detects such failures and falls back to float64 automatically.

Reference: Higham & Mary, "Mixed Precision Algorithms in Numerical
Linear Algebra", Acta Numerica (2022).
"""
n = H.shape[0]
device = H.device

# Symmetrize to eliminate floating-point asymmetry
H = 0.5 * (H + H.T)

# For small matrices or already-FP64 on CPU: direct FP64 solve
if n <= _MIXED_PREC_MIN_SIZE or (H.dtype == torch.float64 and device.type == "cpu"):
H_fp64 = H.to(dtype=torch.float64)
vals, vecs = torch.linalg.eigh(H_fp64)
order = torch.argsort(vals)
return vals[order].cpu(), vecs[:, order].cpu()

# --- Mixed precision path ---
H_fp64 = H.to(dtype=torch.float64, device=device)

# Step 1: FP32 solve for speed
H_fp32 = H_fp64.to(dtype=torch.float32)
try:
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mixed-precision path currently runs on CPU for large FP32 inputs as well (only skips when H is FP64 on CPU). On CPU there’s no TF32 acceleration benefit, so this adds extra casts/memory and does an FP64 matmul refinement, which is likely slower than a straight FP64 torch.linalg.eigh. Consider gating the mixed-precision branch to device.type == "cuda" (and otherwise always do the direct FP64 solve), or update the docstring to reflect intended CPU behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +35 to +54
"""Second call with same args should use cache (faster)."""
import time

from qvartools.hamiltonians.integrals import get_integral_cache

cache = get_integral_cache(str(tmp_path / "cache"))
geometry = [("H", (0.0, 0.0, 0.0)), ("H", (0.0, 0.0, 0.74))]

# First call — computes
t0 = time.perf_counter()
r1 = cache(geometry, basis="sto-3g")
t_first = time.perf_counter() - t0

# Second call — should be cached
t0 = time.perf_counter()
r2 = cache(geometry, basis="sto-3g")
t_second = time.perf_counter() - t0

# Cached call should be significantly faster
assert t_second < t_first or t_first < 0.1 # very fast compute is OK too
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_second_call_is_cached relies on wall-clock timing (t_second < t_first) to assert caching, which is inherently flaky on shared/variable CI environments. Since this is specifically testing joblib.Memory behavior, it should use a deterministic signal (e.g., joblib’s check_call_in_cache(...) / cache metadata inspection, or verifying the cache directory contents change only on the first call) rather than performance measurements.

Suggested change
"""Second call with same args should use cache (faster)."""
import time
from qvartools.hamiltonians.integrals import get_integral_cache
cache = get_integral_cache(str(tmp_path / "cache"))
geometry = [("H", (0.0, 0.0, 0.0)), ("H", (0.0, 0.0, 0.74))]
# First call — computes
t0 = time.perf_counter()
r1 = cache(geometry, basis="sto-3g")
t_first = time.perf_counter() - t0
# Second call — should be cached
t0 = time.perf_counter()
r2 = cache(geometry, basis="sto-3g")
t_second = time.perf_counter() - t0
# Cached call should be significantly faster
assert t_second < t_first or t_first < 0.1 # very fast compute is OK too
"""Second call with same args should use cache (reuse on-disk data)."""
import pathlib
from qvartools.hamiltonians.integrals import get_integral_cache
cache_dir = tmp_path / "cache"
cache = get_integral_cache(str(cache_dir))
geometry = [("H", (0.0, 0.0, 0.0)), ("H", (0.0, 0.0, 0.74))]
# First call — should create cache entries on disk
r1 = cache(geometry, basis="sto-3g")
cache_files_after_first = sorted(
p.relative_to(cache_dir)
for p in pathlib.Path(cache_dir).rglob("*")
if p.is_file()
)
# Ensure that something was actually cached
assert len(cache_files_after_first) > 0
# Second call — should reuse existing cache entries (no new files)
r2 = cache(geometry, basis="sto-3g")
cache_files_after_second = sorted(
p.relative_to(cache_dir)
for p in pathlib.Path(cache_dir).rglob("*")
if p.is_file()
)
assert cache_files_after_second == cache_files_after_first

Copilot uses AI. Check for mistakes.
Comment on lines +93 to +94
# joblib may leave directory but should have no cached results
cached_files = list(cache_path.rglob("*.npy"))
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_clear_cache only checks for leftover *.npy files, but joblib caches may store results as other file types (e.g., pickles/metadata). This can let the test pass even if cached data remains. Consider asserting that the cache directory does not exist after clear_integral_cache, or that it contains no files of any kind (excluding empty directories).

Suggested change
# joblib may leave directory but should have no cached results
cached_files = list(cache_path.rglob("*.npy"))
# joblib may leave directory but should have no cached results of any type
cached_files = [p for p in cache_path.rglob("*") if p.is_file()]

Copilot uses AI. Check for mistakes.
mixed_precision_eigh:
- Reject complex input with TypeError (was silently wrong)
- Gate mixed-precision to CUDA only (no benefit on CPU)
- Enable TF32 tensor cores during FP32 solve (Blackwell/Hopper)
- Restore TF32 setting in finally block
- Move import torch to module top (remove noqa: E402)
- Fix docstring: real benchmarks (6.3x not ~10x)

integrals.py:
- Move logging/os/shutil imports to file top
- Add safety check in clear_integral_cache (refuse to delete
  directories without 'qvartools' or 'cache' in path)

Tests:
- Add complex input rejection test
- Replace timing-based cache test with deterministic file check
- Fix clear_cache test to check all files (not just *.npy)
- Use tmp_path in all cache tests (no home dir pollution)

Signed-off-by: thc1006 <84045975+thc1006@users.noreply.github.com>
@George930502
Copy link
Copy Markdown
Collaborator

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@George930502 George930502 merged commit 1f140be into QuantumNoLab:main Mar 26, 2026
5 checks passed
@thc1006 thc1006 deleted the feat/mixed-precision-and-cache branch March 26, 2026 15:17
thc1006 added a commit to thc1006/qvartools that referenced this pull request Apr 1, 2026
Critical fixes:
- #1/#2: Use persistent prev_coeffs/prev_batch_configs for PT2 scoring
  (was using best_coeffs before it was defined in current iteration)
- QuantumNoLab#5: Eviction checks cumulative_basis size (was checking best_batch_configs)
- QuantumNoLab#7: Convert numpy indices to torch.LongTensor before indexing CUDA tensor
- QuantumNoLab#8: Default energy_weight=0.0, entropy_weight=0.0 (was 0.1/0.05,
  silently changing existing behavior)

Lower priority fixes:
- QuantumNoLab#6: CIPSI docstring mentions build_sparse_hamiltonian requirement
- QuantumNoLab#9: Warn when energy_weight>0 but hamiltonian=None

Not applicable (won't fix):
- QuantumNoLab#3/QuantumNoLab#4: config_integer_hash returns tuple for n_sites>=64; our CAS max
  is 58Q (<64), so int cast is safe

Co-authored-by: leo07010 <leo07010@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants