Skip to content

KeithTab/MPSCF

Repository files navigation

MCPB-GPU4PySCF

Workflow Overview

Python GPU4PySCF CUDA License

A GPU-accelerated QM/MM parameterization workflow for metalloenzyme active sites. Combines the MCPB (Metal Center Parameter Builder) methodology with GPU4PySCF to replace the traditional Gaussian QM backend — delivering the same rigorous parameterization on modern NVIDIA hardware.


Table of Contents


Overview

Accurate force-field parameterization of metalloenzyme active sites requires high-quality QM data — optimised geometry, vibrational force constants, and electrostatic potential (ESP) charges — that is prohibitively slow for large models (~150 atoms) using conventional CPU-based codes such as Gaussian.

This project replaces the Gaussian QM backend with PySCF / GPU4PySCF, enabling:

  • DF-DFT (density-fitting DFT) with analytical gradients and Hessians on GPU
  • Full two-stage RESP fitting within PySCF (no external resp binary required)
  • Seminario bond/angle force constants directly from the GPU Hessian matrix
  • An end-to-end pipeline from raw PDB to tleap-ready topology

Step descriptions

Step Engine Inputs Key Outputs
1 — Model building pyMSMT Raw PDB Small/large .com, .in, mol2 templates
2a — Geometry opt GPU4PySCF Small model Optimised .xyz, .json
2b — Hessian GPU4PySCF Optimised geometry Force constants .npy, .fchk
2c — RESP GPU4PySCF Large model Charges .npy, .txt
3 — Frcmod Seminario Hessian .npy _mcpbpy.frcmod
4 — Leap prep pyMSMT RESP mol2 + PDB _mcpbpy.pdb, _tleap.in

Features

  • Full GPU acceleration — DF-DFT SCF, analytical gradients, and Hessians via GPU4PySCF
  • Multi-GPU Hessian — built-in patches (lib/patch.py) fix upstream single-GPU bottlenecks and OOM issues that occur with ≥3 GPUs (see Multi-GPU Hessian Patch)
  • Two-stage RESP fitting — stage-1 free fit; stage-2 with pyMSMT-style backbone charge constraints (chgmod) and CH₂/CH₃ equivalence constraints, entirely within gpu4pyscf.pop.esp
  • Seminario frcmod — metal-site bond/angle force constants from the GPU4PySCF Hessian (no Gaussian .fchk required)
  • Open-shell support — automatically selects UKS for radicals / high-spin metal centres
  • Dispersion correction — D3BJ / D3 / D4 / D3Zero
  • JSON parameter interface — all settings via a single JSON file; no code edits needed
  • tleap-ready outputset head/tail connect-atom definitions auto-inserted; redundant backbone bond commands removed
  • Optional fchk output — Gaussian-compatible .fchk files for Multiwfn, NBO, etc.

Requirements

Package Version Notes
Python ≥ 3.11
pyscf 2.12.1
gpu4pyscf 1.7.0 CUDA toolkit required
geometric 1.1 Geometry optimisation driver
numpy ≥ 2.0
openmm 8.4.0 PDB preparation
pdbfixer 1.12.0 PDB repair & protonation
pyMSMT 22.0 MCPB model building (steps 1, 3, 4)
AmberTools (tleap) ≥ 22 Final topology generation

Installation

conda create -n gpu4pyscf python=3.11
conda activate gpu4pyscf

# Core stack
conda install -c conda-forge numpy openmm pdbfixer pymsmt ambertools

# PySCF
pip install pyscf geometric

# GPU4PySCF  (choose your CUDA version)
pip install gpu4pyscf-cuda12x   # CUDA 12.x
# pip install gpu4pyscf-cuda11x # CUDA 11.x

Clone this repository into any working directory — no installation step required:

git clone <repo-url> mcpb_run
cd mcpb_run

Quick Start

Full pipeline (steps 1 → 4)

python main.py input.pdb output_H.pdb \
    --metal-charge ZN:2 \
    --step1 --step2 --step3 --step4 \
    --json scf_params.json

Add --gpu-json gpu.json to any command that runs a Hessian (step 2) to override the GPU memory-tuning defaults — see GPU Memory Tuning.

Standalone steps

Run each step independently from the directory containing the .in file:

# Step 2 — QM calculations (opt + Hessian + RESP)
python main.py --step2 --json scf_params.json

# Step 3 — Seminario frcmod
python main.py --step3 --json scf_params.json

# Step 4 — mol2 files, _mcpbpy.pdb, _tleap.in
python main.py --step4 --json scf_params.json

Running tleap

tleap -s -f {gname}_tleap.in

SCF Parameters (JSON)

All settings are controlled by a single JSON file passed with --json. Copy and edit scf_params.json:

{
    "basis":                "def2-svp",
    "auxbasis":             "def2-svp-ri",
    "xc":                   "b3lyp",

    "device":               "auto",
    "cuda_visible_devices": "0,1,2",

    "max_cycle":            100,
    "conv_tol":             1e-7,
    "grid_level":           3,
    "damp":                 0.2,
    "level_shift":          0.2,

    "maxsteps":             30,
    "hessian":              true,
    "hessian_device":       "gpu",
    "hessian_patch":        true,
    "cphf_grid_level":      null,

    "output_dir":           "outputs",
    "disp":                 "",
    "verbose":              4,
    "write_fchk":           true,

    "resp_a1":              5e-4,
    "resp_a2":              1e-3,
    "resp_b":               0.1,
    "hfree":                true,
    "chgmod":               1,

    "scalef":               1.0,
    "bond_avg":             true,
    "angle_avg":            true,

    "paraset":              "12_6"
}

Parameter Reference

SCF / Geometry optimisation

Key Type Default Description
basis string "def2-svp" Orbital basis set
auxbasis string "def2-svp-ri" Auxiliary basis for density fitting
xc string "b3lyp" XC functional
device string "auto" "auto" (GPU → CPU fallback), "gpu", "cpu"
cuda_visible_devices string "0" GPU index(es) for CUDA_VISIBLE_DEVICES
max_cycle int 100 Maximum SCF iterations
conv_tol float 1e-7 SCF energy convergence threshold (Hartree)
grid_level int 3 DFT integration grid level (1–9)
damp float 0.2 SCF damping factor
level_shift float 0.2 Level-shift value (Hartree)
maxsteps int 30 Maximum geometry optimisation steps
disp string|null "" Dispersion: "d3bj", "d3", "d4", "d3zero", or ""
verbose int 4 PySCF verbosity (0=silent … 5=debug)

Hessian

Key Type Default Description
hessian bool true Compute Hessian after optimisation
hessian_device string "same" Device for Hessian SCF: "same" as opt, or "cpu"
hessian_patch bool true Enable multi-GPU Hessian patches (see below)
cphf_grid_level int|null null Separate DFT grid level for CPSCF XC response (null = same as grid_level; 2 ≈ SG-1 for ~20–40 % speedup)

Output

Key Type Default Description
output_dir string "outputs" Directory for all QM output files
write_fchk bool false Write Gaussian-compatible .fchk files

RESP charge fitting

Key Type Default Description
resp_a1 float 5e-4 Stage-1 restraint strength
resp_a2 float 1e-3 Stage-2 restraint strength
resp_b float 0.1 Hyperbolic restraint width
hfree bool true Exclude H atoms from RESP restraints
chgmod int 1 Backbone charge constraint level: 0=free; 1=fix CA/N/C/O; 2=+H/HA; 3=+CB

Frcmod generation (Seminario, step 3)

Key Type Default Description
scalef float 1.0 Frequency scale factor (force constants scaled by scalef²)
bond_avg bool true Average both off-diagonal Hessian blocks per bond
angle_avg bool true 4-way block average for angle force constants

Leap preparation (step 4)

Key Type Default Description
paraset string "12_6" Ion non-bonded parameter set: "12_6", "12_6_4", "HFE", "CM", "IOD"

GPU Memory Tuning (gpu.json)

The Hessian pipeline sizes every large GPU buffer at run time from a small set of memory-tuning knobs. These default to values that work well on the test hardware, but you can override any of them without editing the source by passing a JSON file with --gpu-json:

python main.py --step2 --json scf_params.json --gpu-json gpu.json

The file is optional — omit --gpu-json to use the built-in defaults. Only the keys you want to change need to appear; any key starting with _ is treated as a comment and ignored, and an unknown key raises an error. All values are read when the Hessian kernel runs, so a single load in main() applies to the whole job.

Accuracy is unaffected. These parameters only change how tensors are blocked and distributed across GPUs — never the arithmetic. Lower a *_frac / *_factor (or raise cphf_est_cycles / make_h1_avail_div) to use less GPU memory at the cost of speed; raise them to go faster when VRAM is ample.

Example gpu.json

{
    "_comment": "GPU memory-tuning parameters. Pass with --gpu-json gpu.json.",

    "cphf_subspace_frac":        0.50,
    "cphf_est_cycles":           20,

    "make_h1_pinned_frac":       0.40,
    "make_h1_avail_div":         4,
    "make_h1_stream_tile_frac":  0.08,

    "split_min_frac":            0.55,
    "hk_ao_aux_budget_frac":     0.30,
    "hk_ip1_ip1_budget_frac":    0.35,
    "aux_aux_budget_frac":       0.20,

    "df_blksize_factor":         0.05
}

Parameter Reference

CPHF / Krylov subspace (solve_mo1)

Key Type Default Description
cphf_subspace_frac float 0.50 Fraction of device-0 free VRAM the Krylov subspace (xs/ax) may occupy. Lower if CPHF OOMs on GPU 0.
cphf_est_cycles int 20 Assumed Krylov iteration count used to size the per-atom subspace cost. Raise to shrink the atom block (less memory, more batches).

make_h1 / _get_jk_ip

Key Type Default Description
make_h1_pinned_frac float 0.40 If wk_Pl_ ≥ this fraction of the smallest GPU's free memory, build it in pinned host memory instead of on-GPU. Lower to spill to host sooner.
make_h1_avail_div int 4 Divisor applied to get_avail_mem inside native make_h1 (smaller blocks). Raise for a smaller footprint.
make_h1_stream_tile_frac float 0.08 Fraction of free VRAM used to size the streamed wk0_10_P__ aux tile. Lower if make_h1 OOMs.

Multi-GPU partial_hess_elec

Key Type Default Description
split_min_frac float 0.55 Minimum share (relative to an even split) each GPU is guaranteed, so no device is starved to zero rows.
hk_ao_aux_budget_frac float 0.30 BlockPlan budget fraction for the ao-aux contraction phase.
hk_ip1_ip1_budget_frac float 0.35 BlockPlan budget fraction for the fused ip1/ip1 phase.
aux_aux_budget_frac float 0.20 BlockPlan budget fraction for the aux-aux (int2c_ip) phase.

DF JK loop

Key Type Default Description
df_blksize_factor float 0.05 Fraction of free VRAM used to size the DF JK-loop block. Lower for a smaller footprint.

If a phase still OOMs at runtime, solve_mo1 additionally auto-halves its atom block and retries, so these knobs mainly trade peak memory for fewer/larger (faster) batches rather than being strictly required for correctness.


Multi-GPU Hessian Patch

lib/patch.py is loaded automatically when "hessian_patch": true (the default). It fixes three upstream issues in GPU4PySCF 1.7 that cause OOM crashes or sub-optimal utilisation when running Hessian calculations with multiple GPUs.

Set "hessian_patch": false to bypass all patches and use the unmodified GPU4PySCF path — recommended only when VRAM per GPU is large enough to hold the full intermediate tensors (typically A100/H100 80 GB cards).

Fix 1 — partial_hess_elec OOM (_hk_ip1_ip1)

Root cause: Inside _partial_hess_ejk, the function _hk_ip1_ip1 computes a tensor contraction 'pikx,pkiy->ikxy' of shape [blksize × nao × nao × 3]. CuPy's einsum implementation reshapes this into a batch matrix multiply of size [nao², blksize, 3], creating a temporary array of equal size. The original blksize formula uses a 0.4 memory factor that does not account for this temporary — the actual peak usage is 2 × blksize × nao² × 3 × 8 bytes, all on GPU 0.

Example: nao=1200, 20 GB card → 23 GB required on GPU 0 alone → OOM.

Fix: _hk_ip1_ip1_multigpu replaces the function for the duration of kernel(). It splits the aux-shell (nnz) axis across all visible GPUs via ThreadPoolExecutor and uses a conservative 0.1 blksize factor. The results are reduced onto GPU 0. Peak memory per GPU drops to ~2.2 GB; wall-time is ~2–3× faster than the original single-GPU path.

Fix 2 — make_h1 single-GPU OOM (get_int3c2e_wjk)

Root cause: make_h1 calls _get_jk_ipget_int3c2e_wjk, which allocates wk_Pl_ of shape [naux × nao × nocc] on GPU 0 alone when it fits within 40% of GPU 0's free VRAM, then solve_j2c allocates a second equally-sized tensor on the same device. The other GPUs are not considered.

Fix: When wk_Pl_ exceeds 40% of the minimum free VRAM across all GPUs, the patch forces the built-in pinned-CPU-memory path. The downstream get_int3c2e_ip1_vjk and get_int3c2e_ip2_vjk functions already distribute work across GPUs via ThreadPoolExecutor and transparently accept CPU/pinned input.

Fix 3 — Redundant XC-kernel recomputation in CPSCF (solve_mo1)

Root cause: Inside the CPSCF Krylov loop, get_veff_resp_mo calls ni.cache_xc_kernel on every iteration to compute the ground-state XC kernel (ρ₀, vxc, fxc). This kernel depends only on the converged SCF density, which does not change during CPSCF — yet the original code recomputes it every step. For a large system (300 000 DFT grid points) this wastes 5–15 s per Krylov iteration, or 250–750 s over 50 iterations.

Fix: ni.cache_xc_kernel is shadowed with a memoised wrapper for the lifetime of kernel(). The first call computes and stores (ρ₀, vxc, fxc); all subsequent calls return the cached result instantly. The wrapper is removed unconditionally in the finally block, leaving the NumInt object unmodified after the Hessian.

Numerical accuracy: all three fixes are exact — they perform identical floating-point operations and produce Hessian values within floating-point rounding error (~10⁻¹⁵) of the original code.


Output Files

All QM outputs are written to output_dir (default: outputs/). Step 3/4 outputs are written to the working directory.

Step 2 outputs

File Description
{gname}_small_opt.xyz Optimised small-model geometry
{gname}_small_opt.json Optimisation metadata (energy, convergence, timings)
{gname}_small_fc.hessian.npy Mass-unweighted Hessian (3N × 3N), Hartree/Bohr²
{gname}_small_fc.hessian.txt Hessian (plain text)
{gname}_small_fc.json Hessian metadata
{gname}_small_fc.fchk Gaussian fchk for small model (if write_fchk: true)
{gname}_large_mk.resp_charges.npy Final RESP charges (natm,), electrons
{gname}_large_mk.resp_charges.txt RESP charges plain text (stage-1 and final)
{gname}_large_mk.json RESP metadata
{gname}_large_mk.fchk Gaussian fchk for large model (if write_fchk: true)

Step 3 outputs

File Description
{gname}_mcpbpy_pre.frcmod Pre-frcmod with mass/VDW/standard parameters
{gname}_mcpbpy.frcmod Final frcmod with Seminario bond/angle force constants
{gname}_step3.json Step-3 metadata

Step 4 outputs

File Description
{mcresname}.mol2 One mol2 per metal-site residue (with RESP charges)
{gname}_mcpbpy.pdb PDB with MCPB-style residue names for tleap
{gname}_tleap.in tleap input (patched set head/tail, cleaned backbone bonds)
{gname}_step4.json Step-4 metadata

MCPB Input File (.in)

The .in file is generated automatically in step 1, or can be written manually. Key fields:

group_name      1okl
ion_ids         260
ion_mol2files   ZN.mol2
original_pdb    1okl_H.pdb
force_field     ff19SB
water_model     opc
cut_off         2.8

The standalone --step2, --step3, and --step4 modes auto-detect this file in the current working directory — no extra arguments required.


Plugin: Gaussian External Interface

plugin/external_pyscf.py is a standalone Gaussian External interface that lets you call PySCF / GPU4PySCF as a drop-in QM engine from within a Gaussian job (geometry optimisations, IRC, frequency, etc.) via the External= keyword.

# Gaussian input example
#External="python /path/to/external_pyscf.py --xc b3lyp --basis def2-svp --gpu 0 --disp d3bj --ri"

This plugin is independent of the MCPB workflow and can be used separately.


License

Released under the MIT License.

About

Generate forcefield parameters for metal enzyme system by using GPU4PYSCF

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors