A GPU-accelerated QM/MM parameterization workflow for metalloenzyme active sites. Combines the MCPB (Metal Center Parameter Builder) methodology with GPU4PySCF to replace the traditional Gaussian QM backend — delivering the same rigorous parameterization on modern NVIDIA hardware.
- Overview
- Features
- Requirements
- Installation
- Quick Start
- SCF Parameters
- GPU Memory Tuning
- Multi-GPU Hessian Patch
- Output Files
- Plugin: Gaussian External Interface
- License
Accurate force-field parameterization of metalloenzyme active sites requires high-quality QM data — optimised geometry, vibrational force constants, and electrostatic potential (ESP) charges — that is prohibitively slow for large models (~150 atoms) using conventional CPU-based codes such as Gaussian.
This project replaces the Gaussian QM backend with PySCF / GPU4PySCF, enabling:
- DF-DFT (density-fitting DFT) with analytical gradients and Hessians on GPU
- Full two-stage RESP fitting within PySCF (no external
respbinary required) - Seminario bond/angle force constants directly from the GPU Hessian matrix
- An end-to-end pipeline from raw PDB to tleap-ready topology
| Step | Engine | Inputs | Key Outputs |
|---|---|---|---|
| 1 — Model building | pyMSMT | Raw PDB | Small/large .com, .in, mol2 templates |
| 2a — Geometry opt | GPU4PySCF | Small model | Optimised .xyz, .json |
| 2b — Hessian | GPU4PySCF | Optimised geometry | Force constants .npy, .fchk |
| 2c — RESP | GPU4PySCF | Large model | Charges .npy, .txt |
| 3 — Frcmod | Seminario | Hessian .npy |
_mcpbpy.frcmod |
| 4 — Leap prep | pyMSMT | RESP mol2 + PDB | _mcpbpy.pdb, _tleap.in |
- Full GPU acceleration — DF-DFT SCF, analytical gradients, and Hessians via GPU4PySCF
- Multi-GPU Hessian — built-in patches (
lib/patch.py) fix upstream single-GPU bottlenecks and OOM issues that occur with ≥3 GPUs (see Multi-GPU Hessian Patch) - Two-stage RESP fitting — stage-1 free fit; stage-2 with pyMSMT-style backbone charge constraints (
chgmod) and CH₂/CH₃ equivalence constraints, entirely withingpu4pyscf.pop.esp - Seminario frcmod — metal-site bond/angle force constants from the GPU4PySCF Hessian (no Gaussian
.fchkrequired) - Open-shell support — automatically selects UKS for radicals / high-spin metal centres
- Dispersion correction — D3BJ / D3 / D4 / D3Zero
- JSON parameter interface — all settings via a single JSON file; no code edits needed
- tleap-ready output —
set head/tailconnect-atom definitions auto-inserted; redundant backbonebondcommands removed - Optional fchk output — Gaussian-compatible
.fchkfiles for Multiwfn, NBO, etc.
| Package | Version | Notes |
|---|---|---|
| Python | ≥ 3.11 | |
pyscf |
2.12.1 | |
gpu4pyscf |
1.7.0 | CUDA toolkit required |
geometric |
1.1 | Geometry optimisation driver |
numpy |
≥ 2.0 | |
openmm |
8.4.0 | PDB preparation |
pdbfixer |
1.12.0 | PDB repair & protonation |
pyMSMT |
22.0 | MCPB model building (steps 1, 3, 4) |
AmberTools (tleap) |
≥ 22 | Final topology generation |
conda create -n gpu4pyscf python=3.11
conda activate gpu4pyscf
# Core stack
conda install -c conda-forge numpy openmm pdbfixer pymsmt ambertools
# PySCF
pip install pyscf geometric
# GPU4PySCF (choose your CUDA version)
pip install gpu4pyscf-cuda12x # CUDA 12.x
# pip install gpu4pyscf-cuda11x # CUDA 11.xClone this repository into any working directory — no installation step required:
git clone <repo-url> mcpb_run
cd mcpb_runpython main.py input.pdb output_H.pdb \
--metal-charge ZN:2 \
--step1 --step2 --step3 --step4 \
--json scf_params.jsonAdd
--gpu-json gpu.jsonto any command that runs a Hessian (step 2) to override the GPU memory-tuning defaults — see GPU Memory Tuning.
Run each step independently from the directory containing the .in file:
# Step 2 — QM calculations (opt + Hessian + RESP)
python main.py --step2 --json scf_params.json
# Step 3 — Seminario frcmod
python main.py --step3 --json scf_params.json
# Step 4 — mol2 files, _mcpbpy.pdb, _tleap.in
python main.py --step4 --json scf_params.jsontleap -s -f {gname}_tleap.inAll settings are controlled by a single JSON file passed with --json.
Copy and edit scf_params.json:
{
"basis": "def2-svp",
"auxbasis": "def2-svp-ri",
"xc": "b3lyp",
"device": "auto",
"cuda_visible_devices": "0,1,2",
"max_cycle": 100,
"conv_tol": 1e-7,
"grid_level": 3,
"damp": 0.2,
"level_shift": 0.2,
"maxsteps": 30,
"hessian": true,
"hessian_device": "gpu",
"hessian_patch": true,
"cphf_grid_level": null,
"output_dir": "outputs",
"disp": "",
"verbose": 4,
"write_fchk": true,
"resp_a1": 5e-4,
"resp_a2": 1e-3,
"resp_b": 0.1,
"hfree": true,
"chgmod": 1,
"scalef": 1.0,
"bond_avg": true,
"angle_avg": true,
"paraset": "12_6"
}| Key | Type | Default | Description |
|---|---|---|---|
basis |
string | "def2-svp" |
Orbital basis set |
auxbasis |
string | "def2-svp-ri" |
Auxiliary basis for density fitting |
xc |
string | "b3lyp" |
XC functional |
device |
string | "auto" |
"auto" (GPU → CPU fallback), "gpu", "cpu" |
cuda_visible_devices |
string | "0" |
GPU index(es) for CUDA_VISIBLE_DEVICES |
max_cycle |
int | 100 |
Maximum SCF iterations |
conv_tol |
float | 1e-7 |
SCF energy convergence threshold (Hartree) |
grid_level |
int | 3 |
DFT integration grid level (1–9) |
damp |
float | 0.2 |
SCF damping factor |
level_shift |
float | 0.2 |
Level-shift value (Hartree) |
maxsteps |
int | 30 |
Maximum geometry optimisation steps |
disp |
string|null | "" |
Dispersion: "d3bj", "d3", "d4", "d3zero", or "" |
verbose |
int | 4 |
PySCF verbosity (0=silent … 5=debug) |
| Key | Type | Default | Description |
|---|---|---|---|
hessian |
bool | true |
Compute Hessian after optimisation |
hessian_device |
string | "same" |
Device for Hessian SCF: "same" as opt, or "cpu" |
hessian_patch |
bool | true |
Enable multi-GPU Hessian patches (see below) |
cphf_grid_level |
int|null | null |
Separate DFT grid level for CPSCF XC response (null = same as grid_level; 2 ≈ SG-1 for ~20–40 % speedup) |
| Key | Type | Default | Description |
|---|---|---|---|
output_dir |
string | "outputs" |
Directory for all QM output files |
write_fchk |
bool | false |
Write Gaussian-compatible .fchk files |
| Key | Type | Default | Description |
|---|---|---|---|
resp_a1 |
float | 5e-4 |
Stage-1 restraint strength |
resp_a2 |
float | 1e-3 |
Stage-2 restraint strength |
resp_b |
float | 0.1 |
Hyperbolic restraint width |
hfree |
bool | true |
Exclude H atoms from RESP restraints |
chgmod |
int | 1 |
Backbone charge constraint level: 0=free; 1=fix CA/N/C/O; 2=+H/HA; 3=+CB |
| Key | Type | Default | Description |
|---|---|---|---|
scalef |
float | 1.0 |
Frequency scale factor (force constants scaled by scalef²) |
bond_avg |
bool | true |
Average both off-diagonal Hessian blocks per bond |
angle_avg |
bool | true |
4-way block average for angle force constants |
| Key | Type | Default | Description |
|---|---|---|---|
paraset |
string | "12_6" |
Ion non-bonded parameter set: "12_6", "12_6_4", "HFE", "CM", "IOD" |
The Hessian pipeline sizes every large GPU buffer at run time from a small set of
memory-tuning knobs. These default to values that work well on the test hardware,
but you can override any of them without editing the source by passing a JSON
file with --gpu-json:
python main.py --step2 --json scf_params.json --gpu-json gpu.jsonThe file is optional — omit --gpu-json to use the built-in defaults. Only the
keys you want to change need to appear; any key starting with _ is treated as a
comment and ignored, and an unknown key raises an error. All values are read when
the Hessian kernel runs, so a single load in main() applies to the whole job.
Accuracy is unaffected. These parameters only change how tensors are blocked and distributed across GPUs — never the arithmetic. Lower a
*_frac/*_factor(or raisecphf_est_cycles/make_h1_avail_div) to use less GPU memory at the cost of speed; raise them to go faster when VRAM is ample.
{
"_comment": "GPU memory-tuning parameters. Pass with --gpu-json gpu.json.",
"cphf_subspace_frac": 0.50,
"cphf_est_cycles": 20,
"make_h1_pinned_frac": 0.40,
"make_h1_avail_div": 4,
"make_h1_stream_tile_frac": 0.08,
"split_min_frac": 0.55,
"hk_ao_aux_budget_frac": 0.30,
"hk_ip1_ip1_budget_frac": 0.35,
"aux_aux_budget_frac": 0.20,
"df_blksize_factor": 0.05
}| Key | Type | Default | Description |
|---|---|---|---|
cphf_subspace_frac |
float | 0.50 |
Fraction of device-0 free VRAM the Krylov subspace (xs/ax) may occupy. Lower if CPHF OOMs on GPU 0. |
cphf_est_cycles |
int | 20 |
Assumed Krylov iteration count used to size the per-atom subspace cost. Raise to shrink the atom block (less memory, more batches). |
| Key | Type | Default | Description |
|---|---|---|---|
make_h1_pinned_frac |
float | 0.40 |
If wk_Pl_ ≥ this fraction of the smallest GPU's free memory, build it in pinned host memory instead of on-GPU. Lower to spill to host sooner. |
make_h1_avail_div |
int | 4 |
Divisor applied to get_avail_mem inside native make_h1 (smaller blocks). Raise for a smaller footprint. |
make_h1_stream_tile_frac |
float | 0.08 |
Fraction of free VRAM used to size the streamed wk0_10_P__ aux tile. Lower if make_h1 OOMs. |
| Key | Type | Default | Description |
|---|---|---|---|
split_min_frac |
float | 0.55 |
Minimum share (relative to an even split) each GPU is guaranteed, so no device is starved to zero rows. |
hk_ao_aux_budget_frac |
float | 0.30 |
BlockPlan budget fraction for the ao-aux contraction phase. |
hk_ip1_ip1_budget_frac |
float | 0.35 |
BlockPlan budget fraction for the fused ip1/ip1 phase. |
aux_aux_budget_frac |
float | 0.20 |
BlockPlan budget fraction for the aux-aux (int2c_ip) phase. |
| Key | Type | Default | Description |
|---|---|---|---|
df_blksize_factor |
float | 0.05 |
Fraction of free VRAM used to size the DF JK-loop block. Lower for a smaller footprint. |
If a phase still OOMs at runtime,
solve_mo1additionally auto-halves its atom block and retries, so these knobs mainly trade peak memory for fewer/larger (faster) batches rather than being strictly required for correctness.
lib/patch.py is loaded automatically when "hessian_patch": true (the default).
It fixes three upstream issues in GPU4PySCF 1.7 that cause OOM crashes or sub-optimal
utilisation when running Hessian calculations with multiple GPUs.
Set
"hessian_patch": falseto bypass all patches and use the unmodified GPU4PySCF path — recommended only when VRAM per GPU is large enough to hold the full intermediate tensors (typically A100/H100 80 GB cards).
Root cause: Inside _partial_hess_ejk, the function _hk_ip1_ip1 computes a
tensor contraction 'pikx,pkiy->ikxy' of shape [blksize × nao × nao × 3].
CuPy's einsum implementation reshapes this into a batch matrix multiply of size
[nao², blksize, 3], creating a temporary array of equal size. The original
blksize formula uses a 0.4 memory factor that does not account for this
temporary — the actual peak usage is 2 × blksize × nao² × 3 × 8 bytes, all on
GPU 0.
Example: nao=1200, 20 GB card → 23 GB required on GPU 0 alone → OOM.
Fix: _hk_ip1_ip1_multigpu replaces the function for the duration of
kernel(). It splits the aux-shell (nnz) axis across all visible GPUs via
ThreadPoolExecutor and uses a conservative 0.1 blksize factor. The results are
reduced onto GPU 0. Peak memory per GPU drops to ~2.2 GB; wall-time is ~2–3× faster
than the original single-GPU path.
Root cause: make_h1 calls _get_jk_ip → get_int3c2e_wjk, which allocates
wk_Pl_ of shape [naux × nao × nocc] on GPU 0 alone when it fits within 40% of
GPU 0's free VRAM, then solve_j2c allocates a second equally-sized tensor on the
same device. The other GPUs are not considered.
Fix: When wk_Pl_ exceeds 40% of the minimum free VRAM across all GPUs, the
patch forces the built-in pinned-CPU-memory path. The downstream
get_int3c2e_ip1_vjk and get_int3c2e_ip2_vjk functions already distribute work
across GPUs via ThreadPoolExecutor and transparently accept CPU/pinned input.
Root cause: Inside the CPSCF Krylov loop, get_veff_resp_mo calls
ni.cache_xc_kernel on every iteration to compute the ground-state XC kernel
(ρ₀, vxc, fxc). This kernel depends only on the converged SCF density, which
does not change during CPSCF — yet the original code recomputes it every step.
For a large system (300 000 DFT grid points) this wastes 5–15 s per Krylov
iteration, or 250–750 s over 50 iterations.
Fix: ni.cache_xc_kernel is shadowed with a memoised wrapper for the lifetime
of kernel(). The first call computes and stores (ρ₀, vxc, fxc); all subsequent
calls return the cached result instantly. The wrapper is removed unconditionally
in the finally block, leaving the NumInt object unmodified after the Hessian.
Numerical accuracy: all three fixes are exact — they perform identical floating-point operations and produce Hessian values within floating-point rounding error (~10⁻¹⁵) of the original code.
All QM outputs are written to output_dir (default: outputs/).
Step 3/4 outputs are written to the working directory.
| File | Description |
|---|---|
{gname}_small_opt.xyz |
Optimised small-model geometry |
{gname}_small_opt.json |
Optimisation metadata (energy, convergence, timings) |
{gname}_small_fc.hessian.npy |
Mass-unweighted Hessian (3N × 3N), Hartree/Bohr² |
{gname}_small_fc.hessian.txt |
Hessian (plain text) |
{gname}_small_fc.json |
Hessian metadata |
{gname}_small_fc.fchk |
Gaussian fchk for small model (if write_fchk: true) |
{gname}_large_mk.resp_charges.npy |
Final RESP charges (natm,), electrons |
{gname}_large_mk.resp_charges.txt |
RESP charges plain text (stage-1 and final) |
{gname}_large_mk.json |
RESP metadata |
{gname}_large_mk.fchk |
Gaussian fchk for large model (if write_fchk: true) |
| File | Description |
|---|---|
{gname}_mcpbpy_pre.frcmod |
Pre-frcmod with mass/VDW/standard parameters |
{gname}_mcpbpy.frcmod |
Final frcmod with Seminario bond/angle force constants |
{gname}_step3.json |
Step-3 metadata |
| File | Description |
|---|---|
{mcresname}.mol2 |
One mol2 per metal-site residue (with RESP charges) |
{gname}_mcpbpy.pdb |
PDB with MCPB-style residue names for tleap |
{gname}_tleap.in |
tleap input (patched set head/tail, cleaned backbone bonds) |
{gname}_step4.json |
Step-4 metadata |
The .in file is generated automatically in step 1, or can be written manually.
Key fields:
group_name 1okl
ion_ids 260
ion_mol2files ZN.mol2
original_pdb 1okl_H.pdb
force_field ff19SB
water_model opc
cut_off 2.8
The standalone --step2, --step3, and --step4 modes auto-detect this file in
the current working directory — no extra arguments required.
plugin/external_pyscf.py is a standalone Gaussian External interface that
lets you call PySCF / GPU4PySCF as a drop-in QM engine from within a Gaussian job
(geometry optimisations, IRC, frequency, etc.) via the External= keyword.
# Gaussian input example
#External="python /path/to/external_pyscf.py --xc b3lyp --basis def2-svp --gpu 0 --disp d3bj --ri"
This plugin is independent of the MCPB workflow and can be used separately.
Released under the MIT License.
