Open-source 3D molecular alignment and scoring engine for shape, electrostatics, and pharmacophore similarity, with a Rust core and Python API.
MolAlign is currently strongest in:
- rigid alignment of drug-like molecules
- batch CPU throughput for rigid scoring/alignment
- integration work toward REINVENT4 and de novo design workflows
Current limitations:
- flexible alignment is still experimental for de novo screening and scaffold hopping
- rigid batch alignment scales well across CPU cores, and flexible batches now parallelize across molecules, but single flexible alignment is still effectively single-core
- LOBSTER oracle benchmarking is in place, but objective selection needs harder analyses than RMSD alone
- DUDE-Z virtual-screening performance currently trails published RoShAMBo results on the public
CXCR4andCSF1Rtargets
- Gaussian shape overlap scoring (ROCS-like)
- Electrostatic potential similarity (3-Gaussian 1/r fit)
- Pharmacophore feature matching (RDKit features as "color" Gaussians)
- Pharmacophore features now follow flexible atom motion through contributing-atom tracking
- Rigid alignment
- Multi-conformer ranking
- Flexible alignment with alternating rigid/torsion local search
- Batch CPU parallelism via Rayon for rigid and flexible batch processing
- REINVENT4 scoring component plugin
MolAlign uses Rayon for molecule-level batch parallelism in both rigid and flexible batch workflows.
| Level | Description | Expected Speedup |
|---|---|---|
| Molecule | Parallel rigid batch alignment | Measured, scales well |
| Flexible single molecule | Current state | Effectively single-core |
| Flexible batch | Parallel over molecules in Rust | Best fit for REINVENT-style batches |
# Set thread count via environment variable (before Python import)
RAYON_NUM_THREADS=8 python your_script.pyimport os
os.environ['RAYON_NUM_THREADS'] = '8' # Must be set before importing molalign
from molalign import MolAligner
from molalign.prepared import PreparedScreeningDataset| Threads | Time | Throughput | Speedup | Efficiency |
|---|---|---|---|---|
| 1 | 17215 ms | 58 mol/s | 1.00x | 100% |
| 2 | 10025 ms | 100 mol/s | 1.72x | 86% |
| 4 | 4898 ms | 204 mol/s | 3.51x | 88% |
| 8 | 3076 ms | 325 mol/s | 5.60x | 70% |
These numbers come from benchmarks/benchmark_scaling.py on a synthetic rigid batch benchmark. The script currently measures CPU scaling on randomly generated molecules, not DUDE-Z data.
For comparison, a single flexible alignment still shows only ~1.15x speedup from 1 to 8 Rayon threads, i.e. within-molecule flexible search remains effectively single-core today even though flexible batches now scale across molecules.
pip install maturin
maturin develop --releasefrom molalign import MolAligner
aligner = MolAligner(
ref_mol=ref_3d_mol,
preset="balanced",
mode="flexible",
num_conformers=50,
weights={"shape": 0.5, "esp": 0.25, "pharm": 0.25},
top_k_starts=3,
)
results = aligner.align(smiles_list)
# Large-scale rigid screening can use prepared packed arrays
prepared = PreparedScreeningDataset.from_rdkit_mols(dataset_mols)
screened = aligner.align_prepared(prepared)Presets:
screening: faster staged rigid pipeline with shape prescreen and ESP-light rerankingbalanced: default compromise for general discovery workdesign: larger conformer pools and deeper reranking for accuracy-first workflows
Rigid-mode results now include metadata such as generated conformer count, shortlist size, generation time, effective weights, and whether staged reranking was applied.
# Basic benchmarks
python benchmarks/benchmark_rigid.py
python benchmarks/benchmark_batch.py
# LOBSTER oracle + de novo benchmark
python benchmarks/benchmark_lobster.py
# Aggregate LOBSTER + DUDE-Z scorecard
python benchmarks/benchmark_scorecard.py
# Targeted DUDE-Z calibration smoke run
python benchmarks/benchmark_roshambo_compare.py --targets CXCR4 --modes shape,shape_esp,shape_pharm,balanced --max-queries 1
# RoShAMBo-style DUDE-Z comparison (published metrics only)
python benchmarks/benchmark_roshambo_compare.py
# Parallelization scaling benchmark
python benchmarks/benchmark_scaling.pyLOBSTER oracle mode:
- rigid validation now uses the actual returned
aligned_coordsto compute Shape Tversky/Tanimoto and pose deltas - native LOBSTER poses and RDKit rigid alignment are both reported as references for rigid single-conformer validation
- score self-consistency is checked by rescoring the returned rigid pose with
score_pre_aligned
Validated run on subset_90/80/70 with 30 pairs each, 12 workers, and 20 random baseline samples:
| Scheme | Returned Tv | Native Tv | RDKit Tv |
|---|---|---|---|
| shape | 0.896 | 0.892 | 0.894 |
| shape+esp | 0.897 | 0.892 | 0.894 |
| shape+pharm | 0.899 | 0.892 | 0.890 |
LOBSTER de novo mode:
- generated conformer pools are now used to validate rigid multi-conformer selection directly
- the harness reports oracle-best conformer in the generated pool, selected conformer, regret, and top-1/top-3 selection accuracy
- recommended smoke run:
python benchmarks/benchmark_lobster.py --mode all --subsets subset_90 --max-pairs 5 --n-confs 10 --seeds 42,43
Validated de novo run on subset_90/80/70 with 30 pairs each, 10 conformers, and seeds 42,43,44:
| Scheme | Selected Tv | Best Tv | Regret | Top-1 | Top-3 |
|---|---|---|---|---|---|
| shape | 0.829 | 0.839 | 0.010 | 49.6% | 77.4% |
| shape+esp | 0.835 | 0.844 | 0.008 | 55.6% | 85.6% |
| shape+pharm | 0.818 | 0.831 | 0.013 | 45.2% | 76.3% |
Current roadmap progress:
- completed: benchmark scorecard generation, staged rigid shortlist/rerank pipeline, adaptive conformer pruning controls, result metadata, molecule-aware rigid calibration
- completed: major Rust rigid speed pass with in-place transforms, top-start refinement, conformer-level parallelism, and SIMD-backed shape overlap
- next: recover the small quality regression introduced by the fast path while keeping most of the new throughput, then continue rigid ranking/pruning work before flexible search redesign
RoShAMBo-style DUDE-Z comparison:
- implemented on the public
CXCR4andCSF1R3D pose sets - MolAlign currently underperforms published RoShAMBo enrichment and AUC metrics on these targets
- this is a real gap to close before publication claims about VS competitiveness