Skip to content

AdrienCerdan/molalign

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MolAlign

Open-source 3D molecular alignment and scoring engine for shape, electrostatics, and pharmacophore similarity, with a Rust core and Python API.

Current Status

MolAlign is currently strongest in:

  • rigid alignment of drug-like molecules
  • batch CPU throughput for rigid scoring/alignment
  • integration work toward REINVENT4 and de novo design workflows

Current limitations:

  • flexible alignment is still experimental for de novo screening and scaffold hopping
  • rigid batch alignment scales well across CPU cores, and flexible batches now parallelize across molecules, but single flexible alignment is still effectively single-core
  • LOBSTER oracle benchmarking is in place, but objective selection needs harder analyses than RMSD alone
  • DUDE-Z virtual-screening performance currently trails published RoShAMBo results on the public CXCR4 and CSF1R targets

Features

  • Gaussian shape overlap scoring (ROCS-like)
  • Electrostatic potential similarity (3-Gaussian 1/r fit)
  • Pharmacophore feature matching (RDKit features as "color" Gaussians)
  • Pharmacophore features now follow flexible atom motion through contributing-atom tracking
  • Rigid alignment
  • Multi-conformer ranking
  • Flexible alignment with alternating rigid/torsion local search
  • Batch CPU parallelism via Rayon for rigid and flexible batch processing
  • REINVENT4 scoring component plugin

Parallelization

MolAlign uses Rayon for molecule-level batch parallelism in both rigid and flexible batch workflows.

Level Description Expected Speedup
Molecule Parallel rigid batch alignment Measured, scales well
Flexible single molecule Current state Effectively single-core
Flexible batch Parallel over molecules in Rust Best fit for REINVENT-style batches

Thread Configuration

# Set thread count via environment variable (before Python import)
RAYON_NUM_THREADS=8 python your_script.py
import os
os.environ['RAYON_NUM_THREADS'] = '8'  # Must be set before importing molalign
from molalign import MolAligner
from molalign.prepared import PreparedScreeningDataset

Measured Scaling

Threads Time Throughput Speedup Efficiency
1 17215 ms 58 mol/s 1.00x 100%
2 10025 ms 100 mol/s 1.72x 86%
4 4898 ms 204 mol/s 3.51x 88%
8 3076 ms 325 mol/s 5.60x 70%

These numbers come from benchmarks/benchmark_scaling.py on a synthetic rigid batch benchmark. The script currently measures CPU scaling on randomly generated molecules, not DUDE-Z data.

For comparison, a single flexible alignment still shows only ~1.15x speedup from 1 to 8 Rayon threads, i.e. within-molecule flexible search remains effectively single-core today even though flexible batches now scale across molecules.

Installation

pip install maturin
maturin develop --release

Usage

from molalign import MolAligner

aligner = MolAligner(
    ref_mol=ref_3d_mol,
    preset="balanced",
    mode="flexible",
    num_conformers=50,
    weights={"shape": 0.5, "esp": 0.25, "pharm": 0.25},
    top_k_starts=3,
)

results = aligner.align(smiles_list)

# Large-scale rigid screening can use prepared packed arrays
prepared = PreparedScreeningDataset.from_rdkit_mols(dataset_mols)
screened = aligner.align_prepared(prepared)

Presets:

  • screening: faster staged rigid pipeline with shape prescreen and ESP-light reranking
  • balanced: default compromise for general discovery work
  • design: larger conformer pools and deeper reranking for accuracy-first workflows

Rigid-mode results now include metadata such as generated conformer count, shortlist size, generation time, effective weights, and whether staged reranking was applied.

Benchmarks

# Basic benchmarks
python benchmarks/benchmark_rigid.py
python benchmarks/benchmark_batch.py

# LOBSTER oracle + de novo benchmark
python benchmarks/benchmark_lobster.py

# Aggregate LOBSTER + DUDE-Z scorecard
python benchmarks/benchmark_scorecard.py

# Targeted DUDE-Z calibration smoke run
python benchmarks/benchmark_roshambo_compare.py --targets CXCR4 --modes shape,shape_esp,shape_pharm,balanced --max-queries 1

# RoShAMBo-style DUDE-Z comparison (published metrics only)
python benchmarks/benchmark_roshambo_compare.py

# Parallelization scaling benchmark
python benchmarks/benchmark_scaling.py

Current Benchmark Snapshot

LOBSTER oracle mode:

  • rigid validation now uses the actual returned aligned_coords to compute Shape Tversky/Tanimoto and pose deltas
  • native LOBSTER poses and RDKit rigid alignment are both reported as references for rigid single-conformer validation
  • score self-consistency is checked by rescoring the returned rigid pose with score_pre_aligned

Validated run on subset_90/80/70 with 30 pairs each, 12 workers, and 20 random baseline samples:

Scheme Returned Tv Native Tv RDKit Tv
shape 0.896 0.892 0.894
shape+esp 0.897 0.892 0.894
shape+pharm 0.899 0.892 0.890

LOBSTER de novo mode:

  • generated conformer pools are now used to validate rigid multi-conformer selection directly
  • the harness reports oracle-best conformer in the generated pool, selected conformer, regret, and top-1/top-3 selection accuracy
  • recommended smoke run: python benchmarks/benchmark_lobster.py --mode all --subsets subset_90 --max-pairs 5 --n-confs 10 --seeds 42,43

Validated de novo run on subset_90/80/70 with 30 pairs each, 10 conformers, and seeds 42,43,44:

Scheme Selected Tv Best Tv Regret Top-1 Top-3
shape 0.829 0.839 0.010 49.6% 77.4%
shape+esp 0.835 0.844 0.008 55.6% 85.6%
shape+pharm 0.818 0.831 0.013 45.2% 76.3%

Current roadmap progress:

  • completed: benchmark scorecard generation, staged rigid shortlist/rerank pipeline, adaptive conformer pruning controls, result metadata, molecule-aware rigid calibration
  • completed: major Rust rigid speed pass with in-place transforms, top-start refinement, conformer-level parallelism, and SIMD-backed shape overlap
  • next: recover the small quality regression introduced by the fast path while keeping most of the new throughput, then continue rigid ranking/pruning work before flexible search redesign

RoShAMBo-style DUDE-Z comparison:

  • implemented on the public CXCR4 and CSF1R 3D pose sets
  • MolAlign currently underperforms published RoShAMBo enrichment and AUC metrics on these targets
  • this is a real gap to close before publication claims about VS competitiveness

About

Molecular alignment library

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors