Molecular docking—the prediction of binding modes and binding affinity of a molecule to a target of known structure—is a great computational tool for structure-based drug design. However, docking scoring functions are mostly empirical or knowledge-based and the flexibility of the receptor is completely neglected in most docking studies. Recent advances in the field showed that scoring functions can be effectively learnt by convolutional neural networks (CNNs). Here we want to build on top of these findings and develop a CNN scoring function for flexible docking by extending the capabilities of
gnina
—a state-of-the-art deep learning framework for molecular docking—and by building an high-quality training dataset for flexible docking.
This Google Summer of Code 2019 project aims to extend the capabilities of gnina, the deep learning framework for molecular docking devloped in David Koes's group, to build a CNN-based scoring function for docking with flexible side chains.
The main stages of the project are the following:
- Build a high-quality training dataset of docking with flexible side chains
- Get and pre-process PDBbind18 (see
PDBbind18/README.md
) - Re-docking with flexible side chains using smina (see
docking/README.md
) - Optimize crystal poses using smina (see
optimisation/README.md
) - Build training dataset (see
datasets/flexdock/README.md
)
- Get and pre-process PDBbind18 (see
- Enable optimisation of flexible side chains (see PR #73)
- Split ligand and receptor movable atoms in the correct channels
- Combine ligand and receptor gradients for geometry optimisation
- Train a new CNN-based scoring function for docking with flexible side chains (see
mltraining/README.md
)- Evaluate the performance of pose prediction
- Evaluate the performance of pose optimisation
- Iterate training on datasets augmented with CNN-optimized poses
- Optimize docking poses with the CNN (see
mlopt/README.md
)
- Optimize docking poses with the CNN (see
This repository collects the different pipelines built in order to achieve the project goals. A list of constributions and fixes to openbabel, smina and gnina (OpenChemistry organisation) and MDAnalysis (NumFocus organisation) is given below.
The datasets related to this project will be released on Zenodo in due time.
List of contributions to gnina and gnina-scripts:
- Optimisation of flexible side chains (PR #73)
- Added option to
pymol_arrows.py
(PR #31) - Low-memory and faster substitute
combine_rows.py
(PR #30) - Attempt to decrease memory usage of
combine_rows.py
(PR #29) - Added serialization of
struct residue
(PR #74) - Small fixes to
gninavis
for gradients (PR #72) - Fixed Python3
pickle
in clustering pipeline (PR #26) - Added insertion code support to
makeflex.py
(PR #65) - Improved
makeflex.py
script to deal with PDB file without atom types (PR #64) - Added test support for newer versions of Boost (PR #62)
- Provided documentation and PDB standardization for
makeflex.py
script (PR #61) - Provided fixes for the
makeflex.py
script (PR #60) - Raised issue about
gnina
parallel compilation withoutlibmolgrid
installed (Issue #57) - Updated
PDBQTUtilities.cpp
to latest OpenBabel version (PR #59)
List of contributions to libmolgrid:
- Fixed issue with unsupported CUDA architecture (PR #5)
List of contributions to smina:
- Fixed a problem with proline residues, broken by flexible docking (MR #3)
List of contributions to openbabel:
- Fixed various problems with PDB and PDBQT insertion codes (PR #1998)
- Fixed CMake when compiling without RapidJSON (PR #1988)
List of contributions to MDAnalysis:
- Improved mass guess (PR #2331)
- Fixed issues with PDB
HEADER
field inPDBReader
andPDBWriter
(PR #2325) - Allowed MOL2 parser to ignore status bit strings (PR #2319)
- Dr. David Ryan Koes, Assistant Professor, Department of Computational and Systems Biology, University of Pittsburgh
- Jocelyn Sunseri, Computational Biology Doctoral Candidate, Carnegie Mellon and University of Pittsburgh