GitHub - Goodman-lab/DP5: Python workflow for DP5 and DP4 analysis of organic molecules

2 Branches 1 Tag

Name	Name	Last commit message	Last commit date
Latest commit Jonathan-Goodman Update README Jul 15, 2023 b6cf559 · Jul 15, 2023 History 336 Commits
.gitignore	.gitignore	Removing venv	Feb 4, 2020
Carbon_assignment.py	Carbon_assignment.py	Further changes to allow support for both OpenBabel 2.4.1 and 3.0	Apr 30, 2020
Carbon_plotting.py	Carbon_plotting.py	finishing DP5 code	Jun 3, 2021
Carbon_processing.py	Carbon_processing.py	finishing DP5 code	Jun 3, 2021
ConfPrune.pyx	ConfPrune.pyx	A few bugfixes	Jun 27, 2019
DP4.py	DP4.py	linked list bug fix	Sep 23, 2021
DP5.py	DP5.py	omit assignment bug fix	Feb 9, 2022
Defaultslurm	Defaultslurm	Changes to run jobs on the new Darwin cluster	Nov 29, 2017
FiveConf.py	FiveConf.py	Further changes to allow support for both OpenBabel 2.4.1 and 3.0	Apr 30, 2020
Gaussian.py	Gaussian.py	Tinker output file length	Dec 21, 2021
GaussianDarwin.py	GaussianDarwin.py	A few minor improvements	Jul 4, 2019
GaussianZiggy.py	GaussianZiggy.py	Convergence checking and Ziggy queue checking added.	Jun 19, 2019
InchiGen.py	InchiGen.py	Added Omit into pairwise assignment	Nov 12, 2021
Karplus.py	Karplus.py	Further changes to allow support for both OpenBabel 2.4.1 and 3.0	Apr 30, 2020
LICENSE	LICENSE	Update	Nov 6, 2016
LICENSE.md	LICENSE.md	Create LICENSE.md	Aug 10, 2022
MacroModel.py	MacroModel.py	finishing DP5 code	Jun 3, 2021
NMR.py	NMR.py	omit assignment bug fix	Feb 9, 2022
NWChem.py	NWChem.py	Small bugfix in NWChem code	Jul 14, 2020
Proton_assignment.py	Proton_assignment.py	Further changes to allow support for both OpenBabel 2.4.1 and 3.0	Apr 30, 2020
Proton_plotting.py	Proton_plotting.py	finishing DP5 code	Jun 3, 2021
Proton_processing.py	Proton_processing.py	finishing DP5 code	Jun 3, 2021
PyDP4.py	PyDP4.py	omit assignment bug fix	Feb 9, 2022
PyDP4_GUI.py	PyDP4_GUI.py	initial DP5 commit	Jul 1, 2021
README	README	Update README	Jul 15, 2023
StructureInput.py	StructureInput.py	finishing DP5 code	Jun 3, 2021
TMSdata	TMSdata	Update	Nov 6, 2016
Tinker.py	Tinker.py	Tinker output file length	Dec 21, 2021
TreeRenum.py	TreeRenum.py	Python3 support	Apr 19, 2020
atomic_reps.gz	atomic_reps.gz	initial DP5 commit	Jul 1, 2021
c_w_kde_mean_s_0.025.p	c_w_kde_mean_s_0.025.p	initial DP5 commit	Jul 1, 2021
default.com	default.com	Update	Nov 6, 2016
default.key	default.key	Update	Nov 6, 2016
default1.com	default1.com	Update	Nov 6, 2016
folded_scaled_errors.p	folded_scaled_errors.p	initial DP5 commit	Jul 1, 2021
frag_reps.gz	frag_reps.gz	finished integration and GUI support - adding test scripts now	Feb 23, 2021
i_w_kde_mean_s_0.025.p	i_w_kde_mean_s_0.025.p	initial DP5 commit	Jul 1, 2021
sdf2tinkerxyz	sdf2tinkerxyz	Initial commit of PyDP4 v0.4	Aug 27, 2015
sdftinkerxyzpy.py	sdftinkerxyzpy.py	Tinker output file length	Dec 21, 2021
settings.example	settings.example	Merge remote-tracking branch 'github-master/master' into github-master	Apr 12, 2021

Repository files navigation

===============================================================

DP5

version 1.0

Copyright (c) 2020 Alexander Howarth, Kristaps Ermanis, Jonathan M. Goodman

distributed under MIT license

===============================================================

CONTENTS
1) Release Notes
2) Requirements and Setup
3) Usage
4) NMR Description Format
5) Included Utilites
6) Code Organization

===============================================================

RELEASE NOTES

This latest release represents a leap forward in structural uncertainty calculation through the addition of the new DP5
module. By adding this functionality to our existing state of the art DP5 software, both DP4 and DP5
probabilities can be calculated alongside each other fully automatically.

DP5 as described in our latest paper (https://doi.org/10.1039/D1SC04406K) calculates the standalone probability of structures being correct.
If a user cannot guarantee the correct structure is in the list of proposals, DP5 calculates the probability each is
correct, quantifying this uncertainty.

When the user is certain a one of their proposals is correct i.e in the case of elucidating relative stereochemistry,
DP4 is the best metric to use, as this uses the additional information that one of the proposals must be correct.

The power of DP5 and DP4 however is greatest when they are used in conjunction as DP5 can be used to increase
reliability and confidence in the conclusions of DP4 calculations (also described in the paper).

DP5 (https://doi.org/10.1039/D1SC04406K) integrates NMR-AI, software for automatic processing, assignment
and visualisation of raw NMR data. This functionality affords fully automated DP4 analysis of databases of molecules
with no user input. DP5 also utilises a fully revised version of PyDP4 updated for Python 3 and with major workflow
improvements.

Users can now also choose to run DP5 within a Graphical user interface for increased ease of use when single DP4
calculations. This graphical user interface can also be used to explore the spectra processed and assigned by NMR-AI.
DP5 can also be utilised to perform assignments of NMR spectra for a single isomer of a molecule. These assignments
can be viewed in the graphical output from DP5 or investigated interactively using The included GUI application.

More details about the software can be found in the publication DP5 Straight from Spectrometer to Structure
(https://doi.org/10.1039/D1SC04406K) and in the accompanying supporting information.

===============================================================

REQUIREMENTS AND SETUP

All the python  files and one utility to convert to and from TINKER
nonstandard xyz file are in the attached archive. They are set up to work
from a centralised location. It is not a requirement, but it is probably
best to add the location to the PATH variable.

The script currently is set up to use MacroModel for molecular mechanics and
Gaussian for DFT and it runs Gaussian on ziggy by default. NWChem and TINKER is also supported.
This setup has several requirements.

1) One should have MacroModel or TINKER and NWChem or Gaussian.
The DP5 folder can contain settings.cfg, where the locations for the installed software packages
can be given, so that DP5 can launch them and use in the workflow. Examples on how to do
this are provided in the included settings.example - this should be renamed to settings.cfg and
any relevant changes made.

2) In order to run DP5 a python 3.6+ environment the following packages are required:
numpy scipy Cython matplotlib nmrglue statsmodels lmfit openbabel rdkit pathos qml
Additionally, the graphical user interface requires PyQT5

3) RDKit and OpenBabel are required for the automatic diastereomer generation and
other manipulations of sdf files (renumbering, ring corner flipping), as well as molecule
graphical plotting. For OpenBabel this requirement includes Python bindings.
The following links provide instructions for building OpenBabel with Python bindings:
http://openbabel.org/docs/dev/UseTheLibrary/PythonInstall.html
http://openbabel.org/docs/dev/Installation/install.html#compile-bindings
Alternatively, the OpenBabel package in conda repository has also lately become
more reliable, and is much easier to install via `conda install openbabel -c conda-forge'. Both OpenBabel 2.4.1 and 3.x are supported.

4) Finally, to run calculations on a computational cluster, a passwordless
ssh connection should be set up in both directions -
desktop -> cluster and cluster -> desktop. In most cases the modification
of the relevant functions in Gaussian.py or NWChem.py will be required
to fit your situation.

5) All development and testing was done on Linux. However, both the scripts
and all the backend software should work equally well on MacOS with minor adjustments.
Windows will require more modification, and is currently not supported.

===================

USAGE - GUI

To call the script from the Graphical User Interface

the folder containing the input files must be opened in terminal and the correct python environment activated.

the GUI is then simply called by:

python PyDP4_GUI.py

This will open the main DP5 GUI window, the user can then select the required input files and the settings for MM and
DFT calculations. The expected structure file format is *.sdf

Once the DP5 has finished all calculations the GUI will open a number of new tabs. These tabs display interactive
carbon and proton NMR spectra as processed and assigned by NMR-AI as well as interactive plots of the NMR prediction
errors and probabilities and conformer energies and populations.

===================

USAGE - Terminal

To call the script from terminal:

1) With all diastereomer generation:

PyDP4.py Candidate CandidateNMR

where Candidate is the sdf file containing 3D coordinates of the candidate
structure (without the extension).

DP5 will automatically detect whether an NMR description (see below) or raw NMR data has been provided by the user.
If raw NMR data has been provided (currently in the Bruker file format), NMR-AI will automatically process the NMR spectra
provided. In this case the NMR data should be organised into a single folder (passed to DP5) containing one or two
folders labelled Proton and (or) Carbon.

Alternatively:

PyDP4.py -w gmns Candidate CandidateNMR

The -w switch specifies the PyDP4 workflow, c specifies structure cleaning utilising RDkit, g specifies diastereomer
generation, m molecular mechanics conformational searching, o DFT geometry optimisation, n DFT NMR calculation,
e M062X single point energy calculations, s calculate DP4 statistics and w for DP5 statistics.
The default workflow is gnms, optimum workflow is gnomes. The letter a can be given in place of s, in the instance
DP5 will analyse and assign the provided NMR spectra but not calculate any probabilities.

In addition the -s switch can be used to specify the solvent for use in MM and DFT calculations. Supported solvents are
listed in the TMSdata file. Other Gaussian/Jaguar/NWChem supported solvents can be used, but only with manually interpreted
data.

PyDP4.py -s chloroform Candidate CandidateNMR

If solvent is not given, no solvent is used.

2) With explicit diastereomer/other candidate structures:

PyDP4.py Candidate1 Candidate2 Candidate3 ... CandidateNMR

The script does not attempt to generate diastereomers, simply carries out the
DP4 on the specified candidate structures.

Structures can also be added from InChI, Smiles and Smarts strings using the designated switches. For example:

PyDP4.py --InChI Candidates_inchis.inchi ... CandidateNMR

where Candidates_inchs.inchi is a text file with all the desired inchi strings on separate lines.

Script has several other switches, including switching the molecular mechanics and dft software etc.

  -m {t,m}, --mm {t,m}  Select molecular mechanics program, t for tinker or m
                        for macromodel, default is t

  -d {j,g,n,z,w}, --dft {j,g,n,z,w}
                        Select DFT program, j for Jaguar, g for Gaussian, n
                        for NWChem, z for Gaussian on ziggy, w for NWChem on
                        ziggy, default is z (jaguar is not yet implemented)

  --StepCount STEPCOUNT
                        Specify stereocentres for diastereomer generation

  -s SOLVENT, --solvent SOLVENT
                        Specify solvent to use for dft calculations

  -q QUEUE, --queue QUEUE
                        Specify queue for job submission on ziggy
			(default is s1)

  -t NTAUT, --ntaut NTAUT
                        Specify number of explicit tautomers per diastereomer
                        given in structure files, must be a multiple of
                        structure files

  -r, --rot5            Manually generate conformers for 5-memebered rings

  --ra RA               Specify ring atoms, for the ring to be rotated, useful
                        for molecules with several 5-membered rings

  --AssumeDFTDone       Assume RMSD pruning, DFT setup and DFT calculations
                        have been run already (saves time when repeating DP4
			analysis)

  -g, --GenOnly         Only generate diastereomers and tinker input files,
                        but don't run any calculations (useful for diastereomer
			generation for calculations ran on computers
			without OpenBabel)

  -c STEREOCENTRES, --StereoCentres STEREOCENTRES
                        Specify stereocentres for diastereomer generation

  -T, --GenTautomers    Automatically generate tautomers

  -o, --DFTOpt          Optimize geometries at DFT level before NMR prediction

  --pd                  Use python port of DP4

  -b BASICATOMS, --BasicAtoms BASICATOMS
                        Generate protonated states on the specified atoms and
                        consider as tautomers

More information on those can be obtained by running PyDP4.py -h

======================

NMR DESCRIPTION FORMAT

NMRFILE example begins:
59.58(C3),127.88(C11),127.52(C10),115.71(C9),157.42(C8),133.98(C23),118.22(C22),115.79(C21),158.00(C20),167.33(C1),59.40(C2),24.50(C31),36.36(C34),71.05(C37),142.14(C42),127.50(C41),114.64(C40),161.02(C39)

4.81(H5),7.18(H15),6.76(H14),7.22(H28),7.13(H27),3.09(H4),1.73(H32 or H33),1.83(H32 or H33),1.73(H36 or H35),1.73(H36 or H35),4.50(H38),7.32(H47),7.11(H46)

H15,H16
H14,H17
H28,H29
H27,H30
H47,H48
H46,H49
C10,C12
C9,C13
C22,C24
C21,C25
C41,C43
C40,C44

OMIT H19,H51

:example ends

Sections are seperated by empty lines.
1) The first section is assigned C shifts, can also be (any).
2) Second section is (un)assigned H shifts.
3) This section defines chemically equivalent atoms. Each line is a new set,
all atoms in a line are treated as equivalent, their computed shifts averaged.
4) Final section, starting with a keyword OMIT defines atoms to be ignored.
Atoms defined in this section do not need a corresponding shift in the NMR
description


=====================

CODE ORGANIZATION

PyDP4.py
file that should be called to start the DP5 workflow. Interprets the
arguments and takes care of the general workflow logic.

PyDP4_GUI.py
file that should be called to start the DP5 graphical user interface

InchiGen.py
Gets called if diastereomer and/or tautomer and/or protomer generation is
used. Called by PyDP4.py.

FiveConf.py
Gets called if automatic 5-membered cycle corner-flipping is used. Called by
PyDP4.py.

MacroModel.py
Contains all of the MacroModel specific code for input generation, calculation
execution and output interpretation. Called by PyDP4.py.

Tinker.py
Contains all of the Tinker specific code for input generation, calculation
execution and output interpretation. Called by PyDP4.py.

ConfPrune.pyx
Cython file for conformer alignment and RMSD pruning. Called by Gaussian.py
and NWChem.py

Gaussian.py
Contains all of the Gaussian specific code for input generation and calculation
execution. Called by PyDP4.py.

GaussianZiggy.py
Code specific to running Gaussian on Cambridge CMI local cluster. Called by PyDP4.py.

GaussianDarwin.py
Code specific to running Gaussian on Cambridge CSD3 HPC cluster. Called by PyDP4.py.

NWChem.py
Contains all of the NWChem specific code for input generation and calculation
execution. Called by PyDP4.py.

NMRDP4GTF.py
Takes care of all the NMR description interpretation, equivalent atom
averaging, Boltzmann averaging, tautomer population optimisation (if used)
and DP4 input preparation and running either DP4.jar or DP4.py. Called by
PyDP4.py

nmrPredictNWChem.py
Extracts NMR shifts from NWChem output files

DP4.py
Equivalent and compact port of the original DP4 java implementation to Python
of the same DP4 process. Has been updated to include multigaussian models.

DP5.py
python implementation of the DP5 probability calucaltion

Proton_processing.py/Carbon_processing.py
NMR-AI python script from processing of raw proton/carbon NMR data

Proton_assignment.py/Carbon_assignment.py
NMR-AI python script for assignment of raw proton/carbon NMR data

Proton_plotting.py/Carbon_plotting
NMR-AI python script for plotting of raw proton/carbon NMR data processed and assigned by NMR-AI

StructureInput.py
Script for cleaning structures using RDkit, generating 3d geometries and reading InChI, Smiles and SMARTS input formats