**Import required libraries and scripts**

In [1]:
#Import required libraries and scripts
from scripts.library_preparation import *
from scripts.utilities import *
from scripts.docking_functions import *
from scripts.clustering_functions import *
from scripts.rescoring_functions import *
from scripts.consensus_methods import *
from scripts.performance_calculation import *
from scripts.dogsitescorer import *
from scripts.get_pocket import *

[14:18:12] Initializing Normalizer


In [2]:
protein_name = 'protein.pdb'
ligand_library = 'library.sdf'
reference_ligand = 'ref.sdf'

In [3]:
HERE = Path(_dh[-1])
DATA = (HERE / "wocondock_main")


# Move input data (protein pdb, docking library and reference ligand) to data directory
software = str(HERE / "software")
protein_file = str(DATA / protein_name)
docking_library = str(DATA / ligand_library)
ref_file = str(DATA / reference_ligand)

**Set up**
- **software**: The path to the software folder.
- **proteinfile**: The path to the protein file (.pdb).
- **pocket**: The method to use for pocket determination. Must be one of 'reference' or 'dogsitescorer'.
- **dockinglibrary: The path to the docking library file (.sdf).
- **idcolumn**: The unique identifier column used in the docking library.
- **protonation**: The method to use for compound protonation. Must be one of 'pkasolver', 'GypsumDL', or 'None'.
- **docking**: The method(s) to use for docking. Must be one or more of 'GNINA', 'SMINA', or 'PLANTS'.
- **metric**: The method(s) to use for pose clustering. Must be one or more of 'RMSD', 'spyRMSD', 'espsim', 'USRCAT', '3DScore', 'bestpose', 'bestpose_GNINA', 'bestpose_SMINA', or 'bestpose_PLANTS'.
- **nposes**: The number of poses to generate for each docking software. Default=10
- **exhaustiveness**: The precision used if docking with SMINA/GNINA. Default=8
- **parallel**: Whether or not to run the workflow in parallel. Default=1 (on). Can be set to 1 (on) or 0 (off).
- **ncpus**: The number of cpus to use for the workflow. Default behavior is to use half of the available cpus.
- **clustering**: Which algorithm to use for clustering. Must be one of 'KMedoids', 'Aff_prop'.
- **rescoring**: Which scoring functions to use for rescoring. Must be one or more of 'gnina', 'AD4', 'chemplp', 'rfscorevs', 'LinF9', 'vinardo', 'plp', 'AAScore'.

The software will then create a temporary directory to store the output of the various functions.

In [4]:
pocket = 'reference'
protonation = 'pkasolver'
docking_programs = ['GNINA', 'SMINA', 'PLANTS']
clustering_metrics = ['RMSD', 'spyRMSD', 'espsim', 'bestpose', 'bestpose_PLANTS', 'bestpose_GNINA', 'bestpose_SMINA']
clustering_method = 'KMedoids'
rescoring= ['gnina', 'AD4', 'chemplp', 'rfscorevs', 'LinF9', 'RTMScore', 'vinardo', 'SCORCH']
id_column = 'ID'
n_poses = 10
exhaustiveness = 8
ncpus = 14
#Create a temporary folder for all further calculations
w_dir = Path(protein_file).parent
print('The working directory has been set to:', w_dir)
(DATA/'temp').mkdir(exist_ok=True)

The working directory has been set to: /home/ibrahim/Gitlab/DockM8/wocondock_main


**Pocket Extraction**  

This cell will extract the pocket based on the method specified in the 'pocket' variable. Using 'reference' will use the reference ligand to define the pocket. Using 'dogsitescore' will query the dogsitescorer server and use the pocket with the largest volume.

In [5]:
if pocket == 'reference':
    pocket_definition = get_pocket(ref_file, protein_file, 10)
    print(pocket_definition)
if pocket == 'RoG':
    pocket_definition = get_pocket_RoG(ref_file, protein_file)
    print(pocket_definition)
elif pocket == 'dogsitescorer':
    pocket_definition = binding_site_coordinates_dogsitescorer(protein_file, w_dir, method='volume')


[2023-Jun-28 14:18:20]: Extracting pocket from /home/ibrahim/Gitlab/DockM8/wocondock_main/protein.pdb using /home/ibrahim/Gitlab/DockM8/wocondock_main/ref.sdf as reference ligand

[2023-Jun-28 14:18:29]: Finished extracting pocket from /home/ibrahim/Gitlab/DockM8/wocondock_main/protein.pdb using /home/ibrahim/Gitlab/DockM8/wocondock_main/ref.sdf as reference ligand
{'center': [16.7, -2.69, 17.61], 'size': [20.0, 20.0, 20.0]}


**Library preparation**  

In [18]:
if os.path.isfile(w_dir / 'temp' / 'final_library.sdf') == False:
    prepare_library(docking_library, id_column, protonation, ncpus)


[2023-Jun-28 13:44:19]: Standardizing docking library using ChemBL Structure Pipeline...


Standardizing molecules: 100%|██████████| 10/10 [00:00<00:00, 174.41mol/s]



[2023-Jun-28 13:44:19]: Standardization of compound library finished: Started with 10, ended with 10 : 0 compounds lost

[2023-Jun-28 13:44:19]: Calculating protonation states using pkaSolver...


[query.py:297 - calculate_microstate_pka_values()] Using dimorphite-dl to identify protonation sites.


CN(C)c1nc(Cc2nnn[nH]2)cs1
<rdkit.Chem.rdchem.Mol object at 0x7f830781f060>

For help, use: python dimorphite_dl.py --help

If you use Dimorphite-DL in your research, please cite:
Ropp PJ, Kaminsky JC, Yablonski S, Durrant JD (2019) Dimorphite-DL: An
open-source program for enumerating the ionization states of drug-like small
molecules. J Cheminform 11:14. doi:10.1186/s13321-019-0336-9.

CN(C)c1nc(Cc2nnn[nH]2)cs1
<rdkit.Chem.rdchem.Mol object at 0x7f9d212e7060>

For help, use: python dimorphite_dl.py --help

If you use Dimorphite-DL in your research, please cite:
Ropp PJ, Kaminsky JC, Yablonski S, Durrant JD (2019) Dimorphite-DL: An
open-source program for enumerating the ionization states of drug-like small
molecules. J Cheminform 11:14. doi:10.1186/s13321-019-0336-9.

Proposed mol at pH 7.4: CN(C)c1nc(Cc2nnn[nH]2)cs1


[query.py:297 - calculate_microstate_pka_values()] Using dimorphite-dl to identify protonation sites.


Cc1nc(SCc2cc(=O)n3ccsc3n2)c2ccccc2n1
<rdkit.Chem.rdchem.Mol object at 0x7f81f7097060>

For help, use: python dimorphite_dl.py --help

If you use Dimorphite-DL in your research, please cite:
Ropp PJ, Kaminsky JC, Yablonski S, Durrant JD (2019) Dimorphite-DL: An
open-source program for enumerating the ionization states of drug-like small
molecules. J Cheminform 11:14. doi:10.1186/s13321-019-0336-9.

Cc1nc(SCc2cc(=O)n3ccsc3n2)c2ccccc2n1
<rdkit.Chem.rdchem.Mol object at 0x7fbc2aa63060>

For help, use: python dimorphite_dl.py --help

If you use Dimorphite-DL in your research, please cite:
Ropp PJ, Kaminsky JC, Yablonski S, Durrant JD (2019) Dimorphite-DL: An
open-source program for enumerating the ionization states of drug-like small
molecules. J Cheminform 11:14. doi:10.1186/s13321-019-0336-9.

Proposed mol at pH 7.4: Cc1nc(SCc2cc(=O)n3ccsc3n2)c2ccccc2n1


[query.py:297 - calculate_microstate_pka_values()] Using dimorphite-dl to identify protonation sites.


Cn1cccc1C(=O)OCc1ccccc1C#N
<rdkit.Chem.rdchem.Mol object at 0x7fbbe9bc7060>

For help, use: python dimorphite_dl.py --help

If you use Dimorphite-DL in your research, please cite:
Ropp PJ, Kaminsky JC, Yablonski S, Durrant JD (2019) Dimorphite-DL: An
open-source program for enumerating the ionization states of drug-like small
molecules. J Cheminform 11:14. doi:10.1186/s13321-019-0336-9.

Cn1cccc1C(=O)OCc1ccccc1C#N
<rdkit.Chem.rdchem.Mol object at 0x7fa454f8f060>

For help, use: python dimorphite_dl.py --help

If you use Dimorphite-DL in your research, please cite:
Ropp PJ, Kaminsky JC, Yablonski S, Durrant JD (2019) Dimorphite-DL: An
open-source program for enumerating the ionization states of drug-like small
molecules. J Cheminform 11:14. doi:10.1186/s13321-019-0336-9.

Proposed mol at pH 7.4: Cn1cccc1C(=O)OCc1ccccc1C#N
#########################
Could not identify any ionizable group. Aborting.
#########################


[query.py:297 - calculate_microstate_pka_values()] Using dimorphite-dl to identify protonation sites.


C[C@H](OC(=O)c1cc2c(s1)CCC2)c1nc2ccccc2c(=O)[nH]1
<rdkit.Chem.rdchem.Mol object at 0x7f036f0c3140>

For help, use: python dimorphite_dl.py --help

If you use Dimorphite-DL in your research, please cite:
Ropp PJ, Kaminsky JC, Yablonski S, Durrant JD (2019) Dimorphite-DL: An
open-source program for enumerating the ionization states of drug-like small
molecules. J Cheminform 11:14. doi:10.1186/s13321-019-0336-9.

C[C@H](OC(=O)c1cc2c(s1)CCC2)c1nc2ccccc2c(=O)[nH]1
<rdkit.Chem.rdchem.Mol object at 0x7f38b3a33140>

For help, use: python dimorphite_dl.py --help

If you use Dimorphite-DL in your research, please cite:
Ropp PJ, Kaminsky JC, Yablonski S, Durrant JD (2019) Dimorphite-DL: An
open-source program for enumerating the ionization states of drug-like small
molecules. J Cheminform 11:14. doi:10.1186/s13321-019-0336-9.

Proposed mol at pH 7.4: C[C@H](OC(=O)c1cc2c(s1)CCC2)c1nc2ccccc2c(=O)[nH]1


[query.py:297 - calculate_microstate_pka_values()] Using dimorphite-dl to identify protonation sites.


CC(=O)NCc1ccc(C(=O)COC(=O)c2cc3c(s2)CCCCC3)o1
<rdkit.Chem.rdchem.Mol object at 0x7f4207c0f060>

For help, use: python dimorphite_dl.py --help

If you use Dimorphite-DL in your research, please cite:
Ropp PJ, Kaminsky JC, Yablonski S, Durrant JD (2019) Dimorphite-DL: An
open-source program for enumerating the ionization states of drug-like small
molecules. J Cheminform 11:14. doi:10.1186/s13321-019-0336-9.

CC(=O)NCc1ccc(C(=O)COC(=O)c2cc3c(s2)CCCCC3)o1
<rdkit.Chem.rdchem.Mol object at 0x7f8c5aaaf060>

For help, use: python dimorphite_dl.py --help

If you use Dimorphite-DL in your research, please cite:
Ropp PJ, Kaminsky JC, Yablonski S, Durrant JD (2019) Dimorphite-DL: An
open-source program for enumerating the ionization states of drug-like small
molecules. J Cheminform 11:14. doi:10.1186/s13321-019-0336-9.

Proposed mol at pH 7.4: CC(=O)NCc1ccc(C(=O)COC(=O)c2cc3c(s2)CCCCC3)o1


[query.py:297 - calculate_microstate_pka_values()] Using dimorphite-dl to identify protonation sites.


CCN(C(=O)COc1cccc(-n2cnnn2)c1)c1cccc2ccccc12
<rdkit.Chem.rdchem.Mol object at 0x7f3810847060>

For help, use: python dimorphite_dl.py --help

If you use Dimorphite-DL in your research, please cite:
Ropp PJ, Kaminsky JC, Yablonski S, Durrant JD (2019) Dimorphite-DL: An
open-source program for enumerating the ionization states of drug-like small
molecules. J Cheminform 11:14. doi:10.1186/s13321-019-0336-9.

CCN(C(=O)COc1cccc(-n2cnnn2)c1)c1cccc2ccccc12
<rdkit.Chem.rdchem.Mol object at 0x7efc9304f060>

For help, use: python dimorphite_dl.py --help

If you use Dimorphite-DL in your research, please cite:
Ropp PJ, Kaminsky JC, Yablonski S, Durrant JD (2019) Dimorphite-DL: An
open-source program for enumerating the ionization states of drug-like small
molecules. J Cheminform 11:14. doi:10.1186/s13321-019-0336-9.

Proposed mol at pH 7.4: CCN(C(=O)COc1cccc(-n2cnnn2)c1)c1cccc2ccccc12


[query.py:297 - calculate_microstate_pka_values()] Using dimorphite-dl to identify protonation sites.


Cn1c(Cc2ccccc2)nnc1SCc1nc2ccsc2c(=O)[nH]1
<rdkit.Chem.rdchem.Mol object at 0x7f3835f4f060>

For help, use: python dimorphite_dl.py --help

If you use Dimorphite-DL in your research, please cite:
Ropp PJ, Kaminsky JC, Yablonski S, Durrant JD (2019) Dimorphite-DL: An
open-source program for enumerating the ionization states of drug-like small
molecules. J Cheminform 11:14. doi:10.1186/s13321-019-0336-9.

Cn1c(Cc2ccccc2)nnc1SCc1nc2ccsc2c(=O)[nH]1
<rdkit.Chem.rdchem.Mol object at 0x7f18428df060>

For help, use: python dimorphite_dl.py --help

If you use Dimorphite-DL in your research, please cite:
Ropp PJ, Kaminsky JC, Yablonski S, Durrant JD (2019) Dimorphite-DL: An
open-source program for enumerating the ionization states of drug-like small
molecules. J Cheminform 11:14. doi:10.1186/s13321-019-0336-9.

Proposed mol at pH 7.4: Cn1c(Cc2ccccc2)nnc1SCc1nc2ccsc2c(=O)[nH]1


[query.py:297 - calculate_microstate_pka_values()] Using dimorphite-dl to identify protonation sites.


N#Cc1ccc(CSc2nnc(CN3CCCC3)o2)cc1
<rdkit.Chem.rdchem.Mol object at 0x7f134b61b060>

For help, use: python dimorphite_dl.py --help

If you use Dimorphite-DL in your research, please cite:
Ropp PJ, Kaminsky JC, Yablonski S, Durrant JD (2019) Dimorphite-DL: An
open-source program for enumerating the ionization states of drug-like small
molecules. J Cheminform 11:14. doi:10.1186/s13321-019-0336-9.

N#Cc1ccc(CSc2nnc(CN3CCCC3)o2)cc1
<rdkit.Chem.rdchem.Mol object at 0x7f8eb756b060>

For help, use: python dimorphite_dl.py --help

If you use Dimorphite-DL in your research, please cite:
Ropp PJ, Kaminsky JC, Yablonski S, Durrant JD (2019) Dimorphite-DL: An
open-source program for enumerating the ionization states of drug-like small
molecules. J Cheminform 11:14. doi:10.1186/s13321-019-0336-9.

Proposed mol at pH 7.4: N#Cc1ccc(CSc2nnc(C[NH+]3CCCC3)o2)cc1


[query.py:297 - calculate_microstate_pka_values()] Using dimorphite-dl to identify protonation sites.


Cc1noc(C)c1CSCC(=O)Nc1ccccc1
<rdkit.Chem.rdchem.Mol object at 0x7fcfc803f060>

For help, use: python dimorphite_dl.py --help

If you use Dimorphite-DL in your research, please cite:
Ropp PJ, Kaminsky JC, Yablonski S, Durrant JD (2019) Dimorphite-DL: An
open-source program for enumerating the ionization states of drug-like small
molecules. J Cheminform 11:14. doi:10.1186/s13321-019-0336-9.

Cc1noc(C)c1CSCC(=O)Nc1ccccc1
<rdkit.Chem.rdchem.Mol object at 0x7f3031cd7060>

For help, use: python dimorphite_dl.py --help

If you use Dimorphite-DL in your research, please cite:
Ropp PJ, Kaminsky JC, Yablonski S, Durrant JD (2019) Dimorphite-DL: An
open-source program for enumerating the ionization states of drug-like small
molecules. J Cheminform 11:14. doi:10.1186/s13321-019-0336-9.

Proposed mol at pH 7.4: Cc1noc(C)c1CSCC(=O)Nc1ccccc1


[query.py:297 - calculate_microstate_pka_values()] Using dimorphite-dl to identify protonation sites.


CC1CCCC(NC(=O)Cn2nnc(-c3ccccc3Br)n2)C1C
<rdkit.Chem.rdchem.Mol object at 0x7fabfca43060>

For help, use: python dimorphite_dl.py --help

If you use Dimorphite-DL in your research, please cite:
Ropp PJ, Kaminsky JC, Yablonski S, Durrant JD (2019) Dimorphite-DL: An
open-source program for enumerating the ionization states of drug-like small
molecules. J Cheminform 11:14. doi:10.1186/s13321-019-0336-9.

CC1CCCC(NC(=O)Cn2nnc(-c3ccccc3Br)n2)C1C
<rdkit.Chem.rdchem.Mol object at 0x7f35a4bdf060>

For help, use: python dimorphite_dl.py --help

If you use Dimorphite-DL in your research, please cite:
Ropp PJ, Kaminsky JC, Yablonski S, Durrant JD (2019) Dimorphite-DL: An
open-source program for enumerating the ionization states of drug-like small
molecules. J Cheminform 11:14. doi:10.1186/s13321-019-0336-9.

Proposed mol at pH 7.4: CC1CCCC(NC(=O)Cn2nnc(-c3ccccc3Br)n2)C1C

[2023-Jun-28 13:44:33]: ERROR in adding missing protonating state

[2023-Jun-28 13:44:33]: 'Rdkit_mol'

[2023-Jun-28 13:44:33]:

**Docking**

This cell will dock all compounds in the receptor, using the reference ligand as a way to define the binding site. (Note: DogSiteScorer not yet implemented here).

The docking algorithms specified in the 'docking' variable will be used.

The docking will be done in on parallel CPU cores depending on the value or the 'parallel' variable.

The docking results are written to the temporary folder. 

In [6]:
docking(w_dir, protein_file, pocket_definition, docking_programs, exhaustiveness, n_poses, ncpus)


[2023-Jun-28 14:19:11]: Split final library folder already exists...


In [7]:
concat_all_poses(w_dir, docking_programs)


[2023-Jun-28 14:19:13]: All poses succesfully combined!


All poses are then loaded into memory for clustering

In [8]:
print('Loading all poses SDF file...')
tic = time.perf_counter()
all_poses = PandasTools.LoadSDF(str(w_dir / 'temp' / 'allposes.sdf'), idName='Pose ID', molColName='Molecule', includeFingerprints=False, strictParsing=True)
toc = time.perf_counter()
print(f'Finished loading all poses SDF in {toc-tic:0.4f}!...')

Loading all poses SDF file...
Finished loading all poses SDF in 0.0837!...


**Clustering**

This cell will perform the clustering according to the values of the 'clusering_metrics', 'clustering_method' and 'parallel' variables. If it detects that the clustering file for that metric has already been generated, it will skip it.

In [9]:
for metric in clustering_metrics:
        if os.path.isfile(w_dir / 'temp' / f'clustering/{metric}_clustered.sdf') == False:
            cluster_pebble(metric, 'KMedoids', w_dir, protein_file, all_poses, ncpus)

**Rescoring**

This cell will rescore all the clustered .sdf files according to the specified scoring functions and the value of the 'parallel' variable.

In [10]:
for metric in clustering_metrics:
        rescore_all(w_dir, protein_file, pocket_definition, str(w_dir / 'temp' / f'clustering/{metric}_clustered.sdf'), rescoring, ncpus)


[2023-Jun-28 14:19:20]: Skipping gnina rescoring...

[2023-Jun-28 14:19:20]: Skipping AD4 rescoring...

[2023-Jun-28 14:19:20]: Skipping chemplp rescoring...

[2023-Jun-28 14:19:20]: Skipping rfscorevs rescoring...

[2023-Jun-28 14:19:20]: Skipping LinF9 rescoring...
/home/ibrahim/Gitlab/DockM8/wocondock_main/protein_pocket.pdb
Splitting SDF file RMSD_clustered.sdf ...


Splitting files: 100%|██████████| 28/28 [00:00<00:00, 1068.88it/s]

Split docking library into 28 files each containing 1 compounds

[2023-Jun-28 14:19:20]: Skipping vinardo rescoring...

[2023-Jun-28 14:19:20]: Converting protein file to .pdbqt ...





AttributeError: 'str' object has no attribute 'stem'

**Final ranking methods**

This code calculates the final ranking of compounds using various methods.
- *Method 1* : Calculates ECR value for each cluster center, then outputs the top ranked center.
- *Method 2* : Calculates ECR value for each cluster center, then outputs the average ECR value for each compound.
- *Method 3* : Calculates the average rank of each compound, then ouputs the corresponding ECR value for each compound.
- *Method 4* : Calculates the Rank by Rank consensus
- *Method 5* : Calculates the Rank by Vote consensus
- *Method 6* : Calculates Z-score for each cluster center, then ouputs the top ranked center.
- *Method 7* : Calculates Z-score for each cluster center, then ouputs the average Z-score for each compound.

All methods are then combined into a single dataframe for comparison purposes.

In [None]:
calculate_EF_single_functions(w_dir, docking_library, clustering_metrics)
apply_consensus_methods_combinations(w_dir, docking_library, clustering_metrics)

KeyError: "['Activity'] not in index"