**Import required libraries and scripts**

In [1]:
#Import required libraries and scripts
from scripts.library_preparation import *
from scripts.utilities import *
from scripts.docking_functions import *
from scripts.clustering_functions import *
from scripts.rescoring_functions import *
from scripts.ranking_functions import *
from scripts.performance_calculation import *
from scripts.dogsitescorer import *
from scripts.get_pocket import *

[19:55:34] Initializing Normalizer


**Set up**
- **software**: The path to the software folder.
- **proteinfile**: The path to the protein file (.pdb).
- **pocket**: The method to use for pocket determination. Must be one of 'reference' or 'dogsitescorer'.
- **dockinglibrary: The path to the docking library file (.sdf).
- **idcolumn**: The unique identifier column used in the docking library.
- **protonation**: The method to use for compound protonation. Must be one of 'pkasolver', 'GypsumDL', or 'None'.
- **docking**: The method(s) to use for docking. Must be one or more of 'GNINA', 'SMINA', or 'PLANTS'.
- **metric**: The method(s) to use for pose clustering. Must be one or more of 'RMSD', 'spyRMSD', 'espsim', 'USRCAT', '3DScore', 'bestpose', 'bestpose_GNINA', 'bestpose_SMINA', or 'bestpose_PLANTS'.
- **nposes**: The number of poses to generate for each docking software. Default=10
- **exhaustiveness**: The precision used if docking with SMINA/GNINA. Default=8
- **parallel**: Whether or not to run the workflow in parallel. Default=1 (on). Can be set to 1 (on) or 0 (off).
- **ncpus**: The number of cpus to use for the workflow. Default behavior is to use half of the available cpus.
- **clustering**: Which algorithm to use for clustering. Must be one of 'KMedoids', 'Aff_prop'.
- **rescoring**: Which scoring functions to use for rescoring. Must be one or more of 'gnina', 'AD4', 'chemplp', 'rfscorevs', 'LinF9', 'vinardo', 'plp', 'AAScore'.

The software will then create a temporary directory to store the output of the various functions.

In [2]:
software = '/home/tony/DockM8/software'
protein_file = '/home/tony/DockM8/wocondock_main/2o1x_A_apo_protoss.pdb'
ref_file = '/home/tony/DockM8/wocondock_main/2o1x_A_lig_protoss.sdf'
pocket = 'reference'
protonation = 'pkasolver'
docking_library = './wocondock_main/Selection_of_FCHGroup_LeadLike.sdf'
docking_programs = ['GNINA', 'SMINA', 'PLANTS']
clustering_metrics = ['RMSD']
clustering_method = 'KMedoids'
rescoring= ['SCORCH']
id_column = 'ID'
n_poses = 10
exhaustiveness = 8
parallel = 1
ncpus = int(os.cpu_count()/2)
#Create a temporary folder for all further calculations
w_dir = os.path.dirname(protein_file)
print('The working directory has been set to:', w_dir)
create_temp_folder(w_dir+'/temp')

The working directory has been set to: ./wocondock_main
The folder: ./wocondock_main/temp already exists


**Pocket Extraction**  

This cell will extract the pocket based on the method specified in the 'pocket' variable. Using 'reference' will use the reference ligand to define the pocket. Using 'dogsitescore' will query the dogsitescorer server and use the pocket with the largest volume.

In [None]:
if os.path.isfile(protein_file.replace('.pdb', '_pocket.pdb')) == False:
    if pocket == 'reference':
        pocket_definition = GetPocket(ref_file, protein_file, 8)
    elif pocket == 'dogsitescorer':
        pocket_definition = binding_site_coordinates_dogsitescorer(protein_file, w_dir, method='volume')

**Library preparation**

This function will first standardize the library using the ChemBL structure pipeline. This will remove salts and make the library consistent.

Protonation states can be calculated by one of three methods depending on the value of the 'protonation' variable:
- pkasolver : will use the pkasolver library to predict a single protonation state
- GypsumDL : will use the GypsumDL program to predict a single protonation state
- None : will skip protonation and use the protonation state supplied in the docking library

Finally, one 3D conformer is generated per molecule using GypsumDL.

The final_library is then written to a file in the main directory (final_library.sdf)

In [None]:
if os.path.isfile(w_dir+'/temp/final_library.sdf') == False:
    prepare_library(docking_library, id_column, software, protonation, ncpus)

**Docking**

This cell will dock all compounds in the receptor, using the reference ligand as a way to define the binding site. (Note: DogSiteScorer not yet implemented here).

The docking algorithms specified in the 'docking' variable will be used.

The docking will be done in on parallel CPU cores depending on the value or the 'parallel' variable.

The docking results are written to the temporary folder. 

In [None]:
docking_programs = {'GNINA': w_dir+'/temp/gnina/', 'SMINA': w_dir+'/temp/smina/', 'PLANTS': w_dir+'/temp/plants/'}
if parallel == 1:
    for program, file_path in docking_programs.items():
        if os.path.isdir(file_path) == False and program in docking_programs:
            docking_splitted(w_dir, protein_file, ref_file, software, [program], exhaustiveness, n_poses, ncpus)
else:
    for program, file_path in docking_programs.items():
        if os.path.isdir(file_path) == False and program in docking_programs:
            docking(w_dir, protein_file, ref_file, software, [program], exhaustiveness, n_poses, ncpus)


**Combining docking poses**

This cell combine all the poses from the docking programs in a single .sdf file. Depending on the value of the 'parallel' variable, this is done slightly differently due to the splitting of the library if 'parallel' is set to 1.

In [None]:
if parallel == 1:
    if os.path.isfile(w_dir+'/temp/allposes.sdf') == False:
        fetch_poses_splitted(w_dir, n_poses, w_dir+'/temp/split_final_library')
else:
    if os.path.isfile(w_dir+'/temp/allposes.sdf') == False:
        fetch_poses(w_dir, n_poses, w_dir+'/temp/split_final_library')

All poses are then loaded into memory for clustering

In [None]:
print('Loading all poses SDF file...')
tic = time.perf_counter()
all_poses = PandasTools.LoadSDF(w_dir+'/temp/allposes.sdf', idName='Pose ID', molColName='Molecule', includeFingerprints=False, strictParsing=True)
toc = time.perf_counter()
print(f'Finished loading all poses SDF in {toc-tic:0.4f}!...')


**Clustering**

This cell will perform the clustering according to the values of the 'clusering_metrics', 'clustering_method' and 'parallel' variables. If it detects that the clustering file for that metric has already been generated, it will skip it.

In [None]:
if parallel == 1:
    for metric in clustering_metrics:
        if os.path.isfile(w_dir+f'/temp/clustering/{metric}_clustered.sdf') == False:
            cluster_futures(metric, clustering_method, w_dir, protein_file, all_poses, ncpus)
else:
    for metric in clustering_metrics:
        if os.path.isfile(w_dir+f'/temp/clustering/{metric}_clustered.sdf') == False:
            cluster(metric, clustering_method, w_dir, protein_file, all_poses, ncpus)

**Rescoring**

This cell will rescore all the clustered .sdf files according to the specified scoring functions and the value of the 'parallel' variable.

In [5]:
for metric in clustering_metrics:
        rescore_all(w_dir, protein_file, ref_file, software, w_dir+f'/temp/clustering/{metric}_clustered.sdf', rescoring, parallel, ncpus)


The folder: ./wocondock_main/temp/rescoring_RMSD_clustered already exists
The folder: ./wocondock_main/temp/rescoring_RMSD_clustered/SCORCH_rescoring/ was created


  Failed to kekulize aromatic bonds in OBMol::PerceiveBondOrders (title is ./wocondock_main/2o1x_A_apo_protoss.pdb)

1 molecule converted
28 molecules converted



[2023-Mar-20 19:59:40]: python ./software/SCORCH-main/scorch.py --receptor ./wocondock_main/temp/rescoring_RMSD_clustered/SCORCH_rescoring/protein.pdbqt --ligand ./wocondock_main/temp/rescoring_RMSD_clustered/SCORCH_rescoring/ligands.pdbqt --out ./wocondock_main/temp/rescoring_RMSD_clustered/SCORCH_rescoring/scoring_results.csv --threads 8 --verbose --return_pose_scores


  from pandas import MultiIndex, Int64Index


**************************************************************************
SCORCH v1.0
Miles McGibbon, Samuel Money-Kyrle, Vincent Blay & Douglas R. Houston

**************************************************************************

Found 1 ligand(s) for scoring against a single receptor...

**************************************************************************

Counting input poses...
100%|██████████| 1/1 [00:00<00:00, 738.04it/s]

**************************************************************************

Model Request Summary:

XGBoost Model: Yes
Traceback (most recent call last):
  File "./software/SCORCH-main/scorch.py", line 1265, in <module>
    scoring_function_results = scoring(params)
  File "./software/SCORCH-main/scorch.py", line 1219, in scoring
    model_dict = prepare_models(params)
  File "./software/SCORCH-main/scorch.py", line 1054, in prepare_models
    models['xgboost_model'] = pickle.load(open(xgb_path,'rb'))
FileN


[2023-Mar-20 19:59:46]: Rescoring with SCORCH complete in 94.2967!

[2023-Mar-20 19:59:46]: Combining all score for ./wocondock_main/temp/rescoring_RMSD_clustered


IndexError: list index out of range

**Final ranking methods**

This code calculates the final ranking of compounds using various methods.
- *Method 1* : Calculates ECR value for each cluster center, then outputs the top ranked center.
- *Method 2* : Calculates ECR value for each cluster center, then outputs the average ECR value for each compound.
- *Method 3* : Calculates the average rank of each compound, then ouputs the corresponding ECR value for each compound.
- *Method 4* : Calculates the Rank by Rank consensus
- *Method 5* : Calculates the Rank by Vote consensus
- *Method 6* : Calculates Z-score for each cluster center, then ouputs the top ranked center.
- *Method 7* : Calculates Z-score for each cluster center, then ouputs the average Z-score for each compound.

All methods are then combined into a single dataframe for comparison purposes.

In [None]:
apply_consensus_methods(w_dir, clustering_metrics)