**Import required libraries and scripts**

In [1]:
#Import required libraries and scripts
from scripts.library_preparation import *
from scripts.utilities import *
from scripts.docking_functions import *
from scripts.clustering_functions import *
from scripts.rescoring_functions import *
from scripts.ranking_functions import *
from scripts.performance_calculation import *
import numpy as np
import os

software = '/home/tony/CADD22/software'
protein_file = '/home/tony/CADD22/wocondock_main/2o1x_A_apo_protoss.pdb'
ref_file = '/home/tony/CADD22/wocondock_main/2o1x_A_lig_protoss.sdf'
docking_library = '/home/tony/CADD22/wocondock_main/500_of_FCHGroup_LeadLike.sdf'
docking_programs = ['SMINA','GNINA','PLANTS']
id_column = 'ID'
n_poses = 10
exhaustiveness = 4

#Initialise variables and create a temporary folder
w_dir = os.path.dirname(protein_file)
print('The working directory has been set to:', w_dir)
create_temp_folder(w_dir+'/temp')

[12:20:48] Initializing Normalizer


The working directory has been set to: /home/tony/CADD22/wocondock_main
The folder: /home/tony/CADD22/wocondock_main/temp already exists


In [None]:
cleaned_pkasolver_df = prepare_library(docking_library, id_column, software, 'pkasolver')

**Docking**

This function will dock all compounds in the receptor, using the reference ligand as a way to define the binding site. The docking results are written to the temporary folder. 

In [2]:
docking_splitted(w_dir, protein_file, ref_file, software, docking_programs, exhaustiveness, n_poses)

Splitting docking library...
The folder: /home/tony/CADD22/wocondock_main/temp/split_files was created
Split docking library into 5 files each containing 2 compounds
The folder: /home/tony/CADD22/wocondock_main/temp/plants was created
Converting protein file to .mol2 format for PLANTS docking...
Converting reference file from .sdf to .mol2 format for PLANTS docking...
Determining binding site coordinates using PLANTS...
Docking split files using PLANTS...
Docking with PLANTS complete in 102.7282!
Docking split files using SMINA...
Docking with SMINA complete in 124.4346!
Docking split files using GNINA...
Docking with GNINA complete in 117.7798!
Combined all docking poses in 1.0927!


In [None]:
all_poses = docking(protein_file, ref_file, software, docking_programs, exhaustiveness, n_poses)

**Clustering**

We will first load all the poses generated from the docking run. The cluster() function performs the calculation of the clustering metrics (for now simpleRMSD and electroshape similarity), then performs the clustering using the k-medoids clustering algorithm with the number of clusters optimised using silhouette score. Finally, all cluster centers are collected and written to a file in the temporary directory (/temp/clustering/) (one file per clustering metric).

In [None]:
cluster_dask('RMSD', w_dir, protein_file)

In [None]:
cluster('bestpose', w_dir, protein_file)

In [None]:
cluster('espsim', w_dir, protein_file)
cluster('spyRMSD', w_dir, protein_file)
cluster('USRCAT', w_dir, protein_file)
cluster('RMSD', w_dir, protein_file)
cluster('3DScore', w_dir, protein_file)

**Rescoring**

The file containing all the cluster centers is then rescored using all scoring functions available (GNINA, Vina, AutoDock4, PLP, CHEMPLP, RF-Score-VS). The rescored output is return as a dataframe.

In [None]:
RMSD_rescored = rescore_all(w_dir, protein_file, ref_file, software, w_dir+'/temp/clustering/RMSD_clustered.sdf')
espsim_rescored = rescore_all(w_dir, protein_file, ref_file, software, w_dir+'/temp/clustering/espsim_clustered.sdf')
spyRMSD_rescored = rescore_all(w_dir, protein_file, ref_file, software, w_dir+'/temp/clustering/spyRMSD_clustered.sdf')
USRCAT_rescored = rescore_all(w_dir, protein_file, ref_file, software, w_dir+'/temp/clustering/USRCAT_clustered.sdf')
DScore_rescored = rescore_all(w_dir, protein_file, ref_file, software, w_dir+'/temp/clustering/3DScore_clustered.sdf')
bestpose_rescored = rescore_all(w_dir, protein_file, ref_file, software, w_dir+'/temp/clustering/bestpose_clustered.sdf')



**Final ranking methods**

This code calculates the final ranking of compounds using various methods.
*Method 1* : Calculates ECR value for each cluster center, then outputs the top ranked center.
*Method 2* : Calculates ECR value for each cluster center, then outputs the average ECR value for each compound.
*Method 3* : Calculates the average rank of each compound, then ouputs the corresponding ECR value for each compound.
*Method 6* : Calculates Z-score for each cluster center, then ouputs the top ranked center.
*Method 7* : Calculates Z-score for each cluster center, then ouputs the average Z-score for each compound.

All methods are then combined into a single dataframe for comparison purposes.

In [None]:
apply_ranking_methods_simplified(w_dir)

In [None]:
test_df = pd.read_csv('/home/tony/CADD22/wocondock_refactored_chatgpt/temp/ranking/ranking_results.csv')
def show_correlation(dataframe):
    matrix = dataframe.corr().round(2)
    mask = np.triu(np.ones_like(matrix, dtype=bool))
    sns.heatmap(matrix, mask = mask, annot=False, vmax=1, vmin=-1, center=0, linewidths=.5, cmap='coolwarm')
    plt.show()

show_correlation(test_df)

In [None]:
calculate_EFs(w_dir, docking_library)