<a href="https://colab.research.google.com/github/Jahan08/Ambertools-CP2K-MM-QM-Biomolecular-Simulation/blob/main/Untitled11.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Practical Cheminformatics Quick Trick #1 
### Picking the Highest Scoring Molecule(s) From Each Cluster

Here's a quick trick that some might find helpful. In many cases we may want to select a diverse set of docked compounds.  One way to do this is to cluster a set of docked molecules and pick the the molecule in each cluster with the highest docking score. 


Install the RDKit

In [None]:
!pip install rdkit-pypi



In [None]:
from rdkit import Chem, DataStructs
from rdkit.Chem import AllChem
from rdkit.Chem import PandasTools
from rdkit.ML.Cluster import Butina
import requests
import numpy as np
from tqdm.auto import tqdm
import pandas as pd

Enable the Pandas [progress_apply](https://towardsdatascience.com/progress-bars-in-python-and-pandas-f81954d33bae) command

In [None]:
tqdm.pandas()

Define a couple of utility functions.

In [None]:
def mol2fp(mol):
    fp = AllChem.GetMorganFingerprintAsBitVect(mol,2)
    return fp

def taylor_butina_clustering(fp_list,cutoff=0.35):
    dists = []
    nfps = len(fp_list)
    for i in range(1,nfps):
        sims = DataStructs.BulkTanimotoSimilarity(fp_list[i],fp_list[:i])
        dists.extend([1-x for x in sims])
    mol_clusters = Butina.ClusterData(dists,nfps,cutoff,isDistData=True)
    return mol_clusters

Read an SD file with docking results. This data was generated by the team at [OpenEye Scientific Software](https://www.eyesopen.com/blog/openeye-releases-additional-giga-scale-virtual-screening-covid-19-data-for-public-use) who used their Orion platform to dock molecules from the Enamine REAL database into ACE2 bound to the spike protein of the SARS-CoV-2 virus.

In [None]:
fname = "Giga_Docking_10K_Hit_List.sdf.gz"
url = "https://raw.githubusercontent.com/PatWalters/datafiles/main/"+fname
r = requests.get(url)
open(fname , 'wb').write(r.content)

9373713

Put the docking results into a dataframe. PandasTools imports all of the datafields in an SD file as type "object".  We want to be able to sort the dataframe by the docking score.  In order to do this, we need to covert the "Chemgauss4" column to type float. 

In [None]:
df = PandasTools.LoadSDF(fname)
df.Chemgauss4 = df.Chemgauss4.astype(float)

Add a fingerprint to the dataframe.

In [None]:
df['fp'] = df.ROMol.progress_apply(mol2fp)

  0%|          | 0/10000 [00:00<?, ?it/s]

Cluster the fingerprints.

In [None]:
cluster_res = taylor_butina_clustering(df.fp.values)

In [None]:
cluster_id_list = np.zeros(len(df),dtype=int)
for cluster_num,cluster in enumerate(cluster_res):
    for member in cluster:
        cluster_id_list[member] = cluster_num  

Add a cluster column to the dataframe

In [None]:
df['cluster'] = cluster_id_list

Sort the data by cluster and docking score.

In [None]:
df.sort_values(["cluster","Chemgauss4"],inplace=True)

Define the list of properites to export into the SD file.

In [None]:
prop_list = [x for x in df.columns if x not in ['ROMol','fp']]

Create a new dataframe with the best scoring molecule in each cluster.

In [None]:
cluster_best_df = df.drop_duplicates("cluster")

Write the single highest scoring molecule in each cluster to an SD file. 

In [None]:
PandasTools.WriteSDF(cluster_best_df,"cluster_best.sdf",properties=prop_list)

Create a new dataframe with the best 2 scoring molecules in each cluster.

In [None]:
cluster_best_2_df = df.groupby("cluster").head(2)

Write the 2 highest scoring molecules in each cluster to an SD file.

In [None]:
PandasTools.WriteSDF(cluster_best_2_df,"cluster_best_2.sdf",properties=prop_list)