<a href="https://colab.research.google.com/github/Benitmulindwa/Cheminformatics/blob/main/butina_clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
pip install rdkit pandas seaborn tqdm mols2grid



In [None]:
from rdkit import Chem
from rdkit.ML.Cluster import Butina
from rdkit.Chem import rdFingerprintGenerator
from rdkit.Chem import DataStructs

In [None]:
def butina_cluster(mol_list, cutoff=0.35):
  generator=rdFingerprintGenerator.GetMorganGenerator(3,fpSize=2048)
  fp_list=[generator.GetFingerprint(x) for x in mol_list]

  distances=[]
  num_fps=len(fp_list)

  for i in range(1, num_fps):
    similarities= DataStructs.BulkTanimotoSimilarity(fp_list[i], fp_list[:i])
    distances.extend([1-x for x in similarities])

  mol_clusters=Butina.ClusterData(distances,num_fps,cutoff,isDistData=True)
  clusters_id_list=[0]*num_fps
  for idx, cluster in enumerate(mol_clusters,1):
    for member in cluster:
      clusters_id_list[member]=idx

  return clusters_id_list

In [None]:
import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/PatWalters/practical_cheminformatics_tutorials/main/data/dude_erk2_mk01.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,SMILES,ID,is_active
0,0,Cn1ccnc1Sc2ccc(cc2Cl)Nc3c4cc(c(cc4ncc3C#N)OCCC...,168691,1
1,1,C[C@@]12[C@@H]([C@@H](CC(O1)n3c4ccccc4c5c3c6n2...,86358,1
2,2,Cc1cnc(nc1c2cc([nH]c2)C(=O)N[C@H](CO)c3cccc(c3...,575087,1
3,3,Cc1cnc(nc1c2cc([nH]c2)C(=O)N[C@H](CO)c3cccc(c3...,575065,1
4,4,Cc1cnc(nc1c2cc([nH]c2)C(=O)N[C@H](CO)c3cccc(c3...,575047,1


In [None]:
import mols2grid
mols2grid.display(df)

<mols2grid.widget.MolGridWidget object at 0x7f25237dba10>

In [None]:
df['structure']=df.SMILES.apply(Chem.MolFromSmiles)

**Cluster the molecules in the dataframe**

In [None]:
df['cluster']=butina_cluster(df.structure.values)

**View the dataframe with the new Cluster column**

In [None]:
mols2grid.display(df,subset=['img','ID','cluster'])

Select the molecule from each cluster with the lowest LogP.
- calculate the LogP for each molecule
- Put these values into a new column called "logP".

In [None]:
from rdkit.Chem import Crippen
df['LogP']=df.structure.apply(Crippen.MolLogP)

In [None]:
mols2grid.display(df, subset=['img','ID','cluster','LogP'],transform={'LogP':lambda x: f"{x:.2f}"})

**Let's sort the dataframe**

In [None]:
df.sort_values(['cluster','LogP'], inplace=True)

In [None]:
mols2grid.display(df,subset=['img','ID','cluster','LogP'], transform={'LogP': lambda x: f'{x:.2f}'})

let's create a new dataframe containing only the molecule from each cluster with the lowest LogP.

In [None]:
df_unique=df.drop_duplicates('cluster')

In [None]:
mols2grid.display(df_unique,subset=['img','ID','cluster','LogP'], transform={'LogP': lambda x: f'{x:.2f}'})