# Molecules similarity based on their Morgan fingerprints

Detail Jupyter notebook is given in our Nature Protocols paper.

Please cite: Tran-Nguyen, V. K., Junaid, M., Simeon, S. & Ballester, P. J. A practical guide to machine-learning scoring for structure-based virtual screening. Nat. Protoc. (2023)

This is a Jupyter notebook that helps users cluster compounds based on the similarity of their Morgan fingerprints. Please refer to our Nature Protocols paper cited above for more information.

The protocol-env.yml file for setting up the environment required to run this Jupyter notebook can be found in our github repository: https://github.com/vktrannguyen/MLSF-protocol.

## 1. Install all required Python dependencies

Several Python dependencies have to be installed beforehand: set up your protocol-env environment using conda and the yml file protocol-env.yml (downloaded from our github repository). 

In [None]:
from rdkit import DataStructs
from rdkit import Chem
import pandas as pd
from rdkit.Chem.PandasTools import RenderImagesInAllDataFrames
from rdkit.Chem import AllChem

## 2. Load train and test smiles

In [None]:
#Provide the pathway to the smi active molecules:
active_smiles = pd.read_csv("pathway_to_active_smiles")

#Provide the pathway to the smi decoys molecules:
decoys_smiles = pd.read_csv("pathway_to_decoys_smiles")

#Provide the pathway to the smi training molecules:
training_smiles = pd.read_csv("pathway_to_training_smiles")

#Provide the pathway to the smi test molecules:
test_smiles = pd.read_csv("pathway_to_test_smiles")

## 3. Convert to mol files

In [None]:
mol_actives = [Chem.MolFromSmiles(x) for x in actives_smiles]
mol_decoys = [Chem.MolFromSmiles(x) for x in decoy_smiles]

mol_actives_train = [Chem.MolFromSmiles(x) for x in  training_smiles[0].to_list()]
mol_actives_test = [Chem.MolFromSmiles(x) for x in  test_smiles[0].to_list()]


## 4. Compute the Morgan fingerprints of all input compounds 

In [None]:
# Here we compute Morgan fingerprints of radius 2, 2048 bits:

fp_actives= [AllChem.GetMorganFingerprintAsBitVect(x,radius=2,nBits=2048) for x in mol_actives]
fp_decoys = [AllChem.GetMorganFingerprintAsBitVect(x,radius=2,nBits=2048) for x in mol_decoys]

fp_actives_train = [AllChem.GetMorganFingerprintAsBitVect(x,radius=2,nBits=2048) for x in mol_actives_train]
fp_actives_test = [AllChem.GetMorganFingerprintAsBitVect(x,radius=2,nBits=2048) for x in mol_actives_test]

## 4. Calculate the Tanimoto similarity of Morgan fingerprints and create a Tanimoto similarity matrix 

# **Actives and Decoys**

In [None]:
import numpy as np
size_x= len(fp_actives)
size_y= len(fp_decoys)
similarity_matrix = np.zeros((size_y, size_x))
similarity_matrix.shape

In [39]:
idx = 0
np_fps = list()
for fp in fp_decoys:
    np_fp = np.zeros((1,))
    Chem.DataStructs.ConvertToNumpyArray(fp, np_fp)
    np_fps.append(np_fp)
    # Calculate Tanimoto similarity
    similarity = Chem.DataStructs.BulkTanimotoSimilarity(fp, fp_actives)
    # Save it to similarity matrix
    similarity_matrix[idx] = similarity
    idx += 1

In [40]:
df_similarity = pd.DataFrame(similarity_matrix)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(rc={'figure.figsize':(11.7,8.27)})
fig, ax = plt.subplots(dpi=300, figsize=(7,5))
ax = sns.heatmap(df_similarity, vmin=0, vmax=1,
                yticklabels=False, xticklabels=False,cmap="coolwarm")
ax.set_xlabel("Actives (37)", fontsize = 20)
ax.set_ylabel("Decoys (1200)", fontsize = 20)
plt.savefig('path_to_save_png_file')

# **training and test**

In [None]:
import numpy as np
size_x= len(fp_actives_test)
size_y= len(fp_actives_train)
similarity_matrix = np.zeros((size_x, size_y))
similarity_matrix.shape

In [None]:
idx = 0
np_fps = list()
for fp in fp_actives_test:
    np_fp = np.zeros((1,))
    Chem.DataStructs.ConvertToNumpyArray(fp, np_fp)
    np_fps.append(np_fp)
    # Calculate Tanimoto similarity
    similarity = Chem.DataStructs.BulkTanimotoSimilarity(fp, fp_actives_train)
    # Save it to similarity matrix
    similarity_matrix[idx] = similarity
    idx += 1

In [None]:
df_similarity = pd.DataFrame(similarity_matrix)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(rc={'figure.figsize':(11.7,8.27)})
fig, ax = plt.subplots(dpi=300, figsize=(7,5))
ax = sns.heatmap(df_similarity, vmin=0, vmax=1,
                yticklabels=False, xticklabels=False,cmap="coolwarm")
ax.set_xlabel("training actives (29)", fontsize = 20)
ax.set_ylabel("test actives (8)", fontsize = 20)
plt.savefig('path_to_save_png_file')