<span style="color:Red">MinMax diversity</span> is a particular case of <span style="color:Red">Kennard-Stone algorithm</span>, the purpose of these algorithms is to split the dataset into two parts based on user-selected factors, e.g. molecular descriptors. The main purpose of this procedure in cheminformatics is to get the test (or validation, or optimization set) for Machine-learning, such as QSAR modelling.

The gist of the algorithm is an iterative selection of new training set candidates from the remaining dataset, so that every new candidate had the highest out of lowest dissimilarities between the candidates and all of the training set compounds. The procedure is repeated until the desired split ratio (e. g. 75% training, 25% test is achieved). The goal of the splitting is to create the training set that comprises the full diversity of the overall dataset, whereas test set does not have compounds that are too different from the training set. <u>It is important to mention that the results of the splitting are quite dependent on similarity metric and descriptor space.<u>

In [1]:
import rdkit as rd
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
import copy
from rdkit.Chem import PandasTools
from rdkit.Chem import Descriptors
from rdkit.Chem import rdMolDescriptors
from rdkit.Chem import MACCSkeys
from rdkit.Chem import GraphDescriptors
from rdkit import DataStructs
from rdkit.ML.Descriptors import MoleculeDescriptors
import numpy as np
import pandas as pd
import pickle

In [2]:
from rdkit.Chem import rdFMCS
cox2_sdf = r'assets/COX2_inhibitors_final.sdf'
df = PandasTools.LoadSDF(cox2_sdf, molColName='Mol')
df["Inhibition, %"] = df["Inhibition, %"].astype(int)

In [8]:
descr_df = np. full((df.shape[0], 4), 0, dtype="float64")
for i in range(0, len(df.index)):
    descr_bundle = []
    mol = df[ 'Mol'][i]
    descr_bundle.append(rdMolDescriptors.CalcNumAromaticRings(mol))
    descr_bundle.append(Descriptors.NumValenceElectrons(mol))
    descr_bundle.append(round(GraphDescriptors.BalabanJ(mol), 2))
    descr_bundle.append(round(rdMolDescriptors.CalcExactMolWt(mol), 1))
    descr_df[i,0:len(descr_bundle)] = descr_bundle
descr_df = pd.DataFrame(descr_df, index = df[ 'CHEMBLID'])
### naming the descr df
descr_names = ['NumAromaticRings', 'NumValenceElectrons','BalabanJ', 'MW']
descr_df.columns = descr_names
descr_df.head()

Unnamed: 0_level_0,NumAromaticRings,NumValenceElectrons,BalabanJ,MW
CHEMBLID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CHEMBL366429,2.0,126.0,2.02,365.1
CHEMBL176216,2.0,146.0,1.79,458.0
CHEMBL174680,3.0,136.0,1.86,380.1
CHEMBL176357,3.0,138.0,1.83,421.0
CHEMBL369840,2.0,142.0,2.13,465.9


In [9]:
### structural keys
finger_df = np. full((df.shape[0], 167), 0, dtype="float64" )
for i in range(0, len(df. index)) :
    finger_df[i, :] = np.array(rdMolDescriptors.GetMACCSKeysFingerprint(df['Mol'][i]))
finger_df = pd. DataFrame(finger_df, index = df[ 'CHEMBLID' ])
del (finger_df[0]) # removing Ithe empty coLumn
MACCSkeys_names = list(MACCSkeys. smartsPatts.values())
finger_df.columns = MACCSkeys_names

In [None]:
# # capstone descr
# capstone_desc_path = SOME_PATH # r"/Users/marcusc/Documents/Courses/neovarsity_chemoinformatics_2024/assets/comblib_descr.csv"
# capstone_desc = pd.read_csv(capstone_desc_path, index_col='CHEMBLID')
# capstone_desc.head()

Index(['Unnamed: 0', 'FpDensityMorgan1', 'HeavyAtomMolWt',
       'MaxAbsPartialCharge', 'MinPartialCharge', 'MolWt',
       'NumValenceElectrons', 'Chi0n', 'Chi1v', 'Chi2v', 'FractionCSP3',
       'HallKierAlpha', 'Kappa1', 'Kappa2', 'NumAliphaticCarbocycles',
       'NumAliphaticHeterocycles', 'NumAliphaticRings', 'NumAmideBonds',
       'NumAromaticCarbocycles', 'NumAromaticHeterocycles', 'NumAromaticRings',
       'NumAtomStereoCenters', 'NumHBA', 'NumHBD', 'NumHeteroatoms',
       'NumLipinskiHBA', 'NumRings', 'NumSaturatedHeterocycles', 'BertzCT',
       '('[#6]~[#16]~[#7]', 0)', '('[#16R]', 0)',
       '('[#7]~[#6](~[#8])~[#7]', 0)', '('[#7]~[#6](~[#6])~[#7]', 0)',
       '('F', 0)', '('[!#6;!#1;!H0]~*~[!#6;!#1;!H0]', 0)', '('Br', 0)',
       '('[#16]~*~[#7]', 0)', '('[#7]~[#7]', 0)',
       '('[!#6;!#1;!H0]~*~*~*~[!#6;!#1;!H0]', 0)',
       '('[!#6;!#1;!H0]~*~*~[!#6;!#1;!H0]', 0)', '('[#8]~[#16]~[#8]', 0)',
       '('[#8R]', 0)', '('*@*!@*@*', 0)', '('*@*!@[#16]', 0)', '('c:n',