# ADN_T004. QSAR models

Authors:
* Adnane Aouidate, (2019-2020), Computer Aided Drug Discovery Center, Shenzhen Institute of Advanced Technology(SIAT), Shenzhen, China.
* Adnane Aouidate, (2021-2022), Structural Bioinformatics and Chemoinformatics, Institute of Organic and Analytical Chemistry (ICOA), Orléans, France.
* Update , 2023, Ait Melloul Faculty of Applied Sciences, Ibn Zohr University, Agadir, Morocco,

### Aim of this tutorial

In this tutorial, you will learn how to build and validate a Quantitative Structure-Activity Relationship (QSAR) model using data from the ChEMBL (Chemical Entities of Biological Interest) database.

QSAR models are **useful techniques in drug discovery research** and are frequently utilized in the hit-to-lead and lead optimization steps by drug discovery researchers.

QSAR **is a technique that allows researchers to identify new drug candidates** by predicting which compounds are likely to be active against a target molecule.

In this tutorial, you will learn how to use the Python scikit-learn library in order to preprocess and curate your data, and then use it in machine learning-based QSAR models.

**Let's get started!**


In [10]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mordred import descriptors, Calculator
from rdkit import Chem
from rdkit.Chem import AllChem
print(np.__version__)

1.21.5


In [11]:
Calc = Calculator(descs=descriptors, ignore_3D=True)

In [12]:
df = pd.read_csv('./databases/acetylcholinesterase_Ki_pKi_bioactivity_data_curated.csv')

In [13]:
df.dropna(how= 'any', inplace=True)

In [14]:
df

Unnamed: 0,molecule_chembl_id,units,Ki,smiles,pKi
0,CHEMBL11805,nM,0.104,COc1ccccc1CN(C)CCCCCC(=O)N(C)CCCCCCCCN(C)C(=O)...,9.982967
1,CHEMBL208599,nM,0.026,CCC1=CC2Cc3nc4cc(Cl)ccc4c(N)c3[C@@H](C1)C2,10.585027
2,CHEMBL60745,nM,1.630,CC[N+](C)(C)c1cccc(O)c1.[Br-],8.787812
3,CHEMBL95,nM,151.000,Nc1c2c(nc3ccccc13)CCCC2,6.821023
4,CHEMBL173309,nM,12.200,CCN(CCCCCC(=O)N(C)CCCCCCCCN(C)C(=O)CCCCCN(CC)C...,7.913640
...,...,...,...,...,...
467,CHEMBL5220695,nM,120.000,CC(C)(C)OC(=O)Nc1ccc(O)c(C(=O)NCCCN2CCCCC2)c1,6.920819
468,CHEMBL5219239,nM,170.000,CC1CCCCN1CCCNC(=O)c1cc(NC(=O)OC(C)(C)C)ccc1O,6.769551
469,CHEMBL5218804,nM,0.264,COc1cccc2c1CCC(NC(=O)OCc1ccccc1)C2,9.578396
470,CHEMBL5219425,nM,3500.000,CCN(CC)C(=O)OC1C[N+]2(C)CCC1CC2.[I-],5.455932


In [15]:
y = df['pKi']

In [16]:
#mols = [Chem.MolFromSmiles(smi) for smi in df.canonical_smiles ]

In [17]:
# df1 = Calc.pandas(mols)

In [18]:
#df1.to_csv('data_mordred_descriptors.csv',index=False)

In [19]:
df.isnull().sum().sum()

0

In [20]:
df2.set_index(df.index, inplace= True)
df2

NameError: name 'df2' is not defined

In [19]:
df2.dtypes

ABC         float64
ABCGG       float64
nAcid         int64
nBase         int64
SpAbs_A      object
             ...   
WPol          int64
Zagreb1     float64
Zagreb2     float64
mZagreb1     object
mZagreb2    float64
Length: 1613, dtype: object

### Convert values to mumerics

In [20]:
data_columuns = df2.columns
data_columuns
df2_indices = df2.index
df

Unnamed: 0_level_0,canonical_smiles,pchembl_value
molecule_chembl_id,Unnamed: 1_level_1,Unnamed: 2_level_1
CHEMBL1795572,CO/N=C(/C(=O)NCP(=O)(O)Oc1ccc(C#N)c(F)c1)c1ccc...,4.51
CHEMBL3112752,N[C@@H](Cc1ccc(NC(=O)[C@@H]2CC[C@@H]3CN2C(=O)N...,4.55
CHEMBL3112746,O=C(Nc1ccncc1)[C@@H]1CC[C@@H]2CN1C(=O)N2OS(=O)...,4.70
CHEMBL1173339,CCC(S)P(=O)(OC(C)C)OC(C)C,5.70
CHEMBL1172388,CCC(S)P(=O)(O)O,4.82
...,...,...
CHEMBL4088285,CC(=O)SCC(CCCc1ccccc1)c1nnn[nH]1,4.38
CHEMBL4064978,CC(=O)SCC(Cc1ccccc1)c1nnn[nH]1,4.17
CHEMBL4075406,O=P(O)(O)C(CO)CCCCc1ccccc1,5.35
CHEMBL4069211,O=P(O)(O)C(CO)CCCc1ccccc1,4.47


In [21]:
df2 = (df2.drop(columns= data_columuns, axis=1).join(df2[data_columuns].apply(pd.to_numeric, errors='coerce')))

In [22]:
df2.dtypes

ABC         float64
ABCGG       float64
nAcid         int64
nBase         int64
SpAbs_A     float64
             ...   
WPol          int64
Zagreb1     float64
Zagreb2     float64
mZagreb1    float64
mZagreb2    float64
Length: 1613, dtype: object

In [23]:
df2.to_csv('betalactamase_mordred_all_descriptors.csv', index=True)

In [24]:
df.isnull().sum().sum()

0

In [25]:
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import StandardScaler

In [26]:
Scaler = StandardScaler().fit(df2)
df2 = Scaler.transform(df2)
df2 = pd.DataFrame(data= df2, index=df2_indices, columns= data_columuns)
df2

  updated_mean = (last_sum + new_sum) / updated_sample_count
  result = op(x, *args, **kwargs)


Unnamed: 0_level_0,ABC,ABCGG,nAcid,nBase,SpAbs_A,SpMax_A,SpDiam_A,SpAD_A,SpMAD_A,LogEE_A,...,SRW10,TSRW10,MW,AMW,WPath,WPol,Zagreb1,Zagreb2,mZagreb1,mZagreb2
molecule_chembl_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
CHEMBL1795572,0.085011,0.495100,2.203399,-0.394642,-0.018392,-0.893046,-0.731131,-0.018392,-1.584933,0.196866,...,-0.130687,0.520441,0.783628,2.675741,-0.121708,-0.122044,0.017907,-0.102373,1.108502,0.305109
CHEMBL3112752,0.611166,0.603645,4.674405,1.388081,0.357585,0.896336,0.251232,0.357585,-1.287309,0.618517,...,0.646492,0.900949,0.747344,0.533264,-0.121708,0.484744,0.647191,0.581844,1.505499,0.338800
CHEMBL3112746,-0.391348,-0.325746,2.203399,-0.394642,-0.574502,0.892969,0.176168,-0.574502,-0.731794,-0.333416,...,0.110881,0.221033,-0.332643,1.003748,-0.121709,-0.382096,-0.296735,-0.279763,-0.082489,-0.694408
CHEMBL1173339,-2.235638,-1.856657,-0.267607,-0.394642,-2.449497,-1.118906,-0.856604,-2.449497,-5.420783,-2.886535,...,-2.276734,-2.418560,-1.612615,-1.204268,-0.121709,-1.855724,-2.247517,-2.256390,-0.175122,-1.985918
CHEMBL1172388,-3.200974,-3.267351,4.674405,-0.394642,-3.359235,-2.723774,-2.657959,-3.359235,-6.397642,-5.350458,...,-4.227229,-3.294470,-2.668214,0.940401,-0.121709,-2.722565,-3.128515,-3.067314,-1.471980,-3.131432
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
CHEMBL4088285,-1.086948,-0.998379,4.674405,-0.394642,-1.011702,-1.837288,-2.088464,-1.011702,-0.057424,-1.120192,...,-1.996907,-0.421894,-0.984667,-0.486915,-0.121709,-1.508988,-1.240662,-1.344101,-1.223856,-0.885327
CHEMBL4064978,-1.399896,-1.263435,4.674405,-0.394642,-1.355599,-1.520047,-1.630808,-1.355599,-0.031552,-1.545641,...,-2.089384,-0.648176,-1.336534,0.027374,-0.121709,-1.682356,-1.492376,-1.546832,-1.462055,-1.289626
CHEMBL4075406,-1.660442,-1.616855,4.674405,-0.394642,-1.721691,-2.146560,-2.010075,-1.721691,-2.167171,-1.919044,...,-2.135989,-2.047583,-1.386578,-0.917610,-0.121709,-1.682356,-1.744090,-1.825587,-0.823551,-1.446853
CHEMBL4069211,-1.816915,-1.753043,4.674405,-0.394642,-1.872260,-2.081201,-1.936714,-1.872260,-2.032828,-2.169510,...,-2.201062,-2.163828,-1.562511,-0.707412,-0.121709,-1.769040,-1.869946,-1.926952,-0.942650,-1.649003


In [27]:
df2.to_csv('betalactamase_mordred_scaled_descriptors.csv', index=True)

In [28]:
from tqdm.auto import tqdm

In [29]:
from sklearn.feature_selection import mutual_info_regression, SelectKBest

In [30]:
X_train, X_test, y_train, y_test = train_test_split(
    df2, y, test_size=0.33, random_state=42)

In [31]:
X_train.shape, X_test.shape

((41691, 1613), (20535, 1613))

### First we need to drop coorelated descriptors 

In [32]:
from feature_engine.selection import DropCorrelatedFeatures, SmartCorrelatedSelection

In [33]:
Sel = DropCorrelatedFeatures(threshold= 0.6,
                             method= 'pearson', 
                            missing_values='ignore')
Sel.fit(X_train)

DropCorrelatedFeatures(threshold=0.6,
                       variables=['ABC', 'ABCGG', 'nAcid', 'nBase', 'SpAbs_A',
                                  'SpMax_A', 'SpDiam_A', 'SpAD_A', 'SpMAD_A',
                                  'LogEE_A', 'VE1_A', 'VE2_A', 'VE3_A', 'VR1_A',
                                  'VR2_A', 'VR3_A', 'nAromAtom', 'nAromBond',
                                  'nAtom', 'nHeavyAtom', 'nSpiro',
                                  'nBridgehead', 'nHetero', 'nH', 'nB', 'nC',
                                  'nN', 'nO', 'nS', 'nP', ...])

In [34]:
len(Sel.features_to_drop_)

1171

In [35]:
X_trainA = Sel.transform(X_train)
X_testA = Sel.transform(X_test)

X_trainA.shape, X_testA.shape

((41691, 442), (20535, 442))

# To be continued...