# ADN_T004. QSAR models

Authors:
* Adnane Aouidate, (2019-2020), Computer Aided Drug Discovery Center, Shenzhen Institute of Advanced Technology(SIAT), Shenzhen, China.
* Adnane Aouidate, (2021-2022), Structural Bioinformatics and Chemoinformatics, Institute of Organic and Analytical Chemistry (ICOA), Orléans, France.
* Update , 2023, Ait Melloul Faculty of Applied Sciences, Ibn Zohr University, Agadir, Morocco,

### Aim of this tutorial

In this tutorial, you will learn how to build and validate a Quantitative Structure-Activity Relationship (QSAR) model using data from the ChEMBL (Chemical Entities of Biological Interest) database.

QSAR models are **useful techniques in drug discovery research** and are frequently utilized in the hit-to-lead and lead optimization steps by drug discovery researchers.

QSAR **is a technique that allows researchers to identify new drug candidates** by predicting which compounds are likely to be active against a target molecule.

In this tutorial, you will learn how to use the Python scikit-learn library in order to preprocess and curate your data, and then use it in machine learning-based QSAR models.

**Let's get started!**


In [21]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mordred import descriptors, Calculator
from rdkit import Chem
from rdkit.Chem import AllChem
print(np.__version__)

1.21.5


In [23]:
df = pd.read_csv('./databases/acetylcholinesterase_pKi_mordredMD.csv', index_col=0)

In [24]:
df.dropna(how= 'any', inplace=True)

In [25]:
df

Unnamed: 0_level_0,ABC,ABCGG,nAcid,nBase,nAromAtom,nAromBond,nAtom,nHeavyAtom,nSpiro,nBridgehead,...,SRW10,TSRW10,MW,AMW,WPath,WPol,Zagreb1,Zagreb2,mZagreb2,pKi
molecule_chembl_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
CHEMBL11805,35.142811,24.264889,0,2,12,12,114,48,0,0,...,10.238852,84.598254,666.508407,5.846565,14630,63,218.0,238.0,11.361111,9.982967
CHEMBL208599,16.987142,12.790496,0,0,10,11,40,21,0,2,...,10.293467,55.839709,298.123676,7.453092,842,39,120.0,147.0,4.472222,10.585027
CHEMBL60745,8.850899,8.508709,1,1,6,6,29,13,0,0,...,9.303375,43.773162,245.041526,8.449708,1200000190,16,58.0,65.0,2.708333,8.787812
CHEMBL95,11.968445,9.625522,0,0,10,11,29,15,0,0,...,9.827416,47.796305,198.115698,6.831576,326,25,82.0,99.0,3.277778,6.821023
CHEMBL173309,36.338245,25.499176,0,2,12,12,120,50,0,0,...,10.283053,86.794615,694.539707,5.787831,16085,67,226.0,248.0,12.027778,7.913640
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
CHEMBL5220695,20.523035,16.217858,0,1,6,6,58,27,0,0,...,9.925102,61.461191,377.231456,6.503991,2240,35,132.0,145.0,5.930556,6.920819
CHEMBL5219239,21.299091,16.948723,0,1,6,6,61,28,0,0,...,10.015431,62.768394,391.247107,6.413887,2463,38,138.0,153.0,6.125000,6.769551
CHEMBL5218804,17.706179,13.397976,0,0,12,12,44,23,0,0,...,9.899530,57.005714,311.152144,7.071640,1376,33,116.0,133.0,5.222222,9.578396
CHEMBL5219425,12.915350,11.452675,1,1,0,0,43,18,0,2,...,9.909817,51.182410,368.096076,8.560374,1700000520,25,88.0,103.0,3.847222,5.455932


In [26]:
y = df['pKi']

In [19]:
df.isnull().sum().sum()

0

In [28]:
df.dtypes

ABC          float64
ABCGG        float64
nAcid          int64
nBase          int64
nAromAtom      int64
              ...   
WPol           int64
Zagreb1      float64
Zagreb2      float64
mZagreb2     float64
pKi          float64
Length: 1027, dtype: object

In [33]:
df_indices = df.index
data_columuns = df.columns

### Convert values to mumerics

In [34]:
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import StandardScaler

In [36]:
Scaler = StandardScaler().fit(df)
df1 = Scaler.transform(df)
df1 = pd.DataFrame(data= df1, index=df_indices, columns= data_columuns)
df1

Unnamed: 0_level_0,ABC,ABCGG,nAcid,nBase,nAromAtom,nAromBond,nAtom,nHeavyAtom,nSpiro,nBridgehead,...,SRW10,TSRW10,MW,AMW,WPath,WPol,Zagreb1,Zagreb2,mZagreb2,pKi
molecule_chembl_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
CHEMBL11805,1.708479,1.735525,-0.382930,1.753151,-0.193073,-0.257035,2.868298,2.015231,-0.089721,-0.163233,...,0.178133,1.212683,1.942580,-1.141241,-0.299641,1.009158,1.338423,1.017473,2.352085,1.870065
CHEMBL208599,-0.704518,-0.846431,-0.382930,-0.991417,-0.507315,-0.400440,-0.775018,-0.847242,-0.089721,5.340041,...,0.285876,-0.945155,-0.905936,-0.163119,-0.299645,-0.341633,-0.558565,-0.425854,-0.921218,2.191256
CHEMBL60745,-1.785873,-1.809915,1.406605,0.380867,-1.135801,-1.117465,-1.316592,-1.695382,-0.089721,-0.163233,...,-1.667368,-1.850543,-1.316391,0.443663,0.040745,-1.636141,-1.758700,-1.726434,-1.759342,1.232467
CHEMBL95,-1.371532,-1.558610,-0.382930,-0.991417,-0.507315,-0.400440,-1.316592,-1.483347,-0.089721,-0.163233,...,-0.633544,-1.548675,-1.679243,-0.541524,-0.299645,-1.129594,-1.294131,-1.187169,-1.488766,0.183211
CHEMBL173309,1.867360,2.013263,-0.382930,1.753151,-0.193073,-0.257035,3.163702,2.227266,-0.089721,-0.163233,...,0.265333,1.377482,2.159331,-1.177000,-0.299641,1.234290,1.493279,1.176080,2.668856,0.766107
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
CHEMBL5220695,-0.234576,-0.075210,-0.382930,0.380867,-1.135801,-1.117465,0.111194,-0.211137,-0.089721,-0.163233,...,-0.440830,-0.523359,-0.294239,-0.740972,-0.299645,-0.566765,-0.326281,-0.457575,-0.228281,0.236450
CHEMBL5219239,-0.131434,0.089249,-0.382930,0.380867,-1.135801,-1.117465,0.258896,-0.105119,-0.089721,-0.163233,...,-0.262629,-0.425275,-0.185864,-0.795831,-0.299645,-0.397916,-0.210138,-0.330690,-0.135889,0.155751
CHEMBL5218804,-0.608953,-0.709737,-0.382930,-0.991417,-0.193073,-0.257035,-0.578082,-0.635207,-0.089721,-0.163233,...,-0.491279,-0.857666,-0.805194,-0.395363,-0.299645,-0.679330,-0.635993,-0.647904,-0.564850,1.654232
CHEMBL5219425,-1.245683,-1.147466,1.406605,0.380867,-2.078529,-1.977894,-0.627316,-1.165294,-0.089721,5.340041,...,-0.470984,-1.294605,-0.364878,0.511041,0.182574,-1.129594,-1.177989,-1.123726,-1.218191,-0.545047


In [27]:
df2.to_csv('betalactamase_mordred_scaled_descriptors.csv', index=True)

In [28]:
from tqdm.auto import tqdm

In [29]:
from sklearn.feature_selection import mutual_info_regression, SelectKBest

In [30]:
X_train, X_test, y_train, y_test = train_test_split(
    df2, y, test_size=0.33, random_state=42)

In [31]:
X_train.shape, X_test.shape

((41691, 1613), (20535, 1613))

### First we need to drop coorelated descriptors 

In [32]:
from feature_engine.selection import DropCorrelatedFeatures, SmartCorrelatedSelection

In [33]:
Sel = DropCorrelatedFeatures(threshold= 0.6,
                             method= 'pearson', 
                            missing_values='ignore')
Sel.fit(X_train)

DropCorrelatedFeatures(threshold=0.6,
                       variables=['ABC', 'ABCGG', 'nAcid', 'nBase', 'SpAbs_A',
                                  'SpMax_A', 'SpDiam_A', 'SpAD_A', 'SpMAD_A',
                                  'LogEE_A', 'VE1_A', 'VE2_A', 'VE3_A', 'VR1_A',
                                  'VR2_A', 'VR3_A', 'nAromAtom', 'nAromBond',
                                  'nAtom', 'nHeavyAtom', 'nSpiro',
                                  'nBridgehead', 'nHetero', 'nH', 'nB', 'nC',
                                  'nN', 'nO', 'nS', 'nP', ...])

In [34]:
len(Sel.features_to_drop_)

1171

In [35]:
X_trainA = Sel.transform(X_train)
X_testA = Sel.transform(X_test)

X_trainA.shape, X_testA.shape

((41691, 442), (20535, 442))

# To be continued...