# ADN_T004. QSAR models

Authors:
* Adnane Aouidate, (2019-2020), Computer Aided Drug Discovery Center, Shenzhen Institute of Advanced Technology(SIAT), Shenzhen, China.
* Adnane Aouidate, (2021-2022), Structural Bioinformatics and Chemoinformatics, Institute of Organic and Analytical Chemistry (ICOA), Orléans, France.
* Update , 2023, Ait Melloul Faculty of Applied Sciences, Ibn Zohr University, Agadir, Morocco,

### Aim of this tutorial

In this tutorial, you will learn how to build and validate a Quantitative Structure-Activity Relationship (QSAR) model using data from the ChEMBL (Chemical Entities of Biological Interest) database.

QSAR models are **useful techniques in drug discovery research** and are frequently utilized in the hit-to-lead and lead optimization steps by drug discovery researchers.

QSAR **is a technique that allows researchers to identify new drug candidates** by predicting which compounds are likely to be active against a target molecule.

In this tutorial, you will learn how to use the Python scikit-learn library in order to preprocess and curate your data, and then use it in machine learning-based QSAR models.

**Let's get started!**


In [48]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mordred import descriptors, Calculator
from rdkit import Chem
from rdkit.Chem import AllChem
print(np.__version__)

1.21.5


In [49]:
df = pd.read_csv('./databases/acetylcholinesterase_pKi_mordredMD.csv', index_col=0)

In [50]:
df.dropna(how= 'any', inplace=True)

In [51]:
df

Unnamed: 0_level_0,ABC,ABCGG,nAcid,nBase,nAromAtom,nAromBond,nAtom,nHeavyAtom,nSpiro,nBridgehead,...,SRW10,TSRW10,MW,AMW,WPath,WPol,Zagreb1,Zagreb2,mZagreb2,pKi
molecule_chembl_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
CHEMBL11805,35.142811,24.264889,0,2,12,12,114,48,0,0,...,10.238852,84.598254,666.508407,5.846565,14630,63,218.0,238.0,11.361111,9.982967
CHEMBL208599,16.987142,12.790496,0,0,10,11,40,21,0,2,...,10.293467,55.839709,298.123676,7.453092,842,39,120.0,147.0,4.472222,10.585027
CHEMBL60745,8.850899,8.508709,1,1,6,6,29,13,0,0,...,9.303375,43.773162,245.041526,8.449708,1200000190,16,58.0,65.0,2.708333,8.787812
CHEMBL95,11.968445,9.625522,0,0,10,11,29,15,0,0,...,9.827416,47.796305,198.115698,6.831576,326,25,82.0,99.0,3.277778,6.821023
CHEMBL173309,36.338245,25.499176,0,2,12,12,120,50,0,0,...,10.283053,86.794615,694.539707,5.787831,16085,67,226.0,248.0,12.027778,7.913640
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
CHEMBL5220695,20.523035,16.217858,0,1,6,6,58,27,0,0,...,9.925102,61.461191,377.231456,6.503991,2240,35,132.0,145.0,5.930556,6.920819
CHEMBL5219239,21.299091,16.948723,0,1,6,6,61,28,0,0,...,10.015431,62.768394,391.247107,6.413887,2463,38,138.0,153.0,6.125000,6.769551
CHEMBL5218804,17.706179,13.397976,0,0,12,12,44,23,0,0,...,9.899530,57.005714,311.152144,7.071640,1376,33,116.0,133.0,5.222222,9.578396
CHEMBL5219425,12.915350,11.452675,1,1,0,0,43,18,0,2,...,9.909817,51.182410,368.096076,8.560374,1700000520,25,88.0,103.0,3.847222,5.455932


In [5]:
y = df['pKi']

In [6]:
df.isnull().sum().sum()

0

In [7]:
df.dtypes

ABC          float64
ABCGG        float64
nAcid          int64
nBase          int64
nAromAtom      int64
              ...   
WPol           int64
Zagreb1      float64
Zagreb2      float64
mZagreb2     float64
pKi          float64
Length: 1027, dtype: object

In [8]:
df_indices = df.index
data_columuns = df.columns

### Convert values to mumerics

In [9]:
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import StandardScaler

In [10]:
Scaler = StandardScaler().fit(df)
df1 = Scaler.transform(df)
df1 = pd.DataFrame(data= df1, index=df_indices, columns= data_columuns)
df1

Unnamed: 0_level_0,ABC,ABCGG,nAcid,nBase,nAromAtom,nAromBond,nAtom,nHeavyAtom,nSpiro,nBridgehead,...,SRW10,TSRW10,MW,AMW,WPath,WPol,Zagreb1,Zagreb2,mZagreb2,pKi
molecule_chembl_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
CHEMBL11805,1.708479,1.735525,-0.382930,1.753151,-0.193073,-0.257035,2.868298,2.015231,-0.089721,-0.163233,...,0.178133,1.212683,1.942580,-1.141241,-0.299641,1.009158,1.338423,1.017473,2.352085,1.870065
CHEMBL208599,-0.704518,-0.846431,-0.382930,-0.991417,-0.507315,-0.400440,-0.775018,-0.847242,-0.089721,5.340041,...,0.285876,-0.945155,-0.905936,-0.163119,-0.299645,-0.341633,-0.558565,-0.425854,-0.921218,2.191256
CHEMBL60745,-1.785873,-1.809915,1.406605,0.380867,-1.135801,-1.117465,-1.316592,-1.695382,-0.089721,-0.163233,...,-1.667368,-1.850543,-1.316391,0.443663,0.040745,-1.636141,-1.758700,-1.726434,-1.759342,1.232467
CHEMBL95,-1.371532,-1.558610,-0.382930,-0.991417,-0.507315,-0.400440,-1.316592,-1.483347,-0.089721,-0.163233,...,-0.633544,-1.548675,-1.679243,-0.541524,-0.299645,-1.129594,-1.294131,-1.187169,-1.488766,0.183211
CHEMBL173309,1.867360,2.013263,-0.382930,1.753151,-0.193073,-0.257035,3.163702,2.227266,-0.089721,-0.163233,...,0.265333,1.377482,2.159331,-1.177000,-0.299641,1.234290,1.493279,1.176080,2.668856,0.766107
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
CHEMBL5220695,-0.234576,-0.075210,-0.382930,0.380867,-1.135801,-1.117465,0.111194,-0.211137,-0.089721,-0.163233,...,-0.440830,-0.523359,-0.294239,-0.740972,-0.299645,-0.566765,-0.326281,-0.457575,-0.228281,0.236450
CHEMBL5219239,-0.131434,0.089249,-0.382930,0.380867,-1.135801,-1.117465,0.258896,-0.105119,-0.089721,-0.163233,...,-0.262629,-0.425275,-0.185864,-0.795831,-0.299645,-0.397916,-0.210138,-0.330690,-0.135889,0.155751
CHEMBL5218804,-0.608953,-0.709737,-0.382930,-0.991417,-0.193073,-0.257035,-0.578082,-0.635207,-0.089721,-0.163233,...,-0.491279,-0.857666,-0.805194,-0.395363,-0.299645,-0.679330,-0.635993,-0.647904,-0.564850,1.654232
CHEMBL5219425,-1.245683,-1.147466,1.406605,0.380867,-2.078529,-1.977894,-0.627316,-1.165294,-0.089721,5.340041,...,-0.470984,-1.294605,-0.364878,0.511041,0.182574,-1.129594,-1.177989,-1.123726,-1.218191,-0.545047


In [12]:
df1.to_csv('./databases/acetylcholinesterase_mordred_scaled_descriptors.csv', index=True)

In [13]:
from tqdm.auto import tqdm

## Remove features that display high correlation with other features 

In [26]:
correlated_features_1 = set()
corr_matrix = df1.corr() 

In [27]:
corr_matrix

Unnamed: 0,ABC,ABCGG,nAcid,nBase,nAromAtom,nAromBond,nAtom,nHeavyAtom,nSpiro,nBridgehead,...,SRW10,TSRW10,MW,AMW,WPath,WPol,Zagreb1,Zagreb2,mZagreb2,pKi
ABC,1.000000,0.975893,-0.017485,0.156459,0.734489,0.732963,0.926278,0.993147,0.112880,0.095730,...,0.791155,0.873177,0.921575,-0.277788,0.180991,0.958953,0.993935,0.978759,0.979102,0.330915
ABCGG,0.975893,1.000000,-0.020316,0.150692,0.664477,0.658610,0.895024,0.969510,0.180522,0.088568,...,0.806022,0.884833,0.900923,-0.249467,0.227039,0.959700,0.976799,0.966753,0.954461,0.282831
nAcid,-0.017485,-0.020316,1.000000,0.468423,0.049378,0.039674,0.040153,0.051392,0.034454,0.062684,...,-0.102307,-0.054238,0.298560,0.400702,0.632917,-0.064457,-0.038783,-0.059341,0.017197,-0.116094
nBase,0.156459,0.150692,0.468423,1.000000,0.067140,0.044946,0.249803,0.200010,0.034172,0.046170,...,0.053970,0.193255,0.292760,-0.006248,0.260757,0.075135,0.132486,0.107642,0.196819,-0.151362
nAromAtom,0.734489,0.664477,0.049378,0.067140,1.000000,0.996773,0.533746,0.707204,0.000802,-0.024188,...,0.715977,0.642245,0.662290,-0.019054,0.036271,0.711422,0.746545,0.750792,0.665723,0.296372
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
WPol,0.958953,0.959700,-0.064457,0.075135,0.711422,0.710969,0.843061,0.934784,0.182161,0.128635,...,0.884904,0.864715,0.854810,-0.227681,0.136115,1.000000,0.980536,0.990781,0.904882,0.302159
Zagreb1,0.993935,0.976799,-0.038783,0.132486,0.746545,0.745032,0.894446,0.977376,0.139934,0.107885,...,0.843405,0.890066,0.900943,-0.256871,0.162405,0.980536,1.000000,0.995180,0.954042,0.327956
Zagreb2,0.978759,0.966753,-0.059341,0.107642,0.750792,0.749837,0.859312,0.953987,0.166101,0.116339,...,0.879422,0.896116,0.872244,-0.240430,0.143242,0.990781,0.995180,1.000000,0.924220,0.321681
mZagreb2,0.979102,0.954461,0.017197,0.196819,0.665723,0.663627,0.962763,0.991499,0.078950,0.078617,...,0.682087,0.828639,0.926827,-0.324511,0.218753,0.904882,0.954042,0.924220,1.000000,0.316342


In [28]:
for i in range(len(corr_matrix.columns)):
    for j in range(i):
        if abs(corr_matrix.iloc[i, j]) > 0.8:
            colname = corr_matrix.columns[i]
            correlated_features_1.add(colname)

In [29]:
correlated_features_1

{'AATS0are',
 'AATS0d',
 'AATS0m',
 'AATS0p',
 'AATS0pe',
 'AATS0se',
 'AATS0v',
 'AATS1are',
 'AATS1d',
 'AATS1dv',
 'AATS1i',
 'AATS1m',
 'AATS1p',
 'AATS1pe',
 'AATS1se',
 'AATS1v',
 'AATS2Z',
 'AATS2are',
 'AATS2d',
 'AATS2dv',
 'AATS2i',
 'AATS2m',
 'AATS2p',
 'AATS2pe',
 'AATS2se',
 'AATS2v',
 'AATS3Z',
 'AATS3are',
 'AATS3d',
 'AATS3dv',
 'AATS3m',
 'AATS3p',
 'AATS3pe',
 'AATS3se',
 'AATS3v',
 'AATS4Z',
 'AATS4are',
 'AATS4d',
 'AATS4dv',
 'AATS4m',
 'AATS4p',
 'AATS4pe',
 'AATS4se',
 'AATS4v',
 'AATS5Z',
 'AATS5are',
 'AATS5d',
 'AATS5dv',
 'AATS5m',
 'AATS5p',
 'AATS5pe',
 'AATS5se',
 'AATS5v',
 'AATSC0Z',
 'AATSC0are',
 'AATSC0d',
 'AATSC0dv',
 'AATSC0i',
 'AATSC0m',
 'AATSC0p',
 'AATSC0pe',
 'AATSC0se',
 'AATSC1Z',
 'AATSC1are',
 'AATSC1i',
 'AATSC1m',
 'AATSC1p',
 'AATSC1pe',
 'AATSC1se',
 'AATSC1v',
 'AATSC2Z',
 'AATSC2are',
 'AATSC2c',
 'AATSC2d',
 'AATSC2dv',
 'AATSC2i',
 'AATSC2m',
 'AATSC2p',
 'AATSC2pe',
 'AATSC2se',
 'AATSC2v',
 'AATSC3Z',
 'AATSC3are',
 'AATSC3c',


In [30]:
## Remove correlated features.
def remove_correlated_features(features, data):
    for x in features:
        data.drop(x, axis=1, inplace=True)
    return data

In [32]:
remove_correlated_features(correlated_features_1, df1)

Unnamed: 0_level_0,ABC,nAcid,nBase,nAromAtom,nSpiro,nBridgehead,nHetero,nB,nN,nO,...,JGI3,JGI4,JGI5,JGI6,JGI7,JGI8,JGI9,JGI10,SRW03,pKi
molecule_chembl_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
CHEMBL11805,1.708479,-0.382930,1.753151,-0.193073,-0.089721,-0.163233,0.730146,-0.065233,0.955423,0.530488,...,-0.549645,-0.656709,-0.615962,-0.979284,-0.583657,-0.531796,-0.738214,0.179914,0.0,1.870065
CHEMBL208599,-0.704518,-0.382930,-0.991417,-0.507315,-0.089721,5.340041,-1.353464,-0.065233,-0.266690,-1.569221,...,0.915764,0.739583,1.129886,0.406215,1.394307,0.120460,1.911806,2.863825,0.0,2.191256
CHEMBL60745,-1.785873,1.406605,0.380867,-1.135801,-0.089721,-0.163233,-1.353464,-0.065233,-0.877747,-1.044294,...,0.804034,1.265459,1.000292,0.554652,-2.569629,-2.306414,-1.909838,-1.837927,0.0,1.232467
CHEMBL95,-1.371532,-0.382930,-0.991417,-0.507315,-0.089721,-0.163233,-1.770186,-0.065233,-0.266690,-1.569221,...,0.231339,1.478972,-0.618365,-0.677716,-2.569629,-2.306414,-1.909838,-1.837927,0.0,0.183211
CHEMBL173309,1.867360,-0.382930,1.753151,-0.193073,-0.089721,-0.163233,0.730146,-0.065233,0.955423,0.530488,...,-0.728975,-0.859856,-0.837146,-0.882694,-0.591194,-0.356611,-0.630121,0.011761,0.0,0.766107
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
CHEMBL5220695,-0.234576,-0.382930,0.380867,-1.135801,-0.089721,-0.163233,0.313424,-0.065233,0.344367,0.530488,...,0.999675,-0.840255,0.087875,0.857119,0.201399,1.598277,-0.390424,-0.139805,0.0,0.236450
CHEMBL5219239,-0.131434,-0.382930,0.380867,-1.135801,-0.089721,-0.163233,0.313424,-0.065233,0.344367,0.530488,...,1.295477,-0.511460,0.200874,0.728633,0.325246,1.532909,-0.150967,0.119500,0.0,0.155751
CHEMBL5218804,-0.608953,-0.382930,-0.991417,-0.193073,-0.089721,-0.163233,-0.936742,-0.065233,-0.877747,0.005561,...,-0.686181,-0.160577,-0.221959,0.242935,-0.255752,-0.594722,-0.535216,-0.079403,0.0,1.654232
CHEMBL5219425,-1.245683,1.406605,0.380867,-2.078529,-0.089721,5.340041,-0.520020,-0.065233,-0.266690,-0.519366,...,0.430537,0.555944,0.623707,0.627144,1.514427,0.988169,-1.909838,-1.837927,0.0,-0.545047


## Remove descriptors with low variance

In [33]:
def variance_threshold_selector(data, threshold=0.15):
    selector = VarianceThreshold(threshold)
    selector.fit(data)
    return data[data.columns[selector.get_support(indices=True)]]

In [39]:
df1 = variance_threshold_selector(df1)

In [40]:
df1

Unnamed: 0_level_0,ABC,nAcid,nBase,nAromAtom,nSpiro,nBridgehead,nHetero,nB,nN,nO,...,JGI2,JGI3,JGI4,JGI5,JGI6,JGI7,JGI8,JGI9,JGI10,pKi
molecule_chembl_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
CHEMBL11805,1.708479,-0.382930,1.753151,-0.193073,-0.089721,-0.163233,0.730146,-0.065233,0.955423,0.530488,...,0.233026,-0.549645,-0.656709,-0.615962,-0.979284,-0.583657,-0.531796,-0.738214,0.179914,1.870065
CHEMBL208599,-0.704518,-0.382930,-0.991417,-0.507315,-0.089721,5.340041,-1.353464,-0.065233,-0.266690,-1.569221,...,-1.026694,0.915764,0.739583,1.129886,0.406215,1.394307,0.120460,1.911806,2.863825,2.191256
CHEMBL60745,-1.785873,1.406605,0.380867,-1.135801,-0.089721,-0.163233,-1.353464,-0.065233,-0.877747,-1.044294,...,2.493112,0.804034,1.265459,1.000292,0.554652,-2.569629,-2.306414,-1.909838,-1.837927,1.232467
CHEMBL95,-1.371532,-0.382930,-0.991417,-0.507315,-0.089721,-0.163233,-1.770186,-0.065233,-0.266690,-1.569221,...,0.023073,0.231339,1.478972,-0.618365,-0.677716,-2.569629,-2.306414,-1.909838,-1.837927,0.183211
CHEMBL173309,1.867360,-0.382930,1.753151,-0.193073,-0.089721,-0.163233,0.730146,-0.065233,0.955423,0.530488,...,0.090800,-0.728975,-0.859856,-0.837146,-0.882694,-0.591194,-0.356611,-0.630121,0.011761,0.766107
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
CHEMBL5220695,-0.234576,-0.382930,0.380867,-1.135801,-0.089721,-0.163233,0.313424,-0.065233,0.344367,0.530488,...,-0.529436,0.999675,-0.840255,0.087875,0.857119,0.201399,1.598277,-0.390424,-0.139805,0.236450
CHEMBL5219239,-0.131434,-0.382930,0.380867,-1.135801,-0.089721,-0.163233,0.313424,-0.065233,0.344367,0.530488,...,0.233026,1.295477,-0.511460,0.200874,0.728633,0.325246,1.532909,-0.150967,0.119500,0.155751
CHEMBL5218804,-0.608953,-0.382930,-0.991417,-0.193073,-0.089721,-0.163233,-0.936742,-0.065233,-0.877747,0.005561,...,-0.740394,-0.686181,-0.160577,-0.221959,0.242935,-0.255752,-0.594722,-0.535216,-0.079403,1.654232
CHEMBL5219425,-1.245683,1.406605,0.380867,-2.078529,-0.089721,5.340041,-0.520020,-0.065233,-0.266690,-0.519366,...,1.638099,0.430537,0.555944,0.623707,0.627144,1.514427,0.988169,-1.909838,-1.837927,-0.545047


## More steps

In [86]:
from sklearn.feature_selection import mutual_info_regression, SelectKBest
from sklearn.metrics import r2_score

In [87]:
X_train, X_test, y_train, y_test = train_test_split(
    df1, y, test_size=0.33, random_state=42)

In [88]:
X_train.shape, X_test.shape

((316, 236), (156, 236))

### First we need to drop coorelated descriptors 

In [113]:
from feature_engine.selection import DropCorrelatedFeatures, SmartCorrelatedSelection
from sklearn.feature_selection import VarianceThreshold, f_regression

In [90]:
Sel = DropCorrelatedFeatures(threshold= 0.15,
                             method= 'pearson', 
                            missing_values='ignore')
Sel.fit(X_train, y_train)

DropCorrelatedFeatures(threshold=0.15)

In [91]:
len(Sel.features_to_drop_)

217

In [92]:
#y_trainA = Sel.fit_transform(X_trainA)

In [100]:
X_trainA = Sel.transform(X_train)
X_testA = Sel.transform(X_test)

X_trainA.shape, X_testA.shape

((316, 19), (156, 19))

In [107]:
y_trainA = y_train.loc[X_trainA.index]

In [108]:
y_trainA

molecule_chembl_id
CHEMBL4292477    7.207608
CHEMBL3326700    7.404504
CHEMBL4080726    4.879426
CHEMBL3326711    7.258061
CHEMBL4583534    7.000000
                   ...   
CHEMBL592857     3.508638
CHEMBL3235217    5.511873
CHEMBL3819155    5.549443
CHEMBL4783427    6.404504
CHEMBL151        4.181774
Name: pKi, Length: 316, dtype: float64

In [133]:
select_reg = SelectKBest(f_regression,k=5)
select_reg.fit(X_trainA, y_train)
X_trainB = select_reg.transform(X_trainA)
X_testB = select_reg.transform(X_testA)

  X_norms = np.sqrt(row_norms(X.T, squared=True) - n_samples * X_means ** 2)


In [134]:
X_trainC = X_trainA.iloc[:,select_reg.get_support()]
X_testC = X_testA.iloc[:,select_reg.get_support()]

In [136]:
X_trainC.shape, X_testC.shape

((316, 5), (156, 5))

In [135]:
X_trainC.head(), X_testC.head()

(                         ABC  nBridgehead    ATSC4c    GATS2d  PEOE_VSA13
 molecule_chembl_id                                                       
 CHEMBL4292477       0.166579    -0.163233 -1.148664  0.384693    1.131203
 CHEMBL3326700       0.788392    -0.163233  0.660501 -1.190017    1.131203
 CHEMBL4080726       0.195113    -0.163233 -0.261420 -1.641450   -0.475683
 CHEMBL3326711       0.976350    -0.163233  1.515439 -1.185167    1.131203
 CHEMBL4583534      -0.369609    -0.163233 -0.023089  0.451069   -0.475683,
                          ABC  nBridgehead    ATSC4c    GATS2d  PEOE_VSA13
 molecule_chembl_id                                                       
 CHEMBL254300       -1.277506    -0.163233  0.557457  2.733309   -0.475683
 CHEMBL481           1.631985    -0.163233 -0.921737 -0.468875    1.036562
 CHEMBL1255901      -1.059564     5.340041  0.175167  0.792889   -0.475683
 CHEMBL3235225      -0.133685    -0.163233  0.034537  1.096678   -0.475683
 CHEMBL2323355      -0.2

In [140]:
#£Code from Stackoverflow "Variance Inflation Factor in Python"
# import warnings
# warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
from sklearn.linear_model import LinearRegression

def sklearn_vif(exogs, data):

    # initialize dictionaries
    vif_dict, tolerance_dict = {}, {}

    # form input data for each exogenous variable
    for exog in exogs:
        not_exog = [i for i in exogs if i != exog]
        X, y = data[not_exog], data[exog]

        # extract r-squared from the fit
        r_squared = LinearRegression().fit(X, y).score(X, y)

        # calculate VIF
        vif = 1/(1 - r_squared)
        vif_dict[exog] = vif

        # calculate tolerance
        tolerance = 1 - r_squared
        tolerance_dict[exog] = tolerance

    # return VIF DataFrame
    df_vif = pd.DataFrame({'VIF': vif_dict, 'Tolerance': tolerance_dict})

    return df_vif


In [144]:
sklearn_vif(X_trainC.columns, X_trainC)

Unnamed: 0,VIF,Tolerance
ABC,1.067244,0.936993
nBridgehead,1.036082,0.965174
ATSC4c,1.039978,0.961559
GATS2d,1.026348,0.974329
PEOE_VSA13,1.017326,0.982969


# To be continued...