# Machine-learning based predictor construction of CES2 inhibitors

In this demo, we will construct a ML-based predictor for predicting the acticity of CES2 inhibitors. Herein, five ML-based algorithm were employed to construct the predictor. A total of 734 samples with confirmed activities was used for predictor construction.The 734 CES2 inhibitors were firstly divided into training/test set (i.e., Modeling_set, with 433 positives/579 negatives) and test set (i.e.,Validation_set, with 144 positives/500 negatives). Users can also build their classifier by this protocol.

## Import modules

In [2]:
from rdkit.Chem import MACCSkeys
from rdkit.Chem import AllChem
from rdkit.Chem import rdMolDescriptors
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit import DataStructs
from rdkit.ML.Cluster import Butina

In [3]:
import os
import pandas as pd
import sys
import numpy as np
os.chdir('./')
#print(os.getcwd())

## Molecular characterization

In [4]:
sys.path.append('./release/') 
from Data_preprocess import load_data,calcMCFP

Load data

In [5]:
path = "Dataset_CES2_inhibitors_pIC50.csv"

dataset,canonical_smi,canonical_mols = load_data(path)

Calculate MCFP descriptors

In [6]:
pred_data = calcMCFP(mols = canonical_mols, dataset = dataset)

## Predictors construction

In [7]:
# Improt our predictor modules
from Predictor import fit_model, save_model, model_predict

# Import ML algorithm from sklearn
#import os
#import matplotlib.pyplot as plt
from sklearn import tree
from sklearn import linear_model
from sklearn import svm
from sklearn import neighbors
from sklearn import ensemble
#from sklearn.ensemble import BaggingRegressor
#from sklearn.tree import ExtraTreeRegressor
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error, median_absolute_error

In this section, user can adjust suited model parameter for classifier construction. In our work, the paras was set to default.

In [8]:
#### DT ####
dt_reg = tree.DecisionTreeRegressor()

#### LR ####
lr_reg = linear_model.LinearRegression()

#### SVM ####
svm_reg =  svm.SVR(kernel='rbf')

#### KNN ####
knn_reg = neighbors.KNeighborsRegressor()

#### rf ####
rf_reg = ensemble.RandomForestRegressor(n_estimators=5)#这里使用20个决策树

#### PLS ####
pls_reg = PLSRegression(n_components=2,scale=True)

The train_test_split from sklearn was employed for training ML model.

In [9]:
data = pred_data
test_size = 0.3              # The proportion of the dataset to include in the test split
random_state = 43          # A seed to the random generator
model_clf = svm_reg

fit_model(data=data, test_size=test_size, random_state=random_state, model_clf= model_clf)

Training set R2：0.7954 | Test set R2：0.5990 | Training set MSE：0.3184 | Test set MSE：0.5616


Save the optimal predictor.

In [10]:
model_clf = model_clf
path = "./model/1.pka"

#save_model(model_clf=model_clf, path=path)

## Predicting data

Input dataset

In [11]:
path = "Demo_pre_act.csv"

dataset,canonical_smi,canonical_mols = load_data(path)
vali_data = calcMCFP(mols = canonical_mols, dataset = dataset)

Load predictor

In [12]:
data = vali_data  # validation set
path = "./model/1.pka"

pre_act = model_predict(data, path)

In [16]:
pd.concat([dataset["ID"],dataset["Smiles"], pre_act], axis=1)

Unnamed: 0,ID,Smiles,Pred_pIC50
0,Scaffold -1,CC(C1=CC(OC)=C(OCC(N(C)C)=O)C=C1)=O,5.18404
1,Scaffold -2,CC(C1=CC(OC)=C(OCC(N(CC)CC)=O)C=C1)=O,5.986582
2,Scaffold -3,O=C(N(CC(C)C)CC(C)C)COC1=CC=C(C(C)=O)C(OC)=C1,5.797131
3,Scaffold -4,CC(C1=CC(OC)=C(OCC(NC)=O)C=C1)=O,5.651106
4,Scaffold -5,CC(C1=CC(OC)=C(OCC(NCCC)=O)C=C1)=O,5.7596
5,Scaffold -6,O=C(NC(C)C)COC1=C(OC)C=C(C(C)=O)C=C1,5.403786
6,Scaffold -7,CC(C1=CC(OC)=C(OCC(NC(C)(C)C)=O)C=C1)=O,5.84699
7,Scaffold -8,CC(C1=CC(OC)=C(OCC(NC2CC2)=O)C=C1)=O,5.367029
8,Scaffold -9,CC(C1=CC(OC)=C(OCC(NC2CCCCC2)=O)C=C1)=O,5.364095
9,Scaffold -10,CC(C1=CC(OC)=C(OCC(NCC2CCCCC2)=O)C=C1)=O,4.964656
