# **Bioinformatics Project - Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation**

Chanin Nantasenamat

[*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

## **Download PaDEL-Descriptor**

In [1]:
! pip install wget
! pip install padelpy



In [2]:
import wget

wget.download("https://github.com/dataprofessor/padel/raw/main/fingerprints_xml.zip")

'fingerprints_xml (2).zip'

In [3]:
!unzip -o fingerprints_xml.zip


Archive:  fingerprints_xml.zip
  inflating: AtomPairs2DFingerprintCount.xml  
  inflating: AtomPairs2DFingerprinter.xml  
  inflating: EStateFingerprinter.xml  
  inflating: ExtendedFingerprinter.xml  
  inflating: Fingerprinter.xml       
  inflating: GraphOnlyFingerprinter.xml  
  inflating: KlekotaRothFingerprintCount.xml  
  inflating: KlekotaRothFingerprinter.xml  
  inflating: MACCSFingerprinter.xml  
  inflating: PubchemFingerprinter.xml  
  inflating: SubstructureFingerprintCount.xml  
  inflating: SubstructureFingerprinter.xml  


In [4]:
import glob
xml_files = glob.glob("*.xml")
xml_files.sort()
xml_files

['AtomPairs2DFingerprintCount.xml',
 'AtomPairs2DFingerprinter.xml',
 'DELGraphOnlyFingerprinter.xml',
 'EStateFingerprinter.xml',
 'ExtendedFingerprinter.xml',
 'Fingerprinter.xml',
 'GraphOnlyFingerprinter.xml',
 'KlekotaRothFingerprintCount.xml',
 'KlekotaRothFingerprinter.xml',
 'MACCSFingerprinter.xml',
 'PubchemFingerprinter.xml',
 'SubstructureFingerprintCount.xml',
 'SubstructureFingerprinter.xml']

In [5]:
FP_list = ['AtomPairs2DCount',
 'AtomPairs2D',
 'EState',
 'CDKextended',
 'CDK',
 'CDKgraphonly',
 'KlekotaRothCount',
 'KlekotaRoth',
 'MACCS',
 'PubChem',
 'SubstructureCount',
 'Substructure']

In [6]:
fp = dict(zip(FP_list, xml_files))
fp

{'AtomPairs2DCount': 'AtomPairs2DFingerprintCount.xml',
 'AtomPairs2D': 'AtomPairs2DFingerprinter.xml',
 'EState': 'DELGraphOnlyFingerprinter.xml',
 'CDKextended': 'EStateFingerprinter.xml',
 'CDK': 'ExtendedFingerprinter.xml',
 'CDKgraphonly': 'Fingerprinter.xml',
 'KlekotaRothCount': 'GraphOnlyFingerprinter.xml',
 'KlekotaRoth': 'KlekotaRothFingerprintCount.xml',
 'MACCS': 'KlekotaRothFingerprinter.xml',
 'PubChem': 'MACCSFingerprinter.xml',
 'SubstructureCount': 'PubchemFingerprinter.xml',
 'Substructure': 'SubstructureFingerprintCount.xml'}

In [7]:
fp['AtomPairs2D']

'AtomPairs2DFingerprinter.xml'

## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **bioactivity_data_3class_pIC50.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [8]:
import wget
wget.download("https://raw.githubusercontent.com/dataprofessor/data/master/acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv")

'acetylcholinesterase_04_bioactivity_data_3class_pIC50 (3).csv'

In [9]:
import pandas as pd

In [10]:
df3 = pd.read_csv('acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv')

In [11]:
df3

Unnamed: 0.1,Unnamed: 0,molecule_chembl_id,canonical_smiles,class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,0,CHEMBL463210,CCOP(=S)(OCC)Oc1nc(Cl)c(Cl)cc1Cl,intermediate,350.591,4.7181,0.0,5.0,5.737549
1,1,CHEMBL2252723,CCOP(=O)(OCC)SCCCCCCCCCCN1C(=O)c2ccccc2C1=O,inactive,455.557,6.3177,0.0,6.0,3.947999
2,2,CHEMBL2252722,CCOP(=O)(OCC)SCCCCCCCCCN1C(=O)c2ccccc2C1=O,inactive,441.53,5.9276,0.0,6.0,4.425969
3,3,CHEMBL2252721,CCOP(=O)(OCC)SCCCCCCCCN1C(=O)c2ccccc2C1=O,intermediate,427.503,5.5375,0.0,6.0,5.346787
4,4,CHEMBL2252851,CCOP(=O)(OCC)SCCCCCCCN1C(=O)c2ccccc2C1=O,intermediate,413.476,5.1474,0.0,6.0,5.735182
5,5,CHEMBL2252850,CCOP(=O)(OCC)SCCCCCCN1C(=O)c2ccccc2C1=O,intermediate,399.449,4.7573,0.0,6.0,5.419075
6,6,CHEMBL2252849,CCOP(=O)(OCC)SCCCCCN1C(=O)c2ccccc2C1=O,inactive,385.422,4.3672,0.0,6.0,4.908685
7,7,CHEMBL2252848,CCOP(=O)(OCC)SCCCCN1C(=O)c2ccccc2C1=O,intermediate,371.395,3.9771,0.0,6.0,5.003488
8,8,CHEMBL2252847,CCOP(=O)(OCC)SCCCN1C(=O)c2ccccc2C1=O,intermediate,357.368,3.587,0.0,6.0,5.081445
9,9,CHEMBL2252846,CCOP(=O)(OCC)SCCCCCCCCCCSP(=O)(OCC)OCC,intermediate,478.594,7.9358,0.0,8.0,5.754487


In [12]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [13]:
! cat molecule.smi | head -5

CCOP(=S)(OCC)Oc1nc(Cl)c(Cl)cc1Cl	CHEMBL463210
CCOP(=O)(OCC)SCCCCCCCCCCN1C(=O)c2ccccc2C1=O	CHEMBL2252723
CCOP(=O)(OCC)SCCCCCCCCCN1C(=O)c2ccccc2C1=O	CHEMBL2252722
CCOP(=O)(OCC)SCCCCCCCCN1C(=O)c2ccccc2C1=O	CHEMBL2252721
CCOP(=O)(OCC)SCCCCCCCN1C(=O)c2ccccc2C1=O	CHEMBL2252851


In [14]:
! cat molecule.smi | wc -l

      18


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [15]:
fp

{'AtomPairs2DCount': 'AtomPairs2DFingerprintCount.xml',
 'AtomPairs2D': 'AtomPairs2DFingerprinter.xml',
 'EState': 'DELGraphOnlyFingerprinter.xml',
 'CDKextended': 'EStateFingerprinter.xml',
 'CDK': 'ExtendedFingerprinter.xml',
 'CDKgraphonly': 'Fingerprinter.xml',
 'KlekotaRothCount': 'GraphOnlyFingerprinter.xml',
 'KlekotaRoth': 'KlekotaRothFingerprintCount.xml',
 'MACCS': 'KlekotaRothFingerprinter.xml',
 'PubChem': 'MACCSFingerprinter.xml',
 'SubstructureCount': 'PubchemFingerprinter.xml',
 'Substructure': 'SubstructureFingerprintCount.xml'}

In [16]:
fp['PubChem']

'MACCSFingerprinter.xml'

In [18]:
from padelpy import padeldescriptor

fingerprint = 'Substructure'

fingerprint_output_file = ''.join([fingerprint,'.csv']) #Substructure.csv
fingerprint_descriptortypes = fp[fingerprint]

padeldescriptor(mol_dir='molecule.smi', 
                d_file=fingerprint_output_file, #'Substructure.csv'
                #descriptortypes='SubstructureFingerprint.xml', 
                descriptortypes= fingerprint_descriptortypes,
                detectaromaticity=True,
                standardizenitro=True,
                standardizetautomers=True,
                threads=2,
                removesalt=True,
                log=True,
                fingerprints=True)

In [19]:
! ls -l

total 89808
-rwxr-xr-x@ 1 sandrapepkolaj  staff     4645 Mar 27  2018 [31mAtomPairs2DFingerprintCount.xml[m[m
-rwxr-xr-x@ 1 sandrapepkolaj  staff     4645 Mar 27  2018 [31mAtomPairs2DFingerprinter.xml[m[m
-rw-r--r--@ 1 sandrapepkolaj  staff   154554 Oct 15 13:49 CDD_ML_Part_1_Acetylcholinesterase_Bioactivity_Data_Concised.ipynb
-rw-r--r--@ 1 sandrapepkolaj  staff   269174 Oct 22 12:24 CDD_ML_Part_2_Acetylcholinesterase_Exploratory_Data_Analysis.ipynb
-rw-r--r--@ 1 sandrapepkolaj  staff    95910 Dec 11 23:59 CDD_ML_Part_3_Acetylcholinesterase_Descriptor_Dataset_Preparation.ipynb
-rw-r--r--@ 1 sandrapepkolaj  staff    54533 Oct 22 13:54 CDD_ML_Part_3_Acetylcholinesterase_Descriptor_Dataset_Preparation_ DOUBLE.ipynb
-rw-r--r--@ 1 sandrapepkolaj  staff    99668 Dec 11 23:22 CDD_ML_Part_4_Acetylcholinesterase_Regression_Random_Forest.ipynb
-rw-r--r--@ 1 sandrapepkolaj  staff   508733 Dec 11 23:36 CDD_ML_Part_5_Acetylcholinesterase_Compare_Regressors.ipynb
-rwxr-xr-x@ 1 sandrapepkolaj 

## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [20]:
df3_X = pd.read_csv('Substructure.csv')

In [21]:
df3_X

Unnamed: 0,Name,SubFPC1,SubFPC2,SubFPC3,SubFPC4,SubFPC5,SubFPC6,SubFPC7,SubFPC8,SubFPC9,...,SubFPC298,SubFPC299,SubFPC300,SubFPC301,SubFPC302,SubFPC303,SubFPC304,SubFPC305,SubFPC306,SubFPC307
0,CHEMBL463210,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,4.0,4.0,6.0,0.0,0.0,0.0,0.0,6.0
1,CHEMBL2252723,2.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,15.0,15.0,16.0,0.0,0.0,0.0,0.0,10.0
2,CHEMBL2252722,2.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,14.0,14.0,15.0,0.0,0.0,0.0,0.0,10.0
3,CHEMBL2252721,2.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,13.0,13.0,14.0,0.0,0.0,0.0,0.0,10.0
4,CHEMBL2252851,2.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,12.0,12.0,13.0,0.0,0.0,0.0,0.0,10.0
5,CHEMBL2252850,2.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,11.0,11.0,12.0,0.0,0.0,0.0,0.0,10.0
6,CHEMBL2252849,2.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,10.0,10.0,11.0,0.0,0.0,0.0,0.0,10.0
7,CHEMBL2252848,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,9.0,9.0,10.0,0.0,0.0,0.0,0.0,10.0
8,CHEMBL2252847,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,8.0,8.0,9.0,0.0,0.0,0.0,0.0,10.0
9,CHEMBL2252846,4.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,19.0,19.0,21.0,0.0,0.0,0.0,0.0,2.0


In [22]:
df3_X = df3_X.drop(columns=['Name'])
df3_X

Unnamed: 0,SubFPC1,SubFPC2,SubFPC3,SubFPC4,SubFPC5,SubFPC6,SubFPC7,SubFPC8,SubFPC9,SubFPC10,...,SubFPC298,SubFPC299,SubFPC300,SubFPC301,SubFPC302,SubFPC303,SubFPC304,SubFPC305,SubFPC306,SubFPC307
0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,4.0,4.0,6.0,0.0,0.0,0.0,0.0,6.0
1,2.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,15.0,15.0,16.0,0.0,0.0,0.0,0.0,10.0
2,2.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,14.0,14.0,15.0,0.0,0.0,0.0,0.0,10.0
3,2.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,13.0,13.0,14.0,0.0,0.0,0.0,0.0,10.0
4,2.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,12.0,12.0,13.0,0.0,0.0,0.0,0.0,10.0
5,2.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,11.0,11.0,12.0,0.0,0.0,0.0,0.0,10.0
6,2.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,10.0,10.0,11.0,0.0,0.0,0.0,0.0,10.0
7,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,9.0,9.0,10.0,0.0,0.0,0.0,0.0,10.0
8,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,8.0,8.0,9.0,0.0,0.0,0.0,0.0,10.0
9,4.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,19.0,19.0,21.0,0.0,0.0,0.0,0.0,2.0


## **Y variable**

### **Convert IC50 to pIC50**

In [23]:
df3_Y = df3['pIC50']
df3_Y

0     5.737549
1     3.947999
2     4.425969
3     5.346787
4     5.735182
5     5.419075
6     4.908685
7     5.003488
8     5.081445
9     5.754487
10    5.844664
11    5.315155
12    4.991400
13    6.060481
14    4.908685
15    5.093126
16    5.785156
17    1.397940
Name: pIC50, dtype: float64

## **Combining X and Y variable**

In [24]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

Unnamed: 0,SubFPC1,SubFPC2,SubFPC3,SubFPC4,SubFPC5,SubFPC6,SubFPC7,SubFPC8,SubFPC9,SubFPC10,...,SubFPC299,SubFPC300,SubFPC301,SubFPC302,SubFPC303,SubFPC304,SubFPC305,SubFPC306,SubFPC307,pIC50
0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,4.0,4.0,6.0,0.0,0.0,0.0,0.0,6.0,5.737549
1,2.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,15.0,15.0,16.0,0.0,0.0,0.0,0.0,10.0,3.947999
2,2.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,14.0,14.0,15.0,0.0,0.0,0.0,0.0,10.0,4.425969
3,2.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,13.0,13.0,14.0,0.0,0.0,0.0,0.0,10.0,5.346787
4,2.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,12.0,12.0,13.0,0.0,0.0,0.0,0.0,10.0,5.735182
5,2.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,11.0,11.0,12.0,0.0,0.0,0.0,0.0,10.0,5.419075
6,2.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,10.0,10.0,11.0,0.0,0.0,0.0,0.0,10.0,4.908685
7,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,9.0,9.0,10.0,0.0,0.0,0.0,0.0,10.0,5.003488
8,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,8.0,8.0,9.0,0.0,0.0,0.0,0.0,10.0,5.081445
9,4.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,19.0,19.0,21.0,0.0,0.0,0.0,0.0,2.0,5.754487


In [25]:
dataset3.to_csv('acetylcholinesterase_06_bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)

# **Let's download the CSV file to your local computer for the Part 3B (Model Building).**