# **Bioinformatics Project - Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation**

In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

## **Download PaDEL-Descriptor**

In [1]:
# ! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
# ! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

data_folder = "data"
formatted_target_name = 'Kallikrein_1'

In [2]:
# ! unzip padel.zip

## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **bioactivity_data_3class_pIC50.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [3]:
# ! wget https://raw.githubusercontent.com/dataprofessor/data/master/acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv

In [4]:
import pandas as pd

In [5]:
# df3 = pd.read_csv('acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv')
df3 = pd.read_csv(f'{data_folder}/{formatted_target_name}_04_bioactivity_data_3class_pIC50.csv')

In [6]:
df3

Unnamed: 0.1,Unnamed: 0,molecule_chembl_id,canonical_smiles,class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,0,CHEMBL174750,NS(=O)(=O)c1ccccc1-c1ccc(C(=O)Nc2ccccc2C(=O)Nc...,intermediate,551.422,4.66310,3.0,5.0,5.677781
1,1,CHEMBL294121,CN1CCN=C1c1ccc(C(=O)N2CCN(S(=O)(=O)c3cc4cc(Cl)...,intermediate,485.997,2.66000,1.0,5.0,5.236572
2,2,CHEMBL432306,N=C(N)c1cccc(OCCNC(=O)c2ccc(-c3ccccc3S(N)(=O)=...,intermediate,438.509,2.09387,4.0,5.0,5.677781
3,3,CHEMBL297835,N=C(N)c1cccc(Oc2ccccc2NC(=O)c2ccc(-c3ccccc3S(N...,active,486.553,4.32967,4.0,5.0,6.602060
4,4,CHEMBL10378,O=c1oc(-c2ccccc2I)nc2ccccc12,intermediate,349.127,3.45960,0.0,3.0,5.004804
...,...,...,...,...,...,...,...,...,...
332,332,CHEMBL3897372,Cc1cc(C)n(Cc2ccc(Cn3cc(C(=O)NCc4c(C)cc(N)nc4C)...,inactive,444.543,2.71208,2.0,8.0,4.397940
333,333,CHEMBL5204354,COc1ccnc(CNC(=O)c2cn(Cc3ccc(Cn4ccccc4=O)cc3)nc...,intermediate,515.467,3.63290,1.0,7.0,5.397940
334,334,CHEMBL5095064,COCc1nn(Cc2ccc(Cn3cc(F)ccc3=O)cc2)cc1C(=O)NCc1...,inactive,509.513,2.89960,1.0,8.0,4.397940
335,335,CHEMBL5276515,Nc1nn(Cc2ccc(Cn3ccccc3=O)cc2)cc1C(=O)NCCOc1ccc...,active,477.952,3.18580,2.0,7.0,8.537602


In [24]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [25]:
! cat molecule.smi | head -5

NS(=O)(=O)c1ccccc1-c1ccc(C(=O)Nc2ccccc2C(=O)Nc2ccc(Br)cn2)cc1	CHEMBL174750
CN1CCN=C1c1ccc(C(=O)N2CCN(S(=O)(=O)c3cc4cc(Cl)ccc4[nH]3)CC2)cc1	CHEMBL294121
N=C(N)c1cccc(OCCNC(=O)c2ccc(-c3ccccc3S(N)(=O)=O)cc2)c1	CHEMBL432306
N=C(N)c1cccc(Oc2ccccc2NC(=O)c2ccc(-c3ccccc3S(N)(=O)=O)cc2)c1	CHEMBL297835
O=c1oc(-c2ccccc2I)nc2ccccc12	CHEMBL10378


In [26]:
! cat molecule.smi | wc -l

     337


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [29]:
! cat padel.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file data/descriptors_output.csv


In [30]:
! bash padel.sh

Processing CHEMBL174750 in molecule.smi (1/337). 
Processing CHEMBL432306 in molecule.smi (3/337). 
Processing CHEMBL294121 in molecule.smi (2/337). 
Processing CHEMBL297835 in molecule.smi (4/337). 
Processing CHEMBL273264 in molecule.smi (6/337). Average speed: 1.65 s/mol.
Processing CHEMBL10378 in molecule.smi (5/337). Average speed: 3.20 s/mol.
Processing CHEMBL21052 in molecule.smi (7/337). Average speed: 1.15 s/mol.
Processing CHEMBL20918 in molecule.smi (8/337). Average speed: 1.20 s/mol.
Processing CHEMBL279486 in molecule.smi (9/337). Average speed: 0.90 s/mol.
Processing CHEMBL21140 in molecule.smi (10/337). Average speed: 0.68 s/mol.
Processing CHEMBL2371651 in molecule.smi (11/337). Average speed: 0.62 s/mol.
Processing CHEMBL2371642 in molecule.smi (12/337). Average speed: 0.63 s/mol.
Processing CHEMBL2371636 in molecule.smi (13/337). Average speed: 0.56 s/mol.
Processing CHEMBL3143643 in molecule.smi (15/337). Average speed: 0.47 s/mol.
Processing CHEMBL3143648 in molecul

In [12]:
! ls -l

total 140024
-rw-r--r--   1 claradv  staff     84215 Jan 21 17:16 CDD_ML_Part1_Bioactivity_Data_Concised.ipynb
-rw-r--r--   1 claradv  staff    277099 Jan 21 17:17 CDD_ML_Part2_Exploratory_Data_Analysis.ipynb
-rw-r--r--   1 claradv  staff     75139 Jan 21 17:50 CDD_ML_Part3_Descriptor_Dataset_Preparation copy.ipynb
-rw-r--r--@  1 claradv  staff    551083 Jan 21 17:49 CDD_ML_Part3_Descriptor_Dataset_Preparation.ipynb
drwxrwxr-x  21 claradv  staff       672 May 30  2020 [34mPaDEL-Descriptor[m[m
-rw-r--r--   1 claradv  staff        26 Jan 21 10:43 README.md
drwxr-xr-x   4 claradv  staff       128 Jan 21 17:41 [34m__MACOSX[m[m
drwxr-xr-x   4 claradv  staff       128 Jan 21 10:44 [34m__pycache__[m[m
-rw-r--r--   1 claradv  staff    655414 Jan 21 17:39 acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv
-rw-r--r--   1 claradv  staff    655414 Jan 21 17:41 acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv.1
-rw-r--r--   1 claradv  staff  16633406 Jan 21 17:49 acetylchol

## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [13]:
df3_X = pd.read_csv('data/descriptors_output.csv')

In [14]:
df3_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL294121,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL432306,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL174750,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL10378,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL297835,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
332,CHEMBL3897372,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
333,CHEMBL5204354,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
334,CHEMBL5095064,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
335,CHEMBL5276515,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
df3_X = df3_X.drop(columns=['Name'])
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
332,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
333,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
334,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
335,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

### **Convert IC50 to pIC50**

In [16]:
df3_Y = df3['pIC50']
df3_Y

0      5.677781
1      5.236572
2      5.677781
3      6.602060
4      5.004804
         ...   
332    4.397940
333    5.397940
334    4.397940
335    8.537602
336    6.000000
Name: pIC50, Length: 337, dtype: float64

## **Combining X and Y variable**

In [17]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.677781
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.236572
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.677781
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.602060
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.004804
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
332,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.397940
333,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.397940
334,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.397940
335,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.537602


In [18]:
dataset3.to_csv(f'{data_folder}/{formatted_target_name}_06_bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)

# **Let's download the CSV file to your local computer for the Part 3B (Model Building).**