#  Part-III, Descriptor Calculation and Dataset Preparation**

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

## Download PaDEL-Descriptor

In [None]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

--2023-11-15 06:33:15--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip [following]
--2023-11-15 06:33:15--  https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25768637 (25M) [application/zip]
Saving to: ‘padel.zip’


2023-11-15 06:33:15 (152 MB/s) - ‘padel.zip’ saved [25768637/25768637]

--2023-11-15 06:33:15--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
Resolving github.com (gith

In [None]:
! unzip padel.zip

Archive:  padel.zip
   creating: PaDEL-Descriptor/
  inflating: __MACOSX/._PaDEL-Descriptor  
  inflating: PaDEL-Descriptor/MACCSFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._MACCSFingerprinter.xml  
  inflating: PaDEL-Descriptor/AtomPairs2DFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._AtomPairs2DFingerprinter.xml  
  inflating: PaDEL-Descriptor/EStateFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._EStateFingerprinter.xml  
  inflating: PaDEL-Descriptor/Fingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._Fingerprinter.xml  
  inflating: PaDEL-Descriptor/.DS_Store  
  inflating: __MACOSX/PaDEL-Descriptor/._.DS_Store  
   creating: PaDEL-Descriptor/license/
  inflating: __MACOSX/PaDEL-Descriptor/._license  
  inflating: PaDEL-Descriptor/KlekotaRothFingerprintCount.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._KlekotaRothFingerprintCount.xml  
  inflating: PaDEL-Descriptor/config  
  inflating: __MACOSX/PaDEL-Descriptor/._config  
  inf

## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **RUNX1_bioactivity_data_class_pIC50.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [None]:
import pandas as pd

In [None]:
df3= pd.read_csv('/content/RUNX1_bioactivity_data_class_pIC50.csv')
df3

Unnamed: 0.1,Unnamed: 0,molecule_chembl_id,canonical_smiles,class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,0,CHEMBL1339329,CCCOc1ccc(C2C(C(=O)OC)=CN(CCOC)C=C2C(=O)OC)cc1,intermediate,389.448,2.6348,0.0,7.0,5.653647
1,1,CHEMBL1378709,CCCc1cc2c(OC(=O)CC)c(-c3ccc(C(=O)OCC)o3)c(=O)o...,intermediate,428.437,4.5062,0.0,8.0,5.258848
2,2,CHEMBL1464508,Cc1cc2c(c(=O)n1CCc1ccccc1)C(c1ccco1)C(C#N)=C(N)O2,inactive,373.412,3.2106,1.0,6.0,4.440093
3,3,CHEMBL1323994,O=S(=O)(Nc1ccc(-c2nc3cccnc3s2)cc1)c1ccc(Cl)s1,intermediate,407.929,4.8740,1.0,6.0,5.262807
4,4,CHEMBL1338756,COc1cccc(-c2nc(-c3ccc4c(c3)-c3ccccc3S4(=O)=O)c...,inactive,494.572,6.2411,1.0,5.0,4.473661
...,...,...,...,...,...,...,...,...,...
89,89,CHEMBL1364045,O=S(=O)(Nc1ccc(-c2nc3cccnc3s2)cc1)c1ccccc1F,inactive,385.445,4.2982,1.0,5.0,4.301030
90,90,CHEMBL1453190,CCS(=O)(=O)Nc1ccc(-c2nc3cccnc3s2)cc1,inactive,319.411,3.1199,1.0,5.0,4.892485
91,91,CHEMBL1353698,O=S(=O)(Nc1ccc(-c2nc3cccnc3s2)cc1)c1cccs1,intermediate,373.484,4.2206,1.0,6.0,5.439316
92,92,CHEMBL1330155,O=S(=O)(Nc1ccc(-c2nc3cccnc3s2)cc1)c1cc(F)ccc1F,inactive,403.435,4.4373,1.0,5.0,4.000000


In [None]:
selection= ['canonical_smiles', 'molecule_chembl_id']
df3_selection= df3[selection]
df3_selection.to_csv('molecule.smi', sep= '\t', index= False, header= False)

In [None]:
!cat molecule.smi | head -5

CCCOc1ccc(C2C(C(=O)OC)=CN(CCOC)C=C2C(=O)OC)cc1	CHEMBL1339329
CCCc1cc2c(OC(=O)CC)c(-c3ccc(C(=O)OCC)o3)c(=O)oc2cc1OC	CHEMBL1378709
Cc1cc2c(c(=O)n1CCc1ccccc1)C(c1ccco1)C(C#N)=C(N)O2	CHEMBL1464508
O=S(=O)(Nc1ccc(-c2nc3cccnc3s2)cc1)c1ccc(Cl)s1	CHEMBL1323994
COc1cccc(-c2nc(-c3ccc4c(c3)-c3ccccc3S4(=O)=O)c(-c3ccccc3)[nH]2)c1OC	CHEMBL1338756


In [None]:
!cat molecule.smi | wc -l

94


## Calculate Fingerprint descriptors
## Calculate PaDEL descriptors

In [None]:
!cat padel.sh # removing salts, organic acids, basically doing the preprocessing of chemical structure

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [None]:
!bash padel.sh

Processing CHEMBL1339329 in molecule.smi (1/94). 
Processing CHEMBL1378709 in molecule.smi (2/94). 
Processing CHEMBL1464508 in molecule.smi (3/94). Average speed: 2.84 s/mol.
Processing CHEMBL1323994 in molecule.smi (4/94). Average speed: 1.62 s/mol.
Processing CHEMBL1338756 in molecule.smi (5/94). Average speed: 1.27 s/mol.
Processing CHEMBL1604389 in molecule.smi (6/94). Average speed: 1.04 s/mol.
Processing CHEMBL1468099 in molecule.smi (7/94). Average speed: 1.06 s/mol.
Processing CHEMBL1467976 in molecule.smi (8/94). Average speed: 0.95 s/mol.
Processing CHEMBL1608118 in molecule.smi (9/94). Average speed: 0.87 s/mol.
Processing CHEMBL1479811 in molecule.smi (10/94). Average speed: 0.85 s/mol.
Processing CHEMBL3190303 in molecule.smi (11/94). Average speed: 0.79 s/mol.
Processing CHEMBL1613571 in molecule.smi (12/94). Average speed: 0.76 s/mol.
Processing CHEMBL1426072 in molecule.smi (14/94). Average speed: 0.69 s/mol.
Processing CHEMBL504791 in molecule.smi (13/94). Average spe

In [None]:
!ls -l

total 25384
-rw-r--r-- 1 root root   178477 Nov 15 06:33 descriptors_output.csv
drwxr-xr-x 3 root root     4096 Nov 15 06:33 __MACOSX
-rw-r--r-- 1 root root     5816 Nov 15 06:33 molecule.smi
drwxrwxr-x 4 root root     4096 May 30  2020 PaDEL-Descriptor
-rw-r--r-- 1 root root      231 Nov 15 06:33 padel.sh
-rw-r--r-- 1 root root 25768637 Nov 15 06:33 padel.zip
-rw-r--r-- 1 root root    12539 Nov 15 06:33 RUNX1_bioactivity_data_class_pIC50.csv
drwxr-xr-x 1 root root     4096 Nov 13 14:20 sample_data


# Preparing the X & Y Data Matrices

## X data matrix

In [None]:
df3_X= pd.read_csv('descriptors_output.csv')

In [None]:
df3_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL1339329,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL1378709,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL1464508,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL1323994,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL1604389,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
89,CHEMBL1483506,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
90,CHEMBL1453190,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
91,CHEMBL1353698,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
92,CHEMBL1450979,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
df3_X= df3_X.drop(columns= ['Name'])
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
89,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
90,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
91,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
92,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## Y variable
### Convert IC50 to pIC5O

In [None]:
df3_Y= df3['pIC50']
df3_Y

0     5.653647
1     5.258848
2     4.440093
3     5.262807
4     4.473661
        ...   
89    4.301030
90    4.892485
91    5.439316
92    4.000000
93    4.000000
Name: pIC50, Length: 94, dtype: float64

## Combining X and Y variable

In [None]:
dataset= pd.concat([df3_X, df3_Y], axis= 1)
dataset

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.653647
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.258848
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.440093
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.262807
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.473661
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
89,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.301030
90,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.892485
91,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.439316
92,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.000000


In [None]:
dataset.to_csv('RUNX1_bioactivity_class_2_pIC50_pubchem_fp.csv', index=False)