# **Bioinformatics Project - Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation**

Chanin Nantasenamat

[*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

## **Download PaDEL-Descriptor**

In [1]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

--2023-08-13 23:03:54--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
Resolving github.com (github.com)... 192.30.255.113
Connecting to github.com (github.com)|192.30.255.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip [following]
--2023-08-13 23:03:54--  https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25768637 (25M) [application/zip]
Saving to: ‘padel.zip’


2023-08-13 23:03:55 (181 MB/s) - ‘padel.zip’ saved [25768637/25768637]

--2023-08-13 23:03:55--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
Resolving github.com (

In [2]:
! unzip padel.zip

Archive:  padel.zip
   creating: PaDEL-Descriptor/
  inflating: __MACOSX/._PaDEL-Descriptor  
  inflating: PaDEL-Descriptor/MACCSFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._MACCSFingerprinter.xml  
  inflating: PaDEL-Descriptor/AtomPairs2DFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._AtomPairs2DFingerprinter.xml  
  inflating: PaDEL-Descriptor/EStateFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._EStateFingerprinter.xml  
  inflating: PaDEL-Descriptor/Fingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._Fingerprinter.xml  
  inflating: PaDEL-Descriptor/.DS_Store  
  inflating: __MACOSX/PaDEL-Descriptor/._.DS_Store  
   creating: PaDEL-Descriptor/license/
  inflating: __MACOSX/PaDEL-Descriptor/._license  
  inflating: PaDEL-Descriptor/KlekotaRothFingerprintCount.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._KlekotaRothFingerprintCount.xml  
  inflating: PaDEL-Descriptor/config  
  inflating: __MACOSX/PaDEL-Descriptor/._config  
  inf

## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **bioactivity_data_preprocessed_final.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [3]:
import pandas as pd

In [5]:
data_path = '/content/bioactivity_data_preprocessed_final.csv'
df3 = pd.read_csv(data_path)

In [6]:
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,CHEMBL281957,CCN(CC)C/C=C/c1nc(O)c2c(ccc3nc(Nc4c(Cl)cccc4Cl...,active,484.431,6.54092,2.0,6.0,9.397940
1,CHEMBL207028,NC(=O)c1sc(-c2ccccc2)cc1N,inactive,218.281,2.09620,2.0,3.0,4.821023
2,CHEMBL386760,COc1ccc(N(C(=O)Oc2c(C)cccc2C)c2ccnc(Nc3cc(OC)c...,active,686.810,6.22484,1.0,12.0,7.000000
3,CHEMBL215943,Cc1ccc(C(=O)Nc2cccc(C(F)(F)F)c2)cc1-c1ccc2nc(N...,active,535.570,5.62042,2.0,6.0,6.263603
4,CHEMBL373882,CNc1ncnc(-c2cccnc2Oc2ccc(F)c(C(=O)Nc3cc(C(F)(F...,inactive,598.605,5.56570,2.0,9.0,4.602060
...,...,...,...,...,...,...,...,...
4286,CHEMBL3936761,C=CC(=O)N1CCC([C@@H]2CCNc3c(C(N)=O)c(-c4ccc(Oc...,active,471.561,4.22260,2.0,6.0,6.443697
4287,CHEMBL3936761,C=CC(=O)N1CCC([C@@H]2CCNc3c(C(N)=O)c(-c4ccc(Oc...,active,471.561,4.22260,2.0,6.0,8.744727
4288,CHEMBL3707348,CC#CC(=O)N1CCC[C@H]1c1nc(-c2ccc(C(=O)Nc3ccccn3...,active,465.517,3.31260,2.0,7.0,8.292430
4289,CHEMBL3936761,C=CC(=O)N1CCC([C@@H]2CCNc3c(C(N)=O)c(-c4ccc(Oc...,active,471.561,4.22260,2.0,6.0,9.522879


In [7]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [8]:
! cat molecule.smi | head -5

CCN(CC)C/C=C/c1nc(O)c2c(ccc3nc(Nc4c(Cl)cccc4Cl)n(C)c32)c1C	CHEMBL281957
NC(=O)c1sc(-c2ccccc2)cc1N	CHEMBL207028
COc1ccc(N(C(=O)Oc2c(C)cccc2C)c2ccnc(Nc3cc(OC)c(OCCCN4CCN(C)CC4)c(OC)c3)n2)c(OC)c1	CHEMBL386760
Cc1ccc(C(=O)Nc2cccc(C(F)(F)F)c2)cc1-c1ccc2nc(NCCN3CCOCC3)ncc2c1	CHEMBL215943
CNc1ncnc(-c2cccnc2Oc2ccc(F)c(C(=O)Nc3cc(C(F)(F)F)ccc3N(C)CCCN(C)C)c2)n1	CHEMBL373882


In [9]:
! cat molecule.smi | wc -l

4291


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [10]:
! cat padel.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [11]:
! bash padel.sh

Processing CHEMBL207028 in molecule.smi (2/4291). 
Processing CHEMBL281957 in molecule.smi (1/4291). 
Processing CHEMBL386760 in molecule.smi (3/4291). Average speed: 10.61 s/mol.
Processing CHEMBL215943 in molecule.smi (4/4291). Average speed: 6.02 s/mol.
Processing CHEMBL373882 in molecule.smi (5/4291). Average speed: 4.47 s/mol.
Processing CHEMBL230761 in molecule.smi (6/4291). Average speed: 3.47 s/mol.
Processing CHEMBL233338 in molecule.smi (7/4291). Average speed: 2.90 s/mol.
Processing CHEMBL246356 in molecule.smi (8/4291). Average speed: 2.45 s/mol.
Processing CHEMBL249097 in molecule.smi (9/4291). Average speed: 2.17 s/mol.
Processing CHEMBL248676 in molecule.smi (10/4291). Average speed: 1.93 s/mol.
Processing CHEMBL398422 in molecule.smi (11/4291). Average speed: 1.74 s/mol.
Processing CHEMBL271190 in molecule.smi (12/4291). Average speed: 1.61 s/mol.
Processing CHEMBL410643 in molecule.smi (13/4291). Average speed: 1.47 s/mol.
Processing CHEMBL410295 in molecule.smi (14/42

In [12]:
! ls -l

total 33636
-rw-r--r-- 1 root root   649276 Aug 13 23:06 bioactivity_data_preprocessed_final.csv
-rw-r--r-- 1 root root  7640674 Aug 13 23:35 descriptors_output.csv
drwxr-xr-x 3 root root     4096 Aug 13 23:03 __MACOSX
-rw-r--r-- 1 root root   359391 Aug 13 23:08 molecule.smi
drwxrwxr-x 4 root root     4096 May 30  2020 PaDEL-Descriptor
-rw-r--r-- 1 root root      231 Aug 13 23:03 padel.sh
-rw-r--r-- 1 root root 25768637 Aug 13 23:03 padel.zip
drwxr-xr-x 1 root root     4096 Aug 10 19:19 sample_data


## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [13]:
df3_X = pd.read_csv('descriptors_output.csv')

In [14]:
df3_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL207028,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL281957,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL386760,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL215943,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL373882,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4286,CHEMBL5175977,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4287,CHEMBL3707348,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4288,CHEMBL3936761,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4289,CHEMBL460702,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
df3_X = df3_X.drop(columns=['Name'])
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4286,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4287,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4288,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4289,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

### **Convert IC50 to pIC50**

In [16]:
df3_Y = df3['pIC50']
df3_Y

0       9.397940
1       4.821023
2       7.000000
3       6.263603
4       4.602060
          ...   
4286    6.443697
4287    8.744727
4288    8.292430
4289    9.522879
4290    8.698970
Name: pIC50, Length: 4291, dtype: float64

## **Combining X and Y variable**

In [17]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,9.397940
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.821023
2,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.000000
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.263603
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.602060
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4286,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.443697
4287,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.744727
4288,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.292430
4289,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,9.522879


In [18]:
dataset3.to_csv('bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)

# **Let's download the CSV file to your local computer for the Part 3B (Model Building).**