# **Bioinformatics Project - Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation**

Chanin Nantasenamat

[*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

## **Download PaDEL-Descriptor**

In [1]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

--2025-01-07 00:52:37--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip [following]
--2025-01-07 00:52:37--  https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25768637 (25M) [application/zip]
Saving to: ‘padel.zip’


2025-01-07 00:52:38 (165 MB/s) - ‘padel.zip’ saved [25768637/25768637]

--2025-01-07 00:52:38--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
Resolving github.com (gith

In [2]:
! unzip padel.zip

Archive:  padel.zip
   creating: PaDEL-Descriptor/
  inflating: __MACOSX/._PaDEL-Descriptor  
  inflating: PaDEL-Descriptor/MACCSFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._MACCSFingerprinter.xml  
  inflating: PaDEL-Descriptor/AtomPairs2DFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._AtomPairs2DFingerprinter.xml  
  inflating: PaDEL-Descriptor/EStateFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._EStateFingerprinter.xml  
  inflating: PaDEL-Descriptor/Fingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._Fingerprinter.xml  
  inflating: PaDEL-Descriptor/.DS_Store  
  inflating: __MACOSX/PaDEL-Descriptor/._.DS_Store  
   creating: PaDEL-Descriptor/license/
  inflating: __MACOSX/PaDEL-Descriptor/._license  
  inflating: PaDEL-Descriptor/KlekotaRothFingerprintCount.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._KlekotaRothFingerprintCount.xml  
  inflating: PaDEL-Descriptor/config  
  inflating: __MACOSX/PaDEL-Descriptor/._config  
  inf

## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **bioactivity_data_3class_pIC50.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [None]:
! wget https://raw.githubusercontent.com/dataprofessor/data/master/acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv

--2020-06-09 17:00:26--  https://raw.githubusercontent.com/dataprofessor/data/master/acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 655414 (640K) [text/plain]
Saving to: ‘acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv’


2020-06-09 17:00:26 (9.21 MB/s) - ‘acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv’ saved [655414/655414]



In [3]:
import pandas as pd

In [4]:
df3 = pd.read_csv('ErbB2_PIC50.csv')

In [5]:
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,CHEMBL68920,Cc1cc(C)c(/C=C2\C(=O)Nc3ncnc(Nc4ccc(F)c(Cl)c4)...,active,383.814,4.45034,3.0,4.0,6.522879
1,CHEMBL68920,Cc1cc(C)c(/C=C2\C(=O)Nc3ncnc(Nc4ccc(F)c(Cl)c4)...,active,383.814,4.45034,3.0,4.0,5.602060
2,CHEMBL69960,Cc1cc(C(=O)N2CCOCC2)[nH]c1/C=C1\C(=O)Nc2ncnc(N...,active,482.903,3.61432,3.0,6.0,6.397940
3,CHEMBL69960,Cc1cc(C(=O)N2CCOCC2)[nH]c1/C=C1\C(=O)Nc2ncnc(N...,active,482.903,3.61432,3.0,6.0,5.917215
4,CHEMBL67057,Cc1cc(C(=O)N2CCOCC2)[nH]c1/C=C1\C(=O)Nc2ncnc(N...,active,559.630,4.82482,3.0,7.0,7.000000
...,...,...,...,...,...,...,...,...
1852,CHEMBL2316777,Cc1ccc(C2=NN(c3nc(-c4ccc(Br)cc4)cs3)C(c3ccc4c(...,active,431.529,3.05820,4.0,6.0,5.651695
1853,CHEMBL2311747,Brc1ccc(C2=NN(c3nc(-c4ccc(Br)cc4)cs3)C(c3ccc4c...,inactive,358.478,4.97970,1.0,4.0,6.744727
1854,CHEMBL2316794,COc1ccc(-c2csc(N3N=C(c4ccc(Br)cc4)CC3c3ccc4c(c...,inactive,358.478,4.97970,1.0,4.0,6.070581
1855,CHEMBL2316793,Clc1ccc(C2=NN(c3nc(-c4ccc(Br)cc4)cs3)C(c3ccc4c...,inactive,443.584,5.03210,3.0,5.0,5.623423


In [6]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [7]:
! cat molecule.smi | head -5

Cc1cc(C)c(/C=C2\C(=O)Nc3ncnc(Nc4ccc(F)c(Cl)c4)c32)[nH]1	CHEMBL68920
Cc1cc(C)c(/C=C2\C(=O)Nc3ncnc(Nc4ccc(F)c(Cl)c4)c32)[nH]1	CHEMBL68920
Cc1cc(C(=O)N2CCOCC2)[nH]c1/C=C1\C(=O)Nc2ncnc(Nc3ccc(F)c(Cl)c3)c21	CHEMBL69960
Cc1cc(C(=O)N2CCOCC2)[nH]c1/C=C1\C(=O)Nc2ncnc(Nc3ccc(F)c(Cl)c3)c21	CHEMBL69960
Cc1cc(C(=O)N2CCOCC2)[nH]c1/C=C1\C(=O)Nc2ncnc(Nc3ccc4c(ccn4Cc4ccccc4)c3)c21	CHEMBL67057


In [8]:
! cat molecule.smi | wc -l

1857


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [9]:
! cat padel.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [10]:
! bash padel.sh

Processing CHEMBL68920 in molecule.smi (1/1857). 
Processing CHEMBL68920 in molecule.smi (2/1857). 
Processing CHEMBL69960 in molecule.smi (3/1857). Average speed: 8.15 s/mol.
Processing CHEMBL69960 in molecule.smi (4/1857). Average speed: 4.14 s/mol.
Processing CHEMBL65848 in molecule.smi (6/1857). Average speed: 2.44 s/mol.
Processing CHEMBL67057 in molecule.smi (5/1857). Average speed: 3.23 s/mol.
Processing CHEMBL65848 in molecule.smi (7/1857). Average speed: 2.30 s/mol.
Processing CHEMBL69629 in molecule.smi (9/1857). Average speed: 1.80 s/mol.
Processing CHEMBL69629 in molecule.smi (8/1857). Average speed: 1.99 s/mol.
Processing CHEMBL66570 in molecule.smi (10/1857). Average speed: 1.63 s/mol.
Processing CHEMBL66570 in molecule.smi (11/1857). Average speed: 1.50 s/mol.
Processing CHEMBL305194 in molecule.smi (12/1857). Average speed: 1.39 s/mol.
Processing CHEMBL305194 in molecule.smi (13/1857). Average speed: 1.31 s/mol.
Processing CHEMBL305246 in molecule.smi (15/1857). Average

In [11]:
! ls -l

total 28796
-rw-r--r-- 1 root root  3311261 Jan  7 01:11 descriptors_output.csv
-rw-r--r-- 1 root root   251282 Jan  7 01:01 ErbB2_PIC50.csv
drwxr-xr-x 3 root root     4096 Jan  7 01:01 __MACOSX
-rw-r--r-- 1 root root   127099 Jan  7 01:02 molecule.smi
drwxrwxr-x 4 root root     4096 May 30  2020 PaDEL-Descriptor
-rw-r--r-- 1 root root      231 Jan  7 00:52 padel.sh
-rw-r--r-- 1 root root 25768637 Jan  7 00:52 padel.zip
drwxr-xr-x 1 root root     4096 Jan  3 14:22 sample_data


## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [12]:
df3_X = pd.read_csv('descriptors_output.csv')

In [13]:
df3_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL68920,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL68920,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL69960,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL69960,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL65848,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1852,CHEMBL2316777,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1853,CHEMBL2311747,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1854,CHEMBL2316794,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1855,CHEMBL2316793,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
df3_X = df3_X.drop(columns=['Name'])
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1852,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1853,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1854,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1855,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

### **Convert IC50 to pIC50**

In [15]:
df3_Y = df3['pIC50']
df3_Y

Unnamed: 0,pIC50
0,6.522879
1,5.602060
2,6.397940
3,5.917215
4,7.000000
...,...
1852,5.651695
1853,6.744727
1854,6.070581
1855,5.623423


## **Combining X and Y variable**

In [16]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.522879
1,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.602060
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.397940
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.917215
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1852,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.651695
1853,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.744727
1854,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.070581
1855,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.623423


In [17]:
dataset3.to_csv('ErbB2_06_bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)

# **Let's download the CSV file to your local computer for the Part 3B (Model Building).**