# **Bioinformatics Project - Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation**

Chanin Nantasenamat

[*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

## **Download PaDEL-Descriptor**

In [18]:
! wget https://raw.githubusercontent.com/dataprofessor/bioinformatics/padel/padel.zip
! wget https://raw.githubusercontent.com/dataprofessor/bioinformatics/padel/padel.sh

--2025-07-07 20:39:11--  https://raw.githubusercontent.com/dataprofessor/bioinformatics/padel/padel.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25768637 (25M) [application/zip]
Saving to: ‘padel.zip.1’


2025-07-07 20:39:12 (252 MB/s) - ‘padel.zip.1’ saved [25768637/25768637]

--2025-07-07 20:39:12--  https://raw.githubusercontent.com/dataprofessor/bioinformatics/padel/padel.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 231 [text/plain]
Saving to: ‘padel.sh.1’


2025-07-07 20:39:12 (8.15 MB/s) - ‘padel.sh.1’ saved [23

In [20]:
! unzip padel.zip.1

Archive:  padel.zip.1
   creating: PaDEL-Descriptor/
  inflating: __MACOSX/._PaDEL-Descriptor  
  inflating: PaDEL-Descriptor/MACCSFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._MACCSFingerprinter.xml  
  inflating: PaDEL-Descriptor/AtomPairs2DFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._AtomPairs2DFingerprinter.xml  
  inflating: PaDEL-Descriptor/EStateFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._EStateFingerprinter.xml  
  inflating: PaDEL-Descriptor/Fingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._Fingerprinter.xml  
  inflating: PaDEL-Descriptor/.DS_Store  
  inflating: __MACOSX/PaDEL-Descriptor/._.DS_Store  
   creating: PaDEL-Descriptor/license/
  inflating: __MACOSX/PaDEL-Descriptor/._license  
  inflating: PaDEL-Descriptor/KlekotaRothFingerprintCount.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._KlekotaRothFingerprintCount.xml  
  inflating: PaDEL-Descriptor/config  
  inflating: __MACOSX/PaDEL-Descriptor/._config  
  i

## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **bioactivity_data_3class_pIC50.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [None]:
! wget https://raw.githubusercontent.com/dataprofessor/data/master/acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv

--2020-06-09 17:00:26--  https://raw.githubusercontent.com/dataprofessor/data/master/acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 655414 (640K) [text/plain]
Saving to: ‘acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv’


2020-06-09 17:00:26 (9.21 MB/s) - ‘acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv’ saved [655414/655414]



In [3]:
import pandas as pd

In [4]:
df3 = pd.read_csv('Enoyl-[acyl-carrier-protein] reductase_04_bioactivity_data_3class_pIC50.csv')

In [5]:
df3

Unnamed: 0.1,Unnamed: 0,molecule_chembl_id,canonical_smiles,class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,0,CHEMBL217926,O=C(Nc1ccccc1)C1CC(=O)N(C2CCCCC2)C1,inactive,286.375,2.80630,1.0,2.0,4.972243
1,1,CHEMBL216547,O=C(Nc1ccccc1Br)C1CC(=O)N(C2CCCCC2)C1,inactive,365.271,3.56880,1.0,2.0,4.000000
2,2,CHEMBL213720,O=C(Nc1ccc2c(c1)OCCO2)C1CC(=O)N(C2CCCCC2)C1,inactive,344.411,2.57750,1.0,4.0,4.000000
3,3,CHEMBL217274,Cc1cccc(C)c1NC(=O)C1CC(=O)N(C2CCCCC2)C1,inactive,314.429,3.42314,1.0,2.0,4.000000
4,4,CHEMBL217773,O=C(Nc1ccc(Oc2ccccc2)cc1)C1CC(=O)N(C2CCCCC2)C1,inactive,378.472,4.59860,1.0,3.0,4.000000
...,...,...,...,...,...,...,...,...,...
368,368,CHEMBL5398235,Cc1cc(Nc2ccc(N3CCC(c4ccccc4)CC3)cc2)c2cc(Br)cc...,inactive,472.430,7.43332,1.0,3.0,4.301030
369,369,CHEMBL5430244,Cc1cc(Nc2ccc(-n3cccn3)cc2)c2cc(Br)ccc2n1,intermediate,379.261,5.23502,1.0,4.0,5.314258
370,370,CHEMBL5434416,Cc1cc(Nc2ccc(N3CCCC3CO)cc2)c2cc(Br)ccc2n1,inactive,412.331,5.01042,2.0,4.0,4.301030
371,371,CHEMBL5406599,CCc1cc(Nc2ccc(N3CCCCC3)cc2)c2cc(Br)ccc2n1,intermediate,410.359,6.29360,1.0,3.0,5.812479


In [6]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [7]:
! cat molecule.smi | head -5

O=C(Nc1ccccc1)C1CC(=O)N(C2CCCCC2)C1	CHEMBL217926
O=C(Nc1ccccc1Br)C1CC(=O)N(C2CCCCC2)C1	CHEMBL216547
O=C(Nc1ccc2c(c1)OCCO2)C1CC(=O)N(C2CCCCC2)C1	CHEMBL213720
Cc1cccc(C)c1NC(=O)C1CC(=O)N(C2CCCCC2)C1	CHEMBL217274
O=C(Nc1ccc(Oc2ccccc2)cc1)C1CC(=O)N(C2CCCCC2)C1	CHEMBL217773


In [None]:
! cat molecule.smi | wc -l

4695


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [23]:
! cat padel.sh.1

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [24]:
! bash padel.sh.1

Processing CHEMBL217926 in molecule.smi (1/373). 
Processing CHEMBL216547 in molecule.smi (2/373). 
Processing CHEMBL213720 in molecule.smi (3/373). Average speed: 2.65 s/mol.
Processing CHEMBL217274 in molecule.smi (4/373). Average speed: 1.45 s/mol.
Processing CHEMBL217773 in molecule.smi (5/373). Average speed: 1.28 s/mol.
Processing CHEMBL217273 in molecule.smi (6/373). Average speed: 1.01 s/mol.
Processing CHEMBL265016 in molecule.smi (8/373). Average speed: 0.94 s/mol.
Processing CHEMBL216781 in molecule.smi (7/373). Average speed: 1.05 s/mol.
Processing CHEMBL217524 in molecule.smi (9/373). Average speed: 0.90 s/mol.
Processing CHEMBL384149 in molecule.smi (10/373). Average speed: 0.79 s/mol.
Processing CHEMBL216339 in molecule.smi (11/373). Average speed: 0.78 s/mol.
Processing CHEMBL216704 in molecule.smi (12/373). Average speed: 0.73 s/mol.
Processing CHEMBL386324 in molecule.smi (13/373). Average speed: 0.71 s/mol.
Processing CHEMBL216807 in molecule.smi (14/373). Average sp

In [25]:
! ls -l

total 26272
-rw-r--r-- 1 root root   674376 Jul  7 20:42  descriptors_output.csv
-rw-r--r-- 1 root root    51461 Jul  7 20:17 'Enoyl-[acyl-carrier-protein] reductase_04_bioactivity_data_3class_pIC50.csv'
drwxr-xr-x 3 root root     4096 Jul  7 20:39  __MACOSX
-rw-r--r-- 1 root root    24222 Jul  7 20:19  molecule.smi
drwxrwxr-x 4 root root     4096 May 30  2020  PaDEL-Descriptor
-rw-r--r-- 1 root root   178484 Jul  7 20:25  padel.sh
-rw-r--r-- 1 root root      231 Jul  7 20:39  padel.sh.1
-rw-r--r-- 1 root root   178242 Jul  7 20:25  padel.zip
-rw-r--r-- 1 root root 25768637 Jul  7 20:39  padel.zip.1
drwxr-xr-x 1 root root     4096 Jul  4 13:34  sample_data


## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [26]:
df3_X = pd.read_csv('descriptors_output.csv')

In [27]:
df3_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL217926,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL216547,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL213720,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL217274,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL217273,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
368,CHEMBL5398235,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
369,CHEMBL5430244,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
370,CHEMBL5434416,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
371,CHEMBL5406599,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [28]:
df3_X = df3_X.drop(columns=['Name'])
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
368,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
369,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
370,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
371,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

### **Convert IC50 to pIC50**

In [29]:
df3_Y = df3['pIC50']
df3_Y

Unnamed: 0,pIC50
0,4.972243
1,4.000000
2,4.000000
3,4.000000
4,4.000000
...,...
368,4.301030
369,5.314258
370,4.301030
371,5.812479


## **Combining X and Y variable**

In [30]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.972243
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.000000
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.000000
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.000000
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
368,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.301030
369,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.314258
370,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.301030
371,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.812479


In [31]:
dataset3.to_csv('Enoyl-[acyl-carrier-protein] reductase_06_bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)

# **Let's download the CSV file to your local computer for the Part 3B (Model Building).**

In [None]:
! unzip padel.zip