# **Bioinformatics Project - Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation**

Chanin Nantasenamat

[*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

## **Download PaDEL-Descriptor**

In [3]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

--2024-05-08 10:41:26--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip [following]
--2024-05-08 10:41:26--  https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25768637 (25M) [application/zip]
Saving to: ‘padel.zip.1’


2024-05-08 10:41:26 (151 MB/s) - ‘padel.zip.1’ saved [25768637/25768637]

--2024-05-08 10:41:27--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
Resolving github.com (

In [4]:
! unzip padel.zip

Archive:  padel.zip
replace __MACOSX/._PaDEL-Descriptor? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **bioactivity_data_3class_pIC50.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [None]:
! wget https://raw.githubusercontent.com/dataprofessor/data/master/acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv

--2020-06-09 17:00:26--  https://raw.githubusercontent.com/dataprofessor/data/master/acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 655414 (640K) [text/plain]
Saving to: ‘acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv’


2020-06-09 17:00:26 (9.21 MB/s) - ‘acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv’ saved [655414/655414]



In [5]:
import pandas as pd

In [6]:
df3 = pd.read_csv('alpha_amylase_04_bioactivity_data_3class_pIC50.csv')

In [7]:
df3

Unnamed: 0.1,Unnamed: 0,molecule_chembl_id,canonical_smiles,class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,0,CHEMBL2057965,CCC(CC)CN1C(=O)S/C(=C\c2ccc(O)c(C(F)(F)F)c2)C1=O,intermediate,373.396,4.88350,1.0,4.0,12.806180
1,1,CHEMBL2058262,CC(CN1C(=O)S/C(=C\c2ccc(O)c(C(F)(F)F)c2)C1=O)(...,intermediate,495.363,6.35840,1.0,4.0,13.204120
2,2,CHEMBL479638,C/C(=N\O)c1ccc(-c2ccc(O)cc2)c(Cl)c1O,inactive,277.707,3.61640,3.0,4.0,12.479844
3,3,CHEMBL479837,COc1ccc(-c2ccc(/C(C)=N/O)c(O)c2Cl)cc1F,inactive,309.724,4.05850,2.0,4.0,12.314818
4,4,CHEMBL480612,COc1ccc(-c2ccc(O)c(/C=N/O)c2)cc1,inactive,243.262,2.87590,2.0,4.0,12.135934
...,...,...,...,...,...,...,...,...,...
1284,1284,CHEMBL4786111,O=C(CN1C(=O)S/C(=C/c2ccco2)C1=O)Nc1ccc(Cl)c(Cl)c1,inactive,397.239,4.26140,1.0,5.0,12.337833
1285,1285,CHEMBL4552180,Cc1c(NC(=O)c2cc(C(N)=O)nc3cc(F)ccc23)c(C(F)(F)...,intermediate,527.522,5.59462,2.0,5.0,13.425969
1286,1286,CHEMBL4445670,CN1CCN(c2cnc3c(C4CCN(C(=O)c5ccc(OC(F)(F)F)cc5N...,active,515.540,3.27720,1.0,8.0,14.961082
1287,1287,CHEMBL4455582,Cc1nc(C2CCN(C(=O)c3ccc(OC(F)(F)F)cc3N)CC2)c2nc...,intermediate,529.567,3.58562,1.0,8.0,12.698970


In [8]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [9]:
! cat molecule.smi | head -5

CCC(CC)CN1C(=O)S/C(=C\c2ccc(O)c(C(F)(F)F)c2)C1=O	CHEMBL2057965
CC(CN1C(=O)S/C(=C\c2ccc(O)c(C(F)(F)F)c2)C1=O)(CC(F)(F)F)CC(F)(F)F	CHEMBL2058262
C/C(=N\O)c1ccc(-c2ccc(O)cc2)c(Cl)c1O	CHEMBL479638
COc1ccc(-c2ccc(/C(C)=N/O)c(O)c2Cl)cc1F	CHEMBL479837
COc1ccc(-c2ccc(O)c(/C=N/O)c2)cc1	CHEMBL480612


In [10]:
! cat molecule.smi | wc -l

1289


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [11]:
! cat padel.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [12]:
! bash padel.sh

Processing CHEMBL2057965 in molecule.smi (1/1289). 
Processing CHEMBL2058262 in molecule.smi (2/1289). 
Processing CHEMBL479638 in molecule.smi (3/1289). Average speed: 5.80 s/mol.
Processing CHEMBL479837 in molecule.smi (4/1289). Average speed: 5.90 s/mol.
Processing CHEMBL480612 in molecule.smi (5/1289). Average speed: 2.38 s/mol.
Processing CHEMBL449451 in molecule.smi (6/1289). Average speed: 2.41 s/mol.
Processing CHEMBL3092944 in molecule.smi (7/1289). Average speed: 1.61 s/mol.
Processing CHEMBL45068 in molecule.smi (8/1289). Average speed: 1.38 s/mol.
Processing CHEMBL537087 in molecule.smi (9/1289). Average speed: 1.31 s/mol.
Processing CHEMBL601041 in molecule.smi (10/1289). Average speed: 1.19 s/mol.
Processing CHEMBL589723 in molecule.smi (11/1289). Average speed: 1.11 s/mol.
Processing CHEMBL592044 in molecule.smi (12/1289). Average speed: 1.08 s/mol.
Processing CHEMBL591395 in molecule.smi (13/1289). Average speed: 1.00 s/mol.
Processing CHEMBL592550 in molecule.smi (14/1

In [14]:
! ls -l

total 27708
-rw-r--r-- 1 root root   184719 May  8 10:46 alpha_amylase_04_bioactivity_data_3class_pIC50.csv
-rw-r--r-- 1 root root  2302429 May  8 10:55 descriptors_output.csv
drwxr-xr-x 3 root root     4096 May  8 10:36 __MACOSX
-rw-r--r-- 1 root root    86788 May  8 10:48 molecule.smi
drwxrwxr-x 4 root root     4096 May 30  2020 PaDEL-Descriptor
-rw-r--r-- 1 root root      231 May  8 10:35 padel.sh
-rw-r--r-- 1 root root 25768637 May  8 10:35 padel.zip
drwxr-xr-x 1 root root     4096 May  6 13:19 sample_data


## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [15]:
df3_X = pd.read_csv('descriptors_output.csv')

In [16]:
df3_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL2057965,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL2058262,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL479638,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL479837,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL480612,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1284,CHEMBL4793560,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1285,CHEMBL4445670,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1286,CHEMBL4552180,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1287,CHEMBL4455582,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [17]:
df3_X = df3_X.drop(columns=['Name'])
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1284,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1285,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1286,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1287,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

### **Convert IC50 to pIC50**

In [18]:
df3_Y = df3['pIC50']
df3_Y

0       12.806180
1       13.204120
2       12.479844
3       12.314818
4       12.135934
          ...    
1284    12.337833
1285    13.425969
1286    14.961082
1287    12.698970
1288    12.892366
Name: pIC50, Length: 1289, dtype: float64

## **Combining X and Y variable**

In [19]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,12.806180
1,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,13.204120
2,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,12.479844
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,12.314818
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,12.135934
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1284,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,12.337833
1285,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,13.425969
1286,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,14.961082
1287,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,12.698970


In [22]:
dataset3.to_csv('alpha_amylase_06_bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)

# **Let's download the CSV file to your local computer for the Part 3B (Model Building).**