# **Bioinformatics Project - Computational Drug Discovery [Part 2] Descriptor Calculation and Dataset Preparation**

Nusrat Jahan

In this Jupyter notebook, we will be building a real-life **data science project** . Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 2**, we will be calculating molecular descriptors that are essentially **quantitative description** of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 3.

---

## **Download PaDEL-Descriptor**

Here we are going to use padel as software of molecular descriptor

In [1]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

--2022-09-28 22:29:26--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip [following]
--2022-09-28 22:29:26--  https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25768637 (25M) [application/zip]
Saving to: ‘padel.zip’


2022-09-28 22:29:27 (306 MB/s) - ‘padel.zip’ saved [25768637/25768637]

--2022-09-28 22:29:27--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
Resolving github.com (gith

In [2]:
! unzip padel.zip

Archive:  padel.zip
   creating: PaDEL-Descriptor/
  inflating: __MACOSX/._PaDEL-Descriptor  
  inflating: PaDEL-Descriptor/MACCSFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._MACCSFingerprinter.xml  
  inflating: PaDEL-Descriptor/AtomPairs2DFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._AtomPairs2DFingerprinter.xml  
  inflating: PaDEL-Descriptor/EStateFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._EStateFingerprinter.xml  
  inflating: PaDEL-Descriptor/Fingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._Fingerprinter.xml  
  inflating: PaDEL-Descriptor/.DS_Store  
  inflating: __MACOSX/PaDEL-Descriptor/._.DS_Store  
   creating: PaDEL-Descriptor/license/
  inflating: __MACOSX/PaDEL-Descriptor/._license  
  inflating: PaDEL-Descriptor/KlekotaRothFingerprintCount.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._KlekotaRothFingerprintCount.xml  
  inflating: PaDEL-Descriptor/config  
  inflating: __MACOSX/PaDEL-Descriptor/._config  
  inf

## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **Coronavirus_pIC50.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [3]:
import pandas as pd

In [5]:
df = pd.read_csv('/content/PLK1_RO5_pIC50.csv')

In [6]:
df

Unnamed: 0,molecule_chembl_id,canonical_smiles,STATUS,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,CHEMBL115220,O=C(Cc1ccc2ccccc2c1)Nc1cc(C2CC2)n[nH]1,inactive,291.354,3.62150,2.0,2.0,5.000000
1,CHEMBL199996,Cc1n[nH]c2sc(C(N)=O)c(NC(=O)Nc3ccccc3)c12,inactive,315.358,2.67572,4.0,4.0,4.698970
2,CHEMBL199658,Cc1n[nH]c2sc(C(N)=O)c(NC(=O)c3ccc(Cl)cc3)c12,inactive,334.788,2.93742,3.0,4.0,4.000000
3,CHEMBL199657,Cc1n[nH]c2sc(C(N)=O)c(NC(=O)c3cccc(Cl)c3)c12,inactive,334.788,2.93742,3.0,4.0,4.000000
4,CHEMBL371695,Cc1n[nH]c2sc(C(N)=O)c(NC(=O)c3ccccc3Cl)c12,inactive,334.788,2.93742,3.0,4.0,4.638272
...,...,...,...,...,...,...,...,...
1208,CHEMBL525907,Cc1nn(-c2ccccc2)c2cc(N[C@@H](C)c3ccccc3)ncc12,intermediate,328.419,4.90202,1.0,4.0,5.885723
1209,CHEMBL559845,Cn1nc(C(N)=O)c2c1-c1nc(Nc3ccccc3)ncc1CC2,active,320.356,1.81820,2.0,6.0,7.167491
1210,CHEMBL562104,CNC(=O)c1nn(C)c2c1CCc1cnc(Nc3ccccc3)nc1-2,intermediate,334.383,2.07890,2.0,6.0,5.375202
1211,CHEMBL563150,Cn1nc(C(=O)Nc2ccccc2)c2c1-c1nc(Nc3ccccc3)ncc1CC2,inactive,396.454,3.97160,2.0,6.0,5.000000


In [7]:
#**molecular fingerprints**
#The molecular fingerprint is a way to describe a molecular structure 
#that can convert a molecular structure into a bit string. 
#Since molecular fingerprint encodes the structure of a molecule, it is a 
#useful method to describe the structural similarity among the 
#molecules as a molecular descriptor
#padel requires a specific file type with .sim  and it generally contains smiles and chembl id to 
#compare the molecular fingerprints of our compounds with fingerprints database 
#like pubchem fingerprints
selection = ['canonical_smiles','molecule_chembl_id']
df_selection = df[selection]
df_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [8]:
! cat molecule.smi | head -5

O=C(Cc1ccc2ccccc2c1)Nc1cc(C2CC2)n[nH]1	CHEMBL115220
Cc1n[nH]c2sc(C(N)=O)c(NC(=O)Nc3ccccc3)c12	CHEMBL199996
Cc1n[nH]c2sc(C(N)=O)c(NC(=O)c3ccc(Cl)cc3)c12	CHEMBL199658
Cc1n[nH]c2sc(C(N)=O)c(NC(=O)c3cccc(Cl)c3)c12	CHEMBL199657
Cc1n[nH]c2sc(C(N)=O)c(NC(=O)c3ccccc3Cl)c12	CHEMBL371695


In [9]:
! cat molecule.smi | wc -l

1213


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [12]:
! cat padel.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [14]:
! bash padel.sh

Processing CHEMBL115220 in molecule.smi (1/1213). 
Processing CHEMBL199996 in molecule.smi (2/1213). 
Processing CHEMBL199657 in molecule.smi (4/1213). Average speed: 2.80 s/mol.
Processing CHEMBL199658 in molecule.smi (3/1213). Average speed: 5.05 s/mol.
Processing CHEMBL371695 in molecule.smi (5/1213). Average speed: 1.97 s/mol.
Processing CHEMBL382070 in molecule.smi (6/1213). Average speed: 1.56 s/mol.
Processing CHEMBL199759 in molecule.smi (7/1213). Average speed: 1.33 s/mol.
Processing CHEMBL370199 in molecule.smi (8/1213). Average speed: 1.14 s/mol.
Processing CHEMBL371239 in molecule.smi (10/1213). Average speed: 0.91 s/mol.
Processing CHEMBL199737 in molecule.smi (9/1213). Average speed: 1.03 s/mol.
Processing CHEMBL199383 in molecule.smi (11/1213). Average speed: 0.84 s/mol.
Processing CHEMBL197923 in molecule.smi (12/1213). Average speed: 0.81 s/mol.
Processing CHEMBL199528 in molecule.smi (14/1213). Average speed: 0.80 s/mol.
Processing CHEMBL199755 in molecule.smi (13/121

In [15]:
! ls -l

total 27552
-rw-r--r-- 1 root root  2167690 Sep 28 22:35 descriptors_output.csv
drwxr-xr-x 3 root root     4096 Sep 28 22:29 __MACOSX
-rw-r--r-- 1 root root    85588 Sep 28 22:30 molecule.smi
drwxrwxr-x 4 root root     4096 May 30  2020 PaDEL-Descriptor
-rw-r--r-- 1 root root      231 Sep 28 22:29 padel.sh
-rw-r--r-- 1 root root 25768637 Sep 28 22:29 padel.zip
-rw-r--r-- 1 root root   166747 Sep 28 22:30 PLK1_RO5_pIC50.csv
drwxr-xr-x 1 root root     4096 Sep 26 13:45 sample_data


## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [16]:
df_pubchem = pd.read_csv('/content/descriptors_output.csv')

In [17]:
df_pubchem

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL199996,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL115220,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL199658,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL199657,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL382070,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,CHEMBL525907,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1209,CHEMBL559845,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1210,CHEMBL562104,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1211,CHEMBL563150,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
df_pubchem_X = df_pubchem.drop(columns=['Name'])
df_pubchem_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1209,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1210,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1211,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

### **Convert IC50 to pIC50**

In [19]:
df_pubchem_Y = df['pIC50']
df_pubchem_Y

0       5.000000
1       4.698970
2       4.000000
3       4.000000
4       4.638272
          ...   
1208    5.885723
1209    7.167491
1210    5.375202
1211    5.000000
1212    5.244125
Name: pIC50, Length: 1213, dtype: float64

## **Combining X and Y variable**

In [20]:
dataset_pubchem = pd.concat([df_pubchem_X,df_pubchem_Y], axis=1)
dataset_pubchem

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.000000
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.698970
2,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.000000
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.000000
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.638272
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.885723
1209,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.167491
1210,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.375202
1211,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.000000


In [21]:
dataset_pubchem.to_csv('PLK1_bioactivity_data_pIC50_pubchem_fp.csv', index=False)

In [22]:
dataset_pubchem_name = pd.concat([df_pubchem,df_pubchem_Y], axis=1)
dataset_pubchem_name

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,CHEMBL199996,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5.000000
1,CHEMBL115220,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4.698970
2,CHEMBL199658,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4.000000
3,CHEMBL199657,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4.000000
4,CHEMBL382070,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4.638272
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,CHEMBL525907,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5.885723
1209,CHEMBL559845,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,7.167491
1210,CHEMBL562104,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5.375202
1211,CHEMBL563150,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5.000000


In [23]:
dataset_pubchem_name.to_csv('PLK1_bioactivity_data_pIC50_pubchem_fp_named.csv', index=False)

# **Let's download the CSV files to your local computer for the Part 3 (Model Building).**