# **Bioinformatics Project - Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation**

In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

## **Download PaDEL-Descriptor**

In [1]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

--2022-09-25 12:02:17--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
Resolving github.com (github.com)... 20.27.177.113
Connecting to github.com (github.com)|20.27.177.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip [following]
--2022-09-25 12:02:18--  https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25768637 (25M) [application/zip]
Saving to: ‘padel.zip’


2022-09-25 12:02:20 (70.1 MB/s) - ‘padel.zip’ saved [25768637/25768637]

--2022-09-25 12:02:20--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
Resolving github.com (g

In [2]:
! unzip padel.zip

Archive:  padel.zip
   creating: PaDEL-Descriptor/
  inflating: __MACOSX/._PaDEL-Descriptor  
  inflating: PaDEL-Descriptor/MACCSFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._MACCSFingerprinter.xml  
  inflating: PaDEL-Descriptor/AtomPairs2DFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._AtomPairs2DFingerprinter.xml  
  inflating: PaDEL-Descriptor/EStateFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._EStateFingerprinter.xml  
  inflating: PaDEL-Descriptor/Fingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._Fingerprinter.xml  
  inflating: PaDEL-Descriptor/.DS_Store  
  inflating: __MACOSX/PaDEL-Descriptor/._.DS_Store  
   creating: PaDEL-Descriptor/license/
  inflating: __MACOSX/PaDEL-Descriptor/._license  
  inflating: PaDEL-Descriptor/KlekotaRothFingerprintCount.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._KlekotaRothFingerprintCount.xml  
  inflating: PaDEL-Descriptor/config  
  inflating: __MACOSX/PaDEL-Descriptor/._config  
  inf

## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **bioactivity_data_3class_pIC50.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [3]:
import pandas as pd

In [4]:
df3 = pd.read_csv('renin_04_bioactivity_data_3class_pIC50.csv')

In [5]:
df3

Unnamed: 0.1,Unnamed: 0,molecule_chembl_id,canonical_smiles,class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,0,CHEMBL47918,CCCC[C@H](O[C@@H](Cc1ccccc1)C(=O)N1CCC(OCOC)CC...,active,745.059,5.2830,2.0,8.0,7.522879
1,1,CHEMBL50217,CCCC[C@H](O[C@@H](Cc1ccccc1)C(=O)N1CCC(OCOC)CC...,active,766.033,6.9418,3.0,8.0,8.657577
2,2,CHEMBL301347,CCCC[C@H](O[C@@H](Cc1ccccc1)C(=O)N1CCC(OCOC)CC...,active,773.069,5.8374,3.0,9.0,8.602060
3,3,CHEMBL48624,CCCC[C@H](O[C@@H](Cc1ccccc1)C(=O)N1CCC(OCOC)CC...,active,787.096,5.1015,3.0,9.0,8.619789
4,4,CHEMBL101874,CC[C@H](C)[C@H](NC(=O)[C@@H](NC(=O)[C@H](CC(C)...,intermediate,813.010,2.4533,7.0,10.0,5.853872
...,...,...,...,...,...,...,...,...,...
3039,3039,CHEMBL4210995,CCc1cccc(-c2c(F)cccc2C(O)(CCCNC(=O)OC)[C@@H]2C...,active,557.707,4.2420,3.0,6.0,8.301030
3040,3040,CHEMBL4212859,CCc1cccc(-c2c(F)cccc2C(O)(CCCNC(=O)OC)[C@@H]2C...,active,543.680,3.5879,4.0,6.0,8.698970
3041,3041,CHEMBL4560579,C#CCOc1cnc(C(=O)Nc2cc(F)c(F)c([C@@]3(C)N=C(N)S...,inactive,473.505,2.7009,2.0,8.0,3.168770
3042,3042,CHEMBL4577527,C#CCOc1cnc(C(=O)Nc2cc(F)c(F)c([C@@]3(C)N=C(N)S...,inactive,479.459,3.3196,2.0,7.0,3.669586


In [6]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [7]:
! cat molecule.smi | head -5

CCCC[C@H](O[C@@H](Cc1ccccc1)C(=O)N1CCC(OCOC)CC1)C(=O)N[C@@H](CC1CCCCC1)[C@@H](O)C[C@H](C(=O)N(C)CCN(C)C)C(C)C	CHEMBL47918
CCCC[C@H](O[C@@H](Cc1ccccc1)C(=O)N1CCC(OCOC)CC1)C(=O)N[C@@H](CC1CCCCC1)[C@@H](O)C[C@H](NC(=O)OCc1ccccc1)C(C)C	CHEMBL50217
CCCC[C@H](O[C@@H](Cc1ccccc1)C(=O)N1CCC(OCOC)CC1)C(=O)N[C@@H](CC1CCCCC1)[C@@H](O)C[C@H](NC(=O)OCCN1CCCC1)C(C)C	CHEMBL301347
CCCC[C@H](O[C@@H](Cc1ccccc1)C(=O)N1CCC(OCOC)CC1)C(=O)N[C@@H](CC1CCCCC1)[C@@H](O)C[C@H](C(=O)NCCCN1CCOCC1)C(C)C	CHEMBL48624
CC[C@H](C)[C@H](NC(=O)[C@@H](NC(=O)[C@H](CC(C)C)NC(=O)[C@H](C)NC(=O)[C@H](Cc1ccccc1)NC(=O)OC(C)(C)C)C(C)C)C(=O)N[C@@H](Cc1c[nH]cn1)C(=O)OC	CHEMBL101874


In [8]:
! cat molecule.smi | wc -l

3044


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [9]:
! cat padel.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [10]:
! bash padel.sh

Processing CHEMBL47918 in molecule.smi (1/3044). 
Processing CHEMBL50217 in molecule.smi (2/3044). 
Processing CHEMBL301347 in molecule.smi (3/3044). Average speed: 8.78 s/mol.
Processing CHEMBL48624 in molecule.smi (4/3044). Average speed: 4.76 s/mol.
Processing CHEMBL101874 in molecule.smi (5/3044). Average speed: 3.87 s/mol.
Processing CHEMBL445118 in molecule.smi (6/3044). Average speed: 3.17 s/mol.
Processing CHEMBL300504 in molecule.smi (7/3044). Average speed: 2.71 s/mol.
Processing CHEMBL49218 in molecule.smi (9/3044). Average speed: 2.12 s/mol.
Processing CHEMBL49225 in molecule.smi (8/3044). Average speed: 2.38 s/mol.
Processing CHEMBL49286 in molecule.smi (11/3044). Average speed: 1.83 s/mol.
Processing CHEMBL297129 in molecule.smi (10/3044). Average speed: 1.97 s/mol.
Processing CHEMBL318242 in molecule.smi (12/3044). Average speed: 1.89 s/mol.
Processing CHEMBL297701 in molecule.smi (13/3044). Average speed: 1.62 s/mol.
Processing CHEMBL295048 in molecule.smi (15/3044). Av

In [11]:
! ls -l

total 31312
-rw-r--r-- 1 root root  5421626 Sep 25 12:20 descriptors_output.csv
drwxr-xr-x 3 root root     4096 Sep 25 12:02 __MACOSX
-rw-r--r-- 1 root root   314279 Sep 25 12:03 molecule.smi
drwxrwxr-x 4 root root     4096 May 30  2020 PaDEL-Descriptor
-rw-r--r-- 1 root root      231 Sep 25 12:02 padel.sh
-rw-r--r-- 1 root root 25768637 Sep 25 12:02 padel.zip
-rw-r--r-- 1 root root   535275 Sep 25 12:02 renin_04_bioactivity_data_3class_pIC50.csv
drwxr-xr-x 1 root root     4096 Sep 22 13:42 sample_data


## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [12]:
df3_X = pd.read_csv('descriptors_output.csv')

In [13]:
df3_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL47918,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL50217,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL301347,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL48624,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL101874,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3039,CHEMBL4210995,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3040,CHEMBL4212859,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3041,CHEMBL4577527,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3042,CHEMBL4560579,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
df3_X = df3_X.drop(columns=['Name'])
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3039,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3040,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3041,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3042,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

### **Convert IC50 to pIC50**

In [15]:
df3_Y = df3['pIC50']
df3_Y

0       7.522879
1       8.657577
2       8.602060
3       8.619789
4       5.853872
          ...   
3039    8.301030
3040    8.698970
3041    3.168770
3042    3.669586
3043    4.000000
Name: pIC50, Length: 3044, dtype: float64

## **Combining X and Y variable**

In [16]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.522879
1,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.657577
2,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.602060
3,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.619789
4,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.853872
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3039,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.301030
3040,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.698970
3041,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,3.168770
3042,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,3.669586


In [18]:
dataset3.to_csv('renin_06_bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)

# **Let's download the CSV file to your local computer for the Part 3B (Model Building).**