# **Bioinformatics Project - Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation**

Based on the work of Chanin Nantasenamat [*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)

---

## **Download PaDEL-Descriptor**

In [1]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

--2024-02-07 17:11:18--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip [following]
--2024-02-07 17:11:18--  https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25768637 (25M) [application/zip]
Saving to: ‘padel.zip.1’


2024-02-07 17:11:23 (5,58 MB/s) - ‘padel.zip.1’ saved [25768637/25768637]

--2024-02-07 17:11:23--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
Resolving github.com 

In [2]:
! unzip padel.zip

Archive:  padel.zip
replace __MACOSX/._PaDEL-Descriptor? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C


## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **bioactivity_data_3class_pIC50.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [None]:
#! wget https://raw.githubusercontent.com/dataprofessor/data/master/acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv

In [5]:
import pandas as pd

In [10]:
df3 = pd.read_csv('HSC70_bioactivity_pIC50.csv')

In [5]:
df3

Unnamed: 0.1,Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,0,CHEMBL470334,N#Cc1ccc(COC[C@H]2O[C@@H](n3c(NCc4ccc(Cl)c(Cl)...,inactive,556.410,3.03508,4.0,11.0,4.982967
1,1,CHEMBL1358945,CCCc1cc(SCC(=O)Nc2c(C)n(C)n(-c3ccccc3)c2=O)n2c...,inactive,510.623,4.84030,1.0,8.0,3.000435
2,2,CHEMBL3191473,O=C(/C=C(/NCCO)C(F)(F)F)c1ccc(Cl)cc1,inactive,293.672,2.55080,2.0,3.0,3.000435
3,3,CHEMBL1537858,Cc1cc(NC(=O)CSc2ncnc3ccccc23)no1,inactive,300.343,2.65702,1.0,6.0,3.000435
4,4,CHEMBL1576855,CCOc1ccccc1N1Cc2cc(C)c(C)cc2C1,inactive,267.372,4.22234,0.0,2.0,3.000435
...,...,...,...,...,...,...,...,...,...
87,90,CHEMBL3309990,Nc1nc2c3c(F)cccc3nc(Cc3ccc4c(c3)OCO4)n2n1,inactive,337.314,2.31830,1.0,7.0,5.000000
88,91,CHEMBL3309995,Nc1ccc2c(c1)nc(Cc1ccc3c(c1)OCO3)n1nc(N)nc21,inactive,334.339,1.76140,2.0,8.0,5.000000
89,92,CHEMBL3309997,OCCCNc1ccc2c(c1)nc(-c1ccc3c(c1)OCO3)n1ncnc21,inactive,363.377,2.46750,2.0,8.0,5.000000
90,93,CHEMBL3309998,Nc1nc2c3ccc(NCCO)cc3nc(Cc3ccc4c(c3)OCO4)n2n1,inactive,378.392,1.58340,3.0,9.0,5.000000


In [6]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [7]:
! cat molecule.smi | head -5

N#Cc1ccc(COC[C@H]2O[C@@H](n3c(NCc4ccc(Cl)c(Cl)c4)nc4c(N)ncnc43)[C@H](O)[C@@H]2O)cc1	CHEMBL470334
CCCc1cc(SCC(=O)Nc2c(C)n(C)n(-c3ccccc3)c2=O)n2c(nc3ccccc32)c1C#N	CHEMBL1358945
O=C(/C=C(/NCCO)C(F)(F)F)c1ccc(Cl)cc1	CHEMBL3191473
Cc1cc(NC(=O)CSc2ncnc3ccccc23)no1	CHEMBL1537858
CCOc1ccccc1N1Cc2cc(C)c(C)cc2C1	CHEMBL1576855


In [8]:
! cat molecule.smi | wc -l

      92


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

http://www.yapcwsoft.com/dd/padeldescriptor/

In [1]:
! cat padel.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [2]:
! bash padel.sh

Processing CHEMBL470334 in molecule.smi (1/92). 
Processing CHEMBL1358945 in molecule.smi (2/92). 
Processing CHEMBL3191473 in molecule.smi (3/92). 
Processing CHEMBL1537858 in molecule.smi (4/92). 
Processing CHEMBL1576855 in molecule.smi (5/92). 
Processing CHEMBL1360493 in molecule.smi (6/92). 
Processing CHEMBL1420920 in molecule.smi (7/92). 
Processing CHEMBL1341697 in molecule.smi (8/92). 
Processing CHEMBL496 in molecule.smi (9/92). Average speed: 1.61 s/mol.
Processing CHEMBL1566165 in molecule.smi (12/92). Average speed: 0.83 s/mol.
Processing CHEMBL1421407 in molecule.smi (10/92). Average speed: 1.61 s/mol.
Processing CHEMBL1332706 in molecule.smi (11/92). Average speed: 1.10 s/mol.
Processing CHEMBL1539325 in molecule.smi (13/92). Average speed: 0.67 s/mol.
Processing CHEMBL1373655 in molecule.smi (14/92). Average speed: 0.56 s/mol.
Processing CHEMBL1609479 in molecule.smi (15/92). Average speed: 0.49 s/mol.
Processing CHEMBL1480546 in molecule.smi (16/92). Average speed: 0.

In [3]:
! ls -l

total 105288
-rw-r--r--@  1 damaro  staff     79737 Feb  7 15:58 CDD_ML_Part_1_bioactivity_data.ipynb
-rw-r--r--@  1 damaro  staff    461811 Feb  7 17:05 CDD_ML_Part_2_Exploratory_Data_Analysis.ipynb
-rw-r--r--@  1 damaro  staff    208719 Feb  7 16:45 CDD_ML_Part_3_Acetylcholinesterase_Descriptor_Dataset_Preparation.ipynb
-rw-r--r--@  1 damaro  staff      4168 Feb  7 13:44 ChEMBL.ipynb
-rw-r--r--@  1 damaro  staff     11241 Feb  7 17:05 HSC70_bioactivity_pIC50.csv
drwxrwxr-x  21 damaro  staff       672 May 30  2020 [34mPaDEL-Descriptor[m[m
drwxr-xr-x   4 damaro  staff       128 Feb  7 16:59 [34m__MACOSX[m[m
-rw-r--r--@  1 damaro  staff     50622 Feb  7 14:59 bioactivity_data.csv
-rw-r--r--@  1 damaro  staff      7141 Feb  7 15:13 bioactivity_preprocessed_data.csv
-rw-r--r--   1 damaro  staff    174897 Feb  7 17:15 descriptors_output.csv
-rw-r--r--@  1 damaro  staff       122 Feb  7 16:40 mannwhitneyu_LogP.csv
-rw-r--r--@  1 damaro  staff       118 Feb  7 16:40 mannwhitneyu_MW.csv

## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [6]:
df3_X = pd.read_csv('/Users/damaro/Documents/Bioinformatics/descriptors_output.csv')

In [7]:
df3_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL3191473,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL1420920,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL1341697,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL1537858,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL1576855,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
87,CHEMBL2443026,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
88,CHEMBL2443044,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
89,CHEMBL3309998,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
90,CHEMBL3309997,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
df3_X = df3_X.drop(columns=['Name'])
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
87,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
88,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
89,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
90,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

### **Convert IC50 to pIC50**

In [11]:
df3_Y = df3['pIC50']
df3_Y

0     4.982967
1     3.000435
2     3.000435
3     3.000435
4     3.000435
        ...   
87    5.000000
88    5.000000
89    5.000000
90    5.000000
91    5.000000
Name: pIC50, Length: 92, dtype: float64

## **Combining X and Y variable**

In [12]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.982967
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,3.000435
2,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,3.000435
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,3.000435
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,3.000435
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
87,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.000000
88,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.000000
89,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.000000
90,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.000000


In [None]:
dataset3.to_csv('HSC70_bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)