# Compute molecular descriptors using the PADEL-Descriptor software and prepare the dataset for the model building.
 A follow along project on drug discovery for *Zaire ebolavirus*

by Rea Kalampaliki

**Digital mentor**: Chanin Nantasenamat ([Data Professor](https://www.youtube.com/watch?v=qWVTxfLq2ak&list=PLtqF5YXg7GLlQJUv9XJ3RWdd5VYGwBHrP&index=2))

##Downlaod PaDEL-Descriptor Software

In [None]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

--2021-02-01 17:08:42--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip [following]
--2021-02-01 17:08:42--  https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25768637 (25M) [application/zip]
Saving to: ‘padel.zip’


2021-02-01 17:08:43 (45.9 MB/s) - ‘padel.zip’ saved [25768637/25768637]

--2021-02-01 17:08:43--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
Resolving github.com (github.c

In [None]:
! unzip padel.zip

Archive:  padel.zip
   creating: PaDEL-Descriptor/
  inflating: __MACOSX/._PaDEL-Descriptor  
  inflating: PaDEL-Descriptor/MACCSFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._MACCSFingerprinter.xml  
  inflating: PaDEL-Descriptor/AtomPairs2DFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._AtomPairs2DFingerprinter.xml  
  inflating: PaDEL-Descriptor/EStateFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._EStateFingerprinter.xml  
  inflating: PaDEL-Descriptor/Fingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._Fingerprinter.xml  
  inflating: PaDEL-Descriptor/.DS_Store  
  inflating: __MACOSX/PaDEL-Descriptor/._.DS_Store  
   creating: PaDEL-Descriptor/license/
  inflating: __MACOSX/PaDEL-Descriptor/._license  
  inflating: PaDEL-Descriptor/KlekotaRothFingerprintCount.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._KlekotaRothFingerprintCount.xml  
  inflating: PaDEL-Descriptor/config  
  inflating: __MACOSX/PaDEL-Descriptor/._config  
  inf

## Load bioactivity data (3classes + pIC50)

In [None]:
import pandas as pd
df3 = pd.read_csv('/content/GAR_transformylase_04_bioactivity_data_3class_pIC50.csv')

print(df3.shape)
df3.head()

(69, 8)


Unnamed: 0,molecule_chembl_id,canonical_smiles,class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,CHEMBL289923,NCCCC(NC(=O)c1ccc(NCC2CNc3nc(N)nc(O)c3C2)cc1)C...,intermediate,429.481,0.3826,7.0,9.0,5.318759
1,CHEMBL152172,Nc1nc(N)c(CCCN(C=O)c2ccc(C(=O)NC(CCC(=O)O)C(=O...,inactive,460.447,-0.01,6.0,9.0,4.744727
2,CHEMBL153550,CN(CCCc1c(N)nc(N)nc1O)c1ccc(C(=O)NC(CCC(=O)O)C...,intermediate,446.464,0.4634,6.0,9.0,5.366532
3,CHEMBL13659,Nc1nc(N)c(CCCNc2ccc(C(=O)N[C@@H](CCC(=O)O)C(=O...,active,432.437,0.4391,7.0,9.0,6.200659
4,CHEMBL158034,Nc1nc(O)c2c(n1)NCC1CCN(c3ccc(C(=O)NC(CCC(=O)O)...,inactive,470.486,0.8478,6.0,9.0,4.69897


Select the columns `canonical_smiles` and `molecule_chembl_id` from the `df3`.

In [None]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]

Save the `df3_selection` into an SMI file.

In [None]:
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [None]:
! cat molecule.smi | head -5

NCCCC(NC(=O)c1ccc(NCC2CNc3nc(N)nc(O)c3C2)cc1)C(=O)O	CHEMBL289923
Nc1nc(N)c(CCCN(C=O)c2ccc(C(=O)NC(CCC(=O)O)C(=O)O)cc2)c(O)n1	CHEMBL152172
CN(CCCc1c(N)nc(N)nc1O)c1ccc(C(=O)NC(CCC(=O)O)C(=O)O)cc1	CHEMBL153550
Nc1nc(N)c(CCCNc2ccc(C(=O)N[C@@H](CCC(=O)O)C(=O)O)cc2)c(O)n1	CHEMBL13659
Nc1nc(O)c2c(n1)NCC1CCN(c3ccc(C(=O)NC(CCC(=O)O)C(=O)O)cc3)CC21	CHEMBL158034


In [None]:
! cat molecule.smi | wc -l

69


##Calculate fingerprint descriptors

In [None]:
! cat padel.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


By running the padel.sh file:

1. The fingerprint descriptors of each chemical compound will be computed (**descriptor type = PubChem fingerprint**).
2. The program will automatically remove the salts (Na+, Cl-) and small organic acids from the chemical structures.
3. The descriptors will be stored into a CSV file named `descriptors_output.csv`.



In [None]:
! bash padel.sh

Processing CHEMBL152172 in molecule.smi (2/69). 
Processing CHEMBL289923 in molecule.smi (1/69). 
Processing CHEMBL153550 in molecule.smi (3/69). Average speed: 2.62 s/mol.
Processing CHEMBL13659 in molecule.smi (4/69). Average speed: 1.36 s/mol.
Processing CHEMBL158034 in molecule.smi (5/69). Average speed: 1.14 s/mol.
Processing CHEMBL23699 in molecule.smi (6/69). Average speed: 0.88 s/mol.
Processing CHEMBL23278 in molecule.smi (7/69). Average speed: 0.86 s/mol.
Processing CHEMBL23388 in molecule.smi (8/69). Average speed: 0.74 s/mol.
Processing CHEMBL284013 in molecule.smi (9/69). Average speed: 0.73 s/mol.
Processing CHEMBL279302 in molecule.smi (10/69). Average speed: 0.64 s/mol.
Processing CHEMBL23872 in molecule.smi (11/69). Average speed: 0.59 s/mol.
Processing CHEMBL31176 in molecule.smi (12/69). Average speed: 0.59 s/mol.
Processing CHEMBL29713 in molecule.smi (13/69). Average speed: 0.58 s/mol.
Processing CHEMBL31107 in molecule.smi (14/69). Average speed: 0.55 s/mol.
Proce

In [None]:
! ls -l

total 25336
-rw-r--r-- 1 root root   133954 Feb  1 17:40 descriptors_output.csv
-rw-r--r-- 1 root root    10051 Feb  1 17:15 GAR_transformylase_04_bioactivity_data_3class_pIC50.csv
drwxr-xr-x 3 root root     4096 Feb  1 17:09 __MACOSX
-rw-r--r-- 1 root root     5150 Feb  1 17:20 molecule.smi
drwxrwxr-x 4 root root     4096 May 30  2020 PaDEL-Descriptor
-rw-r--r-- 1 root root      231 Feb  1 17:08 padel.sh
-rw-r--r-- 1 root root 25768637 Feb  1 17:08 padel.zip
drwxr-xr-x 1 root root     4096 Jan 20 17:27 sample_data


##Prepare the X and Y data matrices

### X matrix (indepented variables)

In [None]:
df3_X = pd.read_csv('descriptors_output.csv')

print(df3_X.shape)
df3_X.head()

(69, 882)


Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,PubchemFP10,PubchemFP11,PubchemFP12,PubchemFP13,PubchemFP14,PubchemFP15,PubchemFP16,PubchemFP17,PubchemFP18,PubchemFP19,PubchemFP20,PubchemFP21,PubchemFP22,PubchemFP23,PubchemFP24,PubchemFP25,PubchemFP26,PubchemFP27,PubchemFP28,PubchemFP29,PubchemFP30,PubchemFP31,PubchemFP32,PubchemFP33,PubchemFP34,PubchemFP35,PubchemFP36,PubchemFP37,PubchemFP38,...,PubchemFP841,PubchemFP842,PubchemFP843,PubchemFP844,PubchemFP845,PubchemFP846,PubchemFP847,PubchemFP848,PubchemFP849,PubchemFP850,PubchemFP851,PubchemFP852,PubchemFP853,PubchemFP854,PubchemFP855,PubchemFP856,PubchemFP857,PubchemFP858,PubchemFP859,PubchemFP860,PubchemFP861,PubchemFP862,PubchemFP863,PubchemFP864,PubchemFP865,PubchemFP866,PubchemFP867,PubchemFP868,PubchemFP869,PubchemFP870,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL152172,1,1,1,0,0,0,0,0,0,1,1,1,1,0,1,1,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,CHEMBL289923,1,1,1,0,0,0,0,0,0,1,1,1,1,0,1,1,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,CHEMBL13659,1,1,1,0,0,0,0,0,0,1,1,1,1,0,1,1,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,CHEMBL153550,1,1,1,0,0,0,0,0,0,1,1,1,1,0,1,1,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,CHEMBL23699,1,1,1,0,0,0,0,0,0,1,1,1,1,0,1,1,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
df3_X = df3_X.drop(columns=['Name'])

### Y matrix (depented variable)

In [None]:
df3_Y = df3['pIC50']

print(df3_Y.shape)
df3_Y.head()

(69,)


0    5.318759
1    4.744727
2    5.366532
3    6.200659
4    4.698970
Name: pIC50, dtype: float64

### Compine X and Y matrices

In [None]:
dataset3 = pd.concat([df3_X, df3_Y], axis = 1)

print(dataset3.shape)
dataset3.head()

(69, 882)


Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,PubchemFP10,PubchemFP11,PubchemFP12,PubchemFP13,PubchemFP14,PubchemFP15,PubchemFP16,PubchemFP17,PubchemFP18,PubchemFP19,PubchemFP20,PubchemFP21,PubchemFP22,PubchemFP23,PubchemFP24,PubchemFP25,PubchemFP26,PubchemFP27,PubchemFP28,PubchemFP29,PubchemFP30,PubchemFP31,PubchemFP32,PubchemFP33,PubchemFP34,PubchemFP35,PubchemFP36,PubchemFP37,PubchemFP38,PubchemFP39,...,PubchemFP842,PubchemFP843,PubchemFP844,PubchemFP845,PubchemFP846,PubchemFP847,PubchemFP848,PubchemFP849,PubchemFP850,PubchemFP851,PubchemFP852,PubchemFP853,PubchemFP854,PubchemFP855,PubchemFP856,PubchemFP857,PubchemFP858,PubchemFP859,PubchemFP860,PubchemFP861,PubchemFP862,PubchemFP863,PubchemFP864,PubchemFP865,PubchemFP866,PubchemFP867,PubchemFP868,PubchemFP869,PubchemFP870,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,1,0,0,0,0,0,0,1,1,1,1,0,1,1,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5.318759
1,1,1,1,0,0,0,0,0,0,1,1,1,1,0,1,1,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4.744727
2,1,1,1,0,0,0,0,0,0,1,1,1,1,0,1,1,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5.366532
3,1,1,1,0,0,0,0,0,0,1,1,1,1,0,1,1,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6.200659
4,1,1,1,0,0,0,0,0,0,1,1,1,1,0,1,1,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4.69897


Save `dataset3` to a SCV file, to use later for model building.

In [None]:
dataset3.to_csv('GAR_transformylase_06_bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)