<div style="padding: 0.5em; background-color: #1876d1; color: #fff;">

### **[Part 3] Computational Drug Discovery - Descriptor Calculation and Dataset Preparation**

</div>
In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

Note :
* Target enzyme: Aromatase responsible for breast cancer
* Objective: find compound that inhibit Aromatase function

---
<b># Bioinformatics Project </b>

## **Download PaDEL-Descriptor**

In [16]:
#! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
#! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

In [18]:
#! unzip padel.zip

In [1]:
! pip install padelpy

Collecting padelpy
  Downloading padelpy-0.1.16-py3-none-any.whl.metadata (7.7 kB)
Downloading padelpy-0.1.16-py3-none-any.whl (20.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.9/20.9 MB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: padelpy
Successfully installed padelpy-0.1.16


## **Load bioactivity data**

In [2]:
import pandas as pd

In [4]:
df3 = pd.read_csv('data/04-bioactivity-3class-data.csv')

In [5]:
df3

Unnamed: 0.1,Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,0,CHEMBL341591,CC12CCC(O)CC1=CCC1C2CCC2(C)C(CC3CN3)CCC12,intermediate,329.528,4.28820,2.0,2.0,5.148736
1,1,CHEMBL2111947,C[C@]12CC[C@H]3[C@@H](CC=C4C[C@@H](O)CC[C@@]43...,inactive,315.501,3.89810,2.0,2.0,4.301029
2,2,CHEMBL431859,CCn1c(C(c2ccc(F)cc2)n2ccnc2)c(C)c2cc(Br)ccc21,active,412.306,5.70542,0.0,3.0,6.623241
3,3,CHEMBL113637,CCn1cc(C(c2ccc(F)cc2)n2ccnc2)c2ccccc21,active,319.383,4.63450,0.0,3.0,7.243364
4,4,CHEMBL112021,Clc1ccccc1Cn1cc(Cn2ccnc2)c2ccccc21,active,321.811,4.58780,0.0,3.0,7.266803
...,...,...,...,...,...,...,...,...,...
2592,2592,CHEMBL5285636,COc1ccc(C(=O)c2cccc(Cn3ccnc3)c2)cc1,active,292.338,3.17100,0.0,4.0,7.882729
2593,2593,CHEMBL5266533,O=C(c1ccc(O)cc1)c1cccc(Cn2ccnc2)c1,active,278.311,2.86800,1.0,4.0,7.882729
2594,2594,CHEMBL5278229,COc1ccc(C(=O)c2ccc(Cn3ccnc3)cc2)cc1,active,292.338,3.17100,0.0,4.0,6.623606
2595,2595,CHEMBL5275747,O=C(c1ccc(O)cc1)c1ccc(Cn2ccnc2)cc1,intermediate,278.311,2.86800,1.0,4.0,5.958568


In [6]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('data/05-molecule.smi', sep='\t', index=False, header=False)

In [None]:
! cat 'data/05-molecule.smi' | head -n 5

In [14]:
! cat 'data/05-molecule.smi' | wc -l

    2597


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [10]:
#! cat padel.sh
#! bash padel.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [23]:
from padelpy import padeldescriptor

padeldescriptor(mol_dir='data/05-molecule.smi', 
                d_file='data/05-descriptors_output.csv', 
                descriptortypes='./PaDEL-Descriptor/PubchemFingerprinter.xml',
                fingerprints=True, 
                standardizenitro=True, 
                removesalt=True, 
                threads=6)

## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [24]:
df3_X = pd.read_csv('data/05-descriptors_output.csv')

In [25]:
df3_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL341591,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL2111947,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL431859,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL113637,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL112021,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2592,CHEMBL5285636,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2593,CHEMBL5266533,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2594,CHEMBL5278229,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2595,CHEMBL5275747,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [26]:
df3_X = df3_X.drop(columns=['Name'])
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2592,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2593,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2594,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2595,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

### **Convert IC50 to pIC50**

In [27]:
df3_Y = df3['pIC50']
df3_Y

0       5.148736
1       4.301029
2       6.623241
3       7.243364
4       7.266803
          ...   
2592    7.882729
2593    7.882729
2594    6.623606
2595    5.958568
2596    5.596691
Name: pIC50, Length: 2597, dtype: float64

## **Combining X and Y variable**

In [28]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.148736
1,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.301029
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.623241
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.243364
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.266803
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2592,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.882729
2593,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.882729
2594,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.623606
2595,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.958568


In [29]:
dataset3.to_csv('data/06-bioactivity_data_3classpubchem_fp.csv', index=False)

---