# **Bioinformatics Project - Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation**


In this Jupyter notebook, I will be building a **data science project**, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

## **Download PaDEL-Descriptor**

In [1]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

--2024-05-31 16:52:03--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip [following]
--2024-05-31 16:52:03--  https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25768637 (25M) [application/zip]
Saving to: ‘padel.zip’


2024-05-31 16:52:03 (150 MB/s) - ‘padel.zip’ saved [25768637/25768637]

--2024-05-31 16:52:04--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
Resolving github.com (gith

In [2]:
! unzip padel.zip

Archive:  padel.zip
   creating: PaDEL-Descriptor/
  inflating: __MACOSX/._PaDEL-Descriptor  
  inflating: PaDEL-Descriptor/MACCSFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._MACCSFingerprinter.xml  
  inflating: PaDEL-Descriptor/AtomPairs2DFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._AtomPairs2DFingerprinter.xml  
  inflating: PaDEL-Descriptor/EStateFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._EStateFingerprinter.xml  
  inflating: PaDEL-Descriptor/Fingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._Fingerprinter.xml  
  inflating: PaDEL-Descriptor/.DS_Store  
  inflating: __MACOSX/PaDEL-Descriptor/._.DS_Store  
   creating: PaDEL-Descriptor/license/
  inflating: __MACOSX/PaDEL-Descriptor/._license  
  inflating: PaDEL-Descriptor/KlekotaRothFingerprintCount.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._KlekotaRothFingerprintCount.xml  
  inflating: PaDEL-Descriptor/config  
  inflating: __MACOSX/PaDEL-Descriptor/._config  
  inf

## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **bioactivity_data_3class_pIC50.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [3]:
from google.colab import files


uploaded = files.upload()


Saving aromatase_04_bioactivity_data_3class_pIC50.csv to aromatase_04_bioactivity_data_3class_pIC50.csv


In [4]:
import pandas as pd

In [6]:
df3 = pd.read_csv('aromatase_04_bioactivity_data_3class_pIC50.csv')

In [7]:
df3

Unnamed: 0.1,Unnamed: 0,molecule_chembl_id,canonical_smiles,class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,0,CHEMBL341591,CC12CCC(O)CC1=CCC1C2CCC2(C)C(CC3CN3)CCC12,intermediate,329.528,4.28820,2.0,2.0,5.148742
1,1,CHEMBL2111947,C[C@]12CC[C@H]3[C@@H](CC=C4C[C@@H](O)CC[C@@]43...,inactive,315.501,3.89810,2.0,2.0,4.301030
2,2,CHEMBL431859,CCn1c(C(c2ccc(F)cc2)n2ccnc2)c(C)c2cc(Br)ccc21,active,412.306,5.70542,0.0,3.0,6.623423
3,3,CHEMBL113637,CCn1cc(C(c2ccc(F)cc2)n2ccnc2)c2ccccc21,active,319.383,4.63450,0.0,3.0,7.244125
4,4,CHEMBL112021,Clc1ccccc1Cn1cc(Cn2ccnc2)c2ccccc21,active,321.811,4.58780,0.0,3.0,7.267606
...,...,...,...,...,...,...,...,...,...
2591,2592,CHEMBL5285636,COc1ccc(C(=O)c2cccc(Cn3ccnc3)c2)cc1,active,292.338,3.17100,0.0,4.0,7.886057
2592,2593,CHEMBL5266533,O=C(c1ccc(O)cc1)c1cccc(Cn2ccnc2)c1,active,278.311,2.86800,1.0,4.0,7.886057
2593,2594,CHEMBL5278229,COc1ccc(C(=O)c2ccc(Cn3ccnc3)cc2)cc1,active,292.338,3.17100,0.0,4.0,6.623788
2594,2595,CHEMBL5275747,O=C(c1ccc(O)cc1)c1ccc(Cn2ccnc2)cc1,intermediate,278.311,2.86800,1.0,4.0,5.958607


In [8]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [9]:
! cat molecule.smi | head -5

CC12CCC(O)CC1=CCC1C2CCC2(C)C(CC3CN3)CCC12	CHEMBL341591
C[C@]12CC[C@H]3[C@@H](CC=C4C[C@@H](O)CC[C@@]43C)[C@@H]1CC[C@@H]2[C@H]1CN1	CHEMBL2111947
CCn1c(C(c2ccc(F)cc2)n2ccnc2)c(C)c2cc(Br)ccc21	CHEMBL431859
CCn1cc(C(c2ccc(F)cc2)n2ccnc2)c2ccccc21	CHEMBL113637
Clc1ccccc1Cn1cc(Cn2ccnc2)c2ccccc21	CHEMBL112021


In [10]:
! cat molecule.smi | wc -l

2596


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [11]:
! cat padel.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [12]:
! bash padel.sh

Processing CHEMBL341591 in molecule.smi (1/2596). 
Processing CHEMBL2111947 in molecule.smi (2/2596). 
Processing CHEMBL113637 in molecule.smi (4/2596). Average speed: 2.97 s/mol.
Processing CHEMBL431859 in molecule.smi (3/2596). Average speed: 5.47 s/mol.
Processing CHEMBL112021 in molecule.smi (5/2596). Average speed: 2.62 s/mol.
Processing CHEMBL324070 in molecule.smi (6/2596). Average speed: 1.99 s/mol.
Processing CHEMBL41761 in molecule.smi (7/2596). Average speed: 1.90 s/mol.
Processing CHEMBL111868 in molecule.smi (8/2596). Average speed: 1.75 s/mol.
Processing CHEMBL111888 in molecule.smi (9/2596). Average speed: 1.55 s/mol.
Processing CHEMBL112074 in molecule.smi (10/2596). Average speed: 1.48 s/mol.
Processing CHEMBL37321 in molecule.smi (12/2596). Average speed: 1.27 s/mol.
Processing CHEMBL324326 in molecule.smi (11/2596). Average speed: 1.34 s/mol.
Processing CHEMBL41066 in molecule.smi (14/2596). Average speed: 1.11 s/mol.
Processing CHEMBL353068 in molecule.smi (13/2596)

In [13]:
! ls -l

total 30208
-rw-r--r-- 1 root root   349878 May 31 16:56 aromatase_04_bioactivity_data_3class_pIC50.csv
-rw-r--r-- 1 root root  4625701 May 31 17:12 descriptors_output.csv
drwxr-xr-x 3 root root     4096 May 31 16:53 __MACOSX
-rw-r--r-- 1 root root   162593 May 31 16:58 molecule.smi
drwxrwxr-x 4 root root     4096 May 30  2020 PaDEL-Descriptor
-rw-r--r-- 1 root root      231 May 31 16:52 padel.sh
-rw-r--r-- 1 root root 25768637 May 31 16:52 padel.zip
drwxr-xr-x 1 root root     4096 May 30 13:36 sample_data


## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [14]:
df3_X = pd.read_csv('descriptors_output.csv')

In [15]:
df3_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL2111947,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL341591,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL431859,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL113637,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL112021,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2591,CHEMBL5285636,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2592,CHEMBL5266533,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2593,CHEMBL5278229,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2594,CHEMBL5275747,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
df3_X = df3_X.drop(columns=['Name'])
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2591,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2592,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2593,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2594,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

### **Convert IC50 to pIC50**

In [17]:
df3_Y = df3['pIC50']
df3_Y

0       5.148742
1       4.301030
2       6.623423
3       7.244125
4       7.267606
          ...   
2591    7.886057
2592    7.886057
2593    6.623788
2594    5.958607
2595    5.596708
Name: pIC50, Length: 2596, dtype: float64

## **Combining X and Y variable**

In [18]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.148742
1,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.301030
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.623423
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.244125
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.267606
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2591,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.886057
2592,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.886057
2593,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.623788
2594,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.958607


In [19]:
dataset3.to_csv('aromatase_06_bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)