# **Bioinformatics Project - Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation**

Chanin Nantasenamat

[*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

## **Download PaDEL-Descriptor**

In [2]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

--2024-05-02 14:53:00--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
Resolving github.com (github.com)... 20.207.73.82
Connecting to github.com (github.com)|20.207.73.82|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip [following]
--2024-05-02 14:53:01--  https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25768637 (25M) [application/zip]
Saving to: ‘padel.zip’


2024-05-02 14:53:04 (29.7 MB/s) - ‘padel.zip’ saved [25768637/25768637]

--2024-05-02 14:53:05--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
Resolving github.com (git

In [3]:
! unzip padel.zip

Archive:  padel.zip
   creating: PaDEL-Descriptor/
  inflating: __MACOSX/._PaDEL-Descriptor  
  inflating: PaDEL-Descriptor/MACCSFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._MACCSFingerprinter.xml  
  inflating: PaDEL-Descriptor/AtomPairs2DFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._AtomPairs2DFingerprinter.xml  
  inflating: PaDEL-Descriptor/EStateFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._EStateFingerprinter.xml  
  inflating: PaDEL-Descriptor/Fingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._Fingerprinter.xml  
  inflating: PaDEL-Descriptor/.DS_Store  
  inflating: __MACOSX/PaDEL-Descriptor/._.DS_Store  
   creating: PaDEL-Descriptor/license/
  inflating: __MACOSX/PaDEL-Descriptor/._license  
  inflating: PaDEL-Descriptor/KlekotaRothFingerprintCount.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._KlekotaRothFingerprintCount.xml  
  inflating: PaDEL-Descriptor/config  
  inflating: __MACOSX/PaDEL-Descriptor/._config  
  inf

## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **bioactivity_data_3class_pIC50.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [4]:
import pandas as pd

In [5]:
df3 = pd.read_csv('MgluR5_04_bioactivity_data_3class_pIC50.csv')

In [6]:
df3

Unnamed: 0.1,Unnamed: 0,molecule_chembl_id,canonical_smiles,class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,0,CHEMBL66654,Cc1cccc(C#Cc2ccccc2)n1,active,193.249000,2.78982,0.0,1.0,-1.556303
1,1,CHEMBL88612,Cc1cccc(/C=C/c2ccccc2)n1,intermediate,195.265000,3.56042,0.0,1.0,-3.477121
2,2,CHEMBL2112677,[3H]C([3H])([3H])OCc1cccc(C#Cc2cccc(C)n2)c1,active,243.326148,2.93622,0.0,2.0,-1.000000
3,3,CHEMBL39338,N[C@@H](C[C@H](CCCC(c1ccccc1)c1ccccc1)C(=O)O)C...,inactive,355.434000,3.49160,3.0,3.0,-5.477121
4,4,CHEMBL40123,N[C@@H](C[C@H](CC(c1ccccc1)c1ccccc1)C(=O)O)C(=O)O,inactive,327.380000,2.71140,3.0,3.0,-5.477121
...,...,...,...,...,...,...,...,...,...
1967,1967,CHEMBL4591221,CC(C)(C)c1cc2n(n1)CCN(C(=O)c1ccc3ccccc3c1)C2,active,333.435000,3.98980,0.0,3.0,-2.678518
1968,1968,CHEMBL4572894,O=C1N[C@H](c2cncc(C#Cc3ccccc3)c2)[C@@H](c2c(F)...,active,376.362000,4.28180,1.0,3.0,-1.431364
1969,1969,CHEMBL1527295,CCOC(=O)c1cnc2c(OC)cccc2c1N1CCN(c2ccccc2F)CC1,inactive,409.461000,3.88580,0.0,6.0,-4.477121
1970,1970,CHEMBL4751065,Cl.N[C@]1(C(=O)O)[C@@H]2[C@@H](C(=O)O)[C@@H]2[...,inactive,359.350000,0.77650,4.0,5.0,-4.096910


In [7]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [8]:
! cat molecule.smi | head -5

Cc1cccc(C#Cc2ccccc2)n1	CHEMBL66654
Cc1cccc(/C=C/c2ccccc2)n1	CHEMBL88612
[3H]C([3H])([3H])OCc1cccc(C#Cc2cccc(C)n2)c1	CHEMBL2112677
N[C@@H](C[C@H](CCCC(c1ccccc1)c1ccccc1)C(=O)O)C(=O)O	CHEMBL39338
N[C@@H](C[C@H](CC(c1ccccc1)c1ccccc1)C(=O)O)C(=O)O	CHEMBL40123
cat: write error: Broken pipe


In [9]:
! cat molecule.smi | wc -l

1972


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [10]:
! cat padel.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [11]:
! bash padel.sh

Processing CHEMBL66654 in molecule.smi (1/1972). 
Processing CHEMBL88612 in molecule.smi (2/1972). 
Processing CHEMBL2112677 in molecule.smi (3/1972). Average speed: 3.39 s/mol.
Processing CHEMBL40123 in molecule.smi (5/1972). Average speed: 1.41 s/mol.
Processing CHEMBL39338 in molecule.smi (4/1972). Average speed: 3.41 s/mol.
Processing CHEMBL97574 in molecule.smi (6/1972). Average speed: 1.59 s/mol.
Processing CHEMBL420262 in molecule.smi (7/1972). Average speed: 0.99 s/mol.
Processing CHEMBL319279 in molecule.smi (8/1972). Average speed: 1.06 s/mol.
Processing CHEMBL439775 in molecule.smi (9/1972). Average speed: 0.83 s/mol.
Processing CHEMBL329920 in molecule.smi (11/1972). Average speed: 0.70 s/mol.
Processing CHEMBL99462 in molecule.smi (10/1972). Average speed: 0.77 s/mol.
Processing CHEMBL319732 in molecule.smi (12/1972). Average speed: 0.65 s/mol.
Processing CHEMBL432038 in molecule.smi (13/1972). Average speed: 0.61 s/mol.
Processing CHEMBL95868 in molecule.smi (14/1972). Av

In [12]:
! ls -l

total 464504
-rw-rw-rw-  1 codespace codespace    70569 Apr 24 13:18 CDD_ML_Part_1_Bioactivity_Data_Concised.ipynb
-rw-rw-rw-  1 codespace root         64775 Apr 24 14:38 CDD_ML_Part_1_Bioactivity_Data_Concised_MgluR5.ipynb
-rw-rw-rw-  1 codespace root        100688 Apr 24 14:50 CDD_ML_Part_1_MgluR5_Bioactivity_Data_Concised.ipynb
-rw-rw-rw-  1 codespace root         85900 Apr 24 13:07 CDD_ML_Part_1_MgluR5_bioactivity_data.ipynb
-rw-rw-rw-  1 codespace root        441157 May  2 07:01 CDD_ML_Part_2_Exploratory_Data_Analysis_MgluR5.ipynb
-rw-rw-rw-  1 codespace root        372504 Apr 29 09:40 CDD_ML_Part_2_MgluR5_Exploratory_Data_Analysis.ipynb
-rw-rw-rw-  1 codespace root        350559 May  2 14:57 CDD_ML_Part_3_MgluR5_Descriptor_Dataset_Preparation.ipynb
-rw-rw-rw-  1 codespace root        100076 Apr 24 13:07 CDD_ML_Part_4_Acetylcholinesterase_Regression_Random_Forest.ipynb
-rw-rw-rw-  1 codespace root        230778 Apr 24 13:07 CDD_ML_Part_5_Acetylcholinesterase_Compare_Regressors.ipy

## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [14]:
df3_X = pd.read_csv('descriptors_output.csv')

In [15]:
df3_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL66654,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL88612,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL2112677,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL39338,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL40123,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1967,CHEMBL4591221,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1968,CHEMBL1527295,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1969,CHEMBL4572894,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1970,CHEMBL4758183,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
df3_X = df3_X.drop(columns=['Name'])
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1967,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1968,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1969,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1970,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

### **Convert IC50 to pIC50**

In [18]:
df3_Y = df3['pIC50']
df3_Y

0      -1.556303
1      -3.477121
2      -1.000000
3      -5.477121
4      -5.477121
          ...   
1967   -2.678518
1968   -1.431364
1969   -4.477121
1970   -4.096910
1971   -4.000000
Name: pIC50, Length: 1972, dtype: float64

## **Combining X and Y variable**

In [19]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,-1.556303
1,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,-3.477121
2,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,-1.000000
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,-5.477121
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,-5.477121
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1967,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,-2.678518
1968,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,-1.431364
1969,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,-4.477121
1970,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,-4.096910


In [20]:
dataset3.to_csv('MgluR5_06_bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)

# **Let's download the CSV file to your local computer for the Part 3B (Model Building).**