# **Bioinformatics Project - Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation**

Nusrat Jahan

In this Jupyter notebook, we will be building a real-life **data science project** . Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 3**, we will be calculating molecular descriptors that are essentially **quantitative description** of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

## **Download PaDEL-Descriptor**

Here we are going to use padel as software of molecular descriptor

In [1]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

--2022-09-20 22:07:13--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
Resolving github.com (github.com)... 20.27.177.113
Connecting to github.com (github.com)|20.27.177.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip [following]
--2022-09-20 22:07:14--  https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25768637 (25M) [application/zip]
Saving to: ‘padel.zip’


2022-09-20 22:07:17 (71.9 MB/s) - ‘padel.zip’ saved [25768637/25768637]

--2022-09-20 22:07:17--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
Resolving github.com (g

In [2]:
! unzip padel.zip

Archive:  padel.zip
   creating: PaDEL-Descriptor/
  inflating: __MACOSX/._PaDEL-Descriptor  
  inflating: PaDEL-Descriptor/MACCSFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._MACCSFingerprinter.xml  
  inflating: PaDEL-Descriptor/AtomPairs2DFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._AtomPairs2DFingerprinter.xml  
  inflating: PaDEL-Descriptor/EStateFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._EStateFingerprinter.xml  
  inflating: PaDEL-Descriptor/Fingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._Fingerprinter.xml  
  inflating: PaDEL-Descriptor/.DS_Store  
  inflating: __MACOSX/PaDEL-Descriptor/._.DS_Store  
   creating: PaDEL-Descriptor/license/
  inflating: __MACOSX/PaDEL-Descriptor/._license  
  inflating: PaDEL-Descriptor/KlekotaRothFingerprintCount.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._KlekotaRothFingerprintCount.xml  
  inflating: PaDEL-Descriptor/config  
  inflating: __MACOSX/PaDEL-Descriptor/._config  
  inf

## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **Coronavirus_pIC50.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [3]:
import pandas as pd

In [4]:
df = pd.read_csv('/content/ER_alpha_RO5_pIC50.csv')

In [5]:
df

Unnamed: 0,molecule_chembl_id,canonical_smiles,value,STATUS,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,CHEMBL431611,Oc1ccc2c(c1)S[C@H](C1CCCC1)[C@H](c1ccc(OCCN3CC...,2.50,active,439.621,6.04150,1.0,5.0,8.602060
1,CHEMBL316132,Oc1ccc2c(c1)S[C@H](C1CCCCCC1)[C@H](c1ccc(OCCN3...,7.50,active,467.675,6.82170,1.0,5.0,8.124939
2,CHEMBL85881,Oc1ccc2c(c1)S[C@H](CC1CCCCC1)[C@H](c1ccc(OCCN3...,3.90,active,467.675,6.82170,1.0,5.0,8.408935
3,CHEMBL85536,Oc1ccc2c(c1)S[C@H](Cc1ccccc1)[C@H](c1ccc(OCCN3...,7.40,active,461.627,6.09400,1.0,5.0,8.130768
4,CHEMBL83451,Oc1ccc2c(c1)S[C@H](c1ccncc1)[C@H](c1ccc(OCCN3C...,490.00,active,448.588,5.61900,1.0,6.0,6.309804
...,...,...,...,...,...,...,...,...,...
1467,CHEMBL4779838,CC/C(=C(/c1ccc(/C=C/C(=O)O)cc1)c1cc2ccccc2[nH]...,0.76,active,495.928,8.30700,2.0,1.0,9.119186
1468,CHEMBL4873534,Cc1c(-c2ccc(O)cc2)n(Cc2ccc(OCCn3cc(COCCOc4cccc...,3.20,active,770.799,4.74352,3.0,13.0,8.494850
1469,CHEMBL68236,CCN(CC)CCOc1ccc(C(=C(C#N)c2ccccc2)c2ccc(OCCN(C...,10000.00,inactive,511.710,6.61048,0.0,5.0,5.000000
1470,CHEMBL71584,Oc1ccc([C@@H]2Sc3ccc(O)cc3O[C@@H]2c2ccc(OCCN3C...,1.80,active,463.599,5.92960,2.0,6.0,8.744727


In [6]:
#**molecular fingerprints**
#The molecular fingerprint is a way to describe a molecular structure 
#that can convert a molecular structure into a bit string. 
#Since molecular fingerprint encodes the structure of a molecule, it is a 
#useful method to describe the structural similarity among the 
#molecules as a molecular descriptor
#padel requires a specific file type with .sim  and it generally contains smiles and chembl id to 
#compare the molecular fingerprints of our compounds with fingerprints database 
#like pubchem fingerprints
selection = ['canonical_smiles','molecule_chembl_id']
df_selection = df[selection]
df_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [7]:
! cat molecule.smi | head -5

Oc1ccc2c(c1)S[C@H](C1CCCC1)[C@H](c1ccc(OCCN3CCCCC3)cc1)O2	CHEMBL431611
Oc1ccc2c(c1)S[C@H](C1CCCCCC1)[C@H](c1ccc(OCCN3CCCCC3)cc1)O2	CHEMBL316132
Oc1ccc2c(c1)S[C@H](CC1CCCCC1)[C@H](c1ccc(OCCN3CCCCC3)cc1)O2	CHEMBL85881
Oc1ccc2c(c1)S[C@H](Cc1ccccc1)[C@H](c1ccc(OCCN3CCCCC3)cc1)O2	CHEMBL85536
Oc1ccc2c(c1)S[C@H](c1ccncc1)[C@H](c1ccc(OCCN3CCCCC3)cc1)O2	CHEMBL83451


In [8]:
! cat molecule.smi | wc -l

1472


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [10]:
! cat '/padel_pubchem.sh'

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_pubchem.csv


In [12]:
! bash /padel_pubchem.sh

Processing CHEMBL431611 in molecule.smi (1/1472). 
Processing CHEMBL316132 in molecule.smi (2/1472). 
Processing CHEMBL85881 in molecule.smi (3/1472). Average speed: 4.38 s/mol.
Processing CHEMBL85536 in molecule.smi (4/1472). Average speed: 2.26 s/mol.
Processing CHEMBL83451 in molecule.smi (5/1472). Average speed: 2.09 s/mol.
Processing CHEMBL315761 in molecule.smi (6/1472). Average speed: 1.57 s/mol.
Processing CHEMBL25228 in molecule.smi (7/1472). Average speed: 1.56 s/mol.
Processing CHEMBL432454 in molecule.smi (8/1472). Average speed: 1.32 s/mol.
Processing CHEMBL419110 in molecule.smi (9/1472). Average speed: 1.41 s/mol.
Processing CHEMBL85090 in molecule.smi (10/1472). Average speed: 1.34 s/mol.
Processing CHEMBL83060 in molecule.smi (11/1472). Average speed: 1.23 s/mol.
Processing CHEMBL85650 in molecule.smi (12/1472). Average speed: 1.18 s/mol.
Processing CHEMBL313825 in molecule.smi (14/1472). Average speed: 1.18 s/mol.
Processing CHEMBL313941 in molecule.smi (13/1472). Ave

In [13]:
! ls -l

total 28060
-rw-r--r-- 1 root root  2627504 Sep 20 22:20 descriptors_pubchem.csv
-rw-r--r-- 1 root root   204153 Sep 20 22:00 ER_alpha_RO5_pIC50.csv
drwxr-xr-x 3 root root     4096 Sep 20 22:07 __MACOSX
-rw-r--r-- 1 root root    96971 Sep 20 22:11 molecule.smi
-rw-r--r-- 1 root root      238 Sep 20 22:05 padel_AtomPairs2DFingerprintCount.sh
-rw-r--r-- 1 root root      235 Sep 20 22:05 padel_AtomPairs2DFingerprinter.sh
drwxrwxr-x 4 root root     4096 May 30  2020 PaDEL-Descriptor
-rw-r--r-- 1 root root      228 Sep 20 22:05 padel_MACCS.sh
-rw-r--r-- 1 root root      231 Sep 20 22:07 padel.sh
-rw-r--r-- 1 root root 25768637 Sep 20 22:07 padel.zip
drwxr-xr-x 1 root root     4096 Sep 14 13:44 sample_data


## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [18]:

df_pubchem = pd.read_csv('/content/descriptors_pubchem.csv')

In [19]:
df_pubchem

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL431611,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL316132,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL85881,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL85536,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL315761,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1467,CHEMBL4779838,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1468,CHEMBL68236,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1469,CHEMBL4873534,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1470,CHEMBL71584,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [20]:
df_pubchem_X = df_pubchem.drop(columns=['Name'])
df_pubchem_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1467,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1468,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1469,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1470,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

### **Convert IC50 to pIC50**

In [22]:
df_pubchem_Y = df['pIC50']
df_pubchem_Y

0       8.602060
1       8.124939
2       8.408935
3       8.130768
4       6.309804
          ...   
1467    9.119186
1468    8.494850
1469    5.000000
1470    8.744727
1471    8.397940
Name: pIC50, Length: 1472, dtype: float64

## **Combining X and Y variable**

In [23]:
dataset_pubchem = pd.concat([df_pubchem_X,df_pubchem_Y], axis=1)
dataset_pubchem

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.602060
1,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.124939
2,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.408935
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.130768
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.309804
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1467,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,9.119186
1468,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.494850
1469,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.000000
1470,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.744727


In [24]:
dataset_pubchem.to_csv('ERA_bioactivity_data_pIC50_pubchem_fp.csv', index=False)

In [25]:
! bash padel_AtomPairs2DFingerprintCount.sh

Processing CHEMBL431611 in molecule.smi (1/1472). 
Processing CHEMBL316132 in molecule.smi (2/1472). 
Processing CHEMBL85881 in molecule.smi (3/1472). Average speed: 2.55 s/mol.
Processing CHEMBL85536 in molecule.smi (4/1472). Average speed: 1.29 s/mol.
Processing CHEMBL315761 in molecule.smi (6/1472). Average speed: 0.69 s/mol.
Processing CHEMBL83451 in molecule.smi (5/1472). Average speed: 0.90 s/mol.
Processing CHEMBL25228 in molecule.smi (7/1472). Average speed: 0.56 s/mol.
Processing CHEMBL432454 in molecule.smi (8/1472). Average speed: 0.48 s/mol.
Processing CHEMBL419110 in molecule.smi (9/1472). Average speed: 0.41 s/mol.
Processing CHEMBL85090 in molecule.smi (10/1472). Average speed: 0.38 s/mol.
Processing CHEMBL83060 in molecule.smi (11/1472). Average speed: 0.33 s/mol.
Processing CHEMBL313941 in molecule.smi (13/1472). Average speed: 0.29 s/mol.
Processing CHEMBL85650 in molecule.smi (12/1472). Average speed: 0.31 s/mol.
Processing CHEMBL313825 in molecule.smi (14/1472). Ave

In [26]:
df_AP2DFC_X = pd.read_csv('/content/descriptors_AP2DFC.csv')
df_AP2DFC_X

Unnamed: 0,Name,APC2D1_C_C,APC2D1_C_N,APC2D1_C_O,APC2D1_C_S,APC2D1_C_P,APC2D1_C_F,APC2D1_C_Cl,APC2D1_C_Br,APC2D1_C_I,...,APC2D10_I_I,APC2D10_I_B,APC2D10_I_Si,APC2D10_I_X,APC2D10_B_B,APC2D10_B_Si,APC2D10_B_X,APC2D10_Si_Si,APC2D10_Si_X,APC2D10_X_X
0,CHEMBL316132,27.0,3.0,5.0,2.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,CHEMBL431611,25.0,3.0,5.0,2.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,CHEMBL85881,27.0,3.0,5.0,2.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,CHEMBL85536,27.0,3.0,5.0,2.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,CHEMBL83451,24.0,5.0,5.0,2.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1467,CHEMBL4779838,30.0,2.0,2.0,0.0,0.0,3.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1468,CHEMBL68236,29.0,7.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1469,CHEMBL71584,26.0,3.0,6.0,2.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1470,CHEMBL4873534,39.0,11.0,12.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [27]:
df_AP2DFC_X = df_AP2DFC_X.drop(columns=['Name'])
df_AP2DFC_X

Unnamed: 0,APC2D1_C_C,APC2D1_C_N,APC2D1_C_O,APC2D1_C_S,APC2D1_C_P,APC2D1_C_F,APC2D1_C_Cl,APC2D1_C_Br,APC2D1_C_I,APC2D1_C_B,...,APC2D10_I_I,APC2D10_I_B,APC2D10_I_Si,APC2D10_I_X,APC2D10_B_B,APC2D10_B_Si,APC2D10_B_X,APC2D10_Si_Si,APC2D10_Si_X,APC2D10_X_X
0,27.0,3.0,5.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,25.0,3.0,5.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,27.0,3.0,5.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,27.0,3.0,5.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,24.0,5.0,5.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1467,30.0,2.0,2.0,0.0,0.0,3.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1468,29.0,7.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1469,26.0,3.0,6.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1470,39.0,11.0,12.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [28]:
df_AP2DFC_Y = df['pIC50']
df_AP2DFC_Y

0       8.602060
1       8.124939
2       8.408935
3       8.130768
4       6.309804
          ...   
1467    9.119186
1468    8.494850
1469    5.000000
1470    8.744727
1471    8.397940
Name: pIC50, Length: 1472, dtype: float64

In [29]:
dataset_AP2DFC = pd.concat([df_AP2DFC_X, df_AP2DFC_Y], axis=1)
dataset_AP2DFC

Unnamed: 0,APC2D1_C_C,APC2D1_C_N,APC2D1_C_O,APC2D1_C_S,APC2D1_C_P,APC2D1_C_F,APC2D1_C_Cl,APC2D1_C_Br,APC2D1_C_I,APC2D1_C_B,...,APC2D10_I_B,APC2D10_I_Si,APC2D10_I_X,APC2D10_B_B,APC2D10_B_Si,APC2D10_B_X,APC2D10_Si_Si,APC2D10_Si_X,APC2D10_X_X,pIC50
0,27.0,3.0,5.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.602060
1,25.0,3.0,5.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.124939
2,27.0,3.0,5.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.408935
3,27.0,3.0,5.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.130768
4,24.0,5.0,5.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.309804
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1467,30.0,2.0,2.0,0.0,0.0,3.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.119186
1468,29.0,7.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.494850
1469,26.0,3.0,6.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.000000
1470,39.0,11.0,12.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.744727


In [30]:
dataset_AP2DFC.to_csv('ERA_bioactivity_data_pIC50_AP2DFC.csv', index=False)

In [31]:
! bash padel_AtomPairs2DFingerprinter.sh

Processing CHEMBL431611 in molecule.smi (1/1472). 
Processing CHEMBL316132 in molecule.smi (2/1472). 
Processing CHEMBL85881 in molecule.smi (3/1472). Average speed: 2.47 s/mol.
Processing CHEMBL85536 in molecule.smi (4/1472). Average speed: 1.26 s/mol.
Processing CHEMBL83451 in molecule.smi (5/1472). Average speed: 0.87 s/mol.
Processing CHEMBL315761 in molecule.smi (6/1472). Average speed: 0.67 s/mol.
Processing CHEMBL25228 in molecule.smi (7/1472). Average speed: 0.46 s/mol.
Processing CHEMBL432454 in molecule.smi (8/1472). Average speed: 0.46 s/mol.
Processing CHEMBL85090 in molecule.smi (10/1472). Average speed: 0.36 s/mol.
Processing CHEMBL419110 in molecule.smi (9/1472). Average speed: 0.41 s/mol.
Processing CHEMBL83060 in molecule.smi (11/1472). Average speed: 0.33 s/mol.
Processing CHEMBL85650 in molecule.smi (12/1472). Average speed: 0.31 s/mol.
Processing CHEMBL313941 in molecule.smi (13/1472). Average speed: 0.28 s/mol.
Processing CHEMBL313825 in molecule.smi (14/1472). Ave

In [32]:
df_AP2DFP_X = pd.read_csv('/content/descriptors_AP2DFP.csv')
df_AP2DFP_X

Unnamed: 0,Name,AD2D1,AD2D2,AD2D3,AD2D4,AD2D5,AD2D6,AD2D7,AD2D8,AD2D9,...,AD2D771,AD2D772,AD2D773,AD2D774,AD2D775,AD2D776,AD2D777,AD2D778,AD2D779,AD2D780
0,CHEMBL316132,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL431611,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL85881,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL85536,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL315761,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1467,CHEMBL4779838,1,1,1,0,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1468,CHEMBL68236,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1469,CHEMBL71584,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1470,CHEMBL4873534,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [33]:
df_AP2DFP_X = df_AP2DFP_X.drop(columns=['Name'])
df_AP2DFP_X

Unnamed: 0,AD2D1,AD2D2,AD2D3,AD2D4,AD2D5,AD2D6,AD2D7,AD2D8,AD2D9,AD2D10,...,AD2D771,AD2D772,AD2D773,AD2D774,AD2D775,AD2D776,AD2D777,AD2D778,AD2D779,AD2D780
0,1,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1467,1,1,1,0,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1468,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1469,1,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1470,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [34]:
df_AP2DFP_Y = df['pIC50']
df_AP2DFP_Y

0       8.602060
1       8.124939
2       8.408935
3       8.130768
4       6.309804
          ...   
1467    9.119186
1468    8.494850
1469    5.000000
1470    8.744727
1471    8.397940
Name: pIC50, Length: 1472, dtype: float64

In [35]:
dataset_AP2DFP = pd.concat([df_AP2DFP_X, df_AP2DFP_Y], axis=1)
dataset_AP2DFP

Unnamed: 0,AD2D1,AD2D2,AD2D3,AD2D4,AD2D5,AD2D6,AD2D7,AD2D8,AD2D9,AD2D10,...,AD2D772,AD2D773,AD2D774,AD2D775,AD2D776,AD2D777,AD2D778,AD2D779,AD2D780,pIC50
0,1,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,8.602060
1,1,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,8.124939
2,1,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,8.408935
3,1,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,8.130768
4,1,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,6.309804
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1467,1,1,1,0,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,9.119186
1468,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,8.494850
1469,1,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5.000000
1470,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,8.744727


In [36]:
dataset_AP2DFP.to_csv('ERA_bioactivity_data_pIC50_AP2DFP.csv', index=False)

In [37]:
! bash padel_MACCS.sh

Processing CHEMBL431611 in molecule.smi (1/1472). 
Processing CHEMBL316132 in molecule.smi (2/1472). 
Processing CHEMBL85881 in molecule.smi (3/1472). Average speed: 4.34 s/mol.
Processing CHEMBL85536 in molecule.smi (4/1472). Average speed: 2.21 s/mol.
Processing CHEMBL83451 in molecule.smi (5/1472). Average speed: 1.72 s/mol.
Processing CHEMBL315761 in molecule.smi (6/1472). Average speed: 1.35 s/mol.
Processing CHEMBL25228 in molecule.smi (7/1472). Average speed: 1.16 s/mol.
Processing CHEMBL432454 in molecule.smi (8/1472). Average speed: 1.00 s/mol.
Processing CHEMBL419110 in molecule.smi (9/1472). Average speed: 0.88 s/mol.
Processing CHEMBL85090 in molecule.smi (10/1472). Average speed: 0.80 s/mol.
Processing CHEMBL83060 in molecule.smi (11/1472). Average speed: 0.73 s/mol.
Processing CHEMBL85650 in molecule.smi (12/1472). Average speed: 0.70 s/mol.
Processing CHEMBL313941 in molecule.smi (13/1472). Average speed: 0.67 s/mol.
Processing CHEMBL313825 in molecule.smi (14/1472). Ave

In [38]:
df_MACCS_X = pd.read_csv('/content/descriptors_MACCS.csv')
df_MACCS_X

Unnamed: 0,Name,MACCSFP1,MACCSFP2,MACCSFP3,MACCSFP4,MACCSFP5,MACCSFP6,MACCSFP7,MACCSFP8,MACCSFP9,...,MACCSFP157,MACCSFP158,MACCSFP159,MACCSFP160,MACCSFP161,MACCSFP162,MACCSFP163,MACCSFP164,MACCSFP165,MACCSFP166
0,CHEMBL431611,0,0,0,0,0,0,0,0,0,...,1,1,1,0,1,1,1,1,1,0
1,CHEMBL316132,0,0,0,0,0,0,0,0,0,...,1,1,1,0,1,1,1,1,1,0
2,CHEMBL85881,0,0,0,0,0,0,0,0,0,...,1,1,1,0,1,1,1,1,1,0
3,CHEMBL85536,0,0,0,0,0,0,0,0,0,...,1,1,1,0,1,1,1,1,1,0
4,CHEMBL83451,0,0,0,0,0,0,0,0,0,...,1,1,1,0,1,1,1,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1467,CHEMBL4649161,0,0,0,0,0,0,0,1,0,...,1,1,0,1,1,1,1,1,1,0
1468,CHEMBL68236,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,0
1469,CHEMBL71584,0,0,0,0,0,0,0,0,0,...,1,1,1,0,1,1,1,1,1,0
1470,CHEMBL4873534,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,0


In [39]:
df_MACCS_X = df_MACCS_X.drop(columns=['Name'])
df_MACCS_X

Unnamed: 0,MACCSFP1,MACCSFP2,MACCSFP3,MACCSFP4,MACCSFP5,MACCSFP6,MACCSFP7,MACCSFP8,MACCSFP9,MACCSFP10,...,MACCSFP157,MACCSFP158,MACCSFP159,MACCSFP160,MACCSFP161,MACCSFP162,MACCSFP163,MACCSFP164,MACCSFP165,MACCSFP166
0,0,0,0,0,0,0,0,0,0,0,...,1,1,1,0,1,1,1,1,1,0
1,0,0,0,0,0,0,0,0,0,0,...,1,1,1,0,1,1,1,1,1,0
2,0,0,0,0,0,0,0,0,0,0,...,1,1,1,0,1,1,1,1,1,0
3,0,0,0,0,0,0,0,0,0,0,...,1,1,1,0,1,1,1,1,1,0
4,0,0,0,0,0,0,0,0,0,0,...,1,1,1,0,1,1,1,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1467,0,0,0,0,0,0,0,1,0,0,...,1,1,0,1,1,1,1,1,1,0
1468,0,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,0
1469,0,0,0,0,0,0,0,0,0,0,...,1,1,1,0,1,1,1,1,1,0
1470,0,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,0


In [40]:
df_MACCS_Y = df['pIC50']
df_MACCS_Y

0       8.602060
1       8.124939
2       8.408935
3       8.130768
4       6.309804
          ...   
1467    9.119186
1468    8.494850
1469    5.000000
1470    8.744727
1471    8.397940
Name: pIC50, Length: 1472, dtype: float64

In [41]:
dataset_MACCS = pd.concat([df_MACCS_X,df_MACCS_Y], axis=1)
dataset_MACCS

Unnamed: 0,MACCSFP1,MACCSFP2,MACCSFP3,MACCSFP4,MACCSFP5,MACCSFP6,MACCSFP7,MACCSFP8,MACCSFP9,MACCSFP10,...,MACCSFP158,MACCSFP159,MACCSFP160,MACCSFP161,MACCSFP162,MACCSFP163,MACCSFP164,MACCSFP165,MACCSFP166,pIC50
0,0,0,0,0,0,0,0,0,0,0,...,1,1,0,1,1,1,1,1,0,8.602060
1,0,0,0,0,0,0,0,0,0,0,...,1,1,0,1,1,1,1,1,0,8.124939
2,0,0,0,0,0,0,0,0,0,0,...,1,1,0,1,1,1,1,1,0,8.408935
3,0,0,0,0,0,0,0,0,0,0,...,1,1,0,1,1,1,1,1,0,8.130768
4,0,0,0,0,0,0,0,0,0,0,...,1,1,0,1,1,1,1,1,0,6.309804
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1467,0,0,0,0,0,0,0,1,0,0,...,1,0,1,1,1,1,1,1,0,9.119186
1468,0,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,0,8.494850
1469,0,0,0,0,0,0,0,0,0,0,...,1,1,0,1,1,1,1,1,0,5.000000
1470,0,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,0,8.744727


In [42]:
dataset_MACCS.to_csv('ERA_bioactivity_data_pIC50_MACCS.csv', index=False)

# **Let's download the CSV files to your local computer for the Part 3 (Model Building).**