# **Bioinformatics Project - Computational Drug Discovery [Part 2] Descriptor Calculation and Dataset Preparation**

Nusrat Jahan

In this Jupyter notebook, we will be building a real-life **data science project** . Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 2**, we will be calculating molecular descriptors that are essentially **quantitative description** of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 3.

---

## **Download PaDEL-Descriptor**

Here we are going to use padel as software of molecular descriptor

In [None]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

--2022-10-02 17:56:26--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip [following]
--2022-10-02 17:56:26--  https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25768637 (25M) [application/zip]
Saving to: ‘padel.zip’


2022-10-02 17:56:26 (289 MB/s) - ‘padel.zip’ saved [25768637/25768637]

--2022-10-02 17:56:26--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
Resolving github.com (gith

In [None]:
! unzip padel.zip

Archive:  padel.zip
   creating: PaDEL-Descriptor/
  inflating: __MACOSX/._PaDEL-Descriptor  
  inflating: PaDEL-Descriptor/MACCSFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._MACCSFingerprinter.xml  
  inflating: PaDEL-Descriptor/AtomPairs2DFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._AtomPairs2DFingerprinter.xml  
  inflating: PaDEL-Descriptor/EStateFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._EStateFingerprinter.xml  
  inflating: PaDEL-Descriptor/Fingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._Fingerprinter.xml  
  inflating: PaDEL-Descriptor/.DS_Store  
  inflating: __MACOSX/PaDEL-Descriptor/._.DS_Store  
   creating: PaDEL-Descriptor/license/
  inflating: __MACOSX/PaDEL-Descriptor/._license  
  inflating: PaDEL-Descriptor/KlekotaRothFingerprintCount.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._KlekotaRothFingerprintCount.xml  
  inflating: PaDEL-Descriptor/config  
  inflating: __MACOSX/PaDEL-Descriptor/._config  
  inf

## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **Coronavirus_pIC50.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('/content/PLK1_RO5_pIC50.csv')

In [None]:
df

Unnamed: 0,molecule_chembl_id,canonical_smiles,STATUS,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,CHEMBL115220,O=C(Cc1ccc2ccccc2c1)Nc1cc(C2CC2)n[nH]1,inactive,291.354,3.62150,2.0,2.0,5.000000
1,CHEMBL199996,Cc1n[nH]c2sc(C(N)=O)c(NC(=O)Nc3ccccc3)c12,inactive,315.358,2.67572,4.0,4.0,4.698970
2,CHEMBL199658,Cc1n[nH]c2sc(C(N)=O)c(NC(=O)c3ccc(Cl)cc3)c12,inactive,334.788,2.93742,3.0,4.0,4.000000
3,CHEMBL199657,Cc1n[nH]c2sc(C(N)=O)c(NC(=O)c3cccc(Cl)c3)c12,inactive,334.788,2.93742,3.0,4.0,4.000000
4,CHEMBL371695,Cc1n[nH]c2sc(C(N)=O)c(NC(=O)c3ccccc3Cl)c12,inactive,334.788,2.93742,3.0,4.0,4.638272
...,...,...,...,...,...,...,...,...
1208,CHEMBL525907,Cc1nn(-c2ccccc2)c2cc(N[C@@H](C)c3ccccc3)ncc12,intermediate,328.419,4.90202,1.0,4.0,5.885723
1209,CHEMBL559845,Cn1nc(C(N)=O)c2c1-c1nc(Nc3ccccc3)ncc1CC2,active,320.356,1.81820,2.0,6.0,7.167491
1210,CHEMBL562104,CNC(=O)c1nn(C)c2c1CCc1cnc(Nc3ccccc3)nc1-2,intermediate,334.383,2.07890,2.0,6.0,5.375202
1211,CHEMBL563150,Cn1nc(C(=O)Nc2ccccc2)c2c1-c1nc(Nc3ccccc3)ncc1CC2,inactive,396.454,3.97160,2.0,6.0,5.000000


In [None]:
#**molecular fingerprints**
#The molecular fingerprint is a way to describe a molecular structure 
#that can convert a molecular structure into a bit string. 
#Since molecular fingerprint encodes the structure of a molecule, it is a 
#useful method to describe the structural similarity among the 
#molecules as a molecular descriptor
#padel requires a specific file type with .sim  and it generally contains smiles and chembl id to 
#compare the molecular fingerprints of our compounds with fingerprints database 
#like pubchem fingerprints
selection = ['canonical_smiles','molecule_chembl_id']
df_selection = df[selection]
df_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [None]:
! cat molecule.smi | head -5

O=C(Cc1ccc2ccccc2c1)Nc1cc(C2CC2)n[nH]1	CHEMBL115220
Cc1n[nH]c2sc(C(N)=O)c(NC(=O)Nc3ccccc3)c12	CHEMBL199996
Cc1n[nH]c2sc(C(N)=O)c(NC(=O)c3ccc(Cl)cc3)c12	CHEMBL199658
Cc1n[nH]c2sc(C(N)=O)c(NC(=O)c3cccc(Cl)c3)c12	CHEMBL199657
Cc1n[nH]c2sc(C(N)=O)c(NC(=O)c3ccccc3Cl)c12	CHEMBL371695


In [None]:
! cat molecule.smi | wc -l

1213


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [None]:
! cat padel.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_pubchem.csv &&

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/MACCSFingerprinter.xml -dir ./ -file descriptors_MACCS.csv &&

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/AtomPairs2DFingerprinter.xml -dir ./ -file descriptors_AtomPairs2DFP.csv &&

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/Fingerprinter.xml -dir ./ -file descriptors_Fingerprinter.csv &&

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./

In [None]:
! bash '/content/padel.sh'

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Processing CHEMBL4762177 in molecule.smi (1070/1213). Average speed: 0.08 s/mol.
Processing CHEMBL4782750 in molecule.smi (1073/1213). Average speed: 0.08 s/mol.
Processing CHEMBL4781698 in molecule.smi (1072/1213). Average speed: 0.08 s/mol.
Processing CHEMBL4753107 in molecule.smi (1075/1213). Average speed: 0.08 s/mol.
Processing CHEMBL4745902 in molecule.smi (1074/1213). Average speed: 0.08 s/mol.
Processing CHEMBL4786880 in molecule.smi (1077/1213). Average speed: 0.08 s/mol.
Processing CHEMBL4795057 in molecule.smi (1076/1213). Average speed: 0.08 s/mol.
Processing CHEMBL4782976 in molecule.smi (1079/1213). Average speed: 0.08 s/mol.
Processing CHEMBL4779347 in molecule.smi (1078/1213). Average speed: 0.08 s/mol.
Processing CHEMBL4747182 in molecule.smi (1081/1213). Average speed: 0.08 s/mol.
Processing CHEMBL4799166 in molecule.smi (1080/1213). Average speed: 0.08 s/mol.
Processing CHEMBL4758334 in molecule.smi (10

In [None]:
! bash '/content/padel_1.sh'

Processing CHEMBL115220 in molecule.smi (1/1213). 
Processing CHEMBL199996 in molecule.smi (2/1213). 
Processing CHEMBL199657 in molecule.smi (4/1213). Average speed: 3.63 s/mol.
Processing CHEMBL199658 in molecule.smi (3/1213). Average speed: 3.62 s/mol.
Processing CHEMBL371695 in molecule.smi (5/1213). Average speed: 1.28 s/mol.
Processing CHEMBL382070 in molecule.smi (6/1213). Average speed: 0.97 s/mol.
Processing CHEMBL199759 in molecule.smi (7/1213). Average speed: 0.81 s/mol.
Processing CHEMBL370199 in molecule.smi (8/1213). Average speed: 0.68 s/mol.
Processing CHEMBL199737 in molecule.smi (9/1213). Average speed: 0.60 s/mol.
Processing CHEMBL199383 in molecule.smi (11/1213). Average speed: 0.48 s/mol.
Processing CHEMBL371239 in molecule.smi (10/1213). Average speed: 0.53 s/mol.
Processing CHEMBL199755 in molecule.smi (13/1213). Average speed: 0.41 s/mol.
Processing CHEMBL197923 in molecule.smi (12/1213). Average speed: 0.45 s/mol.
Processing CHEMBL199528 in molecule.smi (14/121

In [None]:
! ls -l

total 78072
-rw-r--r-- 1 root root  3830709 Oct  2 18:53 descriptors_AtomPairs2DFPCount.csv
-rw-r--r-- 1 root root  1917453 Oct  2 18:05 descriptors_AtomPairs2DFP.csv
-rw-r--r-- 1 root root   211555 Oct  2 19:40 descriptors_EstateFP.csv
-rw-r--r-- 1 root root  2512398 Oct  2 18:50 descriptors_ExtendedFP.csv
-rw-r--r-- 1 root root  2509326 Oct  2 18:57 descriptors_Fingerprinter.csv
-rw-r--r-- 1 root root  2514446 Oct  2 19:41 descriptors_GraphOFP.csv
-rw-r--r-- 1 root root 23648407 Oct  2 18:50 descriptors_KRFPcount.csv
-rw-r--r-- 1 root root 11852034 Oct  2 20:23 descriptors_KRFP.csv
-rw-r--r-- 1 root root   423475 Oct  2 18:04 descriptors_MACCS.csv
-rw-r--r-- 1 root root  2167690 Oct  2 18:02 descriptors_pubchem.csv
-rw-r--r-- 1 root root  1516216 Oct  2 18:56 descriptors_SubSFPCount.csv
-rw-r--r-- 1 root root   766478 Oct  2 18:54 descriptors_substructureFP.csv
drwxr-xr-x 3 root root     4096 Oct  2 17:56 __MACOSX
-rw-r--r-- 1 root root    85588 Oct  2 17:56 molecule.smi
-rw-r--r-- 1

## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [None]:
df_pubchem_name = pd.read_csv('/content/descriptors_pubchem.csv')
df_pubchem_name

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL199996,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL115220,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL199658,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL371695,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL199657,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,CHEMBL525907,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1209,CHEMBL559845,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1210,CHEMBL562104,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1211,CHEMBL603463,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
df_pubchem_X = df_pubchem_name.drop(columns=['Name'])
df_pubchem_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1209,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1210,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1211,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [None]:
df_MACCS_name = pd.read_csv('/content/descriptors_MACCS.csv')
df_MACCS_name

Unnamed: 0,Name,MACCSFP1,MACCSFP2,MACCSFP3,MACCSFP4,MACCSFP5,MACCSFP6,MACCSFP7,MACCSFP8,MACCSFP9,...,MACCSFP157,MACCSFP158,MACCSFP159,MACCSFP160,MACCSFP161,MACCSFP162,MACCSFP163,MACCSFP164,MACCSFP165,MACCSFP166
0,CHEMBL199996,0,0,0,0,0,0,0,0,0,...,0,1,1,1,1,1,1,1,1,0
1,CHEMBL115220,0,0,0,0,0,0,0,0,0,...,0,1,0,0,1,1,1,1,1,0
2,CHEMBL199658,0,0,0,0,0,0,0,0,0,...,0,1,1,1,1,1,1,1,1,0
3,CHEMBL199657,0,0,0,0,0,0,0,0,0,...,0,1,1,1,1,1,1,1,1,0
4,CHEMBL382070,0,0,0,0,0,0,0,0,0,...,0,1,1,1,1,1,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,CHEMBL525907,0,0,0,0,0,0,0,0,0,...,0,1,0,1,1,1,1,0,1,0
1209,CHEMBL559845,0,0,0,0,0,0,0,0,0,...,0,1,0,1,1,1,1,1,1,0
1210,CHEMBL562104,0,0,0,0,0,0,0,0,0,...,0,1,0,1,1,1,1,1,1,0
1211,CHEMBL603463,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,0


In [None]:
df_MACCS_X = df_MACCS_name.drop(columns=['Name'])
df_MACCS_X

Unnamed: 0,MACCSFP1,MACCSFP2,MACCSFP3,MACCSFP4,MACCSFP5,MACCSFP6,MACCSFP7,MACCSFP8,MACCSFP9,MACCSFP10,...,MACCSFP157,MACCSFP158,MACCSFP159,MACCSFP160,MACCSFP161,MACCSFP162,MACCSFP163,MACCSFP164,MACCSFP165,MACCSFP166
0,0,0,0,0,0,0,0,0,0,0,...,0,1,1,1,1,1,1,1,1,0
1,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,1,1,1,1,1,0
2,0,0,0,0,0,0,0,0,0,0,...,0,1,1,1,1,1,1,1,1,0
3,0,0,0,0,0,0,0,0,0,0,...,0,1,1,1,1,1,1,1,1,0
4,0,0,0,0,0,0,0,0,0,0,...,0,1,1,1,1,1,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,0,0,0,0,0,0,0,0,0,0,...,0,1,0,1,1,1,1,0,1,0
1209,0,0,0,0,0,0,0,0,0,0,...,0,1,0,1,1,1,1,1,1,0
1210,0,0,0,0,0,0,0,0,0,0,...,0,1,0,1,1,1,1,1,1,0
1211,0,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,0


In [None]:
df_AP2DFP_name = pd.read_csv('/content/descriptors_AtomPairs2DFP.csv')
df_AP2DFP_name

Unnamed: 0,Name,AD2D1,AD2D2,AD2D3,AD2D4,AD2D5,AD2D6,AD2D7,AD2D8,AD2D9,...,AD2D771,AD2D772,AD2D773,AD2D774,AD2D775,AD2D776,AD2D777,AD2D778,AD2D779,AD2D780
0,CHEMBL199996,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL115220,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL199658,1,1,1,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL199657,1,1,1,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL371695,1,1,1,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,CHEMBL559845,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1209,CHEMBL562104,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1210,CHEMBL525907,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1211,CHEMBL563150,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
df_AP2DFP_X = df_AP2DFP_name.drop(columns=['Name'])
df_AP2DFP_X

Unnamed: 0,AD2D1,AD2D2,AD2D3,AD2D4,AD2D5,AD2D6,AD2D7,AD2D8,AD2D9,AD2D10,...,AD2D771,AD2D772,AD2D773,AD2D774,AD2D775,AD2D776,AD2D777,AD2D778,AD2D779,AD2D780
0,1,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,1,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,1,1,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,1,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1209,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1210,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1211,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
df_AP2DFPC_name = pd.read_csv('/content/descriptors_AtomPairs2DFPCount.csv')
df_AP2DFPC_name

Unnamed: 0,Name,APC2D1_C_C,APC2D1_C_N,APC2D1_C_O,APC2D1_C_S,APC2D1_C_P,APC2D1_C_F,APC2D1_C_Cl,APC2D1_C_Br,APC2D1_C_I,...,APC2D10_I_I,APC2D10_I_B,APC2D10_I_Si,APC2D10_I_X,APC2D10_B_B,APC2D10_B_Si,APC2D10_B_X,APC2D10_Si_Si,APC2D10_Si_X,APC2D10_X_X
0,CHEMBL199996,12.0,7.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,CHEMBL115220,19.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,CHEMBL199657,13.0,5.0,2.0,2.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,CHEMBL199658,13.0,5.0,2.0,2.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,CHEMBL382070,10.0,5.0,2.0,4.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,CHEMBL559845,15.0,10.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1209,CHEMBL525907,20.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1210,CHEMBL562104,15.0,11.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1211,CHEMBL563150,21.0,11.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
df_AP2DFPC_X = df_AP2DFPC_name.drop(columns=['Name'])
df_AP2DFPC_X

Unnamed: 0,APC2D1_C_C,APC2D1_C_N,APC2D1_C_O,APC2D1_C_S,APC2D1_C_P,APC2D1_C_F,APC2D1_C_Cl,APC2D1_C_Br,APC2D1_C_I,APC2D1_C_B,...,APC2D10_I_I,APC2D10_I_B,APC2D10_I_Si,APC2D10_I_X,APC2D10_B_B,APC2D10_B_Si,APC2D10_B_X,APC2D10_Si_Si,APC2D10_Si_X,APC2D10_X_X
0,12.0,7.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,19.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,13.0,5.0,2.0,2.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,13.0,5.0,2.0,2.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,10.0,5.0,2.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,15.0,10.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1209,20.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1210,15.0,11.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1211,21.0,11.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
df_EFP_name = pd.read_csv('/content/descriptors_ExtendedFP.csv')
df_EFP_name

Unnamed: 0,Name,ExtFP1,ExtFP2,ExtFP3,ExtFP4,ExtFP5,ExtFP6,ExtFP7,ExtFP8,ExtFP9,...,ExtFP1015,ExtFP1016,ExtFP1017,ExtFP1018,ExtFP1019,ExtFP1020,ExtFP1021,ExtFP1022,ExtFP1023,ExtFP1024
0,CHEMBL199996,0,1,1,1,0,0,1,1,0,...,1,0,0,0,0,0,0,0,0,0
1,CHEMBL115220,0,0,0,1,0,0,0,1,0,...,1,0,0,0,0,0,0,0,0,0
2,CHEMBL199657,1,1,1,1,0,0,1,1,0,...,1,0,0,0,0,0,0,0,0,0
3,CHEMBL199658,1,1,1,1,0,0,1,1,0,...,1,0,0,0,0,0,0,0,0,0
4,CHEMBL371695,1,1,1,1,0,0,1,1,0,...,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,CHEMBL525907,0,0,1,1,0,0,1,0,0,...,1,0,0,0,0,0,0,0,0,0
1209,CHEMBL559845,1,0,1,0,1,0,0,0,0,...,1,1,0,0,0,0,0,0,0,0
1210,CHEMBL563150,1,0,1,0,1,0,0,0,0,...,1,1,0,0,0,0,0,0,0,0
1211,CHEMBL562104,1,0,1,0,1,0,0,0,0,...,1,1,0,0,0,0,0,0,0,0


In [None]:
df_EFP_X = df_EFP_name.drop(columns=['Name'])
df_EFP_X

Unnamed: 0,ExtFP1,ExtFP2,ExtFP3,ExtFP4,ExtFP5,ExtFP6,ExtFP7,ExtFP8,ExtFP9,ExtFP10,...,ExtFP1015,ExtFP1016,ExtFP1017,ExtFP1018,ExtFP1019,ExtFP1020,ExtFP1021,ExtFP1022,ExtFP1023,ExtFP1024
0,0,1,1,1,0,0,1,1,0,1,...,1,0,0,0,0,0,0,0,0,0
1,0,0,0,1,0,0,0,1,0,0,...,1,0,0,0,0,0,0,0,0,0
2,1,1,1,1,0,0,1,1,0,0,...,1,0,0,0,0,0,0,0,0,0
3,1,1,1,1,0,0,1,1,0,0,...,1,0,0,0,0,0,0,0,0,0
4,1,1,1,1,0,0,1,1,0,0,...,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,0,0,1,1,0,0,1,0,0,0,...,1,0,0,0,0,0,0,0,0,0
1209,1,0,1,0,1,0,0,0,0,0,...,1,1,0,0,0,0,0,0,0,0
1210,1,0,1,0,1,0,0,0,0,0,...,1,1,0,0,0,0,0,0,0,0
1211,1,0,1,0,1,0,0,0,0,0,...,1,1,0,0,0,0,0,0,0,0


In [None]:
df_FP_name = pd.read_csv('/content/descriptors_Fingerprinter.csv')
df_FP_name

Unnamed: 0,Name,FP1,FP2,FP3,FP4,FP5,FP6,FP7,FP8,FP9,...,FP1015,FP1016,FP1017,FP1018,FP1019,FP1020,FP1021,FP1022,FP1023,FP1024
0,CHEMBL199996,1,0,1,0,0,0,0,0,0,...,1,0,1,0,1,0,0,1,0,1
1,CHEMBL115220,0,0,0,0,0,0,0,0,0,...,0,1,0,0,1,0,0,0,0,0
2,CHEMBL199658,1,0,1,0,0,0,0,0,0,...,1,1,1,0,1,0,0,1,0,1
3,CHEMBL199657,1,0,1,0,0,0,0,0,0,...,1,1,1,0,1,0,0,1,0,1
4,CHEMBL371695,1,0,1,0,0,0,0,0,0,...,1,1,1,0,1,0,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,CHEMBL514499,1,1,1,0,0,0,1,0,0,...,0,1,0,1,0,0,0,1,0,0
1209,CHEMBL562104,1,1,0,1,0,0,1,0,0,...,1,0,0,0,0,0,0,1,0,0
1210,CHEMBL559845,1,1,0,1,0,0,1,0,0,...,1,0,0,0,0,0,0,1,0,0
1211,CHEMBL563150,1,1,0,1,1,0,1,0,0,...,1,0,0,0,0,0,0,1,0,0


In [None]:
df_FP_X = df_FP_name.drop(columns=['Name'])
df_FP_X

Unnamed: 0,FP1,FP2,FP3,FP4,FP5,FP6,FP7,FP8,FP9,FP10,...,FP1015,FP1016,FP1017,FP1018,FP1019,FP1020,FP1021,FP1022,FP1023,FP1024
0,1,0,1,0,0,0,0,0,0,0,...,1,0,1,0,1,0,0,1,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,1,0,0,0,0,0
2,1,0,1,0,0,0,0,0,0,0,...,1,1,1,0,1,0,0,1,0,1
3,1,0,1,0,0,0,0,0,0,0,...,1,1,1,0,1,0,0,1,0,1
4,1,0,1,0,0,0,0,0,0,0,...,1,1,1,0,1,0,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,1,1,1,0,0,0,1,0,0,0,...,0,1,0,1,0,0,0,1,0,0
1209,1,1,0,1,0,0,1,0,0,0,...,1,0,0,0,0,0,0,1,0,0
1210,1,1,0,1,0,0,1,0,0,0,...,1,0,0,0,0,0,0,1,0,0
1211,1,1,0,1,1,0,1,0,0,0,...,1,0,0,0,0,0,0,1,0,0


In [None]:
df_KRFPC_name = pd.read_csv('/content/descriptors_KRFPcount.csv')
df_KRFPC_name

Unnamed: 0,Name,KRFPC1,KRFPC2,KRFPC3,KRFPC4,KRFPC5,KRFPC6,KRFPC7,KRFPC8,KRFPC9,...,KRFPC4851,KRFPC4852,KRFPC4853,KRFPC4854,KRFPC4855,KRFPC4856,KRFPC4857,KRFPC4858,KRFPC4859,KRFPC4860
0,CHEMBL199996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,CHEMBL115220,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,CHEMBL199658,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,CHEMBL199657,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,CHEMBL371695,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,CHEMBL525907,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1209,CHEMBL559845,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1210,CHEMBL562104,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1211,CHEMBL563150,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
df_KRFPC_X = df_KRFPC_name.drop(columns=['Name'])
df_KRFPC_X

Unnamed: 0,KRFPC1,KRFPC2,KRFPC3,KRFPC4,KRFPC5,KRFPC6,KRFPC7,KRFPC8,KRFPC9,KRFPC10,...,KRFPC4851,KRFPC4852,KRFPC4853,KRFPC4854,KRFPC4855,KRFPC4856,KRFPC4857,KRFPC4858,KRFPC4859,KRFPC4860
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1209,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1210,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1211,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
df_SFPC_name = pd.read_csv('/content/descriptors_SubSFPCount.csv')
df_SFPC_name

Unnamed: 0,Name,SubFPC1,SubFPC2,SubFPC3,SubFPC4,SubFPC5,SubFPC6,SubFPC7,SubFPC8,SubFPC9,...,SubFPC298,SubFPC299,SubFPC300,SubFPC301,SubFPC302,SubFPC303,SubFPC304,SubFPC305,SubFPC306,SubFPC307
0,CHEMBL199996,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,8.0,8.0,5.0,0.0,0.0,0.0,0.0,16.0
1,CHEMBL115220,0.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,10.0,10.0,5.0,0.0,0.0,0.0,0.0,19.0
2,CHEMBL199657,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,6.0,6.0,4.0,0.0,0.0,0.0,0.0,15.0
3,CHEMBL199658,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,6.0,6.0,4.0,0.0,0.0,0.0,0.0,15.0
4,CHEMBL382070,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,6.0,6.0,4.0,0.0,0.0,0.0,0.0,13.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,CHEMBL525907,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,5.0,5.0,3.0,0.0,0.0,0.0,0.0,21.0
1209,CHEMBL559845,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,7.0,7.0,3.0,0.0,0.0,0.0,0.0,16.0
1210,CHEMBL562104,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,8.0,8.0,4.0,0.0,0.0,0.0,0.0,17.0
1211,CHEMBL563150,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,8.0,8.0,5.0,0.0,0.0,0.0,0.0,23.0


In [None]:
df_SFPC_X = df_SFPC_name.drop(columns=['Name'])
df_SFPC_X

Unnamed: 0,SubFPC1,SubFPC2,SubFPC3,SubFPC4,SubFPC5,SubFPC6,SubFPC7,SubFPC8,SubFPC9,SubFPC10,...,SubFPC298,SubFPC299,SubFPC300,SubFPC301,SubFPC302,SubFPC303,SubFPC304,SubFPC305,SubFPC306,SubFPC307
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,8.0,8.0,5.0,0.0,0.0,0.0,0.0,16.0
1,0.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,10.0,10.0,5.0,0.0,0.0,0.0,0.0,19.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,6.0,6.0,4.0,0.0,0.0,0.0,0.0,15.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,6.0,6.0,4.0,0.0,0.0,0.0,0.0,15.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,6.0,6.0,4.0,0.0,0.0,0.0,0.0,13.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,5.0,5.0,3.0,0.0,0.0,0.0,0.0,21.0
1209,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,7.0,7.0,3.0,0.0,0.0,0.0,0.0,16.0
1210,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,8.0,8.0,4.0,0.0,0.0,0.0,0.0,17.0
1211,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,8.0,8.0,5.0,0.0,0.0,0.0,0.0,23.0


In [None]:
df_SFP_name = pd.read_csv('/content/descriptors_substructureFP.csv')
df_SFP_name

Unnamed: 0,Name,SubFP1,SubFP2,SubFP3,SubFP4,SubFP5,SubFP6,SubFP7,SubFP8,SubFP9,...,SubFP298,SubFP299,SubFP300,SubFP301,SubFP302,SubFP303,SubFP304,SubFP305,SubFP306,SubFP307
0,CHEMBL199996,1,0,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
1,CHEMBL115220,0,1,1,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
2,CHEMBL199658,1,0,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
3,CHEMBL199657,1,0,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
4,CHEMBL371695,1,0,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,CHEMBL525907,1,0,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
1209,CHEMBL559845,0,1,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
1210,CHEMBL562104,0,1,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
1211,CHEMBL603463,0,1,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1


In [None]:
df_SFP_X = df_SFP_name.drop(columns=['Name'])
df_SFP_X

Unnamed: 0,SubFP1,SubFP2,SubFP3,SubFP4,SubFP5,SubFP6,SubFP7,SubFP8,SubFP9,SubFP10,...,SubFP298,SubFP299,SubFP300,SubFP301,SubFP302,SubFP303,SubFP304,SubFP305,SubFP306,SubFP307
0,1,0,0,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
1,0,1,1,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
2,1,0,0,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
3,1,0,0,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
4,1,0,0,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,1,0,0,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
1209,0,1,0,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
1210,0,1,0,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
1211,0,1,0,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1


In [None]:
df_EsFP_name = pd.read_csv('/content/descriptors_EstateFP.csv')
df_EsFP_name

Unnamed: 0,Name,EStateFP1,EStateFP2,EStateFP3,EStateFP4,EStateFP5,EStateFP6,EStateFP7,EStateFP8,EStateFP9,...,EStateFP70,EStateFP71,EStateFP72,EStateFP73,EStateFP74,EStateFP75,EStateFP76,EStateFP77,EStateFP78,EStateFP79
0,CHEMBL115220,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL199996,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL199657,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL199658,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL371695,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,CHEMBL525907,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1209,CHEMBL562104,0,0,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
1210,CHEMBL559845,0,0,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
1211,CHEMBL603463,0,0,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0


In [None]:
df_EsFP_X = df_EsFP_name.drop(columns=['Name'])
df_EsFP_X

Unnamed: 0,EStateFP1,EStateFP2,EStateFP3,EStateFP4,EStateFP5,EStateFP6,EStateFP7,EStateFP8,EStateFP9,EStateFP10,...,EStateFP70,EStateFP71,EStateFP72,EStateFP73,EStateFP74,EStateFP75,EStateFP76,EStateFP77,EStateFP78,EStateFP79
0,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1209,0,0,0,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1210,0,0,0,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1211,0,0,0,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
df_GraphFP_name = pd.read_csv('/content/descriptors_GraphOFP.csv')
df_GraphFP_name

Unnamed: 0,Name,GraphFP1,GraphFP2,GraphFP3,GraphFP4,GraphFP5,GraphFP6,GraphFP7,GraphFP8,GraphFP9,...,GraphFP1015,GraphFP1016,GraphFP1017,GraphFP1018,GraphFP1019,GraphFP1020,GraphFP1021,GraphFP1022,GraphFP1023,GraphFP1024
0,CHEMBL199996,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL115220,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL199658,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL199657,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL371695,0,0,0,0,0,0,0,1,1,...,1,1,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,CHEMBL525907,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1209,CHEMBL562104,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1210,CHEMBL559845,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1211,CHEMBL603463,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
df_GraphFP_X = df_GraphFP_name.drop(columns=['Name'])
df_GraphFP_X

Unnamed: 0,GraphFP1,GraphFP2,GraphFP3,GraphFP4,GraphFP5,GraphFP6,GraphFP7,GraphFP8,GraphFP9,GraphFP10,...,GraphFP1015,GraphFP1016,GraphFP1017,GraphFP1018,GraphFP1019,GraphFP1020,GraphFP1021,GraphFP1022,GraphFP1023,GraphFP1024
0,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,1,1,0,...,1,1,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1209,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1210,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1211,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
df_KRFP_name = pd.read_csv('/content/descriptors_KRFP.csv')
df_KRFP_name

Unnamed: 0,Name,KRFP1,KRFP2,KRFP3,KRFP4,KRFP5,KRFP6,KRFP7,KRFP8,KRFP9,...,KRFP4851,KRFP4852,KRFP4853,KRFP4854,KRFP4855,KRFP4856,KRFP4857,KRFP4858,KRFP4859,KRFP4860
0,CHEMBL199996,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL115220,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL199658,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL199657,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL371695,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,CHEMBL525907,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1209,CHEMBL559845,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1210,CHEMBL562104,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1211,CHEMBL563150,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
df_KRFP_X = df_KRFP_name.drop(columns=['Name'])
df_KRFP_X

Unnamed: 0,KRFP1,KRFP2,KRFP3,KRFP4,KRFP5,KRFP6,KRFP7,KRFP8,KRFP9,KRFP10,...,KRFP4851,KRFP4852,KRFP4853,KRFP4854,KRFP4855,KRFP4856,KRFP4857,KRFP4858,KRFP4859,KRFP4860
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1209,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1210,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1211,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

### **Convert IC50 to pIC50**

In [None]:
df_pubchem_Y = df['pIC50']
df_pubchem_Y

0       5.000000
1       4.698970
2       4.000000
3       4.000000
4       4.638272
          ...   
1208    5.885723
1209    7.167491
1210    5.375202
1211    5.000000
1212    5.244125
Name: pIC50, Length: 1213, dtype: float64

## **Combining X and Y variable**

In [None]:
dataset_pubchem = pd.concat([df_pubchem_X,df_pubchem_Y], axis=1)
dataset_pubchem

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.000000
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.698970
2,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.000000
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.000000
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.638272
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.885723
1209,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.167491
1210,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.375202
1211,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.000000


In [None]:
dataset_MACCS = pd.concat([df_MACCS_X,df_pubchem_Y], axis=1)
dataset_MACCS

Unnamed: 0,MACCSFP1,MACCSFP2,MACCSFP3,MACCSFP4,MACCSFP5,MACCSFP6,MACCSFP7,MACCSFP8,MACCSFP9,MACCSFP10,...,MACCSFP158,MACCSFP159,MACCSFP160,MACCSFP161,MACCSFP162,MACCSFP163,MACCSFP164,MACCSFP165,MACCSFP166,pIC50
0,0,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,0,5.000000
1,0,0,0,0,0,0,0,0,0,0,...,1,0,0,1,1,1,1,1,0,4.698970
2,0,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,0,4.000000
3,0,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,0,4.000000
4,0,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,0,1,1,0,4.638272
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,0,0,0,0,0,0,0,0,0,0,...,1,0,1,1,1,1,0,1,0,5.885723
1209,0,0,0,0,0,0,0,0,0,0,...,1,0,1,1,1,1,1,1,0,7.167491
1210,0,0,0,0,0,0,0,0,0,0,...,1,0,1,1,1,1,1,1,0,5.375202
1211,0,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,0,5.000000


In [None]:
dataset_AP2DFP = pd.concat([df_AP2DFP_X,df_pubchem_Y], axis=1)
dataset_AP2DFP

Unnamed: 0,AD2D1,AD2D2,AD2D3,AD2D4,AD2D5,AD2D6,AD2D7,AD2D8,AD2D9,AD2D10,...,AD2D772,AD2D773,AD2D774,AD2D775,AD2D776,AD2D777,AD2D778,AD2D779,AD2D780,pIC50
0,1,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5.000000
1,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4.698970
2,1,1,1,1,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,4.000000
3,1,1,1,1,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,4.000000
4,1,1,1,1,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,4.638272
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5.885723
1209,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,7.167491
1210,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5.375202
1211,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5.000000


In [None]:
dataset_AP2DFPC = pd.concat([df_AP2DFPC_X,df_pubchem_Y], axis=1)
dataset_AP2DFPC

Unnamed: 0,APC2D1_C_C,APC2D1_C_N,APC2D1_C_O,APC2D1_C_S,APC2D1_C_P,APC2D1_C_F,APC2D1_C_Cl,APC2D1_C_Br,APC2D1_C_I,APC2D1_C_B,...,APC2D10_I_B,APC2D10_I_Si,APC2D10_I_X,APC2D10_B_B,APC2D10_B_Si,APC2D10_B_X,APC2D10_Si_Si,APC2D10_Si_X,APC2D10_X_X,pIC50
0,12.0,7.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.000000
1,19.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.698970
2,13.0,5.0,2.0,2.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.000000
3,13.0,5.0,2.0,2.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.000000
4,10.0,5.0,2.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.638272
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,15.0,10.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.885723
1209,20.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.167491
1210,15.0,11.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.375202
1211,21.0,11.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.000000


In [None]:
dataset_KRFP = pd.concat([df_KRFP_X,df_pubchem_Y], axis=1)
dataset_KRFP

Unnamed: 0,KRFP1,KRFP2,KRFP3,KRFP4,KRFP5,KRFP6,KRFP7,KRFP8,KRFP9,KRFP10,...,KRFP4852,KRFP4853,KRFP4854,KRFP4855,KRFP4856,KRFP4857,KRFP4858,KRFP4859,KRFP4860,pIC50
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5.000000
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4.698970
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4.000000
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4.000000
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4.638272
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5.885723
1209,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,7.167491
1210,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5.375202
1211,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5.000000


In [None]:
dataset_KRFPC = pd.concat([df_KRFPC_X,df_pubchem_Y], axis=1)
dataset_KRFPC

Unnamed: 0,KRFPC1,KRFPC2,KRFPC3,KRFPC4,KRFPC5,KRFPC6,KRFPC7,KRFPC8,KRFPC9,KRFPC10,...,KRFPC4852,KRFPC4853,KRFPC4854,KRFPC4855,KRFPC4856,KRFPC4857,KRFPC4858,KRFPC4859,KRFPC4860,pIC50
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.000000
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.698970
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.000000
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.000000
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.638272
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.885723
1209,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.167491
1210,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.375202
1211,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.000000


In [None]:
dataset_EsFP = pd.concat([df_EsFP_X,df_pubchem_Y], axis=1)
dataset_EsFP

Unnamed: 0,EStateFP1,EStateFP2,EStateFP3,EStateFP4,EStateFP5,EStateFP6,EStateFP7,EStateFP8,EStateFP9,EStateFP10,...,EStateFP71,EStateFP72,EStateFP73,EStateFP74,EStateFP75,EStateFP76,EStateFP77,EStateFP78,EStateFP79,pIC50
0,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,5.000000
1,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,4.698970
2,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,4.000000
3,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,4.000000
4,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,4.638272
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,5.885723
1209,0,0,0,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,7.167491
1210,0,0,0,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,5.375202
1211,0,0,0,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,5.000000


In [None]:
dataset_EFP = pd.concat([df_EFP_X,df_pubchem_Y], axis=1)
dataset_EFP

Unnamed: 0,ExtFP1,ExtFP2,ExtFP3,ExtFP4,ExtFP5,ExtFP6,ExtFP7,ExtFP8,ExtFP9,ExtFP10,...,ExtFP1016,ExtFP1017,ExtFP1018,ExtFP1019,ExtFP1020,ExtFP1021,ExtFP1022,ExtFP1023,ExtFP1024,pIC50
0,0,1,1,1,0,0,1,1,0,1,...,0,0,0,0,0,0,0,0,0,5.000000
1,0,0,0,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,4.698970
2,1,1,1,1,0,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,4.000000
3,1,1,1,1,0,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,4.000000
4,1,1,1,1,0,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,4.638272
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,0,0,1,1,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,5.885723
1209,1,0,1,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,7.167491
1210,1,0,1,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,5.375202
1211,1,0,1,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,5.000000


In [None]:
dataset_GOFP = pd.concat([df_GraphFP_X,df_pubchem_Y], axis=1)
dataset_GOFP

Unnamed: 0,GraphFP1,GraphFP2,GraphFP3,GraphFP4,GraphFP5,GraphFP6,GraphFP7,GraphFP8,GraphFP9,GraphFP10,...,GraphFP1016,GraphFP1017,GraphFP1018,GraphFP1019,GraphFP1020,GraphFP1021,GraphFP1022,GraphFP1023,GraphFP1024,pIC50
0,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,5.000000
1,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,4.698970
2,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,4.000000
3,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,4.000000
4,0,0,0,0,0,0,0,1,1,0,...,1,0,0,0,0,0,0,0,0,4.638272
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5.885723
1209,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,7.167491
1210,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,5.375202
1211,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5.000000


In [None]:
dataset_SFP = pd.concat([df_SFP_X,df_pubchem_Y], axis=1)
dataset_SFP

Unnamed: 0,SubFP1,SubFP2,SubFP3,SubFP4,SubFP5,SubFP6,SubFP7,SubFP8,SubFP9,SubFP10,...,SubFP299,SubFP300,SubFP301,SubFP302,SubFP303,SubFP304,SubFP305,SubFP306,SubFP307,pIC50
0,1,0,0,0,0,0,0,0,0,0,...,0,1,1,1,0,0,0,0,1,5.000000
1,0,1,1,0,0,0,0,0,0,0,...,0,1,1,1,0,0,0,0,1,4.698970
2,1,0,0,0,0,0,0,0,0,0,...,0,1,1,1,0,0,0,0,1,4.000000
3,1,0,0,0,0,0,0,0,0,0,...,0,1,1,1,0,0,0,0,1,4.000000
4,1,0,0,0,0,0,0,0,0,0,...,0,1,1,1,0,0,0,0,1,4.638272
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,1,0,0,0,0,0,0,0,0,0,...,0,1,1,1,0,0,0,0,1,5.885723
1209,0,1,0,0,0,0,0,0,0,0,...,0,1,1,1,0,0,0,0,1,7.167491
1210,0,1,0,0,0,0,0,0,0,0,...,0,1,1,1,0,0,0,0,1,5.375202
1211,0,1,0,0,0,0,0,0,0,0,...,0,1,1,1,0,0,0,0,1,5.000000


In [None]:
dataset_SFPC= pd.concat([df_SFPC_X,df_pubchem_Y], axis=1)
dataset_SFPC

Unnamed: 0,SubFPC1,SubFPC2,SubFPC3,SubFPC4,SubFPC5,SubFPC6,SubFPC7,SubFPC8,SubFPC9,SubFPC10,...,SubFPC299,SubFPC300,SubFPC301,SubFPC302,SubFPC303,SubFPC304,SubFPC305,SubFPC306,SubFPC307,pIC50
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,8.0,8.0,5.0,0.0,0.0,0.0,0.0,16.0,5.000000
1,0.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,10.0,10.0,5.0,0.0,0.0,0.0,0.0,19.0,4.698970
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,6.0,6.0,4.0,0.0,0.0,0.0,0.0,15.0,4.000000
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,6.0,6.0,4.0,0.0,0.0,0.0,0.0,15.0,4.000000
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,6.0,6.0,4.0,0.0,0.0,0.0,0.0,13.0,4.638272
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,5.0,5.0,3.0,0.0,0.0,0.0,0.0,21.0,5.885723
1209,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,7.0,7.0,3.0,0.0,0.0,0.0,0.0,16.0,7.167491
1210,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,8.0,8.0,4.0,0.0,0.0,0.0,0.0,17.0,5.375202
1211,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,8.0,8.0,5.0,0.0,0.0,0.0,0.0,23.0,5.000000


In [None]:
dataset_FP = pd.concat([df_FP_X,df_pubchem_Y], axis=1)
dataset_FP

Unnamed: 0,FP1,FP2,FP3,FP4,FP5,FP6,FP7,FP8,FP9,FP10,...,FP1016,FP1017,FP1018,FP1019,FP1020,FP1021,FP1022,FP1023,FP1024,pIC50
0,1,0,1,0,0,0,0,0,0,0,...,0,1,0,1,0,0,1,0,1,5.000000
1,0,0,0,0,0,0,0,0,0,0,...,1,0,0,1,0,0,0,0,0,4.698970
2,1,0,1,0,0,0,0,0,0,0,...,1,1,0,1,0,0,1,0,1,4.000000
3,1,0,1,0,0,0,0,0,0,0,...,1,1,0,1,0,0,1,0,1,4.000000
4,1,0,1,0,0,0,0,0,0,0,...,1,1,0,1,0,0,1,0,1,4.638272
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,1,1,1,0,0,0,1,0,0,0,...,1,0,1,0,0,0,1,0,0,5.885723
1209,1,1,0,1,0,0,1,0,0,0,...,0,0,0,0,0,0,1,0,0,7.167491
1210,1,1,0,1,0,0,1,0,0,0,...,0,0,0,0,0,0,1,0,0,5.375202
1211,1,1,0,1,1,0,1,0,0,0,...,0,0,0,0,0,0,1,0,0,5.000000


In [None]:
dataset_pubchem.to_csv('PLK1_bioactivity_data_pIC50_pubchem_fp.csv', index=False)

In [None]:
dataset_MACCS.to_csv('PLK1_bioactivity_data_pIC50_MACCS_fp.csv', index=False)

In [None]:
dataset_KRFP.to_csv('PLK1_bioactivity_data_pIC50_KRFP_fp.csv', index=False)

In [None]:
dataset_KRFPC.to_csv('PLK1_bioactivity_data_pIC50_KRFPC_fp.csv', index=False)

In [None]:
dataset_FP.to_csv('PLK1_bioactivity_data_pIC50_FP.csv', index=False)

In [None]:
dataset_EsFP.to_csv('PLK1_bioactivity_data_pIC50_ESFP_fp.csv', index=False)

In [None]:
dataset_EFP.to_csv('PLK1_bioactivity_data_pIC50_EXFP_fp.csv', index=False)

In [None]:
dataset_GOFP.to_csv('PLK1_bioactivity_data_pIC50_GOFP_fp.csv', index=False)

In [None]:
dataset_SFP.to_csv('PLK1_bioactivity_data_pIC50_SSFP_fp.csv', index=False)

In [None]:
dataset_SFPC.to_csv('PLK1_bioactivity_data_pIC50_SFPC_fp.csv', index=False)

In [None]:
dataset_AP2DFP.to_csv('PLK1_bioactivity_data_pIC50_AP2DFP_fp.csv', index=False)

In [None]:
dataset_AP2DFPC.to_csv('PLK1_bioactivity_data_pIC50_AP2DFPC_fp.csv', index=False)

In [None]:
dataset_pubchem_named = pd.concat([df_pubchem_name,df_pubchem_Y], axis=1)
dataset_pubchem_named

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,CHEMBL199996,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5.000000
1,CHEMBL115220,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4.698970
2,CHEMBL199658,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4.000000
3,CHEMBL371695,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4.000000
4,CHEMBL199657,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4.638272
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,CHEMBL525907,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5.885723
1209,CHEMBL559845,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,7.167491
1210,CHEMBL562104,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5.375202
1211,CHEMBL603463,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5.000000


In [None]:
dataset_MACCS_named = pd.concat([df_MACCS_name,df_pubchem_Y], axis=1)
dataset_MACCS_named

Unnamed: 0,Name,MACCSFP1,MACCSFP2,MACCSFP3,MACCSFP4,MACCSFP5,MACCSFP6,MACCSFP7,MACCSFP8,MACCSFP9,...,MACCSFP158,MACCSFP159,MACCSFP160,MACCSFP161,MACCSFP162,MACCSFP163,MACCSFP164,MACCSFP165,MACCSFP166,pIC50
0,CHEMBL199996,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,0,5.000000
1,CHEMBL115220,0,0,0,0,0,0,0,0,0,...,1,0,0,1,1,1,1,1,0,4.698970
2,CHEMBL199658,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,0,4.000000
3,CHEMBL199657,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,0,4.000000
4,CHEMBL382070,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,0,1,1,0,4.638272
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,CHEMBL525907,0,0,0,0,0,0,0,0,0,...,1,0,1,1,1,1,0,1,0,5.885723
1209,CHEMBL559845,0,0,0,0,0,0,0,0,0,...,1,0,1,1,1,1,1,1,0,7.167491
1210,CHEMBL562104,0,0,0,0,0,0,0,0,0,...,1,0,1,1,1,1,1,1,0,5.375202
1211,CHEMBL603463,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,0,5.000000


In [None]:
dataset_FP_named = pd.concat([df_FP_name,df_pubchem_Y], axis=1)
dataset_FP_named

Unnamed: 0,Name,FP1,FP2,FP3,FP4,FP5,FP6,FP7,FP8,FP9,...,FP1016,FP1017,FP1018,FP1019,FP1020,FP1021,FP1022,FP1023,FP1024,pIC50
0,CHEMBL199996,1,0,1,0,0,0,0,0,0,...,0,1,0,1,0,0,1,0,1,5.000000
1,CHEMBL115220,0,0,0,0,0,0,0,0,0,...,1,0,0,1,0,0,0,0,0,4.698970
2,CHEMBL199658,1,0,1,0,0,0,0,0,0,...,1,1,0,1,0,0,1,0,1,4.000000
3,CHEMBL199657,1,0,1,0,0,0,0,0,0,...,1,1,0,1,0,0,1,0,1,4.000000
4,CHEMBL371695,1,0,1,0,0,0,0,0,0,...,1,1,0,1,0,0,1,0,1,4.638272
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,CHEMBL514499,1,1,1,0,0,0,1,0,0,...,1,0,1,0,0,0,1,0,0,5.885723
1209,CHEMBL562104,1,1,0,1,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,7.167491
1210,CHEMBL559845,1,1,0,1,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,5.375202
1211,CHEMBL563150,1,1,0,1,1,0,1,0,0,...,0,0,0,0,0,0,1,0,0,5.000000


In [None]:
dataset_SSFP_named = pd.concat([df_SFP_name,df_pubchem_Y], axis=1)
dataset_SSFP_named

Unnamed: 0,Name,SubFP1,SubFP2,SubFP3,SubFP4,SubFP5,SubFP6,SubFP7,SubFP8,SubFP9,...,SubFP299,SubFP300,SubFP301,SubFP302,SubFP303,SubFP304,SubFP305,SubFP306,SubFP307,pIC50
0,CHEMBL199996,1,0,0,0,0,0,0,0,0,...,0,1,1,1,0,0,0,0,1,5.000000
1,CHEMBL115220,0,1,1,0,0,0,0,0,0,...,0,1,1,1,0,0,0,0,1,4.698970
2,CHEMBL199658,1,0,0,0,0,0,0,0,0,...,0,1,1,1,0,0,0,0,1,4.000000
3,CHEMBL199657,1,0,0,0,0,0,0,0,0,...,0,1,1,1,0,0,0,0,1,4.000000
4,CHEMBL371695,1,0,0,0,0,0,0,0,0,...,0,1,1,1,0,0,0,0,1,4.638272
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,CHEMBL525907,1,0,0,0,0,0,0,0,0,...,0,1,1,1,0,0,0,0,1,5.885723
1209,CHEMBL559845,0,1,0,0,0,0,0,0,0,...,0,1,1,1,0,0,0,0,1,7.167491
1210,CHEMBL562104,0,1,0,0,0,0,0,0,0,...,0,1,1,1,0,0,0,0,1,5.375202
1211,CHEMBL603463,0,1,0,0,0,0,0,0,0,...,0,1,1,1,0,0,0,0,1,5.000000


In [None]:
dataset_SSFPC_named = pd.concat([df_SFPC_name,df_pubchem_Y], axis=1)
dataset_SSFPC_named

Unnamed: 0,Name,SubFPC1,SubFPC2,SubFPC3,SubFPC4,SubFPC5,SubFPC6,SubFPC7,SubFPC8,SubFPC9,...,SubFPC299,SubFPC300,SubFPC301,SubFPC302,SubFPC303,SubFPC304,SubFPC305,SubFPC306,SubFPC307,pIC50
0,CHEMBL199996,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,8.0,8.0,5.0,0.0,0.0,0.0,0.0,16.0,5.000000
1,CHEMBL115220,0.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,10.0,10.0,5.0,0.0,0.0,0.0,0.0,19.0,4.698970
2,CHEMBL199657,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,6.0,6.0,4.0,0.0,0.0,0.0,0.0,15.0,4.000000
3,CHEMBL199658,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,6.0,6.0,4.0,0.0,0.0,0.0,0.0,15.0,4.000000
4,CHEMBL382070,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,6.0,6.0,4.0,0.0,0.0,0.0,0.0,13.0,4.638272
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,CHEMBL525907,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,5.0,5.0,3.0,0.0,0.0,0.0,0.0,21.0,5.885723
1209,CHEMBL559845,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,7.0,7.0,3.0,0.0,0.0,0.0,0.0,16.0,7.167491
1210,CHEMBL562104,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,8.0,8.0,4.0,0.0,0.0,0.0,0.0,17.0,5.375202
1211,CHEMBL563150,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,8.0,8.0,5.0,0.0,0.0,0.0,0.0,23.0,5.000000


In [None]:
dataset_KRFP_named = pd.concat([df_KRFP_name,df_pubchem_Y], axis=1)
dataset_KRFP_named

Unnamed: 0,Name,KRFP1,KRFP2,KRFP3,KRFP4,KRFP5,KRFP6,KRFP7,KRFP8,KRFP9,...,KRFP4852,KRFP4853,KRFP4854,KRFP4855,KRFP4856,KRFP4857,KRFP4858,KRFP4859,KRFP4860,pIC50
0,CHEMBL199996,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5.000000
1,CHEMBL115220,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4.698970
2,CHEMBL199658,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4.000000
3,CHEMBL199657,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4.000000
4,CHEMBL371695,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4.638272
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,CHEMBL525907,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5.885723
1209,CHEMBL559845,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,7.167491
1210,CHEMBL562104,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5.375202
1211,CHEMBL563150,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5.000000


In [None]:
dataset_KRFPC_named = pd.concat([df_KRFPC_name,df_pubchem_Y], axis=1)
dataset_KRFPC_named

Unnamed: 0,Name,KRFPC1,KRFPC2,KRFPC3,KRFPC4,KRFPC5,KRFPC6,KRFPC7,KRFPC8,KRFPC9,...,KRFPC4852,KRFPC4853,KRFPC4854,KRFPC4855,KRFPC4856,KRFPC4857,KRFPC4858,KRFPC4859,KRFPC4860,pIC50
0,CHEMBL199996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.000000
1,CHEMBL115220,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.698970
2,CHEMBL199658,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.000000
3,CHEMBL199657,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.000000
4,CHEMBL371695,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.638272
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,CHEMBL525907,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.885723
1209,CHEMBL559845,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.167491
1210,CHEMBL562104,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.375202
1211,CHEMBL563150,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.000000


In [None]:
dataset_AP2DFP_named = pd.concat([df_AP2DFP_name,df_pubchem_Y], axis=1)
dataset_AP2DFP_named

Unnamed: 0,Name,AD2D1,AD2D2,AD2D3,AD2D4,AD2D5,AD2D6,AD2D7,AD2D8,AD2D9,...,AD2D772,AD2D773,AD2D774,AD2D775,AD2D776,AD2D777,AD2D778,AD2D779,AD2D780,pIC50
0,CHEMBL199996,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5.000000
1,CHEMBL115220,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4.698970
2,CHEMBL199658,1,1,1,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,4.000000
3,CHEMBL199657,1,1,1,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,4.000000
4,CHEMBL371695,1,1,1,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,4.638272
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,CHEMBL559845,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5.885723
1209,CHEMBL562104,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,7.167491
1210,CHEMBL525907,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5.375202
1211,CHEMBL563150,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5.000000


In [None]:
dataset_AP2DFPC_named = pd.concat([df_AP2DFPC_name,df_pubchem_Y], axis=1)
dataset_AP2DFPC_named

Unnamed: 0,Name,APC2D1_C_C,APC2D1_C_N,APC2D1_C_O,APC2D1_C_S,APC2D1_C_P,APC2D1_C_F,APC2D1_C_Cl,APC2D1_C_Br,APC2D1_C_I,...,APC2D10_I_B,APC2D10_I_Si,APC2D10_I_X,APC2D10_B_B,APC2D10_B_Si,APC2D10_B_X,APC2D10_Si_Si,APC2D10_Si_X,APC2D10_X_X,pIC50
0,CHEMBL199996,12.0,7.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.000000
1,CHEMBL115220,19.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.698970
2,CHEMBL199657,13.0,5.0,2.0,2.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.000000
3,CHEMBL199658,13.0,5.0,2.0,2.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.000000
4,CHEMBL382070,10.0,5.0,2.0,4.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.638272
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,CHEMBL559845,15.0,10.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.885723
1209,CHEMBL525907,20.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.167491
1210,CHEMBL562104,15.0,11.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.375202
1211,CHEMBL563150,21.0,11.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.000000


In [None]:
dataset_GOFP_named = pd.concat([df_GraphFP_name,df_pubchem_Y], axis=1)
dataset_GOFP_named

Unnamed: 0,Name,GraphFP1,GraphFP2,GraphFP3,GraphFP4,GraphFP5,GraphFP6,GraphFP7,GraphFP8,GraphFP9,...,GraphFP1016,GraphFP1017,GraphFP1018,GraphFP1019,GraphFP1020,GraphFP1021,GraphFP1022,GraphFP1023,GraphFP1024,pIC50
0,CHEMBL199996,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,5.000000
1,CHEMBL115220,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,4.698970
2,CHEMBL199658,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,4.000000
3,CHEMBL199657,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,4.000000
4,CHEMBL371695,0,0,0,0,0,0,0,1,1,...,1,0,0,0,0,0,0,0,0,4.638272
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,CHEMBL525907,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5.885723
1209,CHEMBL562104,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,7.167491
1210,CHEMBL559845,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,5.375202
1211,CHEMBL603463,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5.000000


In [None]:
dataset_EsFP_named = pd.concat([df_EsFP_name,df_pubchem_Y], axis=1)
dataset_EsFP_named

Unnamed: 0,Name,EStateFP1,EStateFP2,EStateFP3,EStateFP4,EStateFP5,EStateFP6,EStateFP7,EStateFP8,EStateFP9,...,EStateFP71,EStateFP72,EStateFP73,EStateFP74,EStateFP75,EStateFP76,EStateFP77,EStateFP78,EStateFP79,pIC50
0,CHEMBL115220,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.000000
1,CHEMBL199996,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,4.698970
2,CHEMBL199657,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,4.000000
3,CHEMBL199658,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,4.000000
4,CHEMBL371695,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,4.638272
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,CHEMBL525907,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,5.885723
1209,CHEMBL562104,0,0,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,7.167491
1210,CHEMBL559845,0,0,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,5.375202
1211,CHEMBL603463,0,0,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,5.000000


In [None]:
dataset_EFP_named = pd.concat([df_EFP_name,df_pubchem_Y], axis=1)
dataset_EFP_named

Unnamed: 0,Name,ExtFP1,ExtFP2,ExtFP3,ExtFP4,ExtFP5,ExtFP6,ExtFP7,ExtFP8,ExtFP9,...,ExtFP1016,ExtFP1017,ExtFP1018,ExtFP1019,ExtFP1020,ExtFP1021,ExtFP1022,ExtFP1023,ExtFP1024,pIC50
0,CHEMBL199996,0,1,1,1,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,5.000000
1,CHEMBL115220,0,0,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,4.698970
2,CHEMBL199657,1,1,1,1,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,4.000000
3,CHEMBL199658,1,1,1,1,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,4.000000
4,CHEMBL371695,1,1,1,1,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,4.638272
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,CHEMBL525907,0,0,1,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,5.885723
1209,CHEMBL559845,1,0,1,0,1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,7.167491
1210,CHEMBL563150,1,0,1,0,1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,5.375202
1211,CHEMBL562104,1,0,1,0,1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,5.000000


In [None]:
dataset_pubchem_named.to_csv('PLK1_bioactivity_data_pIC50_pubchem_named.csv', index=False)

In [None]:
dataset_MACCS_named.to_csv('PLK1_bioactivity_data_pIC50_MACCS_named.csv', index=False)

In [None]:
dataset_AP2DFP_named.to_csv('PLK1_bioactivity_data_pIC50_AP2DDFP_named.csv', index=False)

In [None]:
dataset_AP2DFPC_named.to_csv('PLK1_bioactivity_data_pIC50_AP2DFPC_named.csv', index=False)

In [None]:
dataset_KRFP_named.to_csv('PLK1_bioactivity_data_pIC50_KRFP_named.csv', index=False)

In [None]:
dataset_KRFPC_named.to_csv('PLK1_bioactivity_data_pIC50_KRFPC_named.csv', index=False)

In [None]:
dataset_EsFP_named.to_csv('PLK1_bioactivity_data_pIC50_ESFP_named.csv', index=False)

In [None]:
dataset_EFP_named.to_csv('PLK1_bioactivity_data_pIC50_EXFP_named.csv', index=False)

In [None]:
dataset_FP_named.to_csv('PLK1_bioactivity_data_pIC50_FP_named.csv', index=False)

In [None]:
dataset_SSFP_named.to_csv('PLK1_bioactivity_data_pIC50_SSFP_named.csv', index=False)

In [None]:
dataset_SSFPC_named.to_csv('PLK1_bioactivity_data_pIC50_SSFPC_named.csv', index=False)

In [None]:
dataset_GOFP_named.to_csv('PLK1_bioactivity_data_pIC50_GOFP_named.csv', index=False)

# **Let's download the CSV files to your local computer for the Part 3 (Model Building).**