
# Download PaDEL-Descriptor
In Part 3 of this project, we will use a tool called Padel to calculate molecular descriptors which are numerical representations of the compounds in our dataset. These descriptors will provide quantitative information about the chemical structures of the compounds, which can be used for subsequent model building in Part 4

In [None]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

--2023-04-18 20:33:27--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip [following]
--2023-04-18 20:33:27--  https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25768637 (25M) [application/zip]
Saving to: ‘padel.zip.1’


2023-04-18 20:33:27 (143 MB/s) - ‘padel.zip.1’ saved [25768637/25768637]

--2023-04-18 20:33:27--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
Resolving github.com (

In [None]:
! unzip padel.zip

Archive:  padel.zip
replace __MACOSX/._PaDEL-Descriptor? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: __MACOSX/._PaDEL-Descriptor  
replace PaDEL-Descriptor/MACCSFingerprinter.xml? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: PaDEL-Descriptor/MACCSFingerprinter.xml  
replace __MACOSX/PaDEL-Descriptor/._MACCSFingerprinter.xml? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: __MACOSX/PaDEL-Descriptor/._MACCSFingerprinter.xml  
replace PaDEL-Descriptor/AtomPairs2DFingerprinter.xml? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: PaDEL-Descriptor/AtomPairs2DFingerprinter.xml  
replace __MACOSX/PaDEL-Descriptor/._AtomPairs2DFingerprinter.xml? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: __MACOSX/PaDEL-Descriptor/._AtomPairs2DFingerprinter.xml  
replace PaDEL-Descriptor/EStateFingerprinter.xml? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: PaDEL-Descriptor/EStateFingerprinter.xml  
replace __MACOSX/PaDEL-Descriptor/._EStateFingerprinter.xml? [y]es, [n

# Import Libraries

In [None]:
import pandas as pd

# Loading CSV

To proceed with the current Bioinformatics Project, we need to download the pre-processed ChEMBL bioactivity data. We will use a file called "plk_04_bioactivity_data_3class_pIC50.csv," which contains the pIC50 values for building a regression model.

In [None]:
df3 = pd.read_csv('plk1_04_bioactivity_data_3class_pIC50 (1).csv')


In [None]:
df3

Unnamed: 0.1,Unnamed: 0,molecule_chembl_id,canonical_smiles,class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,0,CHEMBL115220,O=C(Cc1ccc2ccccc2c1)Nc1cc(C2CC2)n[nH]1,inactive,291.354,3.62150,2.0,2.0,5.000000
1,1,CHEMBL200586,COC(=O)c1cc2c(C)n[nH]c2s1,inactive,196.231,1.71942,1.0,4.0,4.000000
2,2,CHEMBL199996,Cc1n[nH]c2sc(C(N)=O)c(NC(=O)Nc3ccccc3)c12,inactive,315.358,2.67572,4.0,4.0,4.698970
3,3,CHEMBL199658,Cc1n[nH]c2sc(C(N)=O)c(NC(=O)c3ccc(Cl)cc3)c12,inactive,334.788,2.93742,3.0,4.0,4.000000
4,4,CHEMBL199657,Cc1n[nH]c2sc(C(N)=O)c(NC(=O)c3cccc(Cl)c3)c12,inactive,334.788,2.93742,3.0,4.0,4.000000
...,...,...,...,...,...,...,...,...,...
1306,1306,CHEMBL5082952,COC(=O)[C@H](CC(C)C)NC(=O)[C@@H]1CC[C@H]2OCCNC...,active,928.971,0.26190,8.0,14.0,6.698970
1307,1307,CHEMBL5075075,COC(=O)[C@H](CC(C)C)NC(=O)[C@@H]1CC[C@H]2OCCNC...,intermediate,973.968,-1.62370,9.0,16.0,5.782516
1308,1308,CHEMBL5093970,COC(=O)[C@H](CC(C)C)NC(=O)[C@@H]1CC[C@H]2OCCNC...,intermediate,991.986,-0.50210,9.0,15.0,5.910095
1309,1309,CHEMBL5078060,COCCOCCOCCOCCOc1cc(OCCOCCOCCOCCOC)cc(C(=O)N[C@...,active,1197.233,-1.65770,8.0,22.0,6.080922


This selects the two columns 'canonical_smiles' and 'molecule_chembl_id' from the dataframe 'df3' and assigns them to a new dataframe called 'df3_selection'. Then, it exports this new dataframe to a tab-separated file called 'molecule.smi', without including an index column or header row.

In [None]:

selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)
     

By using the command 'cat molecule.smi | head -5' in the terminal, we can view the first 5 lines of the file. This file contains the SMILES notation and molecule names. 

SMILES notation represents the chemical information pertaining to the chemical structure.

In [None]:

! cat molecule.smi | head -5
     

O=C(Cc1ccc2ccccc2c1)Nc1cc(C2CC2)n[nH]1	CHEMBL115220
COC(=O)c1cc2c(C)n[nH]c2s1	CHEMBL200586
Cc1n[nH]c2sc(C(N)=O)c(NC(=O)Nc3ccccc3)c12	CHEMBL199996
Cc1n[nH]c2sc(C(N)=O)c(NC(=O)c3ccc(Cl)cc3)c12	CHEMBL199658
Cc1n[nH]c2sc(C(N)=O)c(NC(=O)c3cccc(Cl)c3)c12	CHEMBL199657


In [None]:
! cat molecule.smi | wc -l

1311


# Calculate fingerprint descriptors
The Padel software is commonly used in drug discovery to calculate molecular descriptors, which are numerical representations of a molecule's physicochemical properties. A shell script file is often provided along with Padel, which contains instructions on how to run the Padel calculation for a given dataset.

The padel.sh file contains instructions for running the PaDEL-Descriptor software, which is used to calculate molecular descriptors. The instructions are as follows:



*  Use Java with 1GB of memory. it's necessary to set java.awt.headless=true in order to run graphical applications that require a display. 
*  Set the property java.awt.headless=true, which tells the program not to use a display (this is needed for running the program on a server or in a terminal with no display). In a computing environment without a display, like Google Colab, it's necessary to set java.awt.headless=true in order to run graphical applications that require a display. This allows the application to run in a headless mode, without requiring a graphical display.

*   Use the -jar option to specify that we will be running the PaDEL-Descriptor software JAR file.
*  Use the removesalt option to remove salt and small organic acid from the chemical structures.
*    the -fingerprints option to tell the program to compute the molecular fingerprints.
*    the pubchem option to specify that we will be using the PubChem fingerprint.



In [None]:
! cat padel.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [None]:
! bash padel.sh

Processing CHEMBL200586 in molecule.smi (2/1311). 
Processing CHEMBL115220 in molecule.smi (1/1311). 
Processing CHEMBL199996 in molecule.smi (3/1311). Average speed: 6.89 s/mol.
Processing CHEMBL199658 in molecule.smi (4/1311). Average speed: 3.89 s/mol.
Processing CHEMBL199657 in molecule.smi (5/1311). Average speed: 2.61 s/mol.
Processing CHEMBL371695 in molecule.smi (6/1311). Average speed: 2.14 s/mol.
Processing CHEMBL382070 in molecule.smi (7/1311). Average speed: 1.73 s/mol.
Processing CHEMBL199759 in molecule.smi (8/1311). Average speed: 1.54 s/mol.
Processing CHEMBL370199 in molecule.smi (9/1311). Average speed: 1.33 s/mol.
Processing CHEMBL199737 in molecule.smi (10/1311). Average speed: 1.20 s/mol.
Processing CHEMBL371239 in molecule.smi (11/1311). Average speed: 1.11 s/mol.
Processing CHEMBL199383 in molecule.smi (12/1311). Average speed: 1.01 s/mol.
Processing CHEMBL197923 in molecule.smi (13/1311). Average speed: 1.08 s/mol.
Processing CHEMBL199755 in molecule.smi (14/131

In [None]:
! ls -l

total 57692
-rw-r--r-- 1 root root  2341367 Apr 18 20:41  BChE_06_bioactivity_data_3class_pIC50_pubchem_fp.csv
-rw-r--r-- 1 root root  2341918 Apr 18 21:05  descriptors_output.csv
drwxr-xr-x 3 root root     4096 Apr 18 20:34  __MACOSX
-rw-r--r-- 1 root root    97369 Apr 18 20:59  molecule.smi
drwxrwxr-x 4 root root     4096 Apr 18 20:34  PaDEL-Descriptor
-rw-r--r-- 1 root root      231 Apr 18 19:31  padel.sh
-rw-r--r-- 1 root root      231 Apr 18 20:33  padel.sh.1
-rw-r--r-- 1 root root 25768637 Apr 18 19:31  padel.zip
-rw-r--r-- 1 root root 25768637 Apr 18 20:33  padel.zip.1
-rw-r--r-- 1 root root   190743 Apr 18 20:59 'plk1_04_bioactivity_data_3class_pIC50 (1).csv'
-rw-r--r-- 1 root root   190743 Apr 18 20:33  plk1_04_bioactivity_data_3class_pIC50.csv
-rw-r--r-- 1 root root  2341367 Apr 18 20:41  plk1_06_bioactivity_data_3class_pIC50_pubchem_fp.csv
drwxr-xr-x 1 root root     4096 Apr 14 13:35  sample_data


The code reads the molecular descriptors output from the Padel software and stores it in a pandas DataFrame named df3_x. This DataFrame will be used to construct the feature matrix for machine learning model building.

The descriptors output file contains a large number of molecular descriptors in the form of fingerprints generated by the Padel software for each compound in the dataset. These fingerprints represent a quantitative description of the molecular structure of each compound, and will be used to train machine learning models to predict bioactivity.

In [None]:
df3_X = pd.read_csv('descriptors_output.csv')


In [None]:
df3_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL200586,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL115220,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL199996,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL199658,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL199657,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1306,CHEMBL5082952,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1307,CHEMBL5075075,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1308,CHEMBL5093970,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1309,CHEMBL5078060,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


This removes the 'Name' column from the dataframe df3_X and returns the updated dataframe.

It is common to drop certain columns from a dataframe that are not useful for analysis or model building. Here, the 'Name' column may have been included in the dataframe df3_X, but it is not a molecular descriptor and thus may not be useful for building a predictive model. By dropping this column, we can simplify the dataframe and only include relevant features.

In [None]:
df3_X = df3_X.drop(columns=['Name'])
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1306,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1307,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1308,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1309,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0



Here df3_Y = df3['pIC50'] assigns the 'pIC50' column from the original df3 to the variable df3_Y. This column contains the bioactivity data that will be used as the dependent variable (Y) in subsequent model building.






In [None]:
df3_Y = df3['pIC50']
df3_Y

0       5.000000
1       4.000000
2       4.698970
3       4.000000
4       4.000000
          ...   
1306    6.698970
1307    5.782516
1308    5.910095
1309    6.080922
1310    5.692932
Name: pIC50, Length: 1311, dtype: float64


The df3_X and df3_Y dataframes contain the molecular descriptors and the corresponding pIC50 values, respectively. The code dataset3 = pd.concat([df3_X,df3_Y], axis=1) is concatenating these two dataframes along the columns axis (axis=1) to create a new dataframe dataset3. This new dataframe contains both the molecular descriptors and pIC50 values together, which will be used for model building in Part 4 of the project.

In [None]:

dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.000000
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.000000
2,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.698970
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.000000
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1306,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.698970
1307,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.782516
1308,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.910095
1309,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.080922


# Saving CSV for model building PART 4

In [None]:
dataset3.to_csv('plk1_06_bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)