### **Part-3 Descriptor Calculation**
---
#### We will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset.
#### Finally, we will be preparing this into a dataset for subsequent model building in Part 4
---

#### **Load Bioactivity data**
##### We'll load the bioactivity class that we Pre-Processed in part 2

In [1]:
import pandas as pd
import os
df3 = pd.read_csv (os.path.join ("Datasets", "Part-2_bioactivity_two_class_pic50.csv"))
df3

Unnamed: 0.1,Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,MolecularWeight,LogP,NumHDonors,NumHAcceptors,PIC50
0,0,CHEMBL133897,CCOc1nn(-c2cccc(OCc3ccccc3)c2)c(=O)o1,active,312.325,2.8032,0.0,6.0,6.124939
1,1,CHEMBL336398,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC1CC1,active,376.913,4.5546,0.0,5.0,7.000000
2,2,CHEMBL131588,CN(C(=O)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F)c1ccccc1,inactive,426.851,5.3574,0.0,5.0,4.301030
3,3,CHEMBL130628,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F,active,404.845,4.7069,0.0,5.0,6.522879
4,4,CHEMBL130478,CSc1nc(-c2ccc(OC(F)(F)F)cc2)nn1C(=O)N(C)C,active,346.334,3.0953,0.0,6.0,6.096910
...,...,...,...,...,...,...,...,...,...
4923,6335,CHEMBL4645659,COc1ccc(CCC(=O)Nc2nc(-c3cc4ccccc4oc3=O)cs2)cc1OC,active,436.489,4.5050,1.0,7.0,6.130768
4924,6336,CHEMBL513063,COc1ccc(-c2csc(NC(=O)CCN3CCCC3)n2)cc1,active,331.441,3.2431,1.0,5.0,6.292430
4925,6337,CHEMBL4640608,COc1cc(C2C3=C(CCCC3=O)NC3=C2C(=O)CCC3)ccc1OCc1...,inactive,447.506,5.1143,1.0,5.0,3.903090
4926,6338,CHEMBL4173961,O=C1CCCC2=C1C(c1ccc(OCc3cccc(F)c3)c(Br)c1)C1=C...,inactive,496.376,5.8682,1.0,4.0,4.000000


In [2]:
selection = ['canonical_smiles', 'molecule_chembl_id']
df3_selection = df3 [selection]
df3_selection.to_csv (os.path.join ("Datasets", 'molecule.smi'), sep = '\t', index = False, header = False)

In [3]:
df3_selection.head (5)

Unnamed: 0,canonical_smiles,molecule_chembl_id
0,CCOc1nn(-c2cccc(OCc3ccccc3)c2)c(=O)o1,CHEMBL133897
1,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC1CC1,CHEMBL336398
2,CN(C(=O)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F)c1ccccc1,CHEMBL131588
3,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F,CHEMBL130628
4,CSc1nc(-c2ccc(OC(F)(F)F)cc2)nn1C(=O)N(C)C,CHEMBL130478


### **Use The Padel Descriptor using PadelPy Library**

######## We'll perform the following steps :

#### **1. Removesalt :** For removing any impurities in the molecule
#### **2. Standardize Nitro :** Also used for cleaning Chemical structure
#### **3. PubChem Fingerprinting :** PubChem Fingerprints describes the local features of the molecules
---
#### **Intuition for PubChem :**
##### Each molecule will be described by a unique building block

##### If a molecule is considered to be made of Legos, each molecule will be described by this Lego building block, and the way these blocks are connected will create a unique property for the drug.
##### We can say the connectivity of these Lego blocks give rise to the unique structure of the molecule and the unique molecular properties and the Fingerprint

###### More About Padel Descriptor : **https://onlinelibrary.wiley.com/doi/full/10.1002/jcc.21707**
---

##### Install Padelpy Library

In [4]:
!pip install padelpy



In [4]:
! mkdir fingerprints_xml
! wget https://github.com/dataprofessor/padel/raw/main/fingerprints_xml.zip -P fingerprints_xml

#! unzip fingerprints_xml.zip -P fingerprints_xml

--2021-10-24 17:43:05--  https://github.com/dataprofessor/padel/raw/main/fingerprints_xml.zip
Resolving github.com (github.com)... 13.234.176.102
Connecting to github.com (github.com)|13.234.176.102|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/padel/main/fingerprints_xml.zip [following]
--2021-10-24 17:43:05--  https://raw.githubusercontent.com/dataprofessor/padel/main/fingerprints_xml.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10871 (11K) [application/zip]
Saving to: ‘fingerprints_xml/fingerprints_xml.zip’


2021-10-24 17:43:05 (24.6 MB/s) - ‘fingerprints_xml/fingerprints_xml.zip’ saved [10871/10871]



In [14]:
%cd fingerprints_xml
! unzip fingerprints_xml.zip

/home/shraeyas/Drug-Discovery-Using-Python/fingerprints_xml


##### List and sort fingerprint XML files

In [15]:
import glob
xml_files = glob.glob ("*.xml")
xml_files.sort ()
%cd ..
xml_files

/home/shraeyas/Drug-Discovery-Using-Python


['AtomPairs2DFingerprintCount.xml',
 'AtomPairs2DFingerprinter.xml',
 'EStateFingerprinter.xml',
 'ExtendedFingerprinter.xml',
 'Fingerprinter.xml',
 'GraphOnlyFingerprinter.xml',
 'KlekotaRothFingerprintCount.xml',
 'KlekotaRothFingerprinter.xml',
 'MACCSFingerprinter.xml',
 'PubchemFingerprinter.xml',
 'SubstructureFingerprintCount.xml',
 'SubstructureFingerprinter.xml']

In [16]:
FP_list = ['AtomPairs2DCount',
 'AtomPairs2D',
 'EState',
 'CDKextended',
 'CDK',
 'CDKgraphonly',
 'KlekotaRothCount',
 'KlekotaRoth',
 'MACCS',
 'PubChem',
 'SubstructureCount',
 'Substructure']

##### Create a dictionary

In [17]:
fp = dict (zip (FP_list, xml_files))
fp

{'AtomPairs2DCount': 'AtomPairs2DFingerprintCount.xml',
 'AtomPairs2D': 'AtomPairs2DFingerprinter.xml',
 'EState': 'EStateFingerprinter.xml',
 'CDKextended': 'ExtendedFingerprinter.xml',
 'CDK': 'Fingerprinter.xml',
 'CDKgraphonly': 'GraphOnlyFingerprinter.xml',
 'KlekotaRothCount': 'KlekotaRothFingerprintCount.xml',
 'KlekotaRoth': 'KlekotaRothFingerprinter.xml',
 'MACCS': 'MACCSFingerprinter.xml',
 'PubChem': 'PubchemFingerprinter.xml',
 'SubstructureCount': 'SubstructureFingerprintCount.xml',
 'Substructure': 'SubstructureFingerprinter.xml'}

In [18]:
fp ['PubChem']

'PubchemFingerprinter.xml'

In [19]:
from padelpy import padeldescriptor

fingerprint = 'PubChem'
fingerprint_output_file = os.path.join ("Datasets", 'Part-3_PubChem_Descriptors_Output.csv')
fingerprint_descriptortypes = os.path.join ('fingerprints_xml', 'PubchemFingerprinter.xml')
mol_dir = os.path.join ("Datasets", "molecule.smi")

#print (fingerprint_descriptortypes)

padeldescriptor (mol_dir = mol_dir,
                 d_file = fingerprint_output_file, #'Substructure.csv'
                 #descriptortypes='SubstructureFingerprint.xml',
                 descriptortypes = fingerprint_descriptortypes,
                 detectaromaticity = False,
                 standardizenitro = True,
                 standardizetautomers = False,
                 threads = 4,
                 removesalt = True,
                 log = False,
                 fingerprints = True)

#### **Create X and Y Data Matrices**
##### **X-Matrix contains Molecular Descriptors (PubChem Fingerprint)**

In [21]:
df3_X = pd.read_csv (os.path.join ("Datasets", 'Part-3_PubChem_Descriptors_Output.csv'))
df3_X = df3_X.drop (columns = ['Name'])
df3_X.head (3)

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


##### **Get the Y Matrix**

In [22]:
df3_Y = df3 ['PIC50']
df3_Y.head (3)

0    6.124939
1    7.000000
2    4.301030
Name: PIC50, dtype: float64

##### **Combine the X and Y Matrices**

In [23]:
dataset3 = pd.concat ([df3_X, df3_Y], axis = 1)
dataset3

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,PIC50
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.124939
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.000000
2,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.301030
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.522879
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.096910
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4923,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.130768
4924,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.292430
4925,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,3.903090
4926,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.000000


#### **Save the dataset as a CSV File**

In [24]:
dataset3.to_csv (os.path.join ("Datasets", "Part-3_Bioactivity_Three_PubChem_PIC50.csv"), index = False)