# Computational Drug Discovery Project - Ligand Based Drug Design [Part3] Descriptor Calculation

Calculate molecular descriptors that are essentially quantitative description of the compounds in the dataset.

Molecular descriptors transform chemical information encoded within a symbolic representation of a
molecule into a useful number.

Molecular descriptors are calculated for chemical compounds and used to develop quantitative structure activity relationship (QSAR) models for predicting the biological activities of novel compounds.

-------------------------------------------------------------------------------------------------------------------

MolVS is a molecule validation and standardization tool, written in Python using the RDKit chemistry framework.

In [None]:
!pip install molvs

In [1]:
import numpy as np
import pandas as pd
from rdkit.Chem import AllChem
from rdkit import Chem, DataStructs

## Calculate fingerprint descriptors

### Calculate Morgan fingerprint descriptors

Morgan fingerprints are also known as Extended-connectivity fingerprints(ECFP).
This family of fingerprints is based on the Morgan algorithm. The bits correspond to the circular environments of each atom in a molecule. The number of neighbouring bonds and atoms to consider is set by the radius.

In [2]:
class ECFP6:
    def __init__(self, smiles):
        self.mols = [Chem.MolFromSmiles(i) for i in smiles]
        self.smiles = smiles

    def mol2fp(self, mol, radius = 3):
        fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius = radius)
        array = np.zeros((1,))
        DataStructs.ConvertToNumpyArray(fp, array)
        return array

    def compute_ECFP6(self, name):
        bit_headers = ['bit' + str(i) for i in range(2048)]
        arr = np.empty((0,2048), int).astype(int)
        for i in self.mols:
            fp = self.mol2fp(i)
            arr = np.vstack((arr, fp))
        df_ecfp6 = pd.DataFrame(np.asarray(arr).astype(int),columns=bit_headers)
        df_ecfp6.insert(loc=0, column='smiles', value=self.smiles)
        df_ecfp6.to_csv(name[:-4]+'_morgan.csv', index=False)

In [3]:
from molvs import standardize_smiles

def main():
    filename = 'ppara_bioactivity_data_ro5.csv'  # path to your csv file
    df = pd.read_csv(filename)               # read the csv file as pandas data frame
    print(len(df))
    smiles = [standardize_smiles(i) for i in df['canonical_smiles'].values]  
    print(len(smiles))
    ## Compute ECFP6 Fingerprints and export a csv file.
    ecfp6_descriptor = ECFP6(smiles)        # create your ECFP6 object and provide smiles
    ecfp6_descriptor.compute_ECFP6('PPARa_fp.csv') # compute ECFP6 and provide the name of your desired output file. you can use the same name as the input file because the ECFP6 class will ensure to add "_ECFP6.csv" as part of the output file.

if __name__ == '__main__':
    main()

1674
1674


In [4]:
df = pd.read_csv('PPARa_fp_morgan.csv')
df

Unnamed: 0,smiles,bit0,bit1,bit2,bit3,bit4,bit5,bit6,bit7,bit8,...,bit2038,bit2039,bit2040,bit2041,bit2042,bit2043,bit2044,bit2045,bit2046,bit2047
0,CCCCC/C=C\C/C=C\C/C=C\C/C=C\CCCC(=O)O,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CCO[C@@H](Cc1ccc(OCCn2c3ccccc3c3cc(Br)ccc32)cc...,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,O=C(O)[C@H](Cc1ccccc1)Oc1ccc(C(F)(F)F)cc1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Cc1ccc(-c2nc(C)c(C(=O)N[C@H]3CCCN(c4cccc(C(=O)...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CCc1ccc(O[C@H](C)CCOc2ccc(CCC(=O)O)c(C)c2)c(C(...,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1669,CC(C)(C)OC(=O)NC(CSCc1ccc(C(=O)c2ccc([N+](=O)[...,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1670,CC(C)(C)OC(=O)NC(COCc1ccc(-c2ccccc2)cc1)C(=O)O,0,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1671,CC(SCc1ccccc1)C(NC(=O)c1ccccc1)C(=O)O,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1672,CC(C)(C)OC(=O)NC(COCc1ccc(Cc2ccccc2)cc1)C(=O)O,0,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
