# Molecular fingerprint based QSAR Analysis for SARS coronavirus 3C-like protease inhibitors

**Quantitative Structure-Activity Relationship** (QSAR) is a computational method used to analyze the relationship between the three-dimensional structure of molecules and their biological activity.


**Molecular fingerprints** encode properties of small molecules and assess their similarities computationally through bit string comparisons.

## Installing Libraries 

In [None]:
import pandas as pd
import numpy as np

In [None]:
#conda
! wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.8.2-Linux-x86_64.sh
! chmod +x Miniconda3-py37_4.8.2-Linux-x86_64.sh
! bash ./Miniconda3-py37_4.8.2-Linux-x86_64.sh -b -f -p /usr/local
import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')

--2023-06-13 05:52:07--  https://repo.anaconda.com/miniconda/Miniconda3-py37_4.8.2-Linux-x86_64.sh
Resolving repo.anaconda.com (repo.anaconda.com)... 104.16.131.3, 104.16.130.3, 2606:4700::6810:8303, ...
Connecting to repo.anaconda.com (repo.anaconda.com)|104.16.131.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 85055499 (81M) [application/x-sh]
Saving to: ‘Miniconda3-py37_4.8.2-Linux-x86_64.sh’


2023-06-13 05:52:07 (198 MB/s) - ‘Miniconda3-py37_4.8.2-Linux-x86_64.sh’ saved [85055499/85055499]

PREFIX=/usr/local
Unpacking payload ...
Collecting package metadata (current_repodata.json): - \ done
Solving environment: / - done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - _libgcc_mutex==0.1=main
    - asn1crypto==1.3.0=py37_0
    - ca-certificates==2020.1.1=0
    - certifi==2019.11.28=py37_0
    - cffi==1.14.0=py37h2e261b9_0
    - chardet==3.0.4=py37_1003
    - conda-package-handling==1.6.0=py37h7b6447c_0
    

In [None]:
#rdkit
! pip install rdkit
from rdkit import Chem
from rdkit.Chem import Descriptors, Lipinski

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rdkit
  Downloading rdkit-2023.3.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.5 MB)
[K     |████████████████████████████████| 29.5 MB 20 kB/s 
[?25hCollecting Pillow
  Downloading Pillow-9.5.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 48.7 MB/s 
[?25hCollecting numpy
  Downloading numpy-1.21.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
[K     |████████████████████████████████| 15.7 MB 57.8 MB/s 
[?25hInstalling collected packages: Pillow, numpy, rdkit
Successfully installed Pillow-9.5.0 numpy-1.21.6 rdkit-2023.3.1


In [None]:
! pip install padelpy
from padelpy import padeldescriptor

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting padelpy
  Downloading padelpy-0.1.14-py2.py3-none-any.whl (20.9 MB)
[K     |████████████████████████████████| 20.9 MB 1.3 MB/s 
[?25hInstalling collected packages: padelpy
Successfully installed padelpy-0.1.14


In [None]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

--2023-06-13 05:52:41--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip [following]
--2023-06-13 05:52:41--  https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25768637 (25M) [application/zip]
Saving to: ‘padel.zip’


2023-06-13 05:52:42 (234 MB/s) - ‘padel.zip’ saved [25768637/25768637]

--2023-06-13 05:52:42--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
Resolving github.com (gith

In [None]:
import glob

## Data Extraction

In [None]:
from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)

Mounted at /content/gdrive/


In [None]:
df = pd.read_csv('/content/gdrive/MyDrive/CSE498R/Results/covid_df_augmented_smiles.csv') 
df

Unnamed: 0,smiles,pIC50,bioactivity_class,MW,LogP,NumHDonors,NumHAcceptors
0,O=S(c1cccc(c1)Cl)(N(CC(=O)NC1CCCC1)C)=O,6.026872,active,330.837,2.0193,1.0,3.0
1,N(CC(NC1CCCC1)=O)(S(c1cccc(Cl)c1)(=O)=O)C,6.026872,active,330.837,2.0193,1.0,3.0
2,O=S(N(CC(=O)NC1CCCC1)C)(=O)c1cccc(c1)Cl,6.026872,active,330.837,2.0193,1.0,3.0
3,C1C(NC(CN(C)S(=O)(c2cccc(c2)Cl)=O)=O)CCC1,6.026872,active,330.837,2.0193,1.0,3.0
4,O=C(CN(S(=O)(=O)c1cccc(Cl)c1)C)NC1CCCC1,6.026872,active,330.837,2.0193,1.0,3.0
...,...,...,...,...,...,...,...
1416,c1c(ccc([N+](=O)[O-])c1)S(=O)(=O)c1ccc(cc1)[N+...,4.602060,inactive,308.271,2.3358,0.0,6.0
1417,C(CC)CN1c2c(cc(I)cc2)C(=O)C1=O,4.180456,inactive,329.137,2.6206,0.0,2.0
1418,C(C)CCN1c2ccc(cc2C(C1=O)=O)I,4.180456,inactive,329.137,2.6206,0.0,2.0
1419,C(CCN1C(=O)C(=O)c2c1ccc(c2)I)C,4.180456,inactive,329.137,2.6206,0.0,2.0


## Fingerprint Generation

In [None]:
! unzip padel.zip 

Archive:  padel.zip
   creating: PaDEL-Descriptor/
  inflating: __MACOSX/._PaDEL-Descriptor  
  inflating: PaDEL-Descriptor/MACCSFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._MACCSFingerprinter.xml  
  inflating: PaDEL-Descriptor/AtomPairs2DFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._AtomPairs2DFingerprinter.xml  
  inflating: PaDEL-Descriptor/EStateFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._EStateFingerprinter.xml  
  inflating: PaDEL-Descriptor/Fingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._Fingerprinter.xml  
  inflating: PaDEL-Descriptor/.DS_Store  
  inflating: __MACOSX/PaDEL-Descriptor/._.DS_Store  
   creating: PaDEL-Descriptor/license/
  inflating: __MACOSX/PaDEL-Descriptor/._license  
  inflating: PaDEL-Descriptor/KlekotaRothFingerprintCount.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._KlekotaRothFingerprintCount.xml  
  inflating: PaDEL-Descriptor/config  
  inflating: __MACOSX/PaDEL-Descriptor/._config  
  inf

In [None]:
! wget https://github.com/dataprofessor/padel/raw/main/fingerprints_xml.zip
! unzip fingerprints_xml.zip

--2023-05-26 11:41:32--  https://github.com/dataprofessor/padel/raw/main/fingerprints_xml.zip
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/padel/main/fingerprints_xml.zip [following]
--2023-05-26 11:41:32--  https://raw.githubusercontent.com/dataprofessor/padel/main/fingerprints_xml.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10871 (11K) [application/zip]
Saving to: ‘fingerprints_xml.zip’


2023-05-26 11:41:33 (81.3 MB/s) - ‘fingerprints_xml.zip’ saved [10871/10871]

Archive:  fingerprints_xml.zip
  inflating: AtomPairs2DFingerprintCount.xml  
  inflating: AtomPairs2DFin

In [None]:
xml_files = glob.glob("*.xml")
xml_files.sort()
xml_files

['AtomPairs2DFingerprintCount.xml',
 'AtomPairs2DFingerprinter.xml',
 'EStateFingerprinter.xml',
 'ExtendedFingerprinter.xml',
 'Fingerprinter.xml',
 'GraphOnlyFingerprinter.xml',
 'KlekotaRothFingerprintCount.xml',
 'KlekotaRothFingerprinter.xml',
 'MACCSFingerprinter.xml',
 'PubchemFingerprinter.xml',
 'SubstructureFingerprintCount.xml',
 'SubstructureFingerprinter.xml']

In [None]:
FP_list = ['AtomPairs2DCount',
 'AtomPairs2D',
 'EState',
 'CDKextended',
 'CDK',
 'CDKgraphonly',
 'KlekotaRothCount',
 'KlekotaRoth',
 'MACCS',
 'PubChem',
 'SubstructureCount',
 'Substructure']

In [None]:
fp = dict(zip(FP_list, xml_files))
fp

{'AtomPairs2DCount': 'AtomPairs2DFingerprintCount.xml',
 'AtomPairs2D': 'AtomPairs2DFingerprinter.xml',
 'EState': 'EStateFingerprinter.xml',
 'CDKextended': 'ExtendedFingerprinter.xml',
 'CDK': 'Fingerprinter.xml',
 'CDKgraphonly': 'GraphOnlyFingerprinter.xml',
 'KlekotaRothCount': 'KlekotaRothFingerprintCount.xml',
 'KlekotaRoth': 'KlekotaRothFingerprinter.xml',
 'MACCS': 'MACCSFingerprinter.xml',
 'PubChem': 'PubchemFingerprinter.xml',
 'SubstructureCount': 'SubstructureFingerprintCount.xml',
 'Substructure': 'SubstructureFingerprinter.xml'}

In [None]:
selection = ['smiles']
df_selection = df[selection]
df_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

### PubChem

In [None]:
fingerprint = 'PubChem'

fingerprint_output_file = ''.join([fingerprint,'.csv']) #Substructure.csv
fingerprint_descriptortypes = fp[fingerprint]

padeldescriptor(mol_dir='molecule.smi', 
                d_file=fingerprint_output_file, #'PubChem.csv'
                #descriptortypes='SubstructureFingerprint.xml', 
                descriptortypes= fingerprint_descriptortypes,
                detectaromaticity=True,
                standardizenitro=True,
                standardizetautomers=True,
                threads=2,
                removesalt=True,
                log=True,
                fingerprints=True)

In [None]:
pubchem = pd.read_csv(fingerprint_output_file)
pubchem

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,AUTOGEN_molecule_1,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,AUTOGEN_molecule_2,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,AUTOGEN_molecule_3,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,AUTOGEN_molecule_4,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,AUTOGEN_molecule_5,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1416,AUTOGEN_molecule_1417,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1417,AUTOGEN_molecule_1418,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1418,AUTOGEN_molecule_1419,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1419,AUTOGEN_molecule_1420,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [1]:
pubchem.to_csv('pubchem.csv', index=False)

NameError: ignored

In [None]:
cp pubchem.csv '/content/gdrive/MyDrive/CSE498R/Results/' 

## AtomPairs2D

In [None]:
fingerprint = 'AtomPairs2D'

fingerprint_output_file = ''.join([fingerprint,'.csv']) #Substructure.csv
fingerprint_descriptortypes = fp[fingerprint]

padeldescriptor(mol_dir='molecule.smi', 
                d_file=fingerprint_output_file, #'PubChem.csv'
                #descriptortypes='SubstructureFingerprint.xml', 
                descriptortypes= fingerprint_descriptortypes,
                detectaromaticity=True,
                standardizenitro=True,
                standardizetautomers=True,
                threads=2,
                removesalt=True,
                log=True,
                fingerprints=True)

In [None]:
atom2d = pd.read_csv(fingerprint_output_file)
atom2d

Unnamed: 0,Name,AD2D1,AD2D2,AD2D3,AD2D4,AD2D5,AD2D6,AD2D7,AD2D8,AD2D9,...,AD2D771,AD2D772,AD2D773,AD2D774,AD2D775,AD2D776,AD2D777,AD2D778,AD2D779,AD2D780
0,AUTOGEN_molecule_1,1,1,1,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,AUTOGEN_molecule_2,1,1,1,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,AUTOGEN_molecule_3,1,1,1,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,AUTOGEN_molecule_4,1,1,1,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,AUTOGEN_molecule_5,1,1,1,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1416,AUTOGEN_molecule_1417,1,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1417,AUTOGEN_molecule_1418,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1418,AUTOGEN_molecule_1419,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1419,AUTOGEN_molecule_1420,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [None]:
atom2d.to_csv('atom2d.csv', index=False)

In [None]:
cp atom2d.csv '/content/gdrive/MyDrive/CSE498R/Results/' 

## AtomPairs2DCount

In [None]:
fingerprint = 'AtomPairs2DCount'

fingerprint_output_file = ''.join([fingerprint,'.csv']) #Substructure.csv
fingerprint_descriptortypes = fp[fingerprint]

padeldescriptor(mol_dir='molecule.smi', 
                d_file=fingerprint_output_file, #'PubChem.csv'
                #descriptortypes='SubstructureFingerprint.xml', 
                descriptortypes= fingerprint_descriptortypes,
                detectaromaticity=True,
                standardizenitro=True,
                standardizetautomers=True,
                threads=2,
                removesalt=True,
                log=True,
                fingerprints=True)

In [None]:
atompairs2dcount = pd.read_csv(fingerprint_output_file)
atompairs2dcount

Unnamed: 0,Name,APC2D1_C_C,APC2D1_C_N,APC2D1_C_O,APC2D1_C_S,APC2D1_C_P,APC2D1_C_F,APC2D1_C_Cl,APC2D1_C_Br,APC2D1_C_I,...,APC2D10_I_I,APC2D10_I_B,APC2D10_I_Si,APC2D10_I_X,APC2D10_B_B,APC2D10_B_Si,APC2D10_B_X,APC2D10_Si_Si,APC2D10_Si_X,APC2D10_X_X
0,AUTOGEN_molecule_1,12.0,4.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,AUTOGEN_molecule_2,12.0,4.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,AUTOGEN_molecule_3,12.0,4.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,AUTOGEN_molecule_4,12.0,4.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,AUTOGEN_molecule_5,12.0,4.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1416,AUTOGEN_molecule_1417,12.0,2.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1417,AUTOGEN_molecule_1418,11.0,3.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1418,AUTOGEN_molecule_1419,11.0,3.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1419,AUTOGEN_molecule_1420,11.0,3.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
atompairs2dcount.to_csv('atompairs2dcount.csv', index=False)

In [None]:
cp atompairs2dcount.csv '/content/gdrive/MyDrive/CSE498R/Results/' 

## EState

In [None]:
fingerprint = 'EState'

fingerprint_output_file = ''.join([fingerprint,'.csv']) #Substructure.csv
fingerprint_descriptortypes = fp[fingerprint]

padeldescriptor(mol_dir='molecule.smi', 
                d_file=fingerprint_output_file, #'PubChem.csv'
                #descriptortypes='SubstructureFingerprint.xml', 
                descriptortypes= fingerprint_descriptortypes,
                detectaromaticity=True,
                standardizenitro=True,
                standardizetautomers=True,
                threads=2,
                removesalt=True,
                log=True,
                fingerprints=True)

In [None]:
estate = pd.read_csv(fingerprint_output_file)
estate

Unnamed: 0,Name,EStateFP1,EStateFP2,EStateFP3,EStateFP4,EStateFP5,EStateFP6,EStateFP7,EStateFP8,EStateFP9,...,EStateFP70,EStateFP71,EStateFP72,EStateFP73,EStateFP74,EStateFP75,EStateFP76,EStateFP77,EStateFP78,EStateFP79
0,AUTOGEN_molecule_1,0,0,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
1,AUTOGEN_molecule_2,0,0,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
2,AUTOGEN_molecule_3,0,0,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
3,AUTOGEN_molecule_4,0,0,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,AUTOGEN_molecule_5,0,0,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1416,AUTOGEN_molecule_1417,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1417,AUTOGEN_molecule_1418,0,0,0,0,0,0,1,0,1,...,0,0,0,0,0,1,0,0,0,0
1418,AUTOGEN_molecule_1419,0,0,0,0,0,0,1,0,1,...,0,0,0,0,0,1,0,0,0,0
1419,AUTOGEN_molecule_1420,0,0,0,0,0,0,1,0,1,...,0,0,0,0,0,1,0,0,0,0


In [None]:
estate.to_csv('estate.csv', index=False)

In [None]:

cp estate.csv '/content/gdrive/MyDrive/CSE498R/Results/' 

## CDK

In [None]:
fingerprint = 'CDK'

fingerprint_output_file = ''.join([fingerprint,'.csv']) #Substructure.csv
fingerprint_descriptortypes = fp[fingerprint]

padeldescriptor(mol_dir='molecule.smi', 
                d_file=fingerprint_output_file, #'PubChem.csv'
                #descriptortypes='SubstructureFingerprint.xml', 
                descriptortypes= fingerprint_descriptortypes,
                detectaromaticity=True,
                standardizenitro=True,
                standardizetautomers=True,
                threads=2,
                removesalt=True,
                log=True,
                fingerprints=True)

In [None]:
cdk = pd.read_csv(fingerprint_output_file)
cdk

Unnamed: 0,Name,FP1,FP2,FP3,FP4,FP5,FP6,FP7,FP8,FP9,...,FP1015,FP1016,FP1017,FP1018,FP1019,FP1020,FP1021,FP1022,FP1023,FP1024
0,AUTOGEN_molecule_1,0,0,0,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,0,0
1,AUTOGEN_molecule_2,0,0,0,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,0,0
2,AUTOGEN_molecule_3,0,0,0,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,0,0
3,AUTOGEN_molecule_4,0,0,0,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,0,0
4,AUTOGEN_molecule_5,0,0,0,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1416,AUTOGEN_molecule_1417,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1417,AUTOGEN_molecule_1418,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1418,AUTOGEN_molecule_1419,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1419,AUTOGEN_molecule_1420,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


In [None]:
cdk.to_csv('cdk.csv', index=False)

In [None]:

cp cdk.csv '/content/gdrive/MyDrive/CSE498R/Results/' 

## CDKextended

In [None]:
fingerprint = 'CDKextended'

fingerprint_output_file = ''.join([fingerprint,'.csv']) #Substructure.csv
fingerprint_descriptortypes = fp[fingerprint]

padeldescriptor(mol_dir='molecule.smi', 
                d_file=fingerprint_output_file, #'PubChem.csv'
                #descriptortypes='SubstructureFingerprint.xml', 
                descriptortypes= fingerprint_descriptortypes,
                detectaromaticity=True,
                standardizenitro=True,
                standardizetautomers=True,
                threads=2,
                removesalt=True,
                log=True,
                fingerprints=True)

In [None]:
cdk_extended = pd.read_csv(fingerprint_output_file)
cdk_extended

Unnamed: 0,Name,ExtFP1,ExtFP2,ExtFP3,ExtFP4,ExtFP5,ExtFP6,ExtFP7,ExtFP8,ExtFP9,...,ExtFP1015,ExtFP1016,ExtFP1017,ExtFP1018,ExtFP1019,ExtFP1020,ExtFP1021,ExtFP1022,ExtFP1023,ExtFP1024
0,AUTOGEN_molecule_1,0,0,1,0,1,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,AUTOGEN_molecule_2,0,0,1,0,1,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,AUTOGEN_molecule_3,0,0,1,0,1,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,AUTOGEN_molecule_4,0,0,1,0,1,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,AUTOGEN_molecule_5,0,0,1,0,1,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1416,AUTOGEN_molecule_1417,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1417,AUTOGEN_molecule_1418,0,0,0,0,1,0,0,1,0,...,1,0,0,0,0,0,0,0,0,0
1418,AUTOGEN_molecule_1419,0,0,0,0,1,0,0,1,0,...,1,0,0,0,0,0,0,0,0,0
1419,AUTOGEN_molecule_1420,0,0,0,0,1,0,0,1,0,...,1,0,0,0,0,0,0,0,0,0


In [None]:
cdk_extended.to_csv('cdk_extended.csv', index=False)

In [None]:

cp cdk_extended.csv '/content/gdrive/MyDrive/CSE498R/Results/' 

## CDKgraphonly

In [None]:
fingerprint = 'CDKgraphonly'

fingerprint_output_file = ''.join([fingerprint,'.csv']) #Substructure.csv
fingerprint_descriptortypes = fp[fingerprint]

padeldescriptor(mol_dir='molecule.smi', 
                d_file=fingerprint_output_file, #'PubChem.csv'
                #descriptortypes='SubstructureFingerprint.xml', 
                descriptortypes= fingerprint_descriptortypes,
                detectaromaticity=True,
                standardizenitro=True,
                standardizetautomers=True,
                threads=2,
                removesalt=True,
                log=True,
                fingerprints=True)

In [None]:
cdk_graph = pd.read_csv(fingerprint_output_file)
cdk_graph

Unnamed: 0,Name,GraphFP1,GraphFP2,GraphFP3,GraphFP4,GraphFP5,GraphFP6,GraphFP7,GraphFP8,GraphFP9,...,GraphFP1015,GraphFP1016,GraphFP1017,GraphFP1018,GraphFP1019,GraphFP1020,GraphFP1021,GraphFP1022,GraphFP1023,GraphFP1024
0,AUTOGEN_molecule_1,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
1,AUTOGEN_molecule_2,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2,AUTOGEN_molecule_3,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3,AUTOGEN_molecule_4,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
4,AUTOGEN_molecule_5,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1416,AUTOGEN_molecule_1417,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1417,AUTOGEN_molecule_1418,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1418,AUTOGEN_molecule_1419,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1419,AUTOGEN_molecule_1420,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
cdk_graph.to_csv('cdk_graph.csv', index=False)

In [None]:

cp cdk_graph.csv '/content/gdrive/MyDrive/CSE498R/Results/' 

## KlekotaRothCount

In [None]:
fingerprint = 'KlekotaRothCount'

fingerprint_output_file = ''.join([fingerprint,'.csv']) #Substructure.csv
fingerprint_descriptortypes = fp[fingerprint]

padeldescriptor(mol_dir='molecule.smi', 
                d_file=fingerprint_output_file, #'PubChem.csv'
                #descriptortypes='SubstructureFingerprint.xml', 
                descriptortypes= fingerprint_descriptortypes,
                detectaromaticity=True,
                standardizenitro=True,
                standardizetautomers=True,
                threads=2,
                removesalt=True,
                log=True,
                fingerprints=True)

In [None]:
klekota_roth_count = pd.read_csv(fingerprint_output_file)
klekota_roth_count

Unnamed: 0,Name,KRFPC1,KRFPC2,KRFPC3,KRFPC4,KRFPC5,KRFPC6,KRFPC7,KRFPC8,KRFPC9,...,KRFPC4851,KRFPC4852,KRFPC4853,KRFPC4854,KRFPC4855,KRFPC4856,KRFPC4857,KRFPC4858,KRFPC4859,KRFPC4860
0,AUTOGEN_molecule_1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,AUTOGEN_molecule_2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,AUTOGEN_molecule_3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,AUTOGEN_molecule_4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,AUTOGEN_molecule_5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1416,AUTOGEN_molecule_1417,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1417,AUTOGEN_molecule_1418,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1418,AUTOGEN_molecule_1419,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1419,AUTOGEN_molecule_1420,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
klekota_roth_count.to_csv('klekota_roth_count.csv', index=False)

In [None]:
cp klekota_roth_count.csv '/content/gdrive/MyDrive/CSE498R/Results/' 

## KlekotaRoth

In [None]:
fingerprint = 'KlekotaRoth'

fingerprint_output_file = ''.join([fingerprint,'.csv']) #Substructure.csv
fingerprint_descriptortypes = fp[fingerprint]

padeldescriptor(mol_dir='molecule.smi', 
                d_file=fingerprint_output_file, #'PubChem.csv'
                #descriptortypes='SubstructureFingerprint.xml', 
                descriptortypes= fingerprint_descriptortypes,
                detectaromaticity=True,
                standardizenitro=True,
                standardizetautomers=True,
                threads=2,
                removesalt=True,
                log=True,
                fingerprints=True)

In [None]:
klekota_roth = pd.read_csv(fingerprint_output_file)
klekota_roth

Unnamed: 0,Name,KRFP1,KRFP2,KRFP3,KRFP4,KRFP5,KRFP6,KRFP7,KRFP8,KRFP9,...,KRFP4851,KRFP4852,KRFP4853,KRFP4854,KRFP4855,KRFP4856,KRFP4857,KRFP4858,KRFP4859,KRFP4860
0,AUTOGEN_molecule_1,1,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,AUTOGEN_molecule_2,1,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
2,AUTOGEN_molecule_3,1,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
3,AUTOGEN_molecule_4,1,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
4,AUTOGEN_molecule_5,1,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1416,AUTOGEN_molecule_1417,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1417,AUTOGEN_molecule_1418,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1418,AUTOGEN_molecule_1419,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1419,AUTOGEN_molecule_1420,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
klekota_roth.to_csv('klekota_roth.csv', index=False)

In [None]:

cp klekota_roth.csv '/content/gdrive/MyDrive/CSE498R/Results/' 

## MACCS

In [None]:
fingerprint = 'MACCS'

fingerprint_output_file = ''.join([fingerprint,'.csv']) #Substructure.csv
fingerprint_descriptortypes = fp[fingerprint]

padeldescriptor(mol_dir='molecule.smi', 
                d_file=fingerprint_output_file, #'PubChem.csv'
                #descriptortypes='SubstructureFingerprint.xml', 
                descriptortypes= fingerprint_descriptortypes,
                detectaromaticity=True,
                standardizenitro=True,
                standardizetautomers=True,
                threads=2,
                removesalt=True,
                log=True,
                fingerprints=True)

In [None]:
maccs = pd.read_csv(fingerprint_output_file)
maccs

Unnamed: 0,Name,MACCSFP1,MACCSFP2,MACCSFP3,MACCSFP4,MACCSFP5,MACCSFP6,MACCSFP7,MACCSFP8,MACCSFP9,...,MACCSFP157,MACCSFP158,MACCSFP159,MACCSFP160,MACCSFP161,MACCSFP162,MACCSFP163,MACCSFP164,MACCSFP165,MACCSFP166
0,AUTOGEN_molecule_1,0,0,0,0,0,0,0,0,0,...,0,1,1,1,1,1,1,1,1,0
1,AUTOGEN_molecule_2,0,0,0,0,0,0,0,0,0,...,0,1,1,1,1,1,1,1,1,0
2,AUTOGEN_molecule_3,0,0,0,0,0,0,0,0,0,...,0,1,1,1,1,1,1,1,1,0
3,AUTOGEN_molecule_4,0,0,0,0,0,0,0,0,0,...,0,1,1,1,1,1,1,1,1,0
4,AUTOGEN_molecule_5,0,0,0,0,0,0,0,0,0,...,0,1,1,1,1,1,1,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1416,AUTOGEN_molecule_1417,0,0,0,0,0,0,0,0,0,...,0,1,1,0,1,1,1,1,1,0
1417,AUTOGEN_molecule_1418,0,0,0,0,0,0,0,0,0,...,0,1,1,1,1,1,1,1,1,0
1418,AUTOGEN_molecule_1419,0,0,0,0,0,0,0,0,0,...,0,1,1,1,1,1,1,1,1,0
1419,AUTOGEN_molecule_1420,0,0,0,0,0,0,0,0,0,...,0,1,1,1,1,1,1,1,1,0


In [None]:
maccs.to_csv('maccs.csv', index=False)

In [None]:

cp maccs.csv '/content/gdrive/MyDrive/CSE498R/Results/' 

## Substructure

In [None]:
fingerprint = 'Substructure'

fingerprint_output_file = ''.join([fingerprint,'.csv']) #Substructure.csv
fingerprint_descriptortypes = fp[fingerprint]

padeldescriptor(mol_dir='molecule.smi', 
                d_file=fingerprint_output_file, #'PubChem.csv'
                #descriptortypes='SubstructureFingerprint.xml', 
                descriptortypes= fingerprint_descriptortypes,
                detectaromaticity=True,
                standardizenitro=True,
                standardizetautomers=True,
                threads=2,
                removesalt=True,
                log=True,
                fingerprints=True)

In [None]:
substructure = pd.read_csv(fingerprint_output_file)
substructure

Unnamed: 0,Name,SubFP1,SubFP2,SubFP3,SubFP4,SubFP5,SubFP6,SubFP7,SubFP8,SubFP9,...,SubFP298,SubFP299,SubFP300,SubFP301,SubFP302,SubFP303,SubFP304,SubFP305,SubFP306,SubFP307
0,AUTOGEN_molecule_1,0,1,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
1,AUTOGEN_molecule_2,0,1,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
2,AUTOGEN_molecule_3,0,1,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
3,AUTOGEN_molecule_4,0,1,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
4,AUTOGEN_molecule_5,0,1,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1416,AUTOGEN_molecule_1417,0,0,0,0,0,0,0,0,0,...,1,1,0,0,1,0,0,0,0,1
1417,AUTOGEN_molecule_1418,1,1,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
1418,AUTOGEN_molecule_1419,1,1,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
1419,AUTOGEN_molecule_1420,1,1,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1


In [None]:
substructure.to_csv('substructure.csv', index=False)

In [None]:
cp substructure.csv '/content/gdrive/MyDrive/CSE498R/Results/' 

## SubstructureCount

In [None]:
fingerprint = 'SubstructureCount'

fingerprint_output_file = ''.join([fingerprint,'.csv']) #Substructure.csv
fingerprint_descriptortypes = fp[fingerprint]

padeldescriptor(mol_dir='molecule.smi', 
                d_file=fingerprint_output_file, #'PubChem.csv'
                #descriptortypes='SubstructureFingerprint.xml', 
                descriptortypes= fingerprint_descriptortypes,
                detectaromaticity=True,
                standardizenitro=True,
                standardizetautomers=True,
                threads=2,
                removesalt=True,
                log=True,
                fingerprints=True)

In [None]:
substructure_count = pd.read_csv(fingerprint_output_file)
substructure_count

Unnamed: 0,Name,SubFPC1,SubFPC2,SubFPC3,SubFPC4,SubFPC5,SubFPC6,SubFPC7,SubFPC8,SubFPC9,...,SubFPC298,SubFPC299,SubFPC300,SubFPC301,SubFPC302,SubFPC303,SubFPC304,SubFPC305,SubFPC306,SubFPC307
0,AUTOGEN_molecule_1,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,10.0,10.0,6.0,0.0,0.0,0.0,0.0,11.0
1,AUTOGEN_molecule_2,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,10.0,10.0,6.0,0.0,0.0,0.0,0.0,11.0
2,AUTOGEN_molecule_3,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,10.0,10.0,6.0,0.0,0.0,0.0,0.0,11.0
3,AUTOGEN_molecule_4,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,10.0,10.0,6.0,0.0,0.0,0.0,0.0,11.0
4,AUTOGEN_molecule_5,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,10.0,10.0,6.0,0.0,0.0,0.0,0.0,11.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1416,AUTOGEN_molecule_1417,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2.0,2.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,15.0
1417,AUTOGEN_molecule_1418,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,4.0,4.0,3.0,0.0,0.0,0.0,0.0,9.0
1418,AUTOGEN_molecule_1419,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,4.0,4.0,3.0,0.0,0.0,0.0,0.0,9.0
1419,AUTOGEN_molecule_1420,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,4.0,4.0,3.0,0.0,0.0,0.0,0.0,9.0


In [None]:
substructure_count.to_csv('substructure_count.csv', index=False)

In [None]:
cp substructure_count.csv '/content/gdrive/MyDrive/CSE498R/Results/' 