## Building a dataframe of fingerprints

This notebook will create a dataframe containing several types of fingerprints for 6935 molecules contained in a dataframe obtained from this research work: Meyer, J.G., Liu, S., Miller, I.J., Coon, J.J., Gitter, A., 2019. Learning Drug Functions from Chemical Structures with Convolutional Neural Networks and Random Forests. J. Chem. Inf. Model. 59, 4438–4449. https://doi.org/10.1021/acs.jcim.9b00236


#### In case the requirements.txt won't work

In [1]:
# pip install rdkit-pypi

In [2]:
# pip install PubChemPy

### Imports and loads

In [3]:
import os
import pandas as pd
import numpy as np
from sklearn import preprocessing
from rdkit.Chem import AllChem, MACCSkeys,rdMolDescriptors
from rdkit.Avalon import pyAvalonTools
from rdkit.Chem import PandasTools as pt
import pubchempy as pcp

In [4]:
# I recommend set this warning off, all operations are map functions to whole columns, so I understand there shoudn't be problems
pd.options.mode.chained_assignment = None

In [5]:
dataset = 'pubchem'
drugs = pd.read_csv(os.path.join('dataframes','pubchem','f{dataset}_dataset_label_clean.csv'))

In [6]:
drugs['ATC_Code'].unique()

array(['B', 'L', 'H', 'J', 'A', 'V', 'G', 'D', 'N', 'S', 'M', 'C', 'R',
       'P'], dtype=object)

In [7]:
drugs.head()

Unnamed: 0,CID,HBondAcceptorCount,HBondDonorCount,MolecularWeight,LogP,RuleFive,IsomericSMILES,ATC_Code
0,101041682,37.0,28.0,2180.2853,-0.76,0,CC[C@H](C)[C@H](NC(=O)[C@H](CCC(O)=O)NC(=O)[C@...,B
1,657181,16.0,16.0,1209.3983,1.04,0,CCNC(=O)[C@@H]1CCCN1C(=O)[C@H](CCCNC(N)=N)NC(=...,L
2,5311128,18.0,17.0,1269.4105,0.3,0,CC(C)C[C@H](NC(=O)[C@@H](COC(C)(C)C)NC(=O)[C@H...,L
3,5311065,15.0,14.0,1069.22,-1.0,0,NC(=O)CC[C@@H]1NC(=O)[C@H](CC2=CC=CC=C2)NC(=O)...,H
4,25074887,18.0,17.0,1431.038,1.33,0,CC(C)C[C@H](NC(=O)[C@@H](CCCNC(N)=O)NC(=O)[C@H...,H


### Functions

A series of functions below will obtain fingerprints from the RDKit Molecules or their CID (using Pubchempy)

In [8]:
def compute_connectivity_invariants(mol):
    """Function that obtains the connectivity invariants of a molecule
    Input: RDKit molecule
    Output: Numpy array
    """
    try:
        con_inv_fp = rdMolDescriptors.GetConnectivityInvariants(mol)
    except:
        print('Something went wrong computing Connectivity Invariants')
        return None
    return np.array(con_inv_fp)

In [9]:
def compute_feature_invariants(mol):
    """Function that obtains the feature invariants of a molecule
    Input: RDKit molecule
    Output: Numpy array
    """
    try:
        inv_fp = rdMolDescriptors.GetFeatureInvariants(mol)
    except:
        print('Something went wrong computing Feature Invariants')
        return None
    return np.array(inv_fp)

In [10]:
def compute_morgan_fp(mol, depth=2, nBits=2048):
    """Function that obtains the Morgan fingerprints of a molecule
    Input: RDKit molecule
    Output: Numpy array
    """
    try:
        mor_fp = AllChem.GetMorganFingerprintAsBitVect(mol,depth,nBits)
    except:
        print('Something went wrong computing Morgan fingerprints')
        return None
    return np.array(mor_fp)

In [11]:
def compute_maccskeys(mol):
    """Function that obtains the MACCSKeys of a molecule
    Input: RDKit molecule
    Output: Numpy array
    """
    try:
        mkeys = MACCSkeys.GenMACCSKeys(mol)   
    except:
        print('Something went wrong computing MACCSKeys')
        return None
    return np.array(mkeys)

In [12]:
def compute_atom_pair_fp(mol, nBits=2048):
    """Function that obtains the atom pair Fingerprints of a molecule
    Input: RDKit molecule
    Output: Numpy array
    """
    try:
        atom_pair_fp = rdMolDescriptors.GetHashedAtomPairFingerprintAsBitVect(mol, nBits)
    except:
        print('Something went wrong computing Atom Pair fingerprints')
        return None
    return np.array(atom_pair_fp)

In [13]:
def compute_topological_torsion_fp(mol, nBits=2048):
    """Function that obtains the topological torsion fingerprints of a molecule
    Input: RDKit molecule
    Output: Numpy array
    """
    try:
        tt_fp = rdMolDescriptors.GetHashedTopologicalTorsionFingerprintAsBitVect(mol)
    except:
        print('Something went wrong computing Topological Torsion fingerprints')
        return None
    return np.array(tt_fp)
    

In [14]:
def compute_avalon_fp(mol, nBits=2048):
    """Function that obtains the Avalon fingerprints of a molecule
    Input: RDKit molecule
    Output: Numpy array
    """
    try:
        av_fp = pyAvalonTools.GetAvalonFP(mol, nBits)
    except:
        print('Something went wrong computing Avalon fingerprints')
        return None
    return np.array(av_fp)

In [26]:
def compute_rdkit_fp(mol, maxPath=5, fpSize=2048):
    """Function that obtains the RDKit fingerprints of a molecule
    Input: RDKit molecule
    Output: Numpy array
    """
    try:
        rdkit_fp = AllChem.RDKFingerprint(mol, maxPath, fpSize)
    except:
        print('Something went wrong computing RDKit fingerprints')
        return None
    return np.array(rdkit_fp)

In [15]:
def compute_pubchem_fingerprints(cid):
    """Function that obtains the PubChem fingerprints of a molecule
    Input: molecules's CID
    Output: Numpy array
    """
    try:
        comp = pcp.Compound.from_cid(int(cid))
        fp_bin = bin(int(comp.fingerprint, 16))[2:]   
    except:
        print('Something went wrong computing Pubchem fingerprints')
        return None
    return np.array(list(fp_bin)).astype('int')

In [16]:
def compute_cactvs_fingerprints(cid):
    """Function that obtains the Cactvs fingerprints of a molecule
    Input: molecule's CID
    Output: Numpy array
    """
    try:
        comp = pcp.Compound.from_cid(int(cid))
        cactvs_fp_bin = bin(int(comp.fingerprint, 16))[2:]
    except:
        print('Something went wrong computing Cactvs fingerprints')
        return None
    return np.array(list(cactvs_fp_bin)).astype('int')

### Build a dataframe of fingerprints

Add a column with the RDKit Molecule to the Dataframe

In [19]:
pt.AddMoleculeColumnToFrame(frame=drugs, smilesCol='IsomericSMILES', molCol='Molecule')
drugs.head()

[18:07:57] Explicit valence for atom # 0 N, 4, is greater than permitted
[18:07:57] Explicit valence for atom # 0 N, 4, is greater than permitted
[18:07:57] Explicit valence for atom # 0 N, 4, is greater than permitted
[18:07:57] SMILES Parse Error: syntax error while parsing: OC1=CC=CC(=C1)C-1=C2\CCC(=N2)\C(=C2/N\C(\C=C2)=C(/C2=N/C(/C=C2)=C(\C2=CC=C\-1N2)C1=CC(O)=CC=C1)C1=CC(O)=CC=C1)\C1=CC(O)=CC=C1
[18:07:57] SMILES Parse Error: Failed parsing SMILES 'OC1=CC=CC(=C1)C-1=C2\CCC(=N2)\C(=C2/N\C(\C=C2)=C(/C2=N/C(/C=C2)=C(\C2=CC=C\-1N2)C1=CC(O)=CC=C1)C1=CC(O)=CC=C1)\C1=CC(O)=CC=C1' for input: 'OC1=CC=CC(=C1)C-1=C2\CCC(=N2)\C(=C2/N\C(\C=C2)=C(/C2=N/C(/C=C2)=C(\C2=CC=C\-1N2)C1=CC(O)=CC=C1)C1=CC(O)=CC=C1)\C1=CC(O)=CC=C1'


Unnamed: 0,CID,HBondAcceptorCount,HBondDonorCount,MolecularWeight,LogP,RuleFive,IsomericSMILES,ATC_Code,Molecule
0,101041682,37.0,28.0,2180.2853,-0.76,0,CC[C@H](C)[C@H](NC(=O)[C@H](CCC(O)=O)NC(=O)[C@...,B,<rdkit.Chem.rdchem.Mol object at 0x00000211023...
1,657181,16.0,16.0,1209.3983,1.04,0,CCNC(=O)[C@@H]1CCCN1C(=O)[C@H](CCCNC(N)=N)NC(=...,L,<rdkit.Chem.rdchem.Mol object at 0x00000211023...
2,5311128,18.0,17.0,1269.4105,0.3,0,CC(C)C[C@H](NC(=O)[C@@H](COC(C)(C)C)NC(=O)[C@H...,L,<rdkit.Chem.rdchem.Mol object at 0x00000211023...
3,5311065,15.0,14.0,1069.22,-1.0,0,NC(=O)CC[C@@H]1NC(=O)[C@H](CC2=CC=CC=C2)NC(=O)...,H,<rdkit.Chem.rdchem.Mol object at 0x00000211023...
4,25074887,18.0,17.0,1431.038,1.33,0,CC(C)C[C@H](NC(=O)[C@@H](CCCNC(N)=O)NC(=O)[C@H...,H,<rdkit.Chem.rdchem.Mol object at 0x00000211023...


Select columns of interest

In [20]:
drug_ids = drugs[['CID','ATC_Code', 'Molecule']]
display(drug_ids.sample(5))

Unnamed: 0,CID,ATC_Code,Molecule
1821,5284587,H,<rdkit.Chem.rdchem.Mol object at 0x0000021102C...
2736,71961,J,<rdkit.Chem.rdchem.Mol object at 0x0000021102C...
656,5755,D,<rdkit.Chem.rdchem.Mol object at 0x0000021102C...
2185,2292,C,<rdkit.Chem.rdchem.Mol object at 0x0000021102C...
1139,10836,N,<rdkit.Chem.rdchem.Mol object at 0x0000021102C...


Encode de drug_class column, the codified column will be our label

In [22]:
le = preprocessing.LabelEncoder()
le = le.fit(drug_ids['ATC_Code'])
drug_ids['ATC_Code_#'] = le.transform(drug_ids['ATC_Code'])
display(drug_ids.head())


Unnamed: 0,CID,ATC_Code,Molecule,ATC_Code_#
0,101041682,B,<rdkit.Chem.rdchem.Mol object at 0x00000211023...,1
1,657181,L,<rdkit.Chem.rdchem.Mol object at 0x00000211023...,7
2,5311128,L,<rdkit.Chem.rdchem.Mol object at 0x00000211023...,7
3,5311065,H,<rdkit.Chem.rdchem.Mol object at 0x00000211023...,5
4,25074887,H,<rdkit.Chem.rdchem.Mol object at 0x00000211023...,5


#### Using the functions described above to add columns containing the fingerprints

In [23]:
drug_ids['FeatInvariants'] = drug_ids['Molecule'].map(compute_feature_invariants)
drug_ids['ConnInvariants'] = drug_ids['Molecule'].map(compute_connectivity_invariants)
drug_ids['Morgan2FP'] = drug_ids['Molecule'].map(compute_morgan_fp)
drug_ids['MACCSKeys'] = drug_ids['Molecule'].map(compute_maccskeys)
drug_ids['AtomPairFP'] = drug_ids['Molecule'].map(compute_atom_pair_fp)
drug_ids['TopTorFP'] = drug_ids['Molecule'].map(compute_topological_torsion_fp)
drug_ids['AvalonFP'] = drug_ids['Molecule'].map(compute_avalon_fp)

Something went wrong computing Feature Invariants
Something went wrong computing Feature Invariants
Something went wrong computing Feature Invariants
Something went wrong computing Feature Invariants
Something went wrong computing Connectivity Invariants
Something went wrong computing Connectivity Invariants
Something went wrong computing Connectivity Invariants
Something went wrong computing Connectivity Invariants
Something went wrong computing Morgan fingerprints
Something went wrong computing Morgan fingerprints
Something went wrong computing Morgan fingerprints
Something went wrong computing Morgan fingerprints
Something went wrong computing MACCSKeys
Something went wrong computing MACCSKeys
Something went wrong computing MACCSKeys
Something went wrong computing MACCSKeys
Something went wrong computing Atom Pair fingerprints
Something went wrong computing Atom Pair fingerprints
Something went wrong computing Atom Pair fingerprints
Something went wrong computing Atom Pair fingerpri

In [24]:
# This mappings might take very long
drug_ids['PubchemFP']= drug_ids['CID'].map(compute_pubchem_fingerprints) #This takes over 1 hour in my computer
drug_ids['CactvsFP']= drug_ids['CID'].map(compute_cactvs_fingerprints) #This takes over 1 hour in my computer
#drug_ids['RDKitFP']= drug_ids['Molecule'].map(compute_rdkit_fp) #This takes so long that crashes my computer, but I coudn't find a way around

In [None]:
drug_ids['RDKitFP']= drug_ids['Molecule'].map(compute_rdkit_fp) #This takes so long that crashes my computer, but I coudn't find a way around

### Saving the dataframe

In [114]:
drug_ids.to_pickle(os.path.join('res','pickles','f{dataset}_fp.pkl'))

In [4]:
drug_ids.sample(5)

NameError: name 'drug_ids' is not defined

In [116]:
drug_ids.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2938 entries, 5 to 14995
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Molecule        2934 non-null   object 
 1   atc_code_3      2938 non-null   object 
 2   CID             2900 non-null   float64
 3   atc_code_#      2938 non-null   int32  
 4   FeatInvariants  2934 non-null   object 
 5   ConnInvariants  2934 non-null   object 
 6   Morgan2FP       2934 non-null   object 
 7   MACCSKeys       2934 non-null   object 
 8   AtomPairFP      2934 non-null   object 
 9   TopTorFP        2934 non-null   object 
 10  AvalonFP        2934 non-null   object 
 11  PubchemFP       2900 non-null   object 
 12  CactvsFP        2900 non-null   object 
dtypes: float64(1), int32(1), object(11)
memory usage: 309.9+ KB


In [1]:
n = pd.read_pickle(os.path.join('res', 'pickles', 'drug'))

NameError: name 'drug_ids' is not defined