## Building a dataframe of fingerprints

This notebook will create a dataframe containing several types of fingerprints for 6935 molecules contained in a dataframe obtained from this research work: Meyer, J.G., Liu, S., Miller, I.J., Coon, J.J., Gitter, A., 2019. Learning Drug Functions from Chemical Structures with Convolutional Neural Networks and Random Forests. J. Chem. Inf. Model. 59, 4438–4449. https://doi.org/10.1021/acs.jcim.9b00236


#### In case the requirements.txt won't work

In [1]:
# pip install rdkit-pypi

In [2]:
# pip install PubChemPy

### Imports and loads

In [3]:
import os
import pandas as pd
import numpy as np
from sklearn import preprocessing
from rdkit.Chem import AllChem, MACCSkeys,rdMolDescriptors
from rdkit.Avalon import pyAvalonTools
from rdkit.Chem import PandasTools as pt
import pubchempy as pcp

In [4]:
# I recommend set this warning off, all operations are map functions to whole columns, so I understand there shoudn't be problems
pd.options.mode.chained_assignment = None

In [35]:
dataset = 'all_label_drugs'
drugs = pd.read_csv(os.path.join('dataframes','pubchem',f'{dataset}.csv'))

In [11]:
display(drugs.head())
drugs.info()

Unnamed: 0,CID,HBondAcceptorCount,HBondDonorCount,MolecularWeight,LogP,RuleFive,IsomericSMILES,ATC_Code_Short,ATC_Code_Explanation
0,1,4.0,0.0,203.24,0.4,1.0,CC(=O)OC(CC(=O)[O-])C[N+](C)(C)C,N,NERVOUS SYSTEM
1,119,3.0,2.0,103.12,-3.17,1.0,C(CC(=O)O)CN,N,NERVOUS SYSTEM
2,137,4.0,2.0,131.13,-1.5,1.0,C(CC(=O)O)C(=O)CN,L,ANTINEOPLASTIC AND IMMUNOMODULATING AGENTS
3,176,2.0,1.0,60.05,-0.17,1.0,CC(=O)O,G,GENITO URINARY SYSTEM AND SEX HORMONES
4,187,2.0,0.0,146.21,0.2,1.0,CC(=O)OCC[N+](C)(C)C,S,SENSORY ORGANS


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10183 entries, 0 to 10182
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   CID                   10183 non-null  int64  
 1   HBondAcceptorCount    10183 non-null  float64
 2   HBondDonorCount       10183 non-null  float64
 3   MolecularWeight       10183 non-null  float64
 4   LogP                  7546 non-null   float64
 5   RuleFive              10183 non-null  float64
 6   IsomericSMILES        10183 non-null  object 
 7   ATC_Code_Short        10183 non-null  object 
 8   ATC_Code_Explanation  10183 non-null  object 
dtypes: float64(5), int64(1), object(3)
memory usage: 716.1+ KB


In [12]:
drugs['ATC_Code_Short'].unique()

array(['N', 'L', 'G', 'S', 'P', 'A', 'V', 'D', 'R', 'J', 'B', 'C', 'H',
       'M', 'I', 'O'], dtype=object)

### Functions

A series of functions below will obtain fingerprints from the RDKit Molecules or their CID (using Pubchempy)

In [13]:
def compute_connectivity_invariants(mol):
    """Function that obtains the connectivity invariants of a molecule
    Input: RDKit molecule
    Output: Numpy array
    """
    try:
        con_inv_fp = rdMolDescriptors.GetConnectivityInvariants(mol)
    except:
        print('Something went wrong computing Connectivity Invariants')
        return None
    return np.array(con_inv_fp)

In [14]:
def compute_feature_invariants(mol):
    """Function that obtains the feature invariants of a molecule
    Input: RDKit molecule
    Output: Numpy array
    """
    try:
        inv_fp = rdMolDescriptors.GetFeatureInvariants(mol)
    except:
        print('Something went wrong computing Feature Invariants')
        return None
    return np.array(inv_fp)

In [15]:
def compute_morgan_fp(mol, depth=2, nBits=2048):
    """Function that obtains the Morgan fingerprints of a molecule
    Input: RDKit molecule
    Output: Numpy array
    """
    try:
        mor_fp = AllChem.GetMorganFingerprintAsBitVect(mol,depth,nBits)
    except:
        print('Something went wrong computing Morgan fingerprints')
        return None
    return np.array(mor_fp)

In [16]:
def compute_maccskeys(mol):
    """Function that obtains the MACCSKeys of a molecule
    Input: RDKit molecule
    Output: Numpy array
    """
    try:
        mkeys = MACCSkeys.GenMACCSKeys(mol)   
    except:
        print('Something went wrong computing MACCSKeys')
        return None
    return np.array(mkeys)

In [17]:
def compute_atom_pair_fp(mol, nBits=2048):
    """Function that obtains the atom pair Fingerprints of a molecule
    Input: RDKit molecule
    Output: Numpy array
    """
    try:
        atom_pair_fp = rdMolDescriptors.GetHashedAtomPairFingerprintAsBitVect(mol, nBits)
    except:
        print('Something went wrong computing Atom Pair fingerprints')
        return None
    return np.array(atom_pair_fp)

In [18]:
def compute_topological_torsion_fp(mol, nBits=2048):
    """Function that obtains the topological torsion fingerprints of a molecule
    Input: RDKit molecule
    Output: Numpy array
    """
    try:
        tt_fp = rdMolDescriptors.GetHashedTopologicalTorsionFingerprintAsBitVect(mol)
    except:
        print('Something went wrong computing Topological Torsion fingerprints')
        return None
    return np.array(tt_fp)
    

In [19]:
def compute_avalon_fp(mol, nBits=2048):
    """Function that obtains the Avalon fingerprints of a molecule
    Input: RDKit molecule
    Output: Numpy array
    """
    try:
        av_fp = pyAvalonTools.GetAvalonFP(mol, nBits)
    except:
        print('Something went wrong computing Avalon fingerprints')
        return None
    return np.array(av_fp)

In [20]:
def compute_rdkit_fp(mol, maxPath=5, fpSize=2048):
    """Function that obtains the RDKit fingerprints of a molecule
    Input: RDKit molecule
    Output: Numpy array
    """
    try:
        rdkit_fp = AllChem.RDKFingerprint(mol, maxPath, fpSize)
    except:
        print('Something went wrong computing RDKit fingerprints')
        return None
    return np.array(rdkit_fp)

In [22]:
def compute_pubchem_fingerprints(cid):
    """Function that obtains the PubChem fingerprints of a molecule
    Input: molecules's CID
    Output: Numpy array
    """
    try:
        comp = pcp.Compound.from_cid(int(cid))
        fp_bin = bin(int(comp.fingerprint, 16))[2:]   
    except:
        print('Something went wrong computing Pubchem fingerprints')
        return None
    return np.array(list(fp_bin)).astype('int')

In [23]:
def compute_cactvs_fingerprints(cid):
    """Function that obtains the Cactvs fingerprints of a molecule
    Input: molecule's CID
    Output: Numpy array
    """
    try:
        comp = pcp.Compound.from_cid(int(cid))
        cactvs_fp_bin = bin(int(comp.fingerprint, 16))[2:]
    except:
        print('Something went wrong computing Cactvs fingerprints')
        return None
    return np.array(list(cactvs_fp_bin)).astype('int')

### Build a dataframe of fingerprints

Add a column with the RDKit Molecule to the Dataframe

In [24]:
pt.AddMoleculeColumnToFrame(frame=drugs, smilesCol='IsomericSMILES', molCol='Molecule')
drugs.head()



Unnamed: 0,CID,HBondAcceptorCount,HBondDonorCount,MolecularWeight,LogP,RuleFive,IsomericSMILES,ATC_Code_Short,ATC_Code_Explanation,Molecule
0,1,4.0,0.0,203.24,0.4,1.0,CC(=O)OC(CC(=O)[O-])C[N+](C)(C)C,N,NERVOUS SYSTEM,<rdkit.Chem.rdchem.Mol object at 0x000001C4CB4...
1,119,3.0,2.0,103.12,-3.17,1.0,C(CC(=O)O)CN,N,NERVOUS SYSTEM,<rdkit.Chem.rdchem.Mol object at 0x000001C4CB4...
2,137,4.0,2.0,131.13,-1.5,1.0,C(CC(=O)O)C(=O)CN,L,ANTINEOPLASTIC AND IMMUNOMODULATING AGENTS,<rdkit.Chem.rdchem.Mol object at 0x000001C4CB4...
3,176,2.0,1.0,60.05,-0.17,1.0,CC(=O)O,G,GENITO URINARY SYSTEM AND SEX HORMONES,<rdkit.Chem.rdchem.Mol object at 0x000001C4CB4...
4,187,2.0,0.0,146.21,0.2,1.0,CC(=O)OCC[N+](C)(C)C,S,SENSORY ORGANS,<rdkit.Chem.rdchem.Mol object at 0x000001C4CB4...


Select columns of interest

In [26]:
drug_ids = drugs[['CID','ATC_Code_Short', 'Molecule']]
display(drug_ids.sample(5))

Unnamed: 0,CID,ATC_Code_Short,Molecule
5990,6437473,D,<rdkit.Chem.rdchem.Mol object at 0x000001C4CDE...
6271,90476194,J,<rdkit.Chem.rdchem.Mol object at 0x000001C4CDE...
924,6300,A,<rdkit.Chem.rdchem.Mol object at 0x000001C4CDE...
9144,5458825,L,<rdkit.Chem.rdchem.Mol object at 0x000001C4DAC...
6848,53297392,L,<rdkit.Chem.rdchem.Mol object at 0x000001C4CDE...


Encode de drug_class column, the codified column will be our label

In [27]:
le = preprocessing.LabelEncoder()
le = le.fit(drug_ids['ATC_Code_Short'])
drug_ids['ATC_Code_#'] = le.transform(drug_ids['ATC_Code_Short'])
display(drug_ids.head())


Unnamed: 0,CID,ATC_Code_Short,Molecule,ATC_Code_#
0,1,N,<rdkit.Chem.rdchem.Mol object at 0x000001C4CB4...,10
1,119,N,<rdkit.Chem.rdchem.Mol object at 0x000001C4CB4...,10
2,137,L,<rdkit.Chem.rdchem.Mol object at 0x000001C4CB4...,8
3,176,G,<rdkit.Chem.rdchem.Mol object at 0x000001C4CB4...,4
4,187,S,<rdkit.Chem.rdchem.Mol object at 0x000001C4CB4...,14


In [29]:
drug_ids['ATC_Code_#'].nunique()

16

#### Using the functions described above to add columns containing the fingerprints

In [30]:
drug_ids['FeatInvariants'] = drug_ids['Molecule'].map(compute_feature_invariants)
drug_ids['ConnInvariants'] = drug_ids['Molecule'].map(compute_connectivity_invariants)
drug_ids['Morgan2FP'] = drug_ids['Molecule'].map(compute_morgan_fp)
drug_ids['MACCSKeys'] = drug_ids['Molecule'].map(compute_maccskeys)
drug_ids['AtomPairFP'] = drug_ids['Molecule'].map(compute_atom_pair_fp)
drug_ids['TopTorFP'] = drug_ids['Molecule'].map(compute_topological_torsion_fp)
drug_ids['AvalonFP'] = drug_ids['Molecule'].map(compute_avalon_fp)

In [31]:
# This mappings might take very long
drug_ids['PubchemFP']= drug_ids['CID'].map(compute_pubchem_fingerprints) #This takes over 1 hour in my computer
drug_ids['CactvsFP']= drug_ids['CID'].map(compute_cactvs_fingerprints) #This takes over 1 hour in my computer
#drug_ids['RDKitFP']= drug_ids['Molecule'].map(compute_rdkit_fp) #This takes so long that crashes my computer, but I coudn't find a way around

In [None]:
drug_ids['RDKitFP']= drug_ids['Molecule'].map(compute_rdkit_fp) #This takes so long that crashes my computer, but I coudn't find a way around

### Saving the dataframe

In [36]:
drug_ids.to_pickle(os.path.join('res','pickles',f'{dataset}.pkl'))

In [33]:
drug_ids.sample(5)

Unnamed: 0,CID,ATC_Code_Short,Molecule,ATC_Code_#,FeatInvariants,ConnInvariants,Morgan2FP,MACCSKeys,AtomPairFP,TopTorFP,AvalonFP,PubchemFP,CactvsFP
4738,21945,J,<rdkit.Chem.rdchem.Mol object at 0x000001C4CDE...,7,"[0, 18, 0, 18, 0, 18, 0, 18, 0, 0, 4, 4, 4, 4,...","[2968968094, 2092489639, 2968968094, 209248963...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, ...","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, ...","[1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, ..."
7599,157917,C,<rdkit.Chem.rdchem.Mol object at 0x000001C4CDE...,2,"[0, 0, 0, 0, 0, 0, 19, 0, 0, 2, 2, 0, 0, 0, 0,...","[2246728737, 3217380708, 3217380708, 297603378...","[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, ...","[0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, ...","[1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, ..."
6976,122130735,J,<rdkit.Chem.rdchem.Mol object at 0x000001C4CDE...,7,"[0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2, ...","[2246728737, 2245384272, 2976033787, 297681616...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, ...","[1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, ..."
7920,45356880,C,<rdkit.Chem.rdchem.Mol object at 0x000001C4CDE...,2,"[0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, ...","[2246728737, 2976033787, 2976033787, 297603378...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, ...","[1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, ..."
1621,65526,B,<rdkit.Chem.rdchem.Mol object at 0x000001C4CDE...,1,"[4, 4, 4, 4, 4, 4, 0, 19, 32, 2, 1]","[3218693969, 3218693969, 3217380708, 321869396...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, ...","[1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, ..."


In [116]:
drug_ids.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2938 entries, 5 to 14995
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Molecule        2934 non-null   object 
 1   atc_code_3      2938 non-null   object 
 2   CID             2900 non-null   float64
 3   atc_code_#      2938 non-null   int32  
 4   FeatInvariants  2934 non-null   object 
 5   ConnInvariants  2934 non-null   object 
 6   Morgan2FP       2934 non-null   object 
 7   MACCSKeys       2934 non-null   object 
 8   AtomPairFP      2934 non-null   object 
 9   TopTorFP        2934 non-null   object 
 10  AvalonFP        2934 non-null   object 
 11  PubchemFP       2900 non-null   object 
 12  CactvsFP        2900 non-null   object 
dtypes: float64(1), int32(1), object(11)
memory usage: 309.9+ KB


In [1]:
n = pd.read_pickle(os.path.join('res', 'pickles', 'drug'))

NameError: name 'drug_ids' is not defined