## PubChem Molecular Fingerprints Drug-Set Library
### Drug-set labels: Molecular Fingerprints
#### ALL DATABASES ACCESSED 08/2020
##### Author : Eryk Kropiwnicki | eryk.kropiwnicki@icahn.mssm.edu

This notebook queries the PubChem API with the master list of Drugmonizome small molecules to retrieve PubChem 2D Fingerprints for each small molecule that are then converted into representative bit strings representing small molecule features that are then converted into a drug set library

In [1]:
import os
import json

from rdkit import Chem
from rdkit import DataStructs

import pandas as pd
import numpy as np

import requests
import time

In [2]:
os.chdir('../../scripts')
from export_script import *
os.chdir('../notebooks/RDKIT')

PubChem Fingerprints are found in a base64 format which needs to be converted into a bit string

In [3]:
# Function for converting base64 encoded fingerprint into bit string

from base64 import b64decode

def PCFP_BitString(pcfp_base64) :

    pcfp_bitstring = "".join( ["{:08b}".format(x) for x in b64decode( pcfp_base64 )] )[32:913]
    return pcfp_bitstring

### Import data

In [4]:
df = pd.read_csv('../../metadata/drugmonizome_metadata.tsv', sep = '\t', usecols = ['Common name',
                                                                                   'InChI Key',
                                                                                    'Canonical_SMILES'])
df = df.dropna()
df.head()

Unnamed: 0,Common name,InChI Key,Canonical_SMILES
0,Bivalirudin,OIRCOABEOLEUMC-GEJPAHFPSA-N,CCC(C)C(C(=O)N1CCCC1C(=O)NC(CCC(=O)O)C(=O)NC(C...
1,Leuprolide,GFIJNRVAKGFPGQ-LIJARHBVSA-N,CCNC(=O)C1CCCN1C(=O)C(CCCN=C(N)N)NC(=O)C(CC(C)...
2,Goserelin,BLCLNMBMMGCOAS-URPVMXJPSA-N,CC(C)CC(C(=O)NC(CCCN=C(N)N)C(=O)N1CCCC1C(=O)NN...
3,Gramicidin D,NDAYQJDHGXTBJL-MWWSRJDJSA-N,CC(C)CC(C(=O)NC(C)C(=O)NC(C(C)C)C(=O)NC(C(C)C)...
4,Desmopressin,NFLWUMRGJYTJIN-PNIOQBSNSA-N,C1CC(N(C1)C(=O)C2CSSCCC(=O)NC(C(=O)NC(C(=O)NC(...


In [5]:
df.shape

(14312, 3)

In [6]:
# Create query list of SMILES representations of small molecules
all_drugs = list(df['Canonical_SMILES'])

### Query PubChem API

In [7]:
bit_vectors = []
failed = []
url = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/smiles/%s/property/Fingerprint2D/json'

for query in all_drugs:
    try:
        response = requests.get(url % query)
        bit_vectors.append(response.json()['PropertyTable']['Properties'][0]['Fingerprint2D'])
        
    except (json.decoder.JSONDecodeError, KeyError):
        failed.append(query)
        
    time.sleep(0.2)

Proportion of SMILES that could not be matched

In [8]:
len(failed)/len(all_drugs)

0.06316377864728899

In [15]:
# Drop failed SMILES from df
df = df[~df['Canonical_SMILES'].isin(failed)]

### Convert list of bit vectors into binary array

In [16]:
# Get bit string representation of each pubchem fingerprint
pubchem_fps = []
for vector in bit_vectors:
    pubchem_fps.append(DataStructs.CreateFromBitString(PCFP_BitString(vector)))

In [18]:
# Create binary array of all small molecule bit strings
pubchem_np_fps = []
for fp in pubchem_fps:
    arr = np.zeros((1,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    pubchem_np_fps.append(arr)
    
df_pubchem_fps = pd.DataFrame(pubchem_np_fps, index = list(df['InChI Key'])) #index by InChI Key

In [19]:
# Renaming column labels 
column_labels = []
for col in df_pubchem_fps.columns:
    column_labels.append('PubChem' + str(col))
df_pubchem_fps.columns = column_labels
df_pubchem_fps.shape

(13408, 881)

### Export as drug-set library

In [22]:
os.chdir('../../data/RDKIT')

In [24]:
drugsetlibrary = {}
for i,col in enumerate(df_pubchem_fps.columns):
    index = df_pubchem_fps[df_pubchem_fps[col] == 1].index
    if len(set(index)) >= 5:
        drugsetlibrary[col] = list(set(index))

In [25]:
gmt_formatter(drugsetlibrary, 'PubChem_fingerprints_drugsetlibrary.gmt')

In [26]:
library_counts(drugsetlibrary)

13379 unique drugs
669 unique association terms
1735873 unique associations
2594.727952167414 average drugs per term
