## DrugBank Small Molecule Carriers Drug-Set Library
### Drug-set labels: Genes
#### ALL DATABASES ACCESSED 01/01/20
##### Author : Eryk Kropiwnicki | eryk.kropiwnicki@icahn.mssm.edu

In [1]:
import json
import pandas as pd
from collections import defaultdict
import csv
import os

In [2]:
os.chdir('../../../scripts')
from export_script import *
from gene_resolver import *
os.chdir('../notebooks/Drugbank/Small molecules')

### Matching Small Molecule Names to Entrez Gene Symbols
#### Input File : drugbank_small_molecule_target_polypeptide_ids.csv (https://www.drugbank.ca/releases/latest#protein-identifiers)
#### Downloaded on 0/01/2020

In [4]:
# Import all protein names and ids matched to drugbank drugs #
df = pd.read_csv('input/drugbank_small_molecule_carrier_polypeptide_ids.csv',
                       usecols = ['Gene Name', 'Species', 'Drug IDs'])

In [5]:
df.head()

Unnamed: 0,Gene Name,Species,Drug IDs
0,F8,Humans,DB09130
1,F5,Humans,DB09130
2,SERPINA6,Humans,DB00180; DB00240; DB00253; DB00324; DB00394; D...
3,FXYD2,Humans,DB09479
4,RBP3,Humans,DB06755


In [6]:
# Dropping all non-human gene names #
df = df[df['Species'].str.contains('Humans', na = False)]

In [7]:
len(df)

92

### Validating genes using lookup table
#### Lookup table generated from ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia

In [8]:
gene_resolver(df, columnName = 'Gene Name')

In [9]:
len(df)

92

In [10]:
df.head()

Unnamed: 0,Gene Name,Species,Drug IDs,Approved Symbol
0,F8,Humans,DB09130,F8
1,F5,Humans,DB09130,F5
2,SERPINA6,Humans,DB00180; DB00240; DB00253; DB00324; DB00394; D...,SERPINA6
3,FXYD2,Humans,DB09479,FXYD2
4,RBP3,Humans,DB06755,RBP3


In [11]:
# Splitting "; " separated drug IDs into separate rows #
df_carrier = pd.DataFrame(df['Drug IDs'].str.split('; ').tolist(), index = df['Approved Symbol']).stack()
df_carrier = df_carrier.reset_index()[[0, 'Approved Symbol']]
df_carrier.columns = ['DrugBank ID','Gene']

In [12]:
df_carrier.head()

Unnamed: 0,DrugBank ID,Gene
0,DB09130,F8
1,DB09130,F5
2,DB00180,SERPINA6
3,DB00240,SERPINA6
4,DB00253,SERPINA6


### Associate each DrugBank ID with InChI Key

In [13]:
# Import Drugbank mapping file
drugbank_mapping = pd.read_csv('../../../metadata/drugmonizome_metadata.tsv', sep = '\t', usecols = ['DrugBank ID',
                                                                                                 'Standard InChI Key'])

In [14]:
drugbank_mapping.head()

Unnamed: 0,DrugBank ID,Standard InChI Key
0,DB00006,OIRCOABEOLEUMC-GEJPAHFPSA-N
1,DB00007,GFIJNRVAKGFPGQ-LIJARHBVSA-N
2,DB00014,BLCLNMBMMGCOAS-URPVMXJPSA-N
3,DB00027,NDAYQJDHGXTBJL-MWWSRJDJSA-N
4,DB00035,NFLWUMRGJYTJIN-PNIOQBSNSA-N


In [15]:
df_carrier = df_carrier.merge(drugbank_mapping)

In [16]:
df_carrier.head(3)

Unnamed: 0,DrugBank ID,Gene,Standard InChI Key
0,DB09130,F8,RYGMFSIKBFXOCR-UHFFFAOYSA-N
1,DB09130,F5,RYGMFSIKBFXOCR-UHFFFAOYSA-N
2,DB09130,ALB,RYGMFSIKBFXOCR-UHFFFAOYSA-N


In [17]:
# Creating list of gene names and drug IDs #
genes = df_carrier['Gene'].tolist()
drugs = df_carrier['Standard InChI Key'].tolist()

### Creating drugsetlibrary and exporting

In [18]:
# The input file contains duplicate protein ids matched to unique Drugbank accession numbers #
# Tupelizing protein ids and drugbank accession numbers and grouping all corresponding drugbank accession numbers under one common dictionary key #

id_dict = tuple(zip(genes, drugs))

drugsetlibrary = defaultdict(list)
for k, v in id_dict:
    drugsetlibrary[k].append(v)

In [19]:
# Removing all terms paired with less than 5 drugs #
drugsetlibrary = {k:list(set(v)) for k,v in drugsetlibrary.items() if len(set(v))>=5}

In [20]:
os.chdir('../../../data/Drugbank')

In [21]:
gmt_formatter(drugsetlibrary, 'Drugbank_smallmolecule_carrier_drugsetlibrary.gmt')

### Library counts

In [22]:
library_counts(drugsetlibrary)

458 unique drugs
14 unique association terms
627 unique associations
44.785714285714285 average drugs per term
