## Filtering important compounds

In this notebook is on data filetering and cleaning. There are too many compounds and their interactions mentioned in the datafile and I feel its going to be a lot to work with for a simple project, so its better I trim the set of compounds to only have those interactions where its easy to find data on chembl.

In [1]:
# load the file as a pandas dataframe
import pandas as pd

df = pd.read_csv('../data/DtcDrugTargetInteractions_Jan_2019.csv', low_memory=True)

  interactivity=interactivity, compiler=compiler, result=result)


In [22]:
# check the columns
df.columns

Index(['compound_id', 'standard_inchi_key', 'compound_name', 'synonym',
       'target_id', 'target_pref_name', 'gene_names', 'wildtype_or_mutant',
       'mutation_info', 'pubmed_id', 'standard_type', 'standard_relation',
       'standard_value', 'standard_units', 'activity_comment',
       'ep_action_mode', 'assay_format', 'assaytype', 'assay_subtype',
       'inhibitor_type', 'detection_tech', 'assay_cell_line',
       'compound_concentration_value', 'compound_concentration_value_unit',
       'substrate_type', 'substrate_relation', 'substrate_value',
       'substrate_units', 'assay_description', 'title', 'journal', 'doc_type',
       'annotation_comments'],
      dtype='object')

There are too many columns which we are not interested in, I will just work with a small set as it will make it quicker to work with.

In [2]:
# Will subset the following columns for now as they repesent drug and target interaction

df_set = df[['compound_id', 'standard_inchi_key', 
       'target_id', 'gene_names', 'wildtype_or_mutant', 'standard_type', 'standard_relation',
       'standard_value', 'standard_units']]

In [3]:
df_set.isna().sum() # check the null values associated with these columns to decide which columns to drop

compound_id            134222
standard_inchi_key      91914
target_id               14833
gene_names            1216689
wildtype_or_mutant    5805369
standard_type             350
standard_relation     2288713
standard_value         378702
standard_units         458412
dtype: int64

In [4]:
df_set.head()

Unnamed: 0,compound_id,standard_inchi_key,target_id,gene_names,wildtype_or_mutant,standard_type,standard_relation,standard_value,standard_units
0,CHEMBL3545284,,Q9Y4K4,MAP4K5,,KDAPP,=,19155.14,NM
1,CHEMBL3545284,,Q9Y478,PRKAB1,,KDAPP,=,1565.72,NM
2,CHEMBL3545284,,Q9Y2U5,MAP3K2,,KDAPP,=,746.77,NM
3,CHEMBL3545284,,Q9Y2K2,SIK3,,KDAPP,=,13558.67,NM
4,CHEMBL3545284,,Q9UL54,TAOK2,,KDAPP,=,2220.98,NM


## Selecting concentration values

I will be looking for only those compound where they have the standard type reported as Kd values. I will also only use those compounds where the units are represented in either NM, MM or NMOL/L and make sure that the standard relation is '='. I also dont think I would like to work with mutated proteins so I will be filtering them off as well.

In [5]:
# filetering based on boolean 

df_set = df_set[(df_set.standard_type == 'KD') | (df_set.standard_type == 'Kd')| 
                (df_set.standard_type == 'KI') | (df_set.standard_type == 'Ki')]

df_set = df_set[(df_set.standard_units == 'NM')|(df_set.standard_units == 'MM') | 
                (df_set.standard_units == 'NMOL/L')]

df_set = df_set[(df_set.standard_relation == '=')]


In [8]:
df_set = df_set[(df_set.wildtype_or_mutant != 'mutated')]

In [9]:
# drop those rows where there are null values for all columns

df_set.dropna(how='all', inplace=True)

In [10]:
df_set.drop_duplicates(inplace=True)

In [11]:
df_set.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 443163 entries, 179 to 5980809
Data columns (total 9 columns):
compound_id           414330 non-null object
standard_inchi_key    432589 non-null object
target_id             437324 non-null object
gene_names            314300 non-null object
wildtype_or_mutant    12229 non-null object
standard_type         443163 non-null object
standard_relation     443163 non-null object
standard_value        443163 non-null float64
standard_units        443163 non-null object
dtypes: float64(1), object(8)
memory usage: 33.8+ MB


In [12]:
# look at how many unique compounds are present

df_set.compound_id.nunique()

164662

In [17]:
# drop those rows where there is null value associated with target, 
# no point keeping a compound if there is no target data associated with it

df_set.dropna(axis=0, subset=['target_id', 'gene_names'], inplace=True)

In [18]:
imp_comp = set(df_set.compound_id.values)

In [19]:
len(imp_comp)

121540

In [21]:
''' Will save these compounds to extract chemical information for them in a different notebook.
'''
with open('../cleaned_data/imp_comp.txt', 'w') as f:
    for item in imp_comp:
        f.write("%s\n" % item)