### Aim: Merge data on current liver toxicity predictors with ToxCast predictors for 175 drugs **(& identify missing values for imputation)**

In this notebook, dataset containing 350 drugs and populated with current predictors in Falgun Shah/Minjun Chen's model of DILI (Chen M et al., 2016 Hepatology + Falgun Shah et al., 2015) are "merged" with data containing all 216 ToxCast predcictors for 175 drugs.

After merge is completed, any missing values in Minjun Chen's predictors will need to be imputed either here or externally.

Note: modeling to be performed on a final dataset of 175 drugs with Toxcast predictors and current predictors.

In [13]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [14]:
#magic command to print all output instead of only last line
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [15]:
#read-in various datasets/formats for merges

OP=pd.read_csv("RM_Cmax_DD_logP_MW.csv") #OP=other predictors (for 350 drugs)

TP_bin=pd.read_csv("dili_targets_binary.csv") #TP_continuous=Test(toxcast targets) predictors (for 175 drugs) in binary format (0/1)


OP.head()
TP_bin.head()

Unnamed: 0,chnm,Classification,casn,code,CmaxStand,Molecular Weght,Daily dose,logP,ReactiveMetabolites
0,abacavir,MostDILI Drugs,136470-78-5,C136470785,14.8985,286.33232,600.0,1.2,Yes
1,acarbose,MostDILI Drugs,56180-94-0,C56180940,0.1502,645.60481,300.0,-6.8,
2,aceclofenac,LessDILIDrugs,89796-99-6,C89796996,26.4272,354.18472,200.0,4.48,
3,acetaminophen,MostDILI Drugs,103-90-2,C103902,132.3066,151.16255,3000.0,0.46,Yes
4,acetazolamide,MostDILI Drugs,59-66-5,C59665,134.9843,222.24543,750.0,-0.26,No


Unnamed: 0,chnm,ADCY5,ADORA1,ADORA2,ADORA2A,ADRA1A,ADRA1B,ADRA2A,ADRA2B,ADRB1,...,zf_jaw,zf_nc,zf_pe,zf_snou,zf_somi,zf_swim,zf_teratoscore,zf_tr,zf_trun,zf_yse
0,abacavir,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,acetaminophen,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,acitretin,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,albendazole,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,alclofenac,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
m1=pd.merge(OP,TP_bin, on="chnm")

m1.head()

m1.isnull().sum() 

m1.to_csv('merge1.csv', index=False)


Unnamed: 0,chnm,Classification,casn,code,CmaxStand,Molecular Weght,Daily dose,logP,ReactiveMetabolites,ADCY5,...,zf_jaw,zf_nc,zf_pe,zf_snou,zf_somi,zf_swim,zf_teratoscore,zf_tr,zf_trun,zf_yse
0,abacavir,MostDILI Drugs,136470-78-5,C136470785,14.8985,286.33232,600.0,1.2,Yes,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,acetaminophen,MostDILI Drugs,103-90-2,C103902,132.3066,151.16255,3000.0,0.46,Yes,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,acitretin,MostDILI Drugs,55079-83-9,C55079839,1.2805,326.42934,35.0,6.4,No,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,albendazole,MostDILI Drugs,54965-21-8,C54965218,0.9045,265.33139,400.0,2.7,Yes,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,alclofenac,MostDILI Drugs,22131-79-9,C22131799,600.0276,226.65623,1250.0,2.88,Yes,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


chnm                    0
Classification          0
casn                    0
code                    0
CmaxStand               0
Molecular Weght         0
Daily dose              9
logP                    0
ReactiveMetabolites    60
ADCY5                   0
ADORA1                  0
ADORA2                  0
ADORA2A                 0
ADRA1A                  0
ADRA1B                  0
ADRA2A                  0
ADRA2B                  0
ADRB1                   0
AHR                     0
AKT1                    0
AKT2                    0
AR                      0
ATAD5                   0
AVPR1                   0
BMPR2                   0
CACNA1C                 0
CASP5                   0
CCKAR                   0
CCL2                    0
CCL26                   0
                       ..
estradiol               0
estrogen                0
hmgbox                  0
hydroxypregnenelone     0
hydroxyprogesterone     0
intmrkr                 0
mmp                     0
nucsize     

### Conclusions/Next steps: 
Above merged data shows large # of missing values in reactive_metabolites column. After data is exported, the missing values will be treated before any additional processing steps