### Data prep - for integration of targets with other liver toxicity predictors

In this notebook, all "positive" targets (i.e., those with at least medium NAS score of -4 or higher) and "negative" targets (i.e., those with NAS values < -4 or "absent" in imported dataset) will be prepared for "merging" with other predictors (Chen M et al., 2016 + Shah F et al., 2015) of liver toxicity for ML modeling. 

It is anticipated that I will attempt variable reduction by PCA - given the large # of positive targets (216) - prior to data integration/merge. So data will be prepared in a "continuous" format (and not only binary format) to faciliate PCA analysis.


In [24]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [25]:
#magic command to print all output instead of only last line
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [26]:
#read-in data from ebtc with test targets and their NAS scores - convert to wide format for use in ML - after concatenating w/other DILI predictors

dili_targets=pd.read_csv("mndt_sl.csv")

dili_targets.head()

Unnamed: 0,Classification,chnm,tgt_abbr,testname,NAS
0,MostDILI Drugs,abacavir,AHR,TOX21_AhR_LUC_Agonist_Row256,0.307311
1,MostDILI Drugs,abacavir,ESR1,TOX21_ERa_LUC_BG1_Agonist_Row253,0.800209
2,MostDILI Drugs,acetaminophen,CD40,BSK_LPS_CD40_down_Row64,0.999917
3,MostDILI Drugs,acetaminophen,F3,BSK_LPS_TissueFactor_down_Row147,0.999924
4,MostDILI Drugs,acetaminophen,PGR,NVS_NR_hPR_Row590,0.891443


In [27]:
#Convert all NAS values to 1 (dummy value) to indicate target perturbed

dili_targets["NAS_dummy"]=1

dili_targets.head()

Unnamed: 0,Classification,chnm,tgt_abbr,testname,NAS,NAS_dummy
0,MostDILI Drugs,abacavir,AHR,TOX21_AhR_LUC_Agonist_Row256,0.307311,1
1,MostDILI Drugs,abacavir,ESR1,TOX21_ERa_LUC_BG1_Agonist_Row253,0.800209,1
2,MostDILI Drugs,acetaminophen,CD40,BSK_LPS_CD40_down_Row64,0.999917,1
3,MostDILI Drugs,acetaminophen,F3,BSK_LPS_TissueFactor_down_Row147,0.999924,1
4,MostDILI Drugs,acetaminophen,PGR,NVS_NR_hPR_Row590,0.891443,1


### Convert positive/negative targets data to wide format 

As part of this step, all NAS values will be converted either to a binary format or a continuous format using following schema:<br>
*a) binary format: all targets with an medium of higher NAS value will be converted to 1, else 0 <br>
 b) continuous format: for non-perturbed targets (i.e, NAS < -4), a dummy value will be inserted (-1000 for not perturbed)*
 
 Wide format will simply data "merge" with other predictors of liver toxicity.  Each of the formats (binary/continuous - with or without variable reduction) anticipated to be evaluated for effect on liver toxicity modeling.

#### Binary formatted dili targets

In [28]:
#Convert data to wide-format to show all targets in columns and their "dummy" NAS values underneath each column

dili_targets_binary=dili_targets.pivot(index="chnm", columns="tgt_abbr", values="NAS_dummy")

dili_targets_binary.head()

tgt_abbr,ADCY5,ADORA1,ADORA2,ADORA2A,ADRA1A,ADRA1B,ADRA2A,ADRA2B,ADRB1,AHR,...,zf_jaw,zf_nc,zf_pe,zf_snou,zf_somi,zf_swim,zf_teratoscore,zf_tr,zf_trun,zf_yse
chnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
abacavir,,,,,,,,,,1.0,...,,,,,,,,,,
acetaminophen,,,,,,,,,,,...,,,,,,,,,,
acitretin,,,,,,,,,,,...,,,,,,,,,,
albendazole,,,,,,,,,,1.0,...,,,,,,,,,,
alclofenac,,,,,,,,,,,...,,,,,,,,,,


In [29]:
#Impute all missing values to 0 (indicates these targets are NOT perturbed)

dili_targets_binary=dili_targets_binary.fillna(0)

dili_targets_binary.head()

dili_targets_binary["ADORA1"].value_counts() #For 172 drugs, target is not perturbed, for 3 drugs target is perturbed

tgt_abbr,ADCY5,ADORA1,ADORA2,ADORA2A,ADRA1A,ADRA1B,ADRA2A,ADRA2B,ADRB1,AHR,...,zf_jaw,zf_nc,zf_pe,zf_snou,zf_somi,zf_swim,zf_teratoscore,zf_tr,zf_trun,zf_yse
chnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
abacavir,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
acetaminophen,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
acitretin,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
albendazole,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
alclofenac,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


0.0    172
1.0      3
Name: ADORA1, dtype: int64

In [30]:
#Squash multi-level index for columns

dili_targets_binary=dili_targets_binary.reset_index()
dili_targets_binary.head()

tgt_abbr,chnm,ADCY5,ADORA1,ADORA2,ADORA2A,ADRA1A,ADRA1B,ADRA2A,ADRA2B,ADRB1,...,zf_jaw,zf_nc,zf_pe,zf_snou,zf_somi,zf_swim,zf_teratoscore,zf_tr,zf_trun,zf_yse
0,abacavir,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,acetaminophen,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,acitretin,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,albendazole,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,alclofenac,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [31]:
dili_targets_binary=dili_targets_binary.loc[:, "chnm":"zf_yse"]
dili_targets_binary.index.name=None

dili_targets_binary.head()

tgt_abbr,chnm,ADCY5,ADORA1,ADORA2,ADORA2A,ADRA1A,ADRA1B,ADRA2A,ADRA2B,ADRB1,...,zf_jaw,zf_nc,zf_pe,zf_snou,zf_somi,zf_swim,zf_teratoscore,zf_tr,zf_trun,zf_yse
0,abacavir,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,acetaminophen,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,acitretin,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,albendazole,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,alclofenac,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [32]:
#export data as csv file
dili_targets_binary.to_csv('dili_targets_binary.csv', index=False)

#### Continuous formatted dili targets

In [42]:
#Convert data to wide-format to show all targets in columns and their NAS values underneath each column

dili_targets_continuous=dili_targets.pivot(index="chnm", columns="tgt_abbr", values="NAS")

dili_targets_continuous.head()


##t0tal missing values in df
dili_targets_continuous.isnull().sum().sum()

dili_targets_continuous.shape

175*216

36773/37800

tgt_abbr,ADCY5,ADORA1,ADORA2,ADORA2A,ADRA1A,ADRA1B,ADRA2A,ADRA2B,ADRB1,AHR,...,zf_jaw,zf_nc,zf_pe,zf_snou,zf_somi,zf_swim,zf_teratoscore,zf_tr,zf_trun,zf_yse
chnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
abacavir,,,,,,,,,,0.307311,...,,,,,,,,,,
acetaminophen,,,,,,,,,,,...,,,,,,,,,,
acitretin,,,,,,,,,,,...,,,,,,,,,,
albendazole,,,,,,,,,,-1.562636,...,,,,,,,,,,
alclofenac,,,,,,,,,,,...,,,,,,,,,,


36773

(175, 216)

37800

0.9728306878306878

**>97% of data consists of "missing values" indicating that majority of drugs do not activate most ToxCast targets!**

In [36]:
#Impute all missing values to -1000 (indicates these targets are NOT perturbed)

dili_targets_continuous=dili_targets_continuous.fillna(-1000)

dili_targets_continuous.head()

dili_targets_continuous["ADORA1"].value_counts() #For 172 drugs, target is not perturbed, for 3 drugs target is perturbed



tgt_abbr,ADCY5,ADORA1,ADORA2,ADORA2A,ADRA1A,ADRA1B,ADRA2A,ADRA2B,ADRB1,AHR,...,zf_jaw,zf_nc,zf_pe,zf_snou,zf_somi,zf_swim,zf_teratoscore,zf_tr,zf_trun,zf_yse
chnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
abacavir,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,0.307311,...,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0
acetaminophen,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,...,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0
acitretin,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,...,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0
albendazole,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1.562636,...,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0
alclofenac,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,...,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0,-1000.0


-1000.000000    172
 0.715462         1
 0.932850         1
 0.868277         1
Name: ADORA1, dtype: int64

In [None]:
#Squash multi-level index for columns

dili_targets_continuous=dili_targets_continuous.reset_index()
dili_targets_continuous.head()

In [None]:
dili_targets_continuous=dili_targets_continuous.loc[:, "chnm":"zf_yse"]
dili_targets_continuous.index.name=None

dili_targets_continuous.head()

In [None]:
#export data as csv file
dili_targets_continuous.to_csv('dili_targets_continuous.csv', index=False)

#### Conclusions/Next steps:
dili_targets_continuous will be merged with Falgun Shah/MinJun Chen variables (after imputation of any missing values) and processed for modelling.