### Descriptions:
COHORT:
* Conor's original cohort: 52,314
* Remove admit years of 2014 and 2020: 52,298
* Remove 8805 csn with non-full code or <18y.old: 43,493 
* Remove 173 csn without vital signs and GCS, and 29 csn with only GCS and no other vital signs: 43,291
* Remove 201 csn wihout any labels during hospital stays: 43,090
* Further remove 82 patients without any labels at 24 hour mark: 43,008 (**cohort_labels.csv**)
* **final**: remove all patients without a complete set of VS: 41,654

On BQ **cohort_final_with_labels_complete1VS** is our most updated cohort, size of 41654 unique encounters, and *cohort_labels* from Tiffany is marked as to_keep on BQ

JOIN all features together with the final cohort:

Inputs: cohort_final (processed in R2 notebook), cohort_demo_final (R1), vitals_clean (R2), labs_clean (R3)

* Combine cohort with demographic, vitals, and labs to the long format
* Use final cohort size of 41,654, only patients who are full code, 18yr or above, and have at least a complete set of 1st VS
* Demographic have indicators of missingness (ESI, H and W) and one hot coding for categorical variables (gender and race)

Output: 
* **features_demos_vitals_labs.csv** 3,308,906 rows in long format with anon_id, csn, label_24_recent, admit_time, 
* with feature_type, features, values, and time (NA for demo, recorded for vitals and result for labs)

### Importing R libraries

In [1]:
library(caret) # import this before glmnet to avoid rlang version problem
library(xgboost)
library(data.table)
library(tidyverse)
library(lubridate)
library(Matrix)
# library(slam)
library(glmnet)
library(bit64)
# library(mtools) for one hot coder, not available on Nero or use caret or tidyr
library(mice)
options(repr.matrix.max.rows=200, repr.matrix.max.cols=40)

Loading required package: lattice

Loading required package: ggplot2

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.0
[32m✔[39m [34mtidyr  [39m 1.1.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0
[32m✔[39m [34mpurrr  [39m 0.3.4     

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mbetween()[39m   masks [34mdata.table[39m::between()
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m    masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mfirst()[39m     masks [34mdata.table[39m::first()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m       masks [34mstats[39m::lag()
[31m✖[39m [34mdplyr[39m::[32mlast()[39m      masks [34mdata.table[39m::last()
[31m✖[39m [34mpurrr[39m::[32mlift()[39m      mas

### Call back all datasets: 
* demographic with ESI
* vitals with GCS (note that this data has 43320 rows, but for ESI imputation, remove those with GCS as well, so we have 43291 left)
* labs, still with 9999999 values

In [33]:
# nrow(cohort_vitals_clean %>% filter(anon_id == "JCd49287")) #23
# cohort_demo_clean %>% filter(anon_id == "JCd49287")

cohort <- read.csv("./Data/cohort_final.csv")
# demos <-  read.csv("./Data/cohort_demo_completed.csv")
demos <- read.csv("./Data/cohort_demo_final.csv") # updated demographic with latest cohort
vitals <- read.csv("./Data/vitals_clean.csv")
labs <- read.csv("./Data/labs_clean.csv")

nrow(cohort) # cohort final 41654
nrow(demos)
nrow(vitals) #1,274,314
nrow(labs) #1,368,351

nrow(demos %>% select(anon_id, pat_enc_csn_id_coded) %>% distinct())
# nrow(demos %>% select(pat_enc_csn_id_coded) %>% distinct())

nrow(vitals %>% select(anon_id, pat_enc_csn_id_coded) %>% distinct())
# nrow(vitals %>% select(pat_enc_csn_id_coded) %>% distinct())

nrow(labs %>% select(anon_id, pat_enc_csn_id_coded) %>% distinct()) # 39226
# nrow(labs %>% select(pat_enc_csn_id_coded) %>% distinct())

# nrow(demos %>% distinct(pat_enc_csn_id_coded) %>% filter(pat_enc_csn_id_coded %in% cohort$pat_enc_csn_id_coded))
# nrow(vitals %>% distinct(pat_enc_csn_id_coded) %>% filter(pat_enc_csn_id_coded %in% cohort$pat_enc_csn_id_coded))
# nrow(labs %>% distinct(pat_enc_csn_id_coded) %>% filter(pat_enc_csn_id_coded %in% cohort$pat_enc_csn_id_coded))

nrow(cohort %>% distinct(pat_enc_csn_id_coded) %>% filter(pat_enc_csn_id_coded %in% demos$pat_enc_csn_id_coded))
nrow(cohort %>% distinct(pat_enc_csn_id_coded) %>% filter(pat_enc_csn_id_coded %in% vitals$pat_enc_csn_id_coded))
nrow(cohort %>% distinct(pat_enc_csn_id_coded) %>% filter(pat_enc_csn_id_coded %in% labs$pat_enc_csn_id_coded))

In [34]:
colnames(cohort)
colnames(demos)
colnames(vitals)
colnames(labs)

### OLD --- when Tiffany had a list of patients who had no labels throughout the hospital stays

### Remove patients in hospitals with missing levels of care
* 202: no labels at all for the entire hospital stays, smallest set
* 806: no levels at the admission level, adt table
* 136: no levels of care at 0 - 12hrs
* 82: no levels of care at the 24hrs

In [7]:
# JCdcafca and 131187786922, 20015518, 0, 2016-05-15 20:57:00+00:00 
# in the no_labels cohort (and Conor's cohort) but not in the updated cohort
noinco <- cohort[cohort$pat_enc_csn_id_coded %in% no_labels$pat_enc_csn_id_coded,]
head(no_labels[!no_labels$pat_enc_csn_id_coded %in% noinco$pat_enc_csn_id_coded, ])

Unnamed: 0_level_0,int64_field_0,anon_id,pat_enc_csn_id_coded,inpatient_data_id_coded,label_max24,admit_time
Unnamed: 0_level_1,<int>,<fct>,<dbl>,<int>,<int>,<fct>
1,101,JCcb8554,131178016708,18546350,0,2016-02-25 03:23:00+00:00
2,332,JCcbb116,131065778333,14009021,0,2015-02-11 09:17:00+00:00
3,412,JCcbc455,131112222787,16567238,0,2015-08-21 20:12:00+00:00
4,529,JCcbe464,131080007580,14770205,0,2015-04-20 21:01:00+00:00
5,690,JCcc0607,131109972005,16518630,0,2015-09-05 05:35:00+00:00
6,830,JCcc2366,131126219970,16980216,0,2015-09-15 10:01:00+00:00


In [90]:
# remove patients in the final cohort with missing labels:
cohort <- anti_join(cohort, no_labels, by = c("anon_id", "pat_enc_csn_id_coded"))
nrow(cohort) # 43291 - 202
head(cohort)

Unnamed: 0_level_0,anon_id,pat_enc_csn_id_coded,inpatient_data_id_coded,label_max24,admit_time
Unnamed: 0_level_1,<fct>,<dbl>,<int>,<int>,<fct>
1,JC29f8ad2,131274729058,40679773,0,2019-08-31 12:52:00
2,JC29f8ad3,131278291027,42992239,0,2019-10-05 23:48:00
3,JC29f8b9c,131266787806,36261582,0,2019-05-05 01:07:00
4,JC29f8beb,131264387263,34626013,0,2019-03-15 03:35:00
5,JC29f8beb,131279241689,43527040,0,2019-11-27 15:29:00
6,JC29f8bef,131280937356,44544574,0,2019-11-30 10:35:00


### OK to continue here

In [35]:
summary(vitals %>% group_by(features) %>% select(values))
summary(labs %>% group_by(features) %>% select(values))

Adding missing grouping variables: `features`



  features          values      
 DBP  :203896   Min.   :  3.00  
 GCS  : 44442   1st Qu.: 37.00  
 Pulse:260114   Median : 82.00  
 RR   :206469   Mean   : 75.19  
 SBP  :203937   3rd Qu.: 99.00  
 SpO2 :218899   Max.   :419.00  
 Temp :136530                   

Adding missing grouping variables: `features`



    features           values       
 Glucose:  49639   Min.   : -30.00  
 K      :  43392   1st Qu.:   3.46  
 Hct    :  43307   Median :  12.40  
 Na     :  43275   Mean   :  38.90  
 Hgb    :  43261   3rd Qu.:  37.00  
 Cl     :  42976   Max.   :9655.00  
 (Other):1102501                    

### Check cohort patients who are not in the vital signs table
Note that all NA were dropped from vital signs. Another approach is to keep and impute them for same time in wide format tables

In [36]:
demos <- demos %>% select(-c(SBP, DBP, Pulse, RR, SpO2, Temp))
colnames(demos)

In [48]:
dim(demos)
colnames(demos)
demo_long <- gather(demos, features, values, ESI_i:race.White, factor_key=TRUE) %>%
                mutate(feature_type = "demo") %>% select(-c(inpatient_data_id_coded, label_max24))
                
head(demo_long, n=1)
dim(demos)
nrow(demo_long) # 43291*29 (cols with values)

Unnamed: 0_level_0,anon_id,pat_enc_csn_id_coded,admit_time,features,values,feature_type
Unnamed: 0_level_1,<fct>,<dbl>,<fct>,<fct>,<dbl>,<chr>
1,JC29f8ad2,131274729058,2019-08-31 12:52:00,ESI_i,3,demo


In [49]:
summary(demo_long$values)
demo_long %>% group_by(features) %>% count()

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    0.00    0.00   18.18    1.00  328.00 

features,n
<fct>,<int>
ESI_i,41654
delta_ESI,41654
gender,41654
age,41654
medis,41654
English,41654
Height_i,41654
delta_H,41654
Weight_i,41654
delta_W,41654


In [52]:
head(demo_long, n=1)
head(vitals, n=1)
head(labs, n=1)

Unnamed: 0_level_0,anon_id,pat_enc_csn_id_coded,admit_time,features,values,feature_type
Unnamed: 0_level_1,<fct>,<dbl>,<fct>,<fct>,<dbl>,<chr>
1,JC29f8ad2,131274729058,2019-08-31 12:52:00,ESI_i,3,demo


Unnamed: 0_level_0,anon_id,pat_enc_csn_id_coded,inpatient_data_id_coded,label_max24,admit_time,features,recorded_time,values,feature_type
Unnamed: 0_level_1,<fct>,<dbl>,<int>,<int>,<fct>,<fct>,<fct>,<dbl>,<fct>
1,JCcb658e,131231466934,24822070,0,2017-06-24 12:56:00,SBP,2017-06-24 09:00:00+00:00,117,vitals


Unnamed: 0_level_0,anon_id,pat_enc_csn_id_coded,inpatient_data_id_coded,label_max24,admit_time,admit_label,label_24hr_recent,label_12hr_recent,has_admit_label,first_label,first_label_time_since_admit,acute_to_critical_label,critical_to_acute_label,features,values,result_time,feature_type
Unnamed: 0_level_1,<fct>,<dbl>,<int>,<int>,<fct>,<int>,<int>,<int>,<int>,<int>,<fct>,<int>,<int>,<fct>,<dbl>,<fct>,<fct>
1,JC29f8ad2,131274729058,40679773,0,2019-08-31 12:52:00+00:00,0,0,0,1,0,0,0,0,"Magnesium, Ser/Plas",2,2019-08-31 11:49:00,labs


In [53]:
# clean vitals and labs to merge
vitals <- vitals %>% select(anon_id, pat_enc_csn_id_coded, admit_time, features, values, feature_type, time=recorded_time) 
labs <- labs %>% select(anon_id, pat_enc_csn_id_coded, admit_time, features, values, feature_type, time=result_time)

In [63]:
head(labs %>% arrange(values))
head(labs %>% arrange(desc(values)))

Unnamed: 0_level_0,anon_id,pat_enc_csn_id_coded,admit_time,features,values,feature_type,time
Unnamed: 0_level_1,<fct>,<dbl>,<fct>,<fct>,<dbl>,<fct>,<fct>
1,JCce9b8f,131117935744,2015-09-04 11:45:00+00:00,Base,-30,labs,2015-09-04 01:22:00
2,JCcfa09a,131243846676,2018-01-21 10:31:00+00:00,Base,-30,labs,2018-01-21 01:22:00
3,JCd08ac9,131102861750,2015-07-03 18:21:00+00:00,Base,-30,labs,2015-07-03 10:35:00
4,JCd08ac9,131102861750,2015-07-03 18:21:00+00:00,Base,-30,labs,2015-07-03 08:41:00
5,JCdab3d0,131094116026,2015-06-29 01:26:00+00:00,Base,-30,labs,2015-06-28 22:38:00
6,JCdab3d0,131094116026,2015-06-29 01:26:00+00:00,Base,-30,labs,2015-06-28 13:25:00


Unnamed: 0_level_0,anon_id,pat_enc_csn_id_coded,admit_time,features,values,feature_type,time
Unnamed: 0_level_1,<fct>,<dbl>,<fct>,<fct>,<dbl>,<fct>,<fct>
1,JCd924e2,131073340176,2015-03-17 07:09:00+00:00,"AST (SGOT), Ser/Plas",9655,labs,2015-03-17 03:11:00
2,JCe207ac,131267177049,2019-04-04 03:57:00+00:00,"AST (SGOT), Ser/Plas",6729,labs,2019-04-03 21:44:00
3,JCdcd138,131279658341,2019-11-02 03:04:00+00:00,"AST (SGOT), Ser/Plas",6673,labs,2019-11-02 03:01:00
4,JCe53597,131238056628,2017-08-17 23:06:00+00:00,"AST (SGOT), Ser/Plas",6589,labs,2017-08-17 21:01:00
5,JCe53597,131238056628,2017-08-17 23:06:00+00:00,"AST (SGOT), Ser/Plas",6307,labs,2017-08-17 21:03:00
6,JCea5c17,131249268828,2018-04-15 05:44:00+00:00,"AST (SGOT), Ser/Plas",5985,labs,2018-04-15 03:33:00


In [55]:
# combine demos, vitals and labs, long format, with "time"
feat3 <- bind_rows(demo_long, vitals, labs)
feat3 <- as.data.frame(unclass(feat3))
nrow(feat3)

In [56]:
head(feat3, n=1)
tail(feat3, n=1)

Unnamed: 0_level_0,anon_id,pat_enc_csn_id_coded,admit_time,features,values,feature_type,time
Unnamed: 0_level_1,<fct>,<dbl>,<fct>,<fct>,<dbl>,<fct>,<fct>
1,JC29f8ad2,131274729058,2019-08-31 12:52:00,ESI_i,3,demo,


Unnamed: 0_level_0,anon_id,pat_enc_csn_id_coded,admit_time,features,values,feature_type,time
Unnamed: 0_level_1,<fct>,<dbl>,<fct>,<fct>,<dbl>,<fct>,<fct>
3350756,JCec489c,131226296895,2017-05-08 01:56:00+00:00,Neut,7.71,labs,2017-05-08 00:58:00


In [58]:
feat3 %>% count(feature_type)

feature_type,n
<fct>,<int>
demo,708118
labs,1368351
vitals,1274287


In [61]:
feat3 %>% group_by(feature_type, features) %>% count()

feature_type,features,n
<fct>,<fct>,<int>
demo,ESI_i,41654
demo,delta_ESI,41654
demo,gender,41654
demo,age,41654
demo,medis,41654
demo,English,41654
demo,Height_i,41654
demo,delta_H,41654
demo,Weight_i,41654
demo,delta_W,41654


In [62]:
summary(feat3)

      anon_id        pat_enc_csn_id_coded               admit_time     
 JCe8f38d :   3464   Min.   :1.311e+11    2015-06-28 23:10:00:    551  
 JC29fe299:   3214   1st Qu.:1.312e+11    2016-01-19 03:57:00:    484  
 JC2a0d68b:   2595   Median :1.312e+11    2019-06-08 21:23:00:    479  
 JCe228ac :   2517   Mean   :1.312e+11    2015-11-22 18:04:00:    452  
 JCdd32fa :   2450   3rd Qu.:1.313e+11    2019-06-14 02:30:00:    413  
 JCe22af4 :   2429   Max.   :1.313e+11    2015-06-30 02:30:00:    404  
 (Other)  :3334087                        (Other)            :3347973  
    features           values        feature_type    
 Pulse  : 260114   Min.   : -30.00   demo  : 708118  
 SpO2   : 218899   1st Qu.:   2.25   labs  :1368351  
 RR     : 206469   Median :  26.00   vitals:1274287  
 SBP    : 203937   Mean   :  48.32                   
 DBP    : 203896   3rd Qu.:  89.00                   
 Temp   : 136530   Max.   :9655.00                   
 (Other):2120911                              

In [64]:
# remember the labs and vitals still contain patients who have no other vital signs except for a GCS
nrow(feat3 %>% select(anon_id, pat_enc_csn_id_coded) %>% distinct())
nrow(feat3 %>% select(pat_enc_csn_id_coded) %>% distinct())

### Explore -- GCS and ESI -- No need to redo
* GCS was not used to impute ESI because there are not many encounters with GCS score
* So encounters without any other VS but even with GCS are excluded from the data
* However, we check to see these encounters with GCS and they all have ESI
* We will not keep them in the data anyways, because if some don't have ESI, it's too many loops to include/exclude

### Join with the final cohort


In [66]:
nrow(feat3)
nrow(feat3 %>% distinct(anon_id, pat_enc_csn_id_coded))
head(feat3, n=1)
nrow(cohort)
head(cohort, n=1)

Unnamed: 0_level_0,anon_id,pat_enc_csn_id_coded,admit_time,features,values,feature_type,time
Unnamed: 0_level_1,<fct>,<dbl>,<fct>,<fct>,<dbl>,<fct>,<fct>
1,JC29f8ad2,131274729058,2019-08-31 12:52:00,ESI_i,3,demo,


Unnamed: 0_level_0,anon_id,pat_enc_csn_id_coded,inpatient_data_id_coded,label_max24,admit_time,admit_label,label_24hr_recent,label_12hr_recent,has_admit_label,first_label,first_label_time_since_admit,acute_to_critical_label,critical_to_acute_label
Unnamed: 0_level_1,<fct>,<dbl>,<int>,<int>,<fct>,<int>,<int>,<int>,<int>,<int>,<fct>,<int>,<int>
1,JC29f8ad2,131274729058,40679773,0,2019-08-31 12:52:00+00:00,0,0,0,1,0,0,0,0


In [67]:
cohort <- cohort %>% mutate(admit_time = ymd_hms(admit_time))
feat3 <- feat3 %>% mutate(admit_time = ymd_hms(admit_time))
head(cohort, n=1)
head(feat3, n=1)

Unnamed: 0_level_0,anon_id,pat_enc_csn_id_coded,inpatient_data_id_coded,label_max24,admit_time,admit_label,label_24hr_recent,label_12hr_recent,has_admit_label,first_label,first_label_time_since_admit,acute_to_critical_label,critical_to_acute_label
Unnamed: 0_level_1,<fct>,<dbl>,<int>,<int>,<dttm>,<int>,<int>,<int>,<int>,<int>,<fct>,<int>,<int>
1,JC29f8ad2,131274729058,40679773,0,2019-08-31 12:52:00,0,0,0,1,0,0,0,0


Unnamed: 0_level_0,anon_id,pat_enc_csn_id_coded,admit_time,features,values,feature_type,time
Unnamed: 0_level_1,<fct>,<dbl>,<dttm>,<fct>,<dbl>,<fct>,<fct>
1,JC29f8ad2,131274729058,2019-08-31 12:52:00,ESI_i,3,demo,


In [68]:
final_feat3 <- left_join(cohort, feat3)
nrow(final_feat3) # 3,308,906
nrow(final_feat3 %>% distinct(anon_id, pat_enc_csn_id_coded))

Joining, by = c("anon_id", "pat_enc_csn_id_coded", "admit_time")



In [74]:
nrow(final_feat3 %>% select(anon_id, pat_enc_csn_id_coded) %>% distinct())
nrow(final_feat3 %>% select(pat_enc_csn_id_coded) %>% distinct())
nrow(final_feat3 %>% select(anon_id) %>% distinct())

In [75]:
head(final_feat3, n=1)

Unnamed: 0_level_0,anon_id,pat_enc_csn_id_coded,inpatient_data_id_coded,label_max24,admit_time,admit_label,label_24hr_recent,label_12hr_recent,has_admit_label,first_label,first_label_time_since_admit,acute_to_critical_label,critical_to_acute_label,features,values,feature_type,time
Unnamed: 0_level_1,<fct>,<dbl>,<int>,<int>,<dttm>,<int>,<int>,<int>,<int>,<int>,<fct>,<int>,<int>,<fct>,<dbl>,<fct>,<fct>
1,JC29f8ad2,131274729058,40679773,0,2019-08-31 12:52:00,0,0,0,1,0,0,0,0,ESI_i,3,demo,


In [76]:
final_feat3 %>% count(feature_type)

feature_type,n
<fct>,<int>
demo,708118
labs,1368351
vitals,1232437


In [78]:
final_feat3 %>% group_by(feature_type, features) %>% count()

feature_type,features,n
<fct>,<fct>,<int>
demo,ESI_i,41654
demo,delta_ESI,41654
demo,gender,41654
demo,age,41654
demo,medis,41654
demo,English,41654
demo,Height_i,41654
demo,delta_H,41654
demo,Weight_i,41654
demo,delta_W,41654


In [79]:
summary(final_feat3)

      anon_id        pat_enc_csn_id_coded inpatient_data_id_coded
 JCe8f38d :   3464   Min.   :1.311e+11    Min.   :13616753       
 JC29fe299:   3138   1st Qu.:1.312e+11    1st Qu.:18137027       
 JC2a0d68b:   2585   Median :1.312e+11    Median :25721698       
 JCe228ac :   2517   Mean   :1.312e+11    Mean   :26400576       
 JCdd32fa :   2450   3rd Qu.:1.313e+11    3rd Qu.:33134355       
 JCdc9c9c :   2372   Max.   :1.313e+11    Max.   :45698377       
 (Other)  :3292380                                               
  label_max24       admit_time                   admit_label   
 Min.   :0.0000   Min.   :2015-01-01 08:24:00   Min.   :0.00   
 1st Qu.:0.0000   1st Qu.:2016-01-15 04:15:00   1st Qu.:0.00   
 Median :0.0000   Median :2017-08-10 02:26:00   Median :0.00   
 Mean   :0.1763   Mean   :2017-07-04 16:50:32   Mean   :0.15   
 3rd Qu.:0.0000   3rd Qu.:2018-11-25 09:19:00   3rd Qu.:0.00   
 Max.   :1.0000   Max.   :2019-12-31 22:00:00   Max.   :1.00   
                        

In [80]:
final_feat3 <- final_feat3 %>% select(anon_id, pat_enc_csn_id_coded, label_24hr_recent, admit_time,
                                     feature_type, features, values, time)
head(final_feat3)

Unnamed: 0_level_0,anon_id,pat_enc_csn_id_coded,label_24hr_recent,admit_time,feature_type,features,values,time
Unnamed: 0_level_1,<fct>,<dbl>,<int>,<dttm>,<fct>,<fct>,<dbl>,<fct>
1,JC29f8ad2,131274729058,0,2019-08-31 12:52:00,demo,ESI_i,3,
2,JC29f8ad2,131274729058,0,2019-08-31 12:52:00,demo,delta_ESI,0,
3,JC29f8ad2,131274729058,0,2019-08-31 12:52:00,demo,gender,1,
4,JC29f8ad2,131274729058,0,2019-08-31 12:52:00,demo,age,52,
5,JC29f8ad2,131274729058,0,2019-08-31 12:52:00,demo,medis,0,
6,JC29f8ad2,131274729058,0,2019-08-31 12:52:00,demo,English,1,


In [81]:
tail(final_feat3)

Unnamed: 0_level_0,anon_id,pat_enc_csn_id_coded,label_24hr_recent,admit_time,feature_type,features,values,time
Unnamed: 0_level_1,<fct>,<dbl>,<int>,<dttm>,<fct>,<fct>,<dbl>,<fct>
3308901,JCec489c,131226296895,1,2017-05-08 01:56:00,labs,TBili,0.5,2017-05-08 01:13:00
3308902,JCec489c,131226296895,1,2017-05-08 01:56:00,labs,Eos,0.14,2017-05-08 00:58:00
3308903,JCec489c,131226296895,1,2017-05-08 01:56:00,labs,Lymp,1.25,2017-05-08 00:58:00
3308904,JCec489c,131226296895,1,2017-05-08 01:56:00,labs,Basos,0.1,2017-05-08 00:58:00
3308905,JCec489c,131226296895,1,2017-05-08 01:56:00,labs,Mono,0.89,2017-05-08 00:58:00
3308906,JCec489c,131226296895,1,2017-05-08 01:56:00,labs,Neut,7.71,2017-05-08 00:58:00


In [82]:
write.csv(final_feat3, file = "./Data/features_demos_vitals_labs.csv", row.names=FALSE)

### OLD --

In [101]:
# write.csv(cohort, file = "./Data/cohort_has_vs_hxlabels.csv", row.names=FALSE)

In [103]:
# read Tiffany's label
labels <- read.csv("./Data/labels.csv")
nrow(labels)

In [114]:
head(labels, n=1)
head(cohort, n=1)

Unnamed: 0_level_0,anon_id,pat_enc_csn_id_coded,inpatient_data_id_coded,label_max24,admit_time,admit_label,X_24hr_recent_label,X_12hr_recent_label
Unnamed: 0_level_1,<fct>,<dbl>,<int>,<int>,<fct>,<dbl>,<dbl>,<dbl>
1,JCcbd0bd,131093156488,15900683,0,2015-06-19 05:29:00+00:00,,,


Unnamed: 0_level_0,anon_id,pat_enc_csn_id_coded,inpatient_data_id_coded,label_max24,admit_time
Unnamed: 0_level_1,<fct>,<dbl>,<int>,<int>,<fct>
1,JC29f8ad2,131274729058,40679773,0,2019-08-31 12:52:00


In [112]:
summary(labels)
colnames(labels)

      anon_id      pat_enc_csn_id_coded inpatient_data_id_coded  label_max24   
 JCe8f38d :   37   Min.   :1.311e+11    Min.   :13616753        Min.   :0.000  
 JC29fe299:   36   1st Qu.:1.312e+11    1st Qu.:19479492        1st Qu.:0.000  
 JC2a0d68b:   35   Median :1.312e+11    Median :26829094        Median :0.000  
 JCdd32fa :   34   Mean   :1.312e+11    Mean   :27179266        Mean   :0.138  
 JCdc9c9c :   32   3rd Qu.:1.313e+11    3rd Qu.:33542568        3rd Qu.:0.000  
 JCcfe0cb :   29   Max.   :1.313e+11    Max.   :45698377        Max.   :1.000  
 (Other)  :43088                                                               
                     admit_time     admit_label     X_24hr_recent_label
 2015-07-29 05:24:00+00:00:    3   Min.   :0.0000   Min.   :0.00000    
 2018-06-26 01:39:00+00:00:    3   1st Qu.:0.0000   1st Qu.:0.00000    
 2018-09-09 00:41:00+00:00:    3   Median :0.0000   Median :0.00000    
 2019-10-05 01:25:00+00:00:    3   Mean   :0.1127   Mean   :0.09799    


In [115]:
new_cohort <- cohort %>% select(-admit_time) %>% left_join(labels) %>%
                    rename(label_24hr_recent = X_24hr_recent_label, label_12hr_recent = X_12hr_recent_label)
nrow(new_cohort)
summary(new_cohort)

Joining, by = c("anon_id", "pat_enc_csn_id_coded", "inpatient_data_id_coded", "label_max24")



      anon_id      pat_enc_csn_id_coded inpatient_data_id_coded
 JCe8f38d :   37   Min.   :1.311e+11    Min.   :13616753       
 JC29fe299:   36   1st Qu.:1.312e+11    1st Qu.:19477028       
 JC2a0d68b:   35   Median :1.312e+11    Median :26817183       
 JCdd32fa :   34   Mean   :1.312e+11    Mean   :27160060       
 JCdc9c9c :   32   3rd Qu.:1.313e+11    3rd Qu.:33520092       
 JCcfe0cb :   29   Max.   :1.313e+11    Max.   :45698377       
 (Other)  :42887                                               
  label_max24                         admit_time     admit_label    
 Min.   :0.0000   2015-07-29 05:24:00+00:00:    3   Min.   :0.0000  
 1st Qu.:0.0000   2018-06-26 01:39:00+00:00:    3   1st Qu.:0.0000  
 Median :0.0000   2018-09-09 00:41:00+00:00:    3   Median :0.0000  
 Mean   :0.1382   2019-10-05 01:25:00+00:00:    3   Mean   :0.1129  
 3rd Qu.:0.0000   2015-01-04 01:11:00+00:00:    2   3rd Qu.:0.0000  
 Max.   :1.0000   2015-01-10 00:04:00+00:00:    2   Max.   :1.0000  
     

In [116]:
# write.csv(new_cohort, "./Data/cohort_final_with_labels.csv", row.names = FALSE)

In [124]:
length(cohort$pat_enc_csn_id_coded %in% new_cohort$pat_enc_csn_id_coded)
length(new_cohort$pat_enc_csn_id_coded %in% cohort$pat_enc_csn_id_coded)
length(new_cohort$pat_enc_csn_id_coded %in% feat3$pat_enc_csn_id_coded)
length(feat3$pat_enc_csn_id_coded %in% new_cohort$pat_enc_csn_id_coded)
nrow(feat3)