### Description:

The codes in this notebook is from the 2nd part of the `2.3_vitalsigns_R.ipynb`

UPDATE cohort, reduced to observations who have at least 1 COMPLETE set of vital signs.

*This notebook is only for demonstration of obtaining an updated cohort, no need to re-run.*

Inputs: `2_3_coh2_vitals.csv`

Outputs: 
- `cohort_vs1st.csv`
- `1_3_cohort.csv` -- updated cohort, use for ESI imputation

### Importing R libraries

In [1]:
library(bigrquery)  # to query STARR-OMOP (stored in BigQuery) using SQL
library(tidyverse)
library(lubridate)

# library(data.table)
# library(Matrix)
# library(caret) # import this before glmnet to avoid rlang version problem
# library(glmnet)
# library(bit64)

# library(slam)
# library(mtools) for one hot coder, not available on Nero or use caret or tidyr
# library(mice)

options(repr.matrix.max.rows=250, repr.matrix.max.cols=30)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.3     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.2     [32m✔[39m [34mdplyr  [39m 1.0.6
[32m✔[39m [34mtidyr  [39m 1.1.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.4.0     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


Attaching package: ‘lubridate’


The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union




## Get the first set of vital signs - UPDATE COHORT
This will result in an updated (smaller) cohort, `1_3_cohort`, only with observations whose have at least a complet set of vital signs.

These blocks of code in this section are the same in:
- `1_cohort/1.5_cohort_complete1vitals_R.ipynb`
- second part of the `2.3_vitalsigns_R.ipynb`

In [114]:
# read inputs
datadir = "../../DataTD"
cohortdir = "../../OutputTD/1_cohort"
featuredir = "../../OutputTD/2_features"

cohort_vitals <- read.csv(file.path(featuredir, "2_3_coh2_vitals.csv"))
nrow(cohort_vitals)
length(unique(cohort_vitals$pat_enc_csn_id_coded))

In [29]:
head(cohort_vitals)

Unnamed: 0_level_0,anon_id,pat_enc_csn_id_coded,inpatient_data_id_coded,template,features,units,recorded_time,values,feature_type
Unnamed: 0_level_1,<chr>,<dbl>,<int>,<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>
1,JCe15c20,131254814765,31338316,Vitals,RR,,2018-08-22 16:00:00+00:00,16,vitals
2,JCd0a8db,131264538689,34735715,DATA VALIDATE,SBP,,2019-02-11 11:45:00+00:00,178,vitals
3,JCdeb93e,131266509496,36078047,M/S VS,Temp,,2019-04-03 22:45:00+00:00,37,vitals
4,JCcbf217,131241202790,27087604,DATA VALIDATE,SBP,,2017-12-18 06:30:00+00:00,88,vitals
5,JCd64572,131231133056,24776616,Vitals,SBP,,2017-06-21 20:45:00+00:00,163,vitals
6,JCd884d0,131121019062,16807919,DATA VALIDATE,SBP,,2015-10-16 19:00:00+00:00,97,vitals


In [None]:
# same as above, but takes ~8min to run
vs1st <- cohort_vitals %>% select(anon_id, pat_enc_csn_id_coded, inpatient_data_id_coded, 
                                  admit_time, label, recorded_time, features, values, feature_type) %>% 
            mutate(recorded_time = ymd_hms(recorded_time)) %>% 
            group_by(anon_id, pat_enc_csn_id_coded, inpatient_data_id_coded, 
                     admit_time, label, features) %>%
            top_n(n=-1, recorded_time) %>% # slice_min(n=1, recorded_time, with_ties = FALSE) 
            summarise(first_val = mean(values, na.rm=TRUE)) %>% distinct()

In [48]:
# only have 1 unique time because we already took care of this 
nrow(vs1st) #226510
nrow(vs1st %>% distinct(anon_id, pat_enc_csn_id_coded, features))
nrow(vs1st %>% distinct(anon_id, pat_enc_csn_id_coded, inpatient_data_id_coded)) #45613
nrow(vs1st %>% distinct(anon_id, pat_enc_csn_id_coded)) 

In [92]:
write.csv(vs1st, file.path(featuredir, "2_3_coh2_vs1st.csv"), row.names=FALSE)

In [None]:
# note: cohort size dropped further
# get the the cohort with 1st complete set of VS for ESI imputation
cohort1vs <- vs1st %>% drop_na() %>% spread(features, first_val) %>% drop_na() 
colnames(cohort1vs)
nrow(cohort1vs %>% distinct(pat_enc_csn_id_coded))

In [91]:
# 44258
write.csv(cohort1vs, file.path(cohortdir, "1_3_cohort.csv"), row.names=FALSE)