### Description:

**This was done as an alternative cohort. This cohort is not used in the final modeling.** Ok to disregard

Create final cohort from previous COHORTs with different inclusion/exclusion criteria. Done after processing all features with values and creating labels.

- `1_1_cohort`: 55170, Original cohort queried from BQ: ER patients admitted to inpatients, 2015 - 2020
- `1_2_cohort`: 45794, Cohort with some other criteria: age >= 18, hospital encounters and full code only
- `1_3_cohort`: 44258, Cohort with observations with at least a complete set of vital signs 
- *NEW*: using *1_3_cohort* to query and process labs, retained only observations with at least 1 lab result
    -  this was saved as cohort3L_withlabs.csv (size 41267), not neccessary to keep
- `1_4_cohort`: : 41267, using *1_3_cohort* to create labels from Tiffany's
- next: 2 options:
    - One: use the *1_4_cohort* as the final cohort, this would be similar to the Triage project's cohort
    - Two: use the *1_4_cohort* only with lab results to create final cohort. We will go with this option.

Input files:
- `2_4_coh3_labs` (which used 1_3_cohort) 44258 --> 41627
- `1_4_cohort` (with labels from Tiffany) 44258 --> 43980

Output file: `1_5_cohort_final` (size 41366)
- the cohort went through processing of criteria --> (demographics) --> vital signs --> labels and labs --> final
- the final cohort is an intersect of 1_3_cohort who had at least 1 lab result and 1_4_cohort: 41366
    - 261 obs in labs but not in cohort with labels
    - 2614 obs in labels but not in labs (resulting from using 1_3_cohort after vital signs)
- *label* in labs' cohort (or any feature cohort since the original cohort) is renamed as *label_max24* in cohort4

### Importing R libraries

In [1]:
library(bigrquery)  # to query STARR-OMOP (stored in BigQuery) using SQL
library(tidyverse)
library(lubridate)
# library(mice)
# library(VIM) # for missing data plot

# library(data.table)
# library(Matrix)
# library(caret) # import this before glmnet to avoid rlang version problem
# library(glmnet)
# library(bit64)

# library(slam)
# library(mtools) for one hot coder, not available on Nero or use caret or tidyr
options(repr.matrix.max.rows=250, repr.matrix.max.cols=30)

"package 'bigrquery' was built under R version 4.0.5"
-- [1mAttaching packages[22m --------------------------------------- tidyverse 1.3.0 --

[32mv[39m [34mggplot2[39m 3.3.2     [32mv[39m [34mpurrr  [39m 0.3.4
[32mv[39m [34mtibble [39m 3.0.4     [32mv[39m [34mdplyr  [39m 1.0.2
[32mv[39m [34mtidyr  [39m 1.1.2     [32mv[39m [34mstringr[39m 1.4.0
[32mv[39m [34mreadr  [39m 1.4.0     [32mv[39m [34mforcats[39m 0.5.0

-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


Attaching package: 'lubridate'


The following objects are masked from 'package:base':

    date, intersect, setdiff, union




### Set up and run queries
Do this on Nero GCP as querying from a local computer takes much longer time

In [3]:
# CREDENTIALS depending on LOCATIONS:
# credential <- paste0("/home/", "minh084", "/.config/gcloud/application_default_credentials.json")

# local computer
# credential <- "C:/Users/User/AppData/Roaming/gcloud/application_default_credentials.json"

# Nero onprem
# credential <- "/home/minh084/.config/gcloud/application_default_credentials.json"

# Nero gcp notebook
credential <- "/home/jupyter/.config/gcloud/application_default_credentials.json"

project_id <- "som-nero-phi-jonc101"

Sys.setenv(GOOGLE_APPLICATION_CREDENTIALS = credential)
Sys.setenv(GCLOUD_PROJECT = project_id)
gargle::credentials_app_default()

NULL

In [4]:
library(DBI)
con <- dbConnect(
  bigrquery::bigquery(),
  project = project_id,
  dataset = "shc_core" #, billing = project_id
)
con 
dbListTables(con)

<BigQueryConnection>
  Dataset: som-nero-phi-jonc101.shc_core
  Billing: som-nero-phi-jonc101

In [2]:
# read inputs
datadir = "../../DataTD"
cohortdir = "../../OutputTD/1_cohort"
featuredir = "../../OutputTD/2_features"

In [19]:
cohort4 <- read.csv(file.path(cohortdir, '1_4_cohort.csv'))
nrow(cohort4) #43980

labs0 <- read.csv(file.path(featuredir, '2_4_coh3_labs.csv'))
nrow(labs0) #1367422

In [22]:
nrow(labs0 %>% select(anon_id, pat_enc_csn_id_coded) %>% distinct()) # 41627
length(unique(labs0$pat_enc_csn_id_coded)) # 41627 (similar before, labs have less than cohort)

nrow(cohort4 %>% select(anon_id, pat_enc_csn_id_coded) %>% distinct()) # 43980
length(unique(cohort4$pat_enc_csn_id_coded)) # 41627 (similar before, labs have less than cohort)

length(setdiff(labs0$pat_enc_csn_id_coded, cohort4$pat_enc_csn_id_coded)) # 261
length(setdiff(cohort4$pat_enc_csn_id_coded, labs0$pat_enc_csn_id_coded)) # 2614

In [23]:
head(cohort4, 1)
head(labs0, 1)

Unnamed: 0_level_0,anon_id,pat_enc_csn_id_coded,inpatient_data_id_coded,admit_time,label_max24,label_24hr_recent,admit_label,has_admit_label,died_within_24hrs,death_24hr_max_label,death_24hr_recent_label,first_label,first_label_minutes_since_admit,acute_to_critical_label_recent,critical_to_acute_label_recent,acute_to_critical_label_max,critical_to_acute_label_max
Unnamed: 0_level_1,<chr>,<dbl>,<int>,<chr>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<int>,<int>
1,JCd97296,131176000000.0,18290644,2016-02-06 22:31:00+00:00,0,0,,0,0,0,0,0,1325,0,0,0,0


Unnamed: 0_level_0,anon_id,pat_enc_csn_id_coded,inpatient_data_id_coded,admit_time,label,features,base_name,ord_value,values,reference_low,reference_high,reference_unit,result_in_range_yn,result_flag,result_time,feature_type
Unnamed: 0_level_1,<chr>,<dbl>,<int>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,JCe33305,131063880385,13777312,2015-01-04 08:11:00+00:00,0,Lactate,LACWBL,2.2,2.2,,,mmol/L,,,2015-01-04 03:18:00+00:00,labs


In [24]:
# label in labs is label_max24 in cohort4
cohort <- labs0 %>% select(anon_id, pat_enc_csn_id_coded, inpatient_data_id_coded, admit_time) %>% # this was saved as cohort3L_withlabs.csv
            distinct() %>% inner_join(cohort4)
nrow(cohort) # 41366

Joining, by = c("anon_id", "pat_enc_csn_id_coded", "inpatient_data_id_coded", "admit_time")



In [25]:
head(cohort, 1)

Unnamed: 0_level_0,anon_id,pat_enc_csn_id_coded,inpatient_data_id_coded,admit_time,label_max24,label_24hr_recent,admit_label,has_admit_label,died_within_24hrs,death_24hr_max_label,death_24hr_recent_label,first_label,first_label_minutes_since_admit,acute_to_critical_label_recent,critical_to_acute_label_recent,acute_to_critical_label_max,critical_to_acute_label_max
Unnamed: 0_level_1,<chr>,<dbl>,<int>,<chr>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<int>,<int>
1,JCe33305,131063880385,13777312,2015-01-04 08:11:00+00:00,0,0,0,1,0,0,0,0,0,0,0,0,0


In [26]:
cohort4 %>% group_by(label_24hr_recent) %>% summarise(count = n(), percent = round(100*count/nrow(cohort4),2))
cohort %>% group_by(label_24hr_recent) %>% summarise(count = n(), percent = round(100*count/nrow(cohort),2))

`summarise()` ungrouping output (override with `.groups` argument)



label_24hr_recent,count,percent
<int>,<int>,<dbl>
0,39824,90.55
1,4156,9.45


`summarise()` ungrouping output (override with `.groups` argument)



label_24hr_recent,count,percent
<int>,<int>,<dbl>
0,37332,90.25
1,4034,9.75


In [27]:
write.csv(cohort, file.path(cohortdir, "1_5_cohort_final.csv"), row.names=FALSE)