### Description: NEW
Refine new cohort4 to only include CSNs that are not already in the original cohort
- New cohort: shc_core_2021, from 04/2020 - 2021
- Original old cohort: shc_core, 2015 - 03/2020

However, some CSNs in the new cohort are also in the old cohort, but with different anon_id and admit time
- Remove these overlapping CSNs. This is clean if use both cohort4 from old and new data
- When using Demographics/HW data later in 6.8 notebook, be aware that there are a few overlapped CSNs. This is due to cohort2 was used to get demo/HW data. These overlapped CSNs were actually removed from cohort3 already. They are removed from new cohort4 again when merging with the old demo/HW data to prevent further issues. But this `cohort4` remain intact.

**Input:**
- `1_4_cohort.csv` (original cohort4)
- `6_7_0_cohort4` (new cohort4 with labels)


**Output:**
- `6_7_cohort4` size 60,464. This is the *final cohort* combing 2015 - 03/2020 (43,980) and 04/2020 - 2021 (16,484) data


In [3]:
library(data.table)
library(tidyverse)
library(lubridate)
# library(Matrix)
# library(slam)
# library(bit64)
# library(mtools) for one hot coder, not available on Nero or use caret or tidyr
options(repr.matrix.max.rows=200, repr.matrix.max.cols=30)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.3     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.2     [32m✔[39m [34mdplyr  [39m 1.0.6
[32m✔[39m [34mtidyr  [39m 1.1.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.4.0     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mbetween()[39m   masks [34mdata.table[39m::between()
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m    masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mfirst()[39m     masks [34mdata.table[39m::first()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m       masks [34mstats[39m::lag()
[31m✖[39m [34mdplyr[39m::[32mlast()[39m      masks [34mdata.table[39m::last()
[31m✖[39m [34mpurrr[39m::[32mtranspose()[39m masks [34mdata.table[39m::transpose(

In [4]:
datadir = "../../DataTD"
datadir6 = "../../DataTD/validation"
valdir = "../../OutputTD/6_validation"
cohortdir = "../../OutputTD/1_cohort"
featuredir = "../../OutputTD/2_features"

In [7]:
# old cohort up to 03/2020
cohort0 <- read.csv(file.path(cohortdir,  '1_4_cohort.csv'))
nrow(cohort0) #43980

# new cohort from 04/2020 - 2021
cohort6 <- read.csv(file.path(valdir,  '6_7_0_cohort4.csv'))
nrow(cohort6) #16700

In [8]:
# check overlapping cohort
length(setdiff(cohort0$pat_enc_csn_id_coded, cohort6$pat_enc_csn_id_coded))
length(setdiff(cohort6$pat_enc_csn_id_coded, cohort0$pat_enc_csn_id_coded)) 

In [9]:
# csn in cohort4 but not in 0
csn_new <- setdiff(cohort6$pat_enc_csn_id_coded, cohort0$pat_enc_csn_id_coded)
head(csn_new)

In [10]:
# merge with unique csn
cohort6 <- cohort6 %>% filter(pat_enc_csn_id_coded %in% csn_new)
nrow(cohort6)
length(unique(cohort6$pat_enc_csn_id_coded))

In [12]:
cohort <- bind_rows(cohort0, cohort6)
nrow(cohort)
nrow(cohort %>% select(anon_id) %>% distinct()) # 60464
nrow(cohort %>% select(pat_enc_csn_id_coded) %>% distinct()) # 60464
nrow(cohort %>% select(anon_id, pat_enc_csn_id_coded) %>% distinct()) # 60464
head(cohort)

Unnamed: 0_level_0,anon_id,pat_enc_csn_id_coded,inpatient_data_id_coded,admit_time,label_max24,label_24hr_recent,admit_label,has_admit_label,died_within_24hrs,death_24hr_max_label,death_24hr_recent_label,first_label,first_label_minutes_since_admit,acute_to_critical_label_recent,critical_to_acute_label_recent,acute_to_critical_label_max,critical_to_acute_label_max,previous_icu_visit
Unnamed: 0_level_1,<chr>,<dbl>,<int>,<chr>,<int>,<int>,<dbl>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<int>,<int>,<int>,<int>,<chr>
1,JCd97296,131176042095,18290644,2016-02-06 22:31:00+00:00,0,0,,0,0,0,0,0,1325,0,0,0,0,
2,JCcdc7e1,131064611420,13865299,2015-01-15 21:16:00+00:00,1,1,1.0,1,0,1,1,1,0,0,0,0,0,
3,JCe3e5f4,131072326078,14296997,2015-01-28 11:12:00+00:00,1,1,1.0,1,0,1,1,1,0,0,0,0,0,
4,JCdcfce9,131178712824,18633398,2016-03-04 17:01:00+00:00,1,1,1.0,1,0,1,1,1,0,0,0,0,0,
5,JCdaaaa6,131211945620,22773101,2016-12-07 22:17:00+00:00,0,0,0.0,1,0,0,0,0,0,0,0,0,0,
6,JCe8840f,131264906504,34995073,2019-02-14 22:22:00+00:00,0,0,0.0,1,0,0,0,0,0,0,0,0,0,


In [14]:
write.csv(cohort6, file = file.path(valdir, "6_7_cohort4.csv"), row.names=FALSE) 
write.csv(cohort, file = file.path(valdir, "6_7_cohort4_all.csv"), row.names=FALSE) 