### Description:
Explore and clean lab variables

- First grab by lab_names (or base_names), then check their base_names (or lab_names), from all_labs file (labs that we went through to select)
- Then reverse to include lab_names with more than 1 base_names and base_names with more than 1 lab_names.
    - Based off Triage work of selecting lab_names, take those final lab_names and get their base_names
    - Add a few more base_names 
    - Use these base_names to run m6_labs.sql. This makes sure we have everything from Triage selection.
- Check the units and ranges of these labs to keep only valid ones.
- Combine lab_names/base_names and convert to same units as needed --> rename these groups
- Most urine related labs are all NA
- For ord_num_value = 9999999 of extreme cases on either end (hi/lo, panic hi/lo):
    - Remove those observations where no < > in ord_value: cancelled or N/A
    - Use ord_value to take bounds values. For example, if ord_value is <0.2 and ord_num_value is 9999999, then replace 9999999 with 0.2
    
**Changes:**
- Use shc_core_2021
- It would be shorter and better to use `base_name`, but still need to check because some same labs have different base_names.
- shc_core_2021 has lab_name `BUN, Arterial`, `BUN, Peripheral`, and `BUN, Venous` with base_name `BUN`. These lab_name weren't  in old data
- Added 'BUN, Arterial', 'BUN, Peripheral', and 'BUN, Venous' under lab_name list
- shc_core_2021/lab doesn't have any of the 9999999 values

**Input:**
- `labs_2021` (from SQL)
- `6_5_cohort3`

**Output:** 
- `6_coh3_labs`
- `6_cohort3_withlabs`: cohort information only from `6_coh3_labs`

### Importing R libraries

In [1]:
library(bigrquery)  # to query STARR-OMOP (stored in BigQuery) using SQL
library(tidyverse)
library(lubridate)
# library(mice)
# library(VIM) # for missing data plot

# library(data.table)
# library(Matrix)
# library(caret) # import this before glmnet to avoid rlang version problem
# library(glmnet)
# library(bit64)

# library(slam)
# library(mtools) for one hot coder, not available on Nero or use caret or tidyr
options(repr.matrix.max.rows=250, repr.matrix.max.cols=30)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.3     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.2     [32m✔[39m [34mdplyr  [39m 1.0.6
[32m✔[39m [34mtidyr  [39m 1.1.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.4.0     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


Attaching package: ‘lubridate’


The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union




### Set up and run queries
Do this on Nero GCP as querying from a local computer takes much longer time

In [2]:
# CREDENTIALS depending on LOCATIONS:
# credential <- paste0("/home/", "minh084", "/.config/gcloud/application_default_credentials.json")

# local computer
# credential <- "C:/Users/User/AppData/Roaming/gcloud/application_default_credentials.json"

# Nero onprem
# credential <- "/home/minh084/.config/gcloud/application_default_credentials.json"

# Nero gcp notebook
credential <- "/home/jupyter/.config/gcloud/application_default_credentials.json"

project_id <- "som-nero-phi-jonc101"

Sys.setenv(GOOGLE_APPLICATION_CREDENTIALS = credential)
Sys.setenv(GCLOUD_PROJECT = project_id)
gargle::credentials_app_default()

NULL

In [3]:
library(DBI)
con <- dbConnect(
  bigrquery::bigquery(),
  project = project_id,
  dataset = "shc_core_2021" #, billing = project_id
)
con 
dbListTables(con)

<BigQueryConnection>
  Dataset: som-nero-phi-jonc101.shc_core_2021
  Billing: som-nero-phi-jonc101

### LABS
1. Use lab results as is. no manually process as with vital signs. 
2. Some notes about labs:
- Cannot trust "result_in_range_yn" --> remove this
- There is 1 incident of Glucose by Meter = hi in ord_value
- ord_value has text, blanks and #
- ord_num_value has NA, 9999999, #. The blanks from ord_value are NA in ord_num_value
- result_flag: "Abnormal" for text in ord_value, usually correct. However, when ord_value is N/A or < or >, the ord_num_value is 9999999.
3. Use ord_num_value and result_flag are good
4. All the urine labs start with X seem to be NA. Remove mag and phos as these are low count 1K

In [4]:
# read inputs
datadir = "../../DataTD/shc2021"
outdir = "../../OutputTD/shc2021"

cohort <- read.csv(file.path(outdir, "5_cohort3.csv"))
labs0 <- read.csv(file.path(datadir, "labs_2021.csv")) # 1386585

# check for unique CSNs and MRNs
nrow(cohort) #44258
nrow(labs0) #depending on which cohort_demo used to query labs 1863768
nrow(labs0 %>% select(pat_enc_csn_id_coded) %>% distinct()) # 41724 vs 41627
colnames(labs0)

In [5]:
labs0 <- labs0 %>% mutate(reference_unit = factor(reference_unit),
                          reference_low = factor(reference_low),
                          reference_high = factor(reference_high),
                          result_in_range_yn = factor(result_in_range_yn),
                          result_flag = factor(result_flag)) #%>% 

In [6]:
options(repr.matrix.max.rows=200, repr.matrix.max.cols=20)
# labs0 %>% group_by(base_name, lab_name) %>% count() %>% arrange(base_name)
labs0 %>% group_by(base_name, lab_name) %>% count() %>% arrange(lab_name)

base_name,lab_name,n
<chr>,<chr>,<int>
ALB,"Albumin, Ser/Plas",53271
ALKP,"Alk P'TASE, Total, Ser/Plas",53129
ALT,"ALT (SGPT), Ser/Plas",53128
AG,Anion Gap,57701
AGAP,"Anion Gap, ISTAT",2607
PCAGP,"Anion Gap, ISTAT",146
PCO2A,Arterial pCO2 for POC,13
PHA,Arterial pH for POC,13
PO2A,Arterial pO2 for POC,13
AST,"AST (SGOT), Ser/Plas",53127


In [7]:
labs0 %>% group_by(base_name) %>% count(sort=TRUE)

base_name,n
<chr>,<int>
GLU,71170
HCT,62773
HGB,62749
K,62537
,62422
CR,61607
CL,60716
BUN,60294
PLT,58353
WBC,58104


In [8]:
# from Triage
triage_labs <- c("Platelet count", "Total Bilirubin, Ser/Plas", "Total Bilirubin", 
"TROPONIN I", "Troponin I, POCT", "WBC", "WBC count", 
"Hematocrit", "Hct, ISTAT", "Hct(Calc), ISTAT", 
"Hct (Est)", "HCT, POC", 
"Hematocrit (Manual Entry) See EMR for details", 
"Hemoglobin", "Hgb(Calc), ISTAT", "Hgb, calculated, POC", "HgB", 
"Potassium, Ser/Plas", "Potassium, ISTAT",  "Potassium, Whole Bld", 
"Potassium, whole blood, ePOC", "Sodium, Ser/Plas", "Sodium, ISTAT",  
"Sodium, Whole Blood", "Chloride, Ser/Plas",  "Chloride, ISTAT",  
"Chloride, Whole Bld", "Creatinine, Ser/Plas", "Creatinine,ISTAT", 
"Anion Gap", "Anion Gap, ISTAT", "Glucose, Ser/Plas", "Glucose,ISTAT", 
"Glucose, Whole Blood", "Glucose by Meter", "Glucose, Non-fasting", 
"BUN, Ser/Plas", "BUN, ISTAT", "Urea Nitrogen,Ser/Plas", 'BUN, Arterial', 'BUN, Peripheral', 'BUN, Venous',
"Neutrophil, Absolute", "NEUT, ABS", "Neut, ABS (Seg+Band) (man diff)", 
"Neutrophils, Absolute (Manual Diff)", "Neut, ABS (Seg+Band) (man diff)", 
"Basophil, Absolute", "BASOS, ABS", "Basophils, ABS (man diff)", 
"Baso, ABS (man diff)", "Eosinophil, Absolute", "EOS, ABS", 
"Eosinophils, ABS (man diff)", "Eos, ABS (man diff)", 
"Lymphocyte, Absolute", "LYM, ABS", "Lymphocytes, ABS (man diff)", 
"Lym, ABS (man diff)", "Lymphocytes, Abs.", "Monocyte, Absolute", 
"MONO, ABS", "Monocytes, ABS (man diff)", "Mono, ABS (man diff)", 
"Lactate, ISTAT", "Lactate, Whole Bld", "Lactic Acid", "HCO3", 
"HCO3 (a), ISTAT", "Bicarbonate, Art for POC", "HCO3 (v), ISTAT", 
"HCO3, ISTAT", "O2 Saturation (a)", "O2 Saturation, ISTAT", 
"Oxygen Saturation for POC", "O2 Saturation (v)",  
"O2 Saturation, ISTAT (Ven)", "pCO2 (a)", "pCO2 (a), ISTAT",  
"PCO2, ISTAT", "Arterial pCO2 for POC", "pCO2 (v)", "PCO2 (v), ISTAT", 
"pH (a)", "PH (a), ISTAT", "pH by Meter", "Arterial pH for POC", 
"pH (v)", "PH (v), ISTAT", "PH, ISTAT", "pO2 (a)", "PO2 (a), ISTAT",  
"PO2, ISTAT", "Arterial pO2 for POC", "pO2 (v)", "PO2 (v), ISTAT", 
"tCO2", "TCO2 (a), ISTAT","TCO2, ISTAT", "TCO2, (ISTAT)", 
"CO2 Arterial Total for POC", "INR", "INR, ISTAT", 
"Prothrombin Time", "PT, ISTAT")
str_sort(triage_labs)

In [9]:
# labs0[labs0$lab_name == "Platelets", ]$base_name
# labs0[labs0$lab_name %in% lab_names, ]$base_name   
# lapply(lab_names, function(x) labs0[labs0$lab_name == x, ]$basename)

In [9]:
# check overlapping of Triage labs and the new queried labs
# str_sort(unique(labs0$base_name))
# str_sort(intersect(labs0$lab_name, lab_names))

setdiff(triage_labs, labs0$lab_name) # NONE
str_sort(setdiff(labs0$lab_name, triage_labs)) # in A, not B

triage_labs0_names <- labs0 %>% filter(lab_name %in% triage_labs) %>% distinct(lab_name, base_name)
triage_labs0_base_names <- str_sort(unique(triage_labs0_names$base_name))
triage_labs0_lab_names <- str_sort(unique(triage_labs0_names$lab_name))

triage_labs0_base_names; # triage_labs0_lab_names

str_sort(setdiff(labs0$base_name, triage_labs0_base_names))

In [10]:
# get lab_names and base_names from labs0
labs0_names <- labs0 %>% distinct(lab_name, base_name)
base_names <- str_sort(unique(labs0_names$base_name))
lab_names <- str_sort(unique(labs0_names$lab_name))

lab_names

In [11]:
# check each lab in the list for reference units, low and high to see if consistent among similar lab names
lab_test <- labs0 %>% select(lab_name, ord_num_value, reference_unit, reference_low, reference_high, base_name)# %>%
#                         mutate(reference_unit = factor(reference_unit),
#                                reference_low = factor(reference_low),
#                                reference_high = factor(reference_high)) #%>% 
#                 drop_na(ord_num_value)
c=0
for (l in lab_names){
    c = c+1
    lab <- lab_test %>% filter(lab_name == l & ord_num_value != 9999999)
    print(l)
    print(lab %>% group_by(base_name) %>% count())
    print(summary(lab %>% select(ord_num_value, reference_unit, reference_low, reference_high)))
}

# options(repr.plot.width=7, repr.plot.height=7)
print(c)

# 'Albumin, Ser/Plas', 'Alk P\'TASE, Total, Ser/Plas', 'ALT (SGPT), Ser/Plas', 'AST (SGOT), Ser/Plas'
# 'Base Excess (vt)', 'Base Excess Arterial for POC', 'Base Excess, ISTAT'
# 'Calcium, Ser/Plas''CO2, Ser/Plas''eGFR''Globulin''Glucose'
# 'Glucose Urine''Glucose, urine''Glucose, Urine'
# 'MCH''Neut, ABS (man diff)''Platelets''Potassium''Protein, Total, Ser/Plas'
# 'RDW''Sodium, whole blood, ePOC'

[1] "Albumin, Ser/Plas"
[90m# A tibble: 1 x 2[39m
[90m# Groups:   base_name [1][39m
  base_name     n
  [3m[90m<chr>[39m[23m     [3m[90m<int>[39m[23m
[90m1[39m ALB       [4m5[24m[4m3[24m211
 ord_num_value                               reference_unit  reference_low  
 Min.   :0.900   g/dL                               :53211   3.5    :53201  
 1st Qu.:3.100                                      :    0   3.2    :    5  
 Median :3.600                                      :    0   3.9    :    4  
 Mean   :3.571   /uL                                :    0   3.8    :    1  
 3rd Qu.:4.100   %                                  :    0          :    0  
 Max.   :6.700   % (See scan or EMR data for detail):    0   >60    :    0  
                 (Other)                            :    0   (Other):    0  
 reference_high 
 5.0    :32215  
 5.2    :20990  
 4.5    :    5  
 5.4    :    1  
        :    0  
 <0.055 :    0  
 (Other):    0  
[1] "Alk P'TASE, Total, Ser/Plas"
[90m#

In [12]:
# update lab list, cut some labs - from TRIAGE
# OK to include 1 entry Glucose
# be careful with Lactate has 2 basename whole blood and none

# remove Potassium --> 1 entry, 7.4 weird
# remove Platelets --> all NA
# remove base_name of glucose urine --> all NA
# no longer here to remove: remove Basophils --> unit is %
# no longer here to remove: "pH" , "Ketone, urine", "Leukocyte Esterase, urine" # few, and all NA

lab_list <- c("RBC", "MCH", "MCHC", "MCV", "Calcium, Ser/Plas", "CO2, Ser/Plas",
               "Albumin, Ser/Plas", "ALT (SGPT), Ser/Plas", "AST (SGOT), Ser/Plas", "Alk P'TASE, Total, Ser/Plas",
               "Globulin", "Protein, Total, Ser/Plas",
               "Magnesium, Ser/Plas",  "Phosphorus, Ser/Plas", #14 ind
               
               "WBC", "WBC count", # WBC count is 1000*WBC
               "TROPONIN I", "Troponin I, POCT",
               "Total Bilirubin, Ser/Plas", "Total Bilirubin",
               "Platelet count", # remove Platelets
              
               "Hematocrit", "Hct, ISTAT", "Hct(Calc), ISTAT", "Hct (Est)", "HCT, POC", "Hematocrit (Manual Entry) See EMR for details",
               "Hemoglobin", "Hgb(Calc), ISTAT", "Hgb, calculated, POC", "HgB", 
               "Potassium, Ser/Plas", "Potassium, ISTAT",  "Potassium, Whole Bld", "Potassium, whole blood, ePOC", # remove Potassium
               "Sodium, Ser/Plas", "Sodium, ISTAT",  "Sodium, Whole Blood", 
               "Chloride, Ser/Plas",  "Chloride, ISTAT",  "Chloride, Whole Bld",
               "Creatinine, Ser/Plas", "Creatinine,ISTAT", 
              
               "Anion Gap", "Anion Gap, ISTAT",
               "Glucose, Ser/Plas", "Glucose,ISTAT", "Glucose, Whole Blood", "Glucose by Meter", "Glucose, Non-fasting",
               "BUN, Ser/Plas", "BUN, ISTAT", "Urea Nitrogen,Ser/Plas", 'BUN, Arterial', 'BUN, Peripheral', 'BUN, Venous',
               "Neutrophil, Absolute", "NEUT, ABS", "Neut, ABS (Seg+Band) (man diff)", "Neutrophils, Absolute (Manual Diff)", "Neut, ABS (Seg+Band) (man diff)",
               "Basophil, Absolute", "BASOS, ABS", "Basophils, ABS (man diff)", "Baso, ABS (man diff)", # remove Basophils
               "Eosinophil, Absolute", "EOS, ABS", "Eosinophils, ABS (man diff)", "Eos, ABS (man diff)",
               "Lymphocyte, Absolute", "LYM, ABS", "Lymphocytes, ABS (man diff)", "Lym, ABS (man diff)", "Lymphocytes, Abs.",
                  # Lymphocytes, Abs. = 1000* the rest of lymphocytes
               "Monocyte, Absolute", "MONO, ABS", "Monocytes, ABS (man diff)", "Mono, ABS (man diff)",
               "Lactate, ISTAT", "Lactate, Whole Bld", "Lactic Acid",
               "Base Excess, ISTAT", "Base Excess (vt)", "Base Excess Arterial for POC",
              
                "HCO3", "HCO3 (a), ISTAT", "Bicarbonate, Art for POC", 
                "HCO3 (v), ISTAT", "HCO3, ISTAT", 
                "O2 Saturation (a)", "O2 Saturation, ISTAT", "Oxygen Saturation for POC", 
                "O2 Saturation (v)",  "O2 Saturation, ISTAT (Ven)",
                "pCO2 (a)", "pCO2 (a), ISTAT",  "PCO2, ISTAT", "Arterial pCO2 for POC",
                "pCO2 (v)", "PCO2 (v), ISTAT",
                "pH (a)", "PH (a), ISTAT", "pH by Meter", "Arterial pH for POC",
                "pH (v)", "PH (v), ISTAT", "PH, ISTAT", 
                "pO2 (a)", "PO2 (a), ISTAT",  "PO2, ISTAT", "Arterial pO2 for POC",
                "pO2 (v)", "PO2 (v), ISTAT", 
              
                "tCO2", "TCO2 (a), ISTAT","TCO2, ISTAT", "TCO2, (ISTAT)", "CO2 Arterial Total for POC", 
                "TCO2 (v), ISTAT", # 1 individual
                "INR", "INR, ISTAT", 
                "Prothrombin Time", "PT, ISTAT") # 33 more than 1
#               "pH", "Ketone, urine", "Leukocyte Esterase, urine") #"O2 Saturation, ISTAT (Oth)", "ctO2 (a)"
length(lab_list)

In [13]:
colnames(labs0)

In [14]:
# sodium base_name is NA, so might not pick up. either collect is.na(base_name) or specify these lab_names
labs0 %>% filter(str_detect(lab_name, "Sodium")) %>% group_by(lab_name, base_name) %>% count() %>% distinct()

lab_name,base_name,n
<chr>,<chr>,<int>
"Sodium, ISTAT",,4127
"Sodium, Ser/Plas",,57803
"Sodium, Whole Blood",,403
"Sodium, whole blood, ePOC",,89


In [15]:
# Only need to remove glucose in urine, 1 potassium, 1 platelets, and base_name of urine glucose
# changed labs0 to labs here
removed_lab_names <- c("Platelets", "Potassium")
labs <- labs0 %>% filter(!lab_name %in% removed_lab_names, base_name != "GLUURN" | is.na(base_name)) %>%
                select(-c(taken_time_utc, order_time_utc)) %>%
#             select(anon_id, pat_enc_csn_id_coded, lab_name, base_name, ord_num_value, 
#                    result_flag, order_time_utc, taken_time_utc, result_time_utc) %>%
                rename(features = lab_name, values = ord_num_value, result_time = result_time_utc) %>% 
#             mutate(feature_type = "labs", result_flag = ifelse(result_flag=="",0,1)) %>% 
                drop_na(values) %>% distinct()
nrow(labs)

In [16]:
str_sort(setdiff(labs$lab_name, lab_list)) # NONE
str_sort(setdiff(lab_list, labs$features)) # ok
labs %>% filter(features %in% c('TCO2 (v), ISTAT')) # not there (only 1 person from the old dataset)
str_sort(unique(labs$base_name)) # has NA

“number of rows of result is not a multiple of vector length (arg 2)”
“number of rows of result is not a multiple of vector length (arg 2)”
“number of rows of result is not a multiple of vector length (arg 2)”
“number of rows of result is not a multiple of vector length (arg 2)”


anon_id,pat_enc_csn_id_coded,inpatient_data_id_coded,admit_time,label,DBP,Pulse,RR,SBP,Temp,⋯,features,base_name,ord_value,values,reference_low,reference_high,reference_unit,result_in_range_yn,result_flag,result_time
<chr>,<dbl>,<int>,<chr>,<int>,<int>,<int>,<int>,<int>,<dbl>,⋯,<chr>,<chr>,<chr>,<dbl>,<fct>,<fct>,<fct>,<fct>,<fct>,<chr>


In [17]:
# convert WBC count to WBC unit and Lymphocytes Abs. before combining
labs <- labs %>% mutate(values = ifelse(features == "WBC count", round(values/1000.0, 1), 
                                          ifelse(features == "Lymphocytes, Abs.", round(values/1000.0, 3), values))) 

In [18]:
# in range: Y or blank 
labs %>% count(result_in_range_yn) %>% arrange(desc(result_in_range_yn))
# normal as blanks
labs %>% count(result_flag) %>% arrange(desc(n))
head(labs %>% count(values) %>% arrange())
tail(labs %>% count(values) %>% arrange())

result_in_range_yn,n
<fct>,<int>
Y,23070
,1865039


result_flag,n
<fct>,<int>
,1272903
High,301293
Low,290268
High Panic,12137
Low Panic,11491
Abnormal,17


Unnamed: 0_level_0,values,n
Unnamed: 0_level_1,<dbl>,<int>
1,-31.2,1
2,-30.0,4
3,-29.8,1
4,-29.0,13
5,-28.0,19
6,-27.7,1


Unnamed: 0_level_0,values,n
Unnamed: 0_level_1,<dbl>,<int>
6811,6307,1
6812,6589,1
6813,6673,1
6814,6729,1
6815,9655,1
6816,9999999,21945


In [19]:
labs %>% group_by(features, base_name) %>% count() %>% distinct()

features,base_name,n
<chr>,<chr>,<int>
"Albumin, Ser/Plas",ALB,53253
"Alk P'TASE, Total, Ser/Plas",ALKP,53033
"ALT (SGPT), Ser/Plas",ALT,52930
Anion Gap,AG,57613
"Anion Gap, ISTAT",AGAP,2524
"Anion Gap, ISTAT",PCAGP,146
Arterial pCO2 for POC,PCO2A,13
Arterial pH for POC,PHA,13
Arterial pO2 for POC,PO2A,13
"AST (SGOT), Ser/Plas",AST,52879


In [20]:
###str_sort(unique(labs$features))

In [21]:


# ISTAT seems all A, so put pH in A
# combine same labs and change the lab name in the data
# could use base_name except for LACTATE has 2 different base_names
# Glucose lab_name as base_name as GLU and UGLU. exclude glucose here, only 1 entry anyways
# New: Glucose, Sodium ePOC, 'Neut, ABS (man diff)' and ALB - RDW

ALB = 'Albumin, Ser/Plas'
ALK = 'Alk P\'TASE, Total, Ser/Plas'
ALT = 'ALT (SGPT), Ser/Plas'
AST = 'AST (SGOT), Ser/Plas'

Ca = 'Calcium, Ser/Plas'
CO2 = 'CO2, Ser/Plas'
eGFR = 'eGFR' # as char
Glob = 'Globulin'

MCH = 'MCH' # as char
TProtein = 'Protein, Total, Ser/Plas'
RDW = 'RDW' # as char

Platelet = c("Platelet count") # remove platelets 
TBili = c("Total Bilirubin", "Total Bilirubin, Ser/Plas")
Trop = c("TROPONIN I", "Troponin I, POCT")
WBC = c("WBC", "WBC count") # WBC count /1000 7

Hct = c("Hct (Est)", "Hct, ISTAT", "HCT, POC", "Hct(Calc), ISTAT", "Hematocrit",    
        "Hematocrit (Manual Entry) See EMR for details")
Hgb = c("Hemoglobin", "HgB", "Hgb, calculated, POC", "Hgb(Calc), ISTAT")
K = c("Potassium, ISTAT",  "Potassium, Ser/Plas", "Potassium, Whole Bld", "Potassium, whole blood, ePOC") # remove Potassium
Na = c("Sodium, ISTAT",  "Sodium, Ser/Plas", "Sodium, Whole Blood", 'Sodium, whole blood, ePOC') # ePOC is new
Cl = c("Chloride, ISTAT",  "Chloride, Ser/Plas",  "Chloride, Whole Bld")
Cr = c("Creatinine, Ser/Plas", "Creatinine,ISTAT") #22

AnionGap = c("Anion Gap", "Anion Gap, ISTAT") 
Glucose = c('Glucose', "Glucose by Meter", "Glucose, Non-fasting", "Glucose, Ser/Plas", "Glucose, Whole Blood", "Glucose,ISTAT")
BUN = c("BUN, Ser/Plas", "BUN, ISTAT", "Urea Nitrogen,Ser/Plas", 'BUN, Arterial', 'BUN, Peripheral', 'BUN, Venous') #10
Neut = c("NEUT, ABS", 'Neut, ABS (man diff)', "Neut, ABS (Seg+Band) (man diff)", "Neutrophil, Absolute",  "Neutrophils, Absolute (Manual Diff)")
Basos = c("Basophils, ABS (man diff)", "Basophil, Absolute", "Baso, ABS (man diff)", "BASOS, ABS")
Eos = c("EOS, ABS", "Eosinophils, ABS (man diff)", "Eosinophil, Absolute", "Eos, ABS (man diff)")
Lymp = c("LYM, ABS", "Lymphocytes, ABS (man diff)", "Lymphocyte, Absolute", "Lym, ABS (man diff)", "Lymphocytes, Abs.")
Mono = c("MONO, ABS", "Monocytes, ABS (man diff)", "Monocyte, Absolute", "Mono, ABS (man diff)")
Lactate = c("Lactate, ISTAT", "Lactate, Whole Bld", "Lactic Acid")
Base = c("Base Excess (vt)", "Base Excess Arterial for POC", "Base Excess, ISTAT") #28

HCO3_a = c("HCO3", "HCO3 (a), ISTAT", "Bicarbonate, Art for POC") # HCO3 is A based on base name
HCO3_v = c("HCO3 (v), ISTAT", "HCO3, ISTAT") 
O2sat_a = c("O2 Saturation (a)", "O2 Saturation, ISTAT", "Oxygen Saturation for POC")
O2sat_v = c("O2 Saturation (v)",  "O2 Saturation, ISTAT (Ven)")
pCO2_a = c("pCO2 (a)", "pCO2 (a), ISTAT",  "PCO2, ISTAT", "Arterial pCO2 for POC")
pCO2_v = c("pCO2 (v)", "PCO2 (v), ISTAT") 
pH_a = c("pH (a)", "PH (a), ISTAT", "pH by Meter", "Arterial pH for POC", "PH, ISTAT")
pH_v = c ("pH (v)", "PH (v), ISTAT") # new: move "PH, ISTAT" to arterial
PO2_a = c("pO2 (a)", "PO2 (a), ISTAT",  "PO2, ISTAT", "Arterial pO2 for POC")
PO2_v = c ("pO2 (v)", "PO2 (v), ISTAT") #29

TCO2_a = c("tCO2", "TCO2 (a), ISTAT", "TCO2, (ISTAT)", "TCO2, ISTAT", "CO2 Arterial Total for POC")
INR = c("INR", "INR, ISTAT")
PT = c("Prothrombin Time", "PT, ISTAT") #9

labs <- labs %>% 
        mutate(features = 
            ifelse(features %in% Platelet, "Platelet", ifelse(features %in% TBili, "TBili", 
            ifelse(features %in% Trop, "Trop", ifelse(features %in% WBC, "WBC",
            ifelse(features %in% Hct, "Hct", ifelse(features %in% Hgb, "Hgb", 
            ifelse(features %in% K, "K", ifelse(features %in% Na, "Na",
            ifelse(features %in% Cl, "Cl", ifelse(features %in% Cr, "Cr",
            ifelse(features %in% AnionGap, "AnionGap", ifelse(features %in% Glucose, "Glucose",
            ifelse(features %in% BUN, "BUN", ifelse(features %in% Neut, "Neut",    
            ifelse(features %in% Basos, "Basos", ifelse(features %in% Eos, "Eos",      
            ifelse(features %in% Lymp, "Lymp", ifelse(features %in% Mono, "Mono",    
            ifelse(features %in% Lactate, "Lactate",  ifelse(features %in% Base, "Base", 
            ifelse(features %in% HCO3_a, "HCO3_a", ifelse(features %in% HCO3_v, "HCO3_v", 
            ifelse(features %in% O2sat_a, "O2sat_a", ifelse(features %in% O2sat_v, "O2sat_v",
            ifelse(features %in% pCO2_a, "pCO2_a", ifelse(features %in% pCO2_v, "pCO2_v",
            ifelse(features %in% pH_a, "pH_a", ifelse(features %in% pH_v, "pH_v",      
            ifelse(features %in% PO2_a, "PO2_a", ifelse(features %in% PO2_v, "PO2_v",
            ifelse(features %in% TCO2_a, "TCO2_a", ifelse(features %in% INR, "INR",
            ifelse(features %in% PT, "PT", ifelse(features %in% ALB, "ALB",
            ifelse(features %in% ALK, "ALK", ifelse(features %in% ALT, "ALT",
            ifelse(features %in% AST, "AST", ifelse(features %in% TProtein, "TProtein",                                     
            ifelse(features %in% Ca, "Ca", ifelse(features %in% CO2, "CO2",                                        
            ifelse(features %in% Glob, "Glob", as.character(features))))))))))))))))))))))))))))))))))))))))))) %>%
        distinct() 


In [22]:
nrow(labs) # 1367892 vs 1367838
# total 48 lab_name but 56 base_name
nrow(labs %>% select(anon_id, pat_enc_csn_id_coded) %>% distinct()) #41627
# labs %>% select(base_name, features) %>% distinct() %>% arrange(features, base_name) 
labs %>% group_by(features, base_name) %>% count(sort=TRUE) %>% arrange(features, base_name) 

features,base_name,n
<chr>,<chr>,<int>
ALB,ALB,53253
ALK,ALKP,53033
ALT,ALT,52930
AnionGap,AG,57613
AnionGap,AGAP,2524
AnionGap,PCAGP,146
AST,AST,52879
Base,BE,28611
Basos,BASOAB,50167
BUN,BUN,60269


### Process the 9999999 values
- check ord_value

In [23]:
# there are no NA, but only 15668 of 9999999 values
value99 <- labs %>% filter(values==9999999) 
nrow(value99)
value99 %>% group_by(ord_value) %>% count() %>% arrange(-n)

ord_value,n
<chr>,<int>
<0.017,9051
>60,5339
>90,2197
<0.1,1556
<0.2,830
<10,550
,430
<0.30,318
>500,309
<5,256


In [24]:
value99 %>% filter(ord_value %in% c("N/A", "hi") | str_detect(ord_value, 'CANCELLED')) %>% count(ord_value, sort=TRUE)
value99 %>% filter(!grepl('>|<', ord_value)) %>% count(ord_value, sort=TRUE)

value99 %>% filter(ord_value %in% c("N/A", "hi") | str_detect(ord_value, 'CANCELLED')) %>% count(features, sort=TRUE)

ord_value,n
<chr>,<int>
,430
CANCELLED: Specimen Clotted.,4
CANCELLED: Note:,1
CANCELLED: Inappropriate Specimen/Container.,1
hi,1


ord_value,n
<chr>,<int>
,430
CANCELLED: Specimen Clotted.,4
CANCELLED: Note:,1
CANCELLED: Inappropriate Specimen/Container.,1
hi,1


features,n
<chr>,<int>
Basos,26
Eos,26
Lymp,26
Mono,26
Neut,26
Platelet,26
Base,21
MCH,21
Hgb,20
RDW,20


In [25]:
# number of cancelled, NA, or 1 hi observations under ord_value
nrow(labs %>% filter(values==9999999, !grepl('>|<', ord_value)) %>% mutate(ord_value = factor(ord_value))) # %>% group_by(ord_value) %>% count() %>% arrange(-n)

In [26]:
# revmove those without > or < in ord_value (416), work with this set and then do it on the main labs
# take the upper/lower bounds for the 9999999 values and add 1% to it
value99 <- value99 %>% filter(grepl('>|<', ord_value)) %>% #mutate(ord_value = as.character(ord_value)) %>%
            mutate(values = 1.01 * as.double(gsub(paste(c(">", "<"), collapse = "|"),"", ord_value))) #%>%
#             drop_na(values) # %>% distinct() # same

nrow(value99) # 21508
summary(value99$values)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
 -30.300    0.017    0.202   40.603   60.600 7070.000 

In [27]:
head(value99)

Unnamed: 0_level_0,anon_id,pat_enc_csn_id_coded,inpatient_data_id_coded,admit_time,label,DBP,Pulse,RR,SBP,Temp,⋯,features,base_name,ord_value,values,reference_low,reference_high,reference_unit,result_in_range_yn,result_flag,result_time
Unnamed: 0_level_1,<chr>,<dbl>,<int>,<chr>,<int>,<int>,<int>,<int>,<int>,<dbl>,⋯,<chr>,<chr>,<chr>,<dbl>,<fct>,<fct>,<fct>,<fct>,<fct>,<chr>
1,JC514732,131245301962,33736992,2018-02-08 23:26:00+00:00,1,57,86,26,128,35.85,⋯,TCO2_a,TCO2A,<5,5.05,23.0,27,mmol/L,,Low Panic,2018-02-08 23:11:00+00:00
2,JC628815,131072949142,19611085,2015-03-15 05:41:00+00:00,1,79,131,34,137,36.0,⋯,TCO2_a,TCO2A,<5,5.05,23.0,27,mmol/L,,Low Panic,2015-03-15 05:17:00+00:00
3,JC1839951,131090282451,20778663,2015-06-14 22:51:00+00:00,0,46,43,18,96,37.0,⋯,TCO2_a,TCO2A,<5,5.05,23.0,27,mmol/L,,Low Panic,2015-06-13 23:49:00+00:00
4,JC577879,131102861750,21400034,2015-07-24 18:21:00+00:00,1,46,102,18,80,33.8,⋯,Lactate,LACWBL,>15.0,15.15,,<2.0,mmol/L,,High,2015-07-24 18:11:00+00:00
5,JC549970,131197131993,26495482,2016-09-13 03:20:00+00:00,1,58,119,26,130,37.8,⋯,TCO2_a,TCO2A,>50,50.5,23.0,27,mmol/L,,High Panic,2016-09-13 02:55:00+00:00
6,JC1767031,131179885980,24058527,2016-03-15 04:39:00+00:00,1,89,92,22,138,36.8,⋯,TCO2_a,TCO2A,>50,50.5,23.0,27,mmol/L,,High Panic,2016-03-15 03:03:00+00:00


In [28]:
# remove only the observations (where values = 999999 and do not have < or > in ord_value) from processed labs
# could do 1.01 or 0.99 but more complicated
grepl('>|<', ">60")
labs99 <- labs %>% filter(!(values==9999999 & !grepl('>|<', ord_value))) %>%
            mutate(values = as.double(gsub(paste(c(">", "<"), collapse = "|"),"", ord_value)),
                   feature_type = "labs") 
nrow(labs) - nrow(labs99) # 437
nrow(labs99) # 1887672
summary(labs99$values)
colnames(labs99)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
  -31.20     3.40    12.80    39.16    38.20 11900.00 

In [29]:
labs99 <- labs99 %>% select(-c('DBP','Pulse','RR','SBP','Temp', 'order_id_coded'))
colnames(labs99)

In [30]:
# not 9999999 and do not have > < from processed labs
write.csv(labs99, file.path(outdir, "6_coh3_labs.csv"), row.names=FALSE)

### Keep this cohort information:

In [31]:
cohort_with_labs <- labs99 %>% select(anon_id, pat_enc_csn_id_coded, inpatient_data_id_coded, admit_time, label) %>% distinct()
nrow(cohort_with_labs) # 57831

cohort_with_labs %>% group_by(label) %>% summarise(count = n(), percent = round(100*count/nrow(cohort_with_labs),2))

label,count,percent
<int>,<int>,<dbl>
0,49816,86.14
1,8015,13.86


In [32]:
write.csv(cohort_with_labs, file.path(outdir, "6_cohort3_withlabs.csv"), row.names=FALSE)