This notebook was originally developed by Conor K. Corbin, modified by Minh Nguyen

### Description:
Query the original *init* cohort with admitted ED patients

- Check different patient service and class
- Take patient class of Emergency Services = 112 and patient service of strictly Emergency
- Take combo of (anon_id and CSN). If there are multiple anon_id with a CSN, this would show up.
- Our data shows that our cohort CSN are unique, i.e, only 1 anon_id associated with a CSN. 

**Output**: original `1_1_cohort`

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import os 
from datetime import datetime

In [2]:
from google.cloud import bigquery
from google.cloud.bigquery import dbapi

### THIS IS MEANT TO RUN ON NERO GCP Jupyter notebook
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/home/jupyter/.config/gcloud/application_default_credentials.json'

# FOR LOCAL COMPUTER:
# os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = r'C:\Users\User\AppData\Roaming\gcloud\application_default_credentials.json' 

os.environ['GCLOUD_PROJECT'] = 'som-nero-phi-jonc101' 
%load_ext google.cloud.bigquery

from google.cloud import bigquery
client=bigquery.Client()

The google.cloud.bigquery extension is already loaded. To reload it, use:
  %reload_ext google.cloud.bigquery




In [3]:
datadir = "../../DataTD"
cohortdir = "../../OutputTD/1_cohort"

### Notes on variables queried:
- Omit event_time_jittered: bc this is when the event record was actually created.
- Grab effective_time_jittered for the actual date and time of the patient admit / discharge / transfer.
- seq_num_in_enc: the sequence number for this event within a patient encounter. You can use this number to determine the order of events for a particular encounter. Only non-canceled events are included within this sequence. First inpatient admission/LOC is 2. And ED admission is 1.
- will grab these for adt later: 'anon_id', 'pat_enc_csn_id_coded', 'pat_lvl_of_care_c', 'pat_lv_of_care', 'accomodation', 'effective_time_jittered_utc', 'admit_time_jittered', 'seq_num_in_enc', 'time_since_admit', 'label'


In [4]:
# no need to re-run, this was to check patient class that is linked to Emergency
q = """
SELECT adt.pat_class_c, adt.pat_class, adt.base_pat_class, adt.pat_service, adt.pat_lv_of_care
FROM 
    `som-nero-phi-jonc101.shc_core.adt` as adt
WHERE adt.pat_class_c = "112"
or adt.pat_class = "Emergency Services"
or adt.pat_service LIKE  "%Emergency%"
"""
query_job = client.query(q)
adt = query_job.to_dataframe()

In [5]:
pd.options.display.max_rows = 100
adt.groupby(['pat_class_c', 'pat_class', 'base_pat_class', 'pat_service']).size().sort_values(ascending=False)

pat_class_c  pat_class                             base_pat_class  pat_service                  
112          Emergency Services                    Emergency       Emergency                        692933
                                                                   Urgent Care                       36578
128          Observation                           Outpatient      Emergency Medicine                26141
                                                                   Emergency                         15743
112          Emergency Services                    Emergency       Emergency Medicine                 7466
126          Inpatient                             Inpatient       Emergency                          7169
                                                                   Emergency Medicine                 3562
140          Outpatient                            Outpatient      Emergency Medicine                  581
122          OP Surgery/Procedure              

In [6]:
# only for pat_class_c = 112 (Emergency Services)
adt.loc[adt['pat_class_c'] == '112'].groupby(['pat_class_c', 'pat_class', 'base_pat_class', 'pat_service']).size().sort_values(ascending=False) 

pat_class_c  pat_class           base_pat_class  pat_service                  
112          Emergency Services  Emergency       Emergency                        692933
                                                 Urgent Care                       36578
                                                 Emergency Medicine                 7466
                                                 Trauma                              203
                                                 Family Medicine                     170
                                                 FINANCIALLY CLEARED                 103
                                                 General Medicine (University)        60
                                                 Obstetrics                           37
                                                 Orthopaedic Surgery                  37
                                                 Ophthalmology                        31
                               

In [7]:
adt.to_csv(os.path.join(datadir,"adt_class_serv_loc.csv"), index=False)

### Query ADT and return rows where patient encounter id is associated with an ED visit
* Get ER stays that BEGIN between 2015 and 2020. Some of our index times will be in 2021 because the ER visit will start on 12-31-2019 and ADT has 2021 data. 
* Assumes patient encounters associated with an ED visit START with an ED visit, which is probably always true-- But might want to check this. 
* Assumes we can trust pat_enc_csn_id_coded as a patient encounter - which we can't always.  Encounters are weird, and sometimes multiple pat_enc_csn_id_coded's will overlap in time for a particular patient (Stephen knows more about this). UPDATE - ER encounters should have unique csns. This was checked toward the end

In [8]:
query = """

WITH er_admit_times AS (
SELECT pat_enc_csn_id_coded, MIN(EXTRACT(YEAR FROM event_time_jittered_utc)) admit_year
FROM shc_core.adt
WHERE pat_class_c = "112"
AND pat_service = "Emergency"
GROUP BY pat_enc_csn_id_coded
)

SELECT er.anon_id, er.pat_enc_csn_id_coded, er.effective_time_jittered_utc,
        er.base_pat_class_c, er.pat_lv_of_care, er.pat_class, er.event_type
           
FROM shc_core.adt as er
INNER JOIN er_admit_times
USING (pat_enc_csn_id_coded)
WHERE er_admit_times.admit_year BETWEEN 2015 AND 2020
ORDER BY anon_id, pat_enc_csn_id_coded, effective_time_jittered_utc 
"""
query_job =client.query(query)
df=query_job.to_dataframe()

# order by: changed from event_time to effective_time

In [9]:
# saving the df - stored in box
df.to_csv("{}/dfer.csv".format(datadir), index=None)

In [None]:
# run this ONLY if read in the stored df locally
df = pd.read_csv("{}/dfer.csv".format(datadir))
df.shape

# change columns to datetime, if read from locally stored data (if queried directly from BQ, already in datetime format)
df.effective_time_jittered_utc = pd.to_datetime(df.effective_time_jittered_utc)

In [10]:
df.head()

Unnamed: 0,anon_id,pat_enc_csn_id_coded,effective_time_jittered_utc,base_pat_class_c,pat_lv_of_care,pat_class,event_type
0,JC29f8a9e,131273320899,2019-07-19 19:08:00+00:00,3.0,,Emergency Services,Admission
1,JC29f8a9e,131273320899,2019-07-19 20:53:00+00:00,,,Emergency Services,Discharge
2,JC29f8aa9,131267052573,2019-04-22 20:38:00+00:00,3.0,,Emergency Services,Admission
3,JC29f8aa9,131267052573,2019-04-23 00:00:00+00:00,,,Emergency Services,Transfer Out
4,JC29f8aa9,131267052573,2019-04-23 00:00:00+00:00,,,Emergency Services,Transfer In


### Get counts on unique patient trajectories 
where patients were sent throughout their stay - this is at the granularity of inpatient vs observation vs discharged straight from ER - not the unit they are sent to.

Trajectories of patient encounter ids.
* Most get discharged directly from ER.
* Next most common is ER -> inpatient (which is what we're focusing on). 
* Next is ER to observation
* Next is ER to observation to inpatient etc. 

In [11]:
# Filter for rows where pat_class changes = meaning they don't get discharged directly from the ER
df_change = df[~df['base_pat_class_c'].isna()]
print(len(df)) # 2685583
print(len(df_change)) #543657

2685583
543657


In [12]:
# Collapse pat class on patient encounter id, create trajectories, group by trajectories, count patient encounters with those trajectories. 
df_change[['pat_enc_csn_id_coded', 'pat_class']].groupby('pat_enc_csn_id_coded').agg(
{'pat_class' : lambda x: ' -> '.join([c for c in x])}).reset_index().groupby('pat_class').agg(
{'pat_enc_csn_id_coded' : 'count'}).reset_index().sort_values('pat_enc_csn_id_coded', ascending=False)

Unnamed: 0,pat_class,pat_enc_csn_id_coded
2,Emergency Services,316400
5,Emergency Services -> Inpatient,53520
17,Emergency Services -> Observation,31177
18,Emergency Services -> Observation -> Inpatient,14706
3,Emergency Services -> Bedded Outpatient (corre...,2361
8,Emergency Services -> Inpatient -> Observation,1161
15,Emergency Services -> OP Surgery/Procedure,995
4,Emergency Services -> Bedded Outpatient (corre...,266
6,Emergency Services -> Inpatient -> Bedded Outp...,266
16,Emergency Services -> OP Surgery/Procedure -> ...,230


### Patient Level of Care Column is the indicator we want - 
It seems to be missing a lot 
Update: LOC not always on the same row as the change in pat code status ie inpatient vs observation vs emergency services

Filter for patients with an inpatient code immediately after emergency services and create trajectories

In [19]:
import datetime

def has_inpatient_code(arr):
    for a in arr:
        if a == 'Inpatient':
            return True
    return False

def has_inpatient_code_after_er(arr):
    """Assumes arr is ordered by time"""
    has_er = False
    for a in arr:
        if a == 'Emergency Services':
            has_er = True
        elif a == 'Inpatient' and has_er == True:
            return True
        else:
            has_er = False
        
    return False

def get_trajectory(arr):
    # Creates trajectory but only adds to path when level of care changes
    traj = []
    for i, a in enumerate(arr):
        if len(traj) == 0:
            traj.append(a)
        elif a != traj[-1]:
            traj.append(a)
    return ' -> '.join(traj)

# Get a set of patient_encounter_ids that have an inpatient code
df_temp = df_change[['pat_enc_csn_id_coded', 'pat_class']].groupby('pat_enc_csn_id_coded').agg(
{'pat_class' : has_inpatient_code_after_er}).reset_index()
inpatient_ids = set(df_temp[df_temp['pat_class'] == True]['pat_enc_csn_id_coded'].values)

# Filter original df for patients in this set and create level of care trajectories.
df_lofc = df[df['pat_enc_csn_id_coded'].isin(inpatient_ids)]

# Get df of csn_ids and admit timestamps
df_admit_times = df_change[df_change['pat_enc_csn_id_coded'].isin(inpatient_ids)]
df_admit_times = df_admit_times[df_admit_times['pat_class'] == 'Inpatient'].groupby(
    'pat_enc_csn_id_coded').first().reset_index()[['pat_enc_csn_id_coded', 'effective_time_jittered_utc']].rename(
    columns={'effective_time_jittered_utc' : 'admit_time_jittered'})
df_admit_times.head()

# Should now be querying adt again for anon_id that match each er CSN id and then look ahead 24 hours to mitigate overlapping csn issue

# Merge to df_lofc and create column called time_since_admit
df_lofc = pd.merge(df_lofc, df_admit_times, how='left', on='pat_enc_csn_id_coded')
df_lofc['time_since_admit'] = df_lofc.apply(lambda x: x.effective_time_jittered_utc - x.admit_time_jittered, axis=1)

# Filter df_lofc so that we only look 24 hours into admission
df_lofc = df_lofc[df_lofc['time_since_admit'] < datetime.timedelta(hours=24)]

print(len(df_lofc)) # 541690 vs 541712

541712


### Print the trajectories 24 hours into admission

In [20]:
# Merge this to df_lofc and 
df_traj = df_lofc[['pat_enc_csn_id_coded', 'pat_lv_of_care']].dropna().groupby(
'pat_enc_csn_id_coded').agg({'pat_lv_of_care' : get_trajectory}).reset_index().groupby(
'pat_lv_of_care').count().reset_index().sort_values('pat_enc_csn_id_coded', ascending=False)

# Print cause these trajectories are long
for i in range(len(df_traj)):
    print(df_traj['pat_lv_of_care'].values[i], ' : ', df_traj['pat_enc_csn_id_coded'].values[i])

# Count number of encounters with a trajectory. 
print('Number of encounters with a trajectory : ', df_traj['pat_enc_csn_id_coded'].sum())
print('Total Number of inpatient encounters : ', len(inpatient_ids))

Acute Care (Assessment or intervention q4-8)  :  33651
Intermediate Care - With Cardiac Monitor  :  7172
Critical Care  :  4309
IICU/Intermediate Care (Assessment or intervention q2-4)  :  2244
Acute Care (Assessment or intervention q4-8) -> IICU/Intermediate Care (Assessment or intervention q2-4)  :  1157
Critical Care -> Acute Care (Assessment or intervention q4-8)  :  1072
IICU/Intermediate Care (Assessment or intervention q2-4) -> Acute Care (Assessment or intervention q4-8)  :  879
Acute Care (Assessment or intervention q4-8) -> Critical Care  :  524
Acute Care (Assessment or intervention q4-8) -> Intermediate Care - With Cardiac Monitor  :  391
Critical Care -> IICU/Intermediate Care (Assessment or intervention q2-4)  :  370
Intermediate Care - With Cardiac Monitor -> Acute Care (Assessment or intervention q4-8)  :  368
Acute Care (Assessment or intervention q4-8) -> IICU/Intermediate Care (Assessment or intervention q2-4) -> Acute Care (Assessment or intervention q4-8)  :  255
C

### Create Labelling function for highest level of care with 24 hours of admit
For each csn id, we'll create positive or negative labels based on whether within 24 hours of admit they have a critical care label in pat_lv_of_care.  This means that if they are originally place in critical care but then sent to acute care we'll still label them as crit care.  Label is thus the max level of care within 24 ours of admit

Output dataframe should have anon_id, csn_id, admit_time, label

NOTE : this isn't completley correct because i've grouped on pat_enc_csn_id_coded.  I really should be taking the anon_id from each er csn id and looking ahead 24 hours in the adt table to see if there exist other csn id's associated with the encounter where the level of care changes... 

UPDATE: OK to group by pat_enc_csn_id_coded, take the first admit time for each visit, first anon_id only matter if there are multiple anon_ids with a CSN. This is not the case in our data, checked.

In [21]:
def was_placed_in_critical_care(arr):
    """Returns true if patient placed in crtical care within 24 hours of admit
       Assumes we have already done the 24 hours logic
       Assumes no overlapping csn ids... """
    for a in arr:
        if a == 'Critical Care':
            return 1
    return 0

In [22]:
# group by anon_id and csn
df_labels = df_lofc.groupby(['anon_id','pat_enc_csn_id_coded']).agg({ #cohort0_all_anon_ids
    'admit_time_jittered' : 'first',
    'pat_lv_of_care' : was_placed_in_critical_care}).rename(
    columns={"pat_lv_of_care" : 'label'}).reset_index()[['anon_id', 'pat_enc_csn_id_coded', 'admit_time_jittered', 'label']]

print(len(df_labels)) # 55168

# Save to .csv and read back as necessary
df_labels.to_csv(os.path.join(cohortdir,"1_1_cohort.csv"), index=False)

# check labels
df_labels.groupby('label').count() # 47290 vs. 7878 (55170: 47294 vs 7876)

55170


Unnamed: 0_level_0,anon_id,pat_enc_csn_id_coded,admit_time_jittered
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,47294,47294,47294
1,7876,7876,7876


In [18]:
# make sure you only see 1's in the anon_id column!! (or else there are multiple anon_ids for a csn)
df_labels[['anon_id','pat_enc_csn_id_coded']].groupby('pat_enc_csn_id_coded').nunique().reset_index().sort_values('anon_id')

Unnamed: 0,pat_enc_csn_id_coded,anon_id
0,131062667066,1
36773,131258544685,1
36774,131258545129,1
36775,131258545430,1
36776,131258545798,1
...,...,...
18394,131210383608,1
18395,131210383668,1
18396,131210383794,1
18398,131210393400,1


### Save to big query, the orginial 1_1_cohort = cohort _init_

In [25]:
# read back as necessary
df_labels = pd.read_csv(os.path.join(cohortdir, '1_1_cohort.csv'), index_col=False)
print(len(df_labels))
print(list(df_labels.columns))

55170
['anon_id', 'pat_enc_csn_id_coded', 'admit_time_jittered', 'label']


In [26]:
table_schema = [{'name' : 'anon_id', 'type' : 'STRING'},
                {'name' : 'pat_enc_csn_id_coded', 'type' : 'INTEGER'},
                {'name' : 'admit_time_jittered', 'type' : 'TIMESTAMP'},
                {'name' : 'label', 'type' : 'INTEGER'}]
                       
DATASET_NAME = 'triageTD'
TABLE_NAME = '1_1_cohort'
df_labels.to_gbq(destination_table='triageTD.%s' % TABLE_NAME,
                 project_id='som-nero-phi-jonc101',
                 table_schema=table_schema,
                 if_exists='replace')

1it [00:05,  5.66s/it]
