# Extract and encode MIMIC data
We extract the mimic information and encode it as time-series data. 
We use the following tables: 
- labevents: 
    - contains blood tests, etc. 
- microbiology events: 

- output events: 
    - contains 
    
- chartevents
    - contains vital signs
   



In [1]:
%load_ext autoreload
%autoreload 2

import os 

Define the relative or absolute data path. The mimic data must be in the folders: 
- data_root + '/mimiciv/3.0/hosp/' 
- data_root + '/mimiciv/3.0/icu/' 

In [2]:
# define a relative or absolute path to the mimic folder. 
data_path = '/../data/real_world_data/physionet.org_small/files'
# data_path = "/lustre/groups/labs/marr/qscd01/datasets/vonKleist/physionet.org/files"
root_dir = os.path.dirname(os.path.abspath('UserInterface.ipynb')) + data_path

# define path for processed files
target_path = './data/features/'

In [3]:
import ipywidgets as widgets
import sys
from pathlib import Path
import importlib
import pandas as pd


module_path='preprocessing/day_intervals_preproc'
if module_path not in sys.path:
    sys.path.append(module_path)

module_path='utils'
if module_path not in sys.path:
    sys.path.append(module_path)
    
module_path='preprocessing/hosp_module_preproc'
if module_path not in sys.path:
    sys.path.append(module_path)
    
module_path='model'
if module_path not in sys.path:
    sys.path.append(module_path)
#print(sys.path)
# define the path to the data from current repository 
# (from here data is in mimiciv/3.0/hosp )
root_dir = os.path.dirname(os.path.abspath('UserInterface.ipynb')) + data_path

# root_dir = data_path

import day_intervals_cohort
from day_intervals_cohort import *

import day_intervals_cohort_v2
from day_intervals_cohort_v2 import *

import day_intervals_cohort_v3
from day_intervals_cohort_v3 import *

import data_generation_icu

import data_generation
import evaluation

import feature_selection_hosp
from feature_selection_hosp import *

In [4]:
# this is only for training
import ml_models
from ml_models import *

import dl_train
from dl_train import *

import tokenization
from tokenization import *

import behrt_train
from behrt_train import *

import feature_selection_icu
from feature_selection_icu import *
import fairness
import callibrate_output

In [5]:
importlib.reload(day_intervals_cohort)
import day_intervals_cohort
from day_intervals_cohort import *

importlib.reload(day_intervals_cohort_v2)
import day_intervals_cohort_v2
from day_intervals_cohort_v2 import *

importlib.reload(day_intervals_cohort_v3)
import day_intervals_cohort_v3
from day_intervals_cohort_v3 import *


importlib.reload(data_generation_icu)
import data_generation_icu
importlib.reload(data_generation)
import data_generation

importlib.reload(feature_selection_hosp)
import feature_selection_hosp
from feature_selection_hosp import *

importlib.reload(feature_selection_icu)
import feature_selection_icu
from feature_selection_icu import *

importlib.reload(tokenization)
import tokenization
from tokenization import *

importlib.reload(ml_models)
import ml_models
from ml_models import *

importlib.reload(dl_train)
import dl_train
from dl_train import *

importlib.reload(behrt_train)
import behrt_train
from behrt_train import *

importlib.reload(fairness)
import fairness

importlib.reload(callibrate_output)
import callibrate_output

importlib.reload(evaluation)
import evaluation

# Welcome to your MIMIC-IV Project

This repository explains the steps to download and clean MIMIC-IV dataset for analysis.
The repository is compatible with MIMIC-IV v1.0 and MIMIC-IV v2.0

Please go to:
- https://physionet.org/content/mimiciv/1.0/ for v1.0
- https://physionet.org/content/mimiciv/2.0/ for v2.0
- https://physionet.org/content/mimiciv/3.0/ for v3.0

Follow instructions to get access to MIMIC-IV dataset.

Download the files using your terminal: 
- wget -r -N -c -np --user mehakg --ask-password https://physionet.org/files/mimiciv/1.0/ or
- wget -r -N -c -np --user mehakg --ask-password https://physionet.org/files/mimiciv/2.0/ or
- wget -r -N -c -np --user mehakg --ask-password https://physionet.org/files/mimiciv/3.0/        

Save downloaded files in the parent directory of this github repo. 

The structure should look like below for v1.0-
- mimiciv/1.0/core
- mimiciv/1.0/hosp
- mimiciv/1.0/icu

The structure should look like below for v2.0-
- mimiciv/2.0/hosp
- mimiciv/2.0/icu

The structure should look like below for v3.0-
- mimiciv/3.0/hosp
- mimiciv/3.0/icu

## 1. DATA EXTRACTION
Please run below cell to select option for cohort selection.
The cohort will be svaed in **./data/cohort/**

In [6]:
print("Please select the approriate version of MIMIC-IV for which you have downloaded data ?")
version = widgets.RadioButtons(options=['Version 1','Version 2','Version 3'],value='Version 3')
display(version)

print("Please select what prediction task you want to perform ?")
radio_input4 = widgets.RadioButtons(options=['Mortality','Length of Stay','Readmission','Phenotype'],value='Phenotype')
display(radio_input4)


Please select the approriate version of MIMIC-IV for which you have downloaded data ?


RadioButtons(index=2, options=('Version 1', 'Version 2', 'Version 3'), value='Version 3')

Please select what prediction task you want to perform ?


RadioButtons(index=3, options=('Mortality', 'Length of Stay', 'Readmission', 'Phenotype'), value='Phenotype')

### Refining Cohort and Prediction Task Definition

Based on your current selection following block will provide option to further refine prediction task and cohort associated with it:

- First you will refine the prediction task choosing from following options -
    - **Length of Stay** - You can select from two predefined options or enter custom number of days to predict length os stay greater than number of days.

    - **Readmission** - You can select from two predefined options or enter custom number of days to predict readmission after "number of days" after previous admission.

    - **Phenotype Prediction** - You can select from four major chronic diseases to predict its future outcome

        - Heart failure
        - CAD (Coronary Artery Disease)
        - CKD (Chronic Kidney Disease)
        - COPD (Chronic obstructive pulmonary disease)

- Second, you will choode whether to perfom above task using ICU or non-ICU admissions data

- Third, you can refine the refine the cohort selection for any of the above choosen prediction tasks by including the admission samples admitted with particular chronic disease - 
    - Heart failure
    - CAD (Coronary Artery Disease)
    - CKD (Chronic Kidney Disease)
    - COPD (Chronic obstructive pulmonary disease)
    
print("**Please run below cell to extract the cohort for selected options**")

In [7]:
# define what data to load 
if radio_input4.value=='Length of Stay':
    options=['Length of Stay ge 3','Length of Stay ge 7','Custom']
    value='Length of Stay ge 3'
    radio_input2 = widgets.RadioButtons(options = option, value=  value)
    display(radio_input2)
    text1=widgets.IntSlider(
    value=3,
    min=1,
    max=10,
    step=1,
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='d'
)
    display(widgets.HBox([widgets.Label('Length of stay ge (in days)',layout={'width': '180px'}), text1]))
elif radio_input4.value=='Readmission':
    options=['30 Day Readmission','60 Day Readmission','90 Day Readmission',
             '120 Day Readmission','Custom']
    value= '30 Day Readmission'
    radio_input2 = widgets.RadioButtons(options = options,value = value)
    display(radio_input2)
    text1=widgets.IntSlider(
    value=30,
    min=10,
    max=150,
    step=10,
    disabled=False
    )
    display(widgets.HBox([widgets.Label('Readmission after (in days)',layout={'width': '180px'}), text1]))
elif radio_input4.value=='Phenotype':
    options = ['Heart Failure in 30 days','CAD in 30 days','CKD in 30 days','COPD in 30 days']
    value ='CKD in 30 days'
    radio_input2 = widgets.RadioButtons(options=options, value = value)
    display(radio_input2)
elif radio_input4.value=='Mortality':
    options=['Mortality']
    value='Mortality'
    radio_input2 = widgets.RadioButtons(options=options,value=value)
    #display(radio_input2)

print("Extract Data")
print("Please select below if you want to work with ICU or Non-ICU data ?")
radio_input1 = widgets.RadioButtons(options=['ICU', 'Non-ICU'],value='ICU')
display(radio_input1)

print("Please select if you want to perform choosen prediction task for a specific disease.")
radio_input3 = widgets.RadioButtons(options=['No Disease Filter','Heart Failure','CKD','CAD','COPD'],value='CKD')
display(radio_input3)

RadioButtons(index=2, options=('Heart Failure in 30 days', 'CAD in 30 days', 'CKD in 30 days', 'COPD in 30 day…

Extract Data
Please select below if you want to work with ICU or Non-ICU data ?


RadioButtons(options=('ICU', 'Non-ICU'), value='ICU')

Please select if you want to perform choosen prediction task for a specific disease.


RadioButtons(index=2, options=('No Disease Filter', 'Heart Failure', 'CKD', 'CAD', 'COPD'), value='CKD')

In [8]:
disease_label=""
time=0
label=radio_input4.value

if label=='Readmission':
    if radio_input2.value=='Custom':
        time=text1.value
    else:
        time=int(radio_input2.value.split()[0])
elif label=='Length of Stay':
    if radio_input2.value=='Custom':
        time=text1.value
    else:
        time=int(radio_input2.value.split()[4])

if label=='Phenotype':    
    if radio_input2.value=='Heart Failure in 30 days':
        label='Readmission'
        time=30
        disease_label='I50'
    elif radio_input2.value=='CAD in 30 days':
        label='Readmission'
        time=30
        disease_label='I25'
    elif radio_input2.value=='CKD in 30 days':
        label='Readmission'
        time=30
        disease_label='N18'
    elif radio_input2.value=='COPD in 30 days':
        label='Readmission'
        time=30
        disease_label='J44'
    
data_icu=radio_input1.value=="ICU"
data_mort=label=="Mortality"
data_admn=label=='Readmission'
data_los=label=='Length of Stay'
        

if (radio_input3.value=="Heart Failure"):
    icd_code='I50'
elif (radio_input3.value=="CKD"):
    icd_code='N18'
elif (radio_input3.value=="COPD"):
    icd_code='J44'
elif (radio_input3.value=="CAD"):
    icd_code='I25'
else:
    icd_code='No Disease Filter'

if version.value=='Version 1':
    version_path= "mimiciv/1.0"
    cohort_output = day_intervals_cohort.extract_data(radio_input1.value,label,time,icd_code, root_dir,disease_label)
elif version.value=='Version 2':
    version_path= "mimiciv/2.0"
    cohort_output = day_intervals_cohort_v2.extract_data(radio_input1.value,label,time,icd_code, root_dir,disease_label)
elif version.value=='Version 3':
    version_path= "mimiciv/3.0"
    cohort_output = day_intervals_cohort_v3.extract_data(radio_input1.value,label,time,icd_code, root_dir,disease_label)

EXTRACTING FOR: | ICU | READMISSION DUE TO N18 | ADMITTED DUE TO N18 | 30 |


  2%|█▌                                                                                | 11/565 [00:00<00:05, 98.79it/s]

[ READMISSION DUE TO N18 ]


100%|█████████████████████████████████████████████████████████████████████████████████| 565/565 [00:05<00:00, 94.89it/s]


[ READMISSION LABELS FINISHED ]
[ COHORT SUCCESSFULLY SAVED ]
[ SUMMARY SUCCESSFULLY SAVED ]
Readmission FOR ICU DATA
# Admission Records: 902
# Patients: 565
# Positive cases: 143
# Negative cases: 759


## 2. FEATURE SELECTION
Features available for ICU data -
- Diagnosis (https://mimic.mit.edu/docs/iv/modules/hosp/diagnoses_icd/)
- Procedures (https://mimic.mit.edu/docs/iv/modules/icu/procedureevents/)
- Medications (https://mimic.mit.edu/docs/iv/modules/icu/inputevents/)
- Output Events (https://mimic.mit.edu/docs/iv/modules/icu/outputevents/)
- Chart Events (https://mimic.mit.edu/docs/iv/modules/icu/chartevents/)
- Lab Events (https://mimic.mit.edu/docs/iv/modules/hosp/labevents/)
- Microbiology Events (https://mimic.mit.edu/docs/iv/modules/hosp/microbiologyevents/)

All features will be saved in the defined **target_path**

**Please run below cell to select features**

In [53]:
print("Feature Selection")
if data_icu:
    print("Which Features you want to include for cohort?")
    check_input1 = widgets.Checkbox(description='Diagnosis', value=True)
    display(check_input1)
    check_input2 = widgets.Checkbox(description='Output Events', value=True)
    display(check_input2)
    check_input3 = widgets.Checkbox(description='Chart Events', value=True)
    display(check_input3)
    check_input4 = widgets.Checkbox(description='Procedures', value=True)
    display(check_input4)
    check_input5 = widgets.Checkbox(description='Medications', value=True)
    display(check_input5)
    check_input100 = widgets.Checkbox(description='Lab Events', value=True)
    display(check_input100)
    check_input101 = widgets.Checkbox(description='Microbiology Events', value=False)
    display(check_input101)

    print("icu")
else:
    print("Which Features you want to include for cohort?")
    check_input1 = widgets.Checkbox(description='Diagnosis', value=True)
    display(check_input1)
    check_input2 = widgets.Checkbox(description='Labs', value=True)
    display(check_input2)
    check_input3 = widgets.Checkbox(description='Procedures', value=True)
    display(check_input3)
    check_input4 = widgets.Checkbox(description='Medications', value=True )
    display(check_input4)
print("**Please run below cell to extract selected features**")

Feature Selection
Which Features you want to include for cohort?


Checkbox(value=True, description='Diagnosis')

Checkbox(value=True, description='Output Events')

Checkbox(value=True, description='Chart Events')

Checkbox(value=True, description='Procedures')

Checkbox(value=True, description='Medications')

Checkbox(value=True, description='Lab Events')

Checkbox(value=False, description='Microbiology Events')

icu
**Please run below cell to extract selected features**


In [54]:
diag_flag=check_input1.value
out_flag=check_input2.value
chart_flag=check_input3.value
proc_flag=check_input4.value
med_flag=check_input5.value
lab_flag=check_input100.value
micro_flag=check_input101.value

In [61]:
# load and preprocess the data (by dropping some columns)
if data_icu:
    diag_flag=check_input1.value
    out_flag=check_input2.value
    chart_flag=check_input3.value
    proc_flag=check_input4.value
    med_flag=check_input5.value

    lab_flag=check_input100.value
    micro_flag=check_input101.value

    
    #feature_icu(cohort_output, root_dir, root_dir + '/'+ version_path,diag_flag,out_flag,chart_flag,proc_flag,med_flag)
    data = feature_icu( cohort_output = cohort_output, 
                        root_dir = root_dir, 
                        version_path = root_dir + '/'+ version_path,
                        save_path = target_path, 
                        diag_flag = diag_flag,
                        out_flag = out_flag,
                        chart_flag = chart_flag,
                        proc_flag = proc_flag,
                        med_flag  = med_flag,
                        lab_flag = lab_flag,
                        micro_flag = micro_flag)
else:
    # not adapted yet
    print("Warning: this code hasn't been checked!")
    diag_flag=check_input1.value
    lab_flag=check_input2.value
    proc_flag=check_input3.value
    med_flag=check_input4.value
    feature_nonicu(cohort_outputroot_dir, root_dir+ '/'+ version_path , version_path,diag_flag,lab_flag,proc_flag,med_flag)

[EXTRACTING DIAGNOSIS DATA]
# unique ICD-9 codes 1130
# unique ICD-10 codes 1853
# unique ICD-10 codes (After converting ICD-9 to ICD-10) 2067
# unique ICD-10 codes (After clinical gruping ICD-10 codes) 695
# Admissions:   902
Total rows 21812
Columns kept for diagnosis: ['subject_id', 'hadm_id', 'stay_id', 'icd_code', 'root_icd10_convert', 'root']
[SUCCESSFULLY SAVED DIAGNOSIS DATA]
[EXTRACTING OUTPUT EVENTS DATA]
# Unique Events:   59
# Admissions:   833
Total rows 43693
Columns kept for output events: ['subject_id', 'hadm_id', 'stay_id', 'itemid', 'charttime', 'intime', 'event_time_from_admit', 'value']


0it [00:00, ?it/s]

[SUCCESSFULLY SAVED OUTPUT EVENTS DATA]
[EXTRACTING CHART EVENTS DATA]


3it [00:41, 13.98s/it]


# Unique Events:   681
# Admissions:   902
Total rows 1496896
Columns kept for chart events: ['stay_id', 'itemid', 'event_time_from_admit', 'valuenum']
[SUCCESSFULLY SAVED CHART EVENTS DATA]
[EXTRACTING PROCEDURES DATA]
# Unique Events:   120
# Admissions:   796
Total rows 6821
Columns kept for procedures: ['subject_id', 'hadm_id', 'stay_id', 'itemid', 'starttime', 'intime', 'event_time_from_admit']
[SUCCESSFULLY SAVED PROCEDURES DATA]
[EXTRACTING MEDICATION DATA]
# of unique type of drug:  104
# Admissions:   731
# Total rows 52798
Columns kept for medication: ['subject_id', 'hadm_id', 'stay_id', 'itemid', 'starttime', 'endtime', 'start_hours_from_admit', 'stop_hours_from_admit', 'rate', 'amount', 'orderid']


0it [00:00, ?it/s]

[SUCCESSFULLY SAVED MEDICATION DATA]
[EXTRACTING LAB EVENTS DATA]


2it [00:03,  1.96s/it]


# of unique type of lab events:  461
# Admissions:   282
# Total rows 167158
Columns kept for lab events: ['subject_id', 'hadm_id', 'stay_id', 'itemid', 'charttime', 'storetime', 'valuenum', 'valueuom', 'ref_range_lower', 'ref_range_upper', 'chart_hours_from_admit', 'store_hours_from_admit']
[SUCCESSFULLY SAVED LAB EVENTS DATA]


In [57]:
# look at the data 
data['chart']

Unnamed: 0,stay_id,itemid,event_time_from_admit,valuenum
0,32669861,220045,1 days 22:29:31,93.0
1,32669861,220210,1 days 22:29:31,26.0
2,32669861,220277,1 days 22:29:31,93.0
3,32669861,220179,1 days 22:30:31,128.0
4,32669861,220180,1 days 22:30:31,55.0
...,...,...,...,...
1496889,30259797,227442,12 days 16:33:25,4.7
1496890,30259797,227443,12 days 16:33:25,29.0
1496891,30259797,227465,12 days 16:33:25,14.2
1496892,30259797,227466,12 days 16:33:25,28.1


## 3. CLINICAL GROUPING
Below you will have option to clinically group diagnosis and medications.
Grouping medical codes will reduce dimensional space of features.

Default options selected below will group medical codes to reduce feature dimension space.

**Please run below cell to select preprocessing for diferent features**

In [62]:
if data_icu:
    if diag_flag:
        print("Do you want to group ICD 10 DIAG codes ?")
        radio_input4 = widgets.RadioButtons(options=['Keep both ICD-9 and ICD-10 codes','Convert ICD-9 to ICD-10 codes','Convert ICD-9 to ICD-10 and group ICD-10 codes'],value='Convert ICD-9 to ICD-10 and group ICD-10 codes',layout={'width': '100%'})
        display(radio_input4)   
    
else:
    if diag_flag:
        print("Do you want to group ICD 10 DIAG codes ?")
        radio_input4 = widgets.RadioButtons(options=['Keep both ICD-9 and ICD-10 codes','Convert ICD-9 to ICD-10 codes','Convert ICD-9 to ICD-10 and group ICD-10 codes'],value='Convert ICD-9 to ICD-10 and group ICD-10 codes',layout={'width': '100%'})
        display(radio_input4)     
    if med_flag:
        print("Do you want to group Medication codes to use Non propietary names?")
        radio_input5 = widgets.RadioButtons(options=['Yes','No'],value='Yes',layout={'width': '100%'})
        display(radio_input5)
    if proc_flag:
        print("Which ICD codes for Procedures you want to keep in data?")
        radio_input6 = widgets.RadioButtons(options=['ICD-9 and ICD-10','ICD-10'],value='ICD-10',layout={'width': '100%'})
        display(radio_input6)
print("**Please run below cell to perform feature preprocessing**")

Do you want to group ICD 10 DIAG codes ?


RadioButtons(index=2, layout=Layout(width='100%'), options=('Keep both ICD-9 and ICD-10 codes', 'Convert ICD-9…

**Please run below cell to perform feature preprocessing**


In [63]:
group_diag=False
group_med=False
group_proc=False
if data_icu:
    if diag_flag:
        group_diag=radio_input4.value
    preprocess_features_icu(cohort_output = cohort_output, 
                            save_path = target_path, 
                            diag_flag = diag_flag, 
                            group_diag = group_diag,
                            chart_flag = False, 
                            clean_chart = False, 
                            impute_outlier_chart = False,  
                            thresh = 0, 
                            left_thresh= 0)
else:
    # not adapted yet
    print("Warning: this code hasn't been checked!")
    if diag_flag:
        group_diag=radio_input4.value
    if med_flag:
        group_med=radio_input5.value
    if proc_flag:
        group_proc=radio_input6.value
    preprocess_features_hosp(cohort_output, diag_flag,proc_flag,med_flag,False,group_diag,group_med,group_proc,False,False,0,0)

[PROCESSING DIAGNOSIS DATA]
Total number of rows 21238
[SUCCESSFULLY SAVED DIAGNOSIS DATA]


### 4. SUMMARY OF FEATURES

This step will generate summary of all features extracted so far.<br>
It will save summary files in **./data/summary/**<br>
- These files provide summary about **mean frequency** of medical codes per admission.<br>
- It also provides **total occurrence count** of each medical code.<br>
- For labs and chart events it will also provide <br>**missing %** which tells how many rows for a certain medical code has missing value.

Please use this information to further refine your cohort by selecting <br>which medical codes in each feature you want to keep and <br>which codes you would like to remove for downstream analysis tasks.

**Please run below cell to generate summary files**

In [64]:
if data_icu:
    #generate_summary_icu(diag_flag,proc_flag,med_flag,out_flag,chart_flag)
    generate_summary_icu(diag_flag,proc_flag,med_flag,out_flag,chart_flag, lab_flag, micro_flag)
else:
    generate_summary_hosp(diag_flag,proc_flag,med_flag,lab_flag)

[GENERATING FEATURE SUMMARY]
[SUCCESSFULLY SAVED FEATURE SUMMARY]


## 5. Feature Selection

based on the files generated in previous step and other infromation gathered by you,<br>
Please select which medical codes you want to include in this study.

Please run below cell to to select options for which features you want to perform feature selection.

- Select **Yes** if you want to select a subset of medical codes for that feature and<br> **edit** the corresponding feature file for it.
- Select **No** if you want to keep all the codes in a feature.

In [65]:
if data_icu:
    if diag_flag:
        print("Do you want to do Feature Selection for Diagnosis \n (If yes, please edit list of codes in ./data/summary/diag_features.csv)")
        radio_input4 = widgets.RadioButtons(options=['Yes','No'],value='No')
        display(radio_input4)       
    if med_flag:
        print("Do you want to do Feature Selection for Medication \n (If yes, please edit list of codes in ./data/summary/med_features.csv)")
        radio_input5 = widgets.RadioButtons(options=['Yes','No'],value='No')
        display(radio_input5)   
    if proc_flag:
        print("Do you want to do Feature Selection for Procedures \n (If yes, please edit list of codes in ./data/summary/proc_features.csv)")
        radio_input6 = widgets.RadioButtons(options=['Yes','No'],value='No')
        display(radio_input6)   
    if out_flag:
        print("Do you want to do Feature Selection for Output event \n (If yes, please edit list of codes in ./data/summary/out_features.csv)")
        radio_input7 = widgets.RadioButtons(options=['Yes','No'],value='No')
        display(radio_input7)  
    if chart_flag:
        print("Do you want to do Feature Selection for Chart events \n (If yes, please edit list of codes in ./data/summary/chart_features.csv)")
        radio_input8 = widgets.RadioButtons(options=['Yes','No'],value='No')
        display(radio_input8)  
else:
    if diag_flag:
        print("Do you want to do Feature Selection for Diagnosis \n (If yes, please edit list of codes in ./data/summary/diag_features.csv)")
        radio_input4 = widgets.RadioButtons(options=['Yes','No'],value='No')
        display(radio_input4)         
    if med_flag:
        print("Do you want to do Feature Selection for Medication \n (If yes, please edit list of codes in ./data/summary/med_features.csv)")
        radio_input5 = widgets.RadioButtons(options=['Yes','No'],value='No')
        display(radio_input5)   
    if proc_flag:
        print("Do you want to do Feature Selection for Procedures \n (If yes, please edit list of codes in ./data/summary/proc_features.csv)")
        radio_input6 = widgets.RadioButtons(options=['Yes','No'],value='No')
        display(radio_input6)   
    if lab_flag:
        print("Do you want to do Feature Selection for Labs \n (If yes, please edit list of codes in ./data/summary/lab_features.csv)")
        radio_input7 = widgets.RadioButtons(options=['Yes','No'],value='No')
        display(radio_input7)   
print("**Please run below cell to perform feature selection**")

Do you want to do Feature Selection for Diagnosis 
 (If yes, please edit list of codes in ./data/summary/diag_features.csv)


RadioButtons(index=1, options=('Yes', 'No'), value='No')

Do you want to do Feature Selection for Medication 
 (If yes, please edit list of codes in ./data/summary/med_features.csv)


RadioButtons(index=1, options=('Yes', 'No'), value='No')

Do you want to do Feature Selection for Procedures 
 (If yes, please edit list of codes in ./data/summary/proc_features.csv)


RadioButtons(index=1, options=('Yes', 'No'), value='No')

Do you want to do Feature Selection for Output event 
 (If yes, please edit list of codes in ./data/summary/out_features.csv)


RadioButtons(index=1, options=('Yes', 'No'), value='No')

Do you want to do Feature Selection for Chart events 
 (If yes, please edit list of codes in ./data/summary/chart_features.csv)


RadioButtons(index=1, options=('Yes', 'No'), value='No')

**Please run below cell to perform feature selection**


In [66]:
select_diag=False
select_med=False
select_proc=False
select_lab=False
select_out=False
select_chart=False

if data_icu:
    if diag_flag:
        select_diag=radio_input4.value == 'Yes'
    if med_flag:
        select_med=radio_input5.value == 'Yes'
    if proc_flag:
        select_proc=radio_input6.value == 'Yes'
    if out_flag:
        select_out=radio_input7.value == 'Yes'
    if chart_flag:
        select_chart=radio_input8.value == 'Yes'
    features_selection_icu(cohort_output, diag_flag,proc_flag,med_flag,out_flag, chart_flag,select_diag,select_med,select_proc,select_out,select_chart)
else:
    if diag_flag:
        select_diag=radio_input4.value == 'Yes'
    if med_flag:
        select_med=radio_input5.value == 'Yes'
    if proc_flag:
        select_proc=radio_input6.value == 'Yes'
    if lab_flag:
        select_lab=radio_input7.value == 'Yes'
    features_selection_hosp(cohort_output, diag_flag,proc_flag,med_flag,lab_flag,select_diag,select_med,select_proc,select_lab)

## 6. CLEANING OF FEATURES
Below you will have option to to clean lab and chart events by performing outlier removal and unit conversion.

Outlier removal is performed to remove values higher than selected **right threshold** percentile and lower than selected **left threshold** percentile among all values for each itemid. 

**Please run below cell to select preprocessing for diferent features**

In [67]:
if data_icu:
    if chart_flag:
        print("Outlier removal in values of chart events ?")
        layout = widgets.Layout(width='100%', height='40px') #set width and height

        radio_input5 = widgets.RadioButtons(options=['No outlier detection','Impute Outlier (default:98)','Remove outliers (default:98)'],value='No outlier detection',layout=layout)
        display(radio_input5)
        outlier=widgets.IntSlider(
        value=98,
        min=90,
        max=99,
        step=1,
        disabled=False,layout={'width': '100%'}
        )
        left_outlier=widgets.IntSlider(
        value=0,
        min=0,
        max=10,
        step=1,
        disabled=False,layout={'width': '100%'}
        )
        #display(oulier)
        display(widgets.HBox([widgets.Label('Right Outlier Threshold',layout={'width': '150px'}), outlier]))
        display(widgets.HBox([widgets.Label('Left Outlier Threshold',layout={'width': '150px'}), left_outlier]))
    
else:      
    if lab_flag:
        print("Outlier removal in values of lab events ?")
        layout = widgets.Layout(width='100%', height='40px') #set width and height

        radio_input7 = widgets.RadioButtons(options=['No outlier detection','Impute Outlier (default:98)','Remove outliers (default:98)'],value='No outlier detection',layout=layout)
        display(radio_input7)
        outlier=widgets.IntSlider(
        value=98,
        min=90,
        max=99,
        step=1,
        disabled=False,layout={'width': '100%'}
        )
        left_outlier=widgets.IntSlider(
        value=0,
        min=0,
        max=10,
        step=1,
        disabled=False,layout={'width': '100%'}
        )
        #display(oulier)
        display(widgets.HBox([widgets.Label('Right Outlier Threshold',layout={'width': '150px'}), outlier]))
        display(widgets.HBox([widgets.Label('Left Outlier Threshold',layout={'width': '150px'}), left_outlier]))
print("**Please run below cell to perform feature preprocessing**")

Outlier removal in values of chart events ?


RadioButtons(layout=Layout(height='40px', width='100%'), options=('No outlier detection', 'Impute Outlier (def…

HBox(children=(Label(value='Right Outlier Threshold', layout=Layout(width='150px')), IntSlider(value=98, layou…

HBox(children=(Label(value='Left Outlier Threshold', layout=Layout(width='150px')), IntSlider(value=0, layout=…

**Please run below cell to perform feature preprocessing**


In [69]:
thresh=0
if data_icu:
    if chart_flag:
        clean_chart=radio_input5.value!='No outlier detection'
        impute_outlier_chart=radio_input5.value=='Impute Outlier (default:98)'
        thresh=outlier.value
        left_thresh=left_outlier.value
    preprocess_features_icu(cohort_output = cohort_output, 
                            save_path = target_path, 
                            diag_flag = False, 
                            group_diag = False,
                            chart_flag = chart_flag,
                            clean_chart = clean_chart,
                            impute_outlier_chart = impute_outlier_chart,
                            thresh = thresh,
                            left_thresh = left_thresh)
else:
    # not adapted yet
    print("Warning: this code hasn't been checked!")
    if lab_flag:
        clean_lab=radio_input7.value!='No outlier detection'
        impute_outlier=radio_input7.value=='Impute Outlier (default:98)'
        thresh=outlier.value
        left_thresh=left_outlier.value
    preprocess_features_hosp(cohort_output, False,False,False,lab_flag,False,False,False,clean_lab,impute_outlier,thresh,left_thresh)

## Visualize the data
Before encoding the data into time-series format, we can first load and visualize it to understand it better

In [81]:
from data_loading.load_preprocessed_data import DataLoader

#data_loader = DataLoader(root_dir, cohort_output,data_mort,data_admn,data_los,diag_flag,proc_flag,out_flag,chart_flag,med_flag, lab_flag, micro_flag,impute, include,bucket,predW)
data_loader = DataLoader(root_dir = root_dir, 
                         cohort_output = cohort_output,
                         if_mort = data_mort,
                         if_admn = data_admn,
                         if_los = data_los,
                         feat_cond = diag_flag,
                         feat_proc = proc_flag,
                         feat_out = out_flag,
                         feat_chart = chart_flag,
                         feat_med = med_flag, 
                         feat_lab = lab_flag, 
                         feat_micro = micro_flag,
                         impute  = 0)

In [82]:
# load each table in a dictionary 
dataset = data_loader.load()
print("Loaded the data for: ", dataset.keys())

[ READ ADM FEATURES ]


0it [00:00, ?it/s]



1it [00:08,  8.60s/it]


[ READ ALL FEATURES ]
[ PROCESSED TIME SERIES TO EQUAL LENGTH  ]
Loaded the data for:  dict_keys(['data', 'cond', 'cond_per_adm', 'meds', 'proc', 'out', 'chart', 'lab'])


In [93]:
# now we can look at individual tables
dataset['data']['stay_id'].unique().shape

(718,)

## 7. Time-Series Representation
In this section, please choose how you want to process and represent time-series data.

- First option is to select the length of time-series data you want to include for this study. (Default is 72 hours)

- Second option is to select bucket size which tells in what size time windows you want to divide your time-series.<br>
For example, if you select **2** bucket size, it wil aggregate data for every 2 hours and <br>a time-series of length 24 hours will be represented as time-series with 12 time-windows <br>where data for every 2 hours is agggregated from original raw time-series.

During this step, we will also save the time-series data in data dictionaries in the format that can be directly used for following deep learning analysis.

### Imputation
You can also choose if you want to impute lab/chart values. The imputation will be done by froward fill and mean or median imputation.<br>
Values will be forward fill first and if no value exists for that admission we will use mean or median value for the patient.

The data dictionaries will be saved in **./data/dict/**

Please refer the readme to know the structure of data dictionaries.

**Please run below cell to select time-series representation**

In [None]:
print("=======Time-series Data Represenation=======")

print("Length of data to be included for time-series prediction ?")
if(data_mort):
    radio_input8 = widgets.RadioButtons(options=['First 72 hours','First 48 hours','First 24 hours','Custom'],value='First 72 hours')
    display(radio_input8)
    text2=widgets.IntSlider(
    value=72,
    min=24,
    max=72,
    step=1,
    description='Fisrt',
    disabled=False
    )
    display(widgets.HBox([widgets.Label('Fisrt (in hours):',layout={'width': '150px'}), text2]))
elif(data_admn):
    radio_input8 = widgets.RadioButtons(options=['Last 72 hours','Last 48 hours','Last 24 hours','Custom'],value='Last 72 hours')
    display(radio_input8)
    text2=widgets.IntSlider(
    value=72,
    min=24,
    max=72,
    step=1,
    description='Last',
    disabled=False
    )
    display(widgets.HBox([widgets.Label('Last (in hours):',layout={'width': '150px'}), text2]))
elif(data_los):
    radio_input8 = widgets.RadioButtons(options=['First 12 hours','First 24 hours','Custom'],value='First 24 hours')
    display(radio_input8)
    text2=widgets.IntSlider(
    value=72,
    min=12,
    max=72,
    step=1,
    description='First',
    disabled=False
    )
    display(widgets.HBox([widgets.Label('Fisrt (in hours):',layout={'width': '150px'}), text2]))
    
    
print("What time bucket size you want to choose ?")
radio_input7 = widgets.RadioButtons(options=['1 hour','2 hour','3 hour','4 hour','5 hour','Custom'],value='1 hour')
display(radio_input7)
text1=widgets.IntSlider(
    value=1,
    min=1,
    max=6,
    step=1,
    disabled=False
    )
#display(text1)
display(widgets.HBox([widgets.Label('Bucket Size (in hours):',layout={'width': '150px'}), text1]))
print("Do you want to forward fill and mean or median impute lab/chart values to form continuous data signal?")
radio_impute = widgets.RadioButtons(options=['No Imputation', 'forward fill and mean','forward fill and median'],value='No Imputation')
display(radio_impute)   

radio_input6 = widgets.RadioButtons(options=['0 hours','2 hours','4 hours','6 hours'],value='0 hours')
if(data_mort):
    print("If you have choosen mortality prediction task, then what prediction window length you want to keep?")
    radio_input6 = widgets.RadioButtons(options=['2 hours','4 hours','6 hours','8 hours','Custom'],value='2 hours')
    display(radio_input6)
    text3=widgets.IntSlider(
    value=2,
    min=2,
    max=8,
    step=1,
    disabled=False
    )
    display(widgets.HBox([widgets.Label('Prediction window (in hours)',layout={'width': '180px'}), text3]))
print("**Please run below cell to perform time-series represenation and save in data dictionaries**")

In [None]:
if (radio_input6.value=='Custom'):
    predW=int(text3.value)
else:
    predW=int(radio_input6.value[0].strip())
if (radio_input7.value=='Custom'):
    bucket=int(text1.value)
else:
    bucket=int(radio_input7.value[0].strip())
if (radio_input8.value=='Custom'):
    include=int(text2.value)
else:
    include=int(radio_input8.value.split()[1])
if (radio_impute.value=='forward fill and mean'):
    impute='Mean'
elif (radio_impute.value=='forward fill and median'):
    impute='Median'
else:
    impute=False

if data_icu:
    gen=data_generation_icu.Generator(root_dir, cohort_output,data_mort,data_admn,data_los,diag_flag,proc_flag,out_flag,chart_flag,med_flag, lab_flag, micro_flag,impute,include,bucket,predW)
    #gen=data_generation_icu.Generator(cohort_output,data_mort,diag_flag,False,False,chart_flag,False,impute,include,bucket,predW)
    #if chart_flag:
    #    gen=data_generation_icu.Generator(cohort_output,data_mort,False,False,False,chart_flag,False,impute,include,bucket,predW)
else:
    gen=data_generation.Generator(root_dir, cohort_output,data_mort,data_admn,data_los,diag_flag,lab_flag,proc_flag,med_flag,impute,include,bucket,predW)

## Extract acute kidney injury information

In [None]:
def extract_aki(out_df, charts_df):
    """
    Extract the acute kidney injury information to compute the label, based on urine output.
    
    Parameters:
    - out_df: DataFrame containing output events.
    - charts_df: DataFrame containing chart events.
    
    Returns:
    - Combined DataFrame with filtered rows from both input DataFrames.
    """

    # Step 1: filter output events
    itemids_out = ["226627", "226631", "227488", "227489"] 
    out_aki = out_df[out_df['itemid'].isin(itemids_out)]

    # check if all ids exist
    print("Unique ids within output data: ", out_aki['itemid'].unique())
    

    # Step 2: filter charts
    itemids_chart = ["224639", "226512", "226531"] 
    charts_aki = charts_df[charts_df['itemid'].isin(itemids_chart)]
    
    # check if all ids exist
    print("Unique ids within chart data: ", charts_aki['itemid'].unique())

    # Step 3: combine the two tables and keep 'stay_id', 'itemid', 'start_time', and 'value'/'valuenum'
    combined_aki = pd.concat([out_aki[['stay_id', 'itemid', 'start_time', 'value']],
                              charts_aki[['stay_id', 'itemid', 'start_time', 'valuenum']].rename(columns={'valuenum': 'value'})])

    # Step 4: filter charts
    itemids_all = ["226627", "226631", "227488", "227489", "224639", "226512", "226531"] 
    print("Unique ids within chart data: ", combined_aki['itemid'].unique())
    
    return combined_aki

In [139]:
combined_aki = extract_aki(dataset['out'], dataset['chart'])

Unique ids within output data:  [226631 227488 227489 226627]
Unique ids within chart data:  [226531 224639 226512]
Unique ids within chart data:  [226631 227488 227489 226627 226531 224639 226512]


In [134]:
import pandas as pd

def extract_patient_weights(df):
    # Define the itemids for weight
    weight_itemids = [224639, 226512, 226531]
    
    # Filter the DataFrame for the relevant itemids
    weights_df = df[df['itemid'].isin(weight_itemids)]
    
    # Convert the weight from pounds to kilograms for the last itemid
    pound_to_kg_factor = 0.453592
    weights_df.loc[weights_df['itemid'] == 226531, 'value'] = weights_df.loc[weights_df['itemid'] == 226531, 'value'] * pound_to_kg_factor
    
    # Calculate the average weights for each stay_id
    averages_df = weights_df.groupby('stay_id').agg({
        'value': 'mean'
    }).reset_index()
    
    # Rename the 'value' column to indicate it's the average weight
    averages_df.rename(columns={'value': 'weight'}, inplace=True)
    
    # Pivot the DataFrame to create a new table with stay_id and weight versions
    # This step is no longer necessary for the final output
    # weights_pivot = weights_df.pivot_table(index='stay_id', columns='itemid', values='value', aggfunc='mean')  # Use mean to average the weights
    
    # Check the differences between the weight columns
    if len(weights_df.columns) > 2:  # Ensure there are at least two weight columns to compare
        for i in range(len(weight_itemids) - 1):  # Loop through itemids
            item1 = weight_itemids[i]
            item2 = weight_itemids[i + 1]
            diff = abs(weights_df.loc[weights_df['itemid'] == item1, 'value'].mean() - weights_df.loc[weights_df['itemid'] == item2, 'value'].mean())
            if diff > 15:
                print(f"Warning: Differences greater than 15 found between itemids {item1} and {item2}")
    
    # Return only the average weight DataFrame
    return averages_df


In [135]:
weight_df = extract_patient_weights(combined_aki)

In [136]:
weight_df

Unnamed: 0,stay_id,weight
0,30001396,161.925454
1,30057454,86.334029
2,30072056,55.292865
3,30086978,84.232034
4,30112361,43.729993
...,...,...
648,39949413,88.813314
649,39971380,116.754581
650,39972000,60.690610
651,39983674,99.580114


### Urine

In [150]:
import pandas as pd

import pandas as pd

def calculate_urine_output_and_aki(df, df_weight):
    """
    Calculate urine output and determine Acute Kidney Injury (AKI) status from patient data for all stay_ids.

    Parameters:
    df (pd.DataFrame): DataFrame containing patient stay data with the following columns:
                       - 'stay_id': Unique identifier for each patient stay
                       - 'start_time': Timestamp for each measurement
                       - 'value': Measurement value (e.g., urine volume)
                       - 'itemid': Identifier for the type of measurement (e.g., urine input/output)
    df_weight (pd.DataFrame): DataFrame containing patient weight data with the following columns:
                              - 'stay_id': Unique identifier for each patient stay
                              - 'itemid_weight': Weight of the patient

    Returns:
    pd.DataFrame: A DataFrame containing the following columns:
                  - 'stay_id': Unique identifier for each patient stay
                  - 'start_time': Timestamp for each measurement
                  - 'value': Difference between urine output and input
                  - 'interval': Time difference between measurements
                  - 'urine_per_hour': Urine output per hour
                  - 'urine_per_hour_per_kg': Urine output per hour per kg of patient weight
                  - 'AKI_volume': Indicator for low urine output (<0.5)
                  - 'AKI_time': Indicator for long intervals (>=12 hours)
                  - 'AKI_preliminary': Combined indicator for AKI based on volume and time
                  - 'AKI': Final indicator for AKI status (1 if AKI is present, 0 otherwise)
    """
    # Filter the main DataFrame to keep relevant columns
    df = df[['stay_id', 'start_time', 'value', 'itemid']]

    # Filter and rename for urine input and output
    import pdb
    pdb.set_trace()
    df_GU_in = df[df['itemid'] == 227488].rename(columns={'itemid': 'itemid_in', 'value': 'value_in'})
    df_GU_out = df[df['itemid'] == 227489].rename(columns={'itemid': 'itemid_out', 'value': 'value_out'})

    # Merge input and output data on stay_id and start_time
    df_calculation = df_GU_out.merge(df_GU_in, how='inner', on=['stay_id', 'start_time'])

    # Calculate urine output and intervals
    df_calculation['value'] = df_calculation['value_out'] - df_calculation['value_in']
    df_calculation['interval'] = df_calculation.groupby('stay_id')['start_time'].diff()

    # Merge patient weights
    df_weight = df_weight.rename(columns={'itemid_weight': 'weight'})
    df_calculation = df_calculation.merge(df_weight[['stay_id', 'weight']], on='stay_id', how='left')

    # Calculate urine output metrics
    df_calculation['urine_per_hour'] = df_calculation['value'] / df_calculation['interval']
    df_calculation['urine_per_hour_per_kg'] = df_calculation['urine_per_hour'] / df_calculation['weight']

    # Determine AKI indicators
    df_calculation["AKI_volume"] = (df_calculation['urine_per_hour_per_kg'] < 0.5).astype(int)
    df_calculation["AKI_time"] = (df_calculation['interval'] >= pd.Timedelta(hours=12)).astype(int)
    df_calculation["AKI_preliminary"] = df_calculation["AKI_volume"] + df_calculation["AKI_time"]
    df_calculation['AKI'] = (df_calculation['AKI_preliminary'] == 2).astype(int)

    return df_calculation[['stay_id', 'start_time', 'value', 'interval', 
                           'urine_per_hour', 'urine_per_hour_per_kg', 
                           'AKI_volume', 'AKI_time', 'AKI_preliminary', 'AKI']]

# Example usage
# df = pd.read_csv('your_data.csv')  # Load your main DataFrame
# df_weight = pd.read_csv('your_weight_data.csv')  # Load your weight DataFrame
# result_df = calculate_urine_output_and_aki(df, df_weight)
# print(result_df)


In [151]:
calculate_urine_output_and_aki(df = combined_aki, df_weight = weight_df)

> [0;32m/tmp/ipykernel_3988/672885513.py[0m(38)[0;36mcalculate_urine_output_and_aki[0;34m()[0m
[0;32m     36 [0;31m    [0;32mimport[0m [0mpdb[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     37 [0;31m    [0mpdb[0m[0;34m.[0m[0mset_trace[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m---> 38 [0;31m    [0mdf_GU_in[0m [0;34m=[0m [0mdf[0m[0;34m[[0m[0mdf[0m[0;34m[[0m[0;34m'itemid'[0m[0;34m][0m [0;34m==[0m [0;36m227488[0m[0;34m][0m[0;34m.[0m[0mrename[0m[0;34m([0m[0mcolumns[0m[0;34m=[0m[0;34m{[0m[0;34m'itemid'[0m[0;34m:[0m [0;34m'itemid_in'[0m[0;34m,[0m [0;34m'value'[0m[0;34m:[0m [0;34m'value_in'[0m[0;34m}[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     39 [0;31m    [0mdf_GU_out[0m [0;34m=[0m [0mdf[0m[0;34m[[0m[0mdf[0m[0;34m[[0m[0;34m'itemid'[0m[0;34m][0m [0;34m==[0m [0;36m227489[0m[0;34m][0m[0;34m.[0m[0mrename[0m[0;34m([0m[0mcolumns[0m[0;34m=[0m[0;34m{[0m[0;34m'itemid

ipdb> df[df['itemid'] == 227488]
        stay_id  start_time  value  itemid
3617   37839379           5   30.0  227488
3627   37839379          12   30.0  227488
18494  37161074          14  250.0  227488
18496  37161074          14   50.0  227488
18498  37161074          15   50.0  227488
18556  37161074           0  200.0  227488
18558  37161074           1  100.0  227488
18560  37161074           2  100.0  227488
18562  37161074           3  100.0  227488
18564  37161074           4  200.0  227488
18566  37161074           5  200.0  227488
18568  37161074           6  100.0  227488
18570  37161074           7  300.0  227488
18572  37161074           9  200.0  227488
18574  37161074          10  100.0  227488
32103  34589391           8   10.0  227488
ipdb> df['itemid'].unique()
array([226631, 227488, 227489, 226627, 226531, 224639, 226512])
ipdb> df['stay_id'].unique()
array([33290343, 37839379, 33856536, 33378345, 37171883, 32841071,
       37161074, 31171895, 38468262, 32794579, 3

(653,)
ipdb> df[df['itemid'] == 227488]
        stay_id  start_time  value  itemid
3617   37839379           5   30.0  227488
3627   37839379          12   30.0  227488
18494  37161074          14  250.0  227488
18496  37161074          14   50.0  227488
18498  37161074          15   50.0  227488
18556  37161074           0  200.0  227488
18558  37161074           1  100.0  227488
18560  37161074           2  100.0  227488
18562  37161074           3  100.0  227488
18564  37161074           4  200.0  227488
18566  37161074           5  200.0  227488
18568  37161074           6  100.0  227488
18570  37161074           7  300.0  227488
18572  37161074           9  200.0  227488
18574  37161074          10  100.0  227488
32103  34589391           8   10.0  227488
ipdb> df[df['itemid'] == 227488]['stay_id'].unique()
array([37839379, 37161074, 34589391])
ipdb> df[df['itemid'] == '227488']['stay_id'].unique()
array([], dtype=int64)
ipdb> df[df['itemid'] == '227489']['stay_id'].unique()
array

BdbQuit: 

In [145]:
combined_aki

Unnamed: 0,stay_id,itemid,start_time,value
909,33290343,226631,0,200.0
3617,37839379,227488,5,30.0
3618,37839379,227489,5,30.0
3627,37839379,227488,12,30.0
3628,37839379,227489,12,30.0
...,...,...,...,...
1421176,30259797,224639,-207,100.8
1422397,30259797,224639,-256,96.8
1424351,30259797,224639,-279,96.0
1424474,30259797,224639,-234,98.0


In [None]:
df = pd.read_csv(path)

df = df[['stay_id', 'time', 'value', 'itemid', 'itemid_weight']]
df_stay = df[df[stay_id] == stay_id_value]

df_GU_in = df_stay[['itemid'] == 227488 ]
df_GU_in = df_GU_in.rename(columns={'itemid': 'itemid_in', 'value': 'value_in'})

df_GU_out = df_stay[['itemid'] == 227489 ]
df_GU_out = df_GU_out.rename(columns={'itemid': 'itemid_out', 'value': 'value_out'})

df_calculation = df_GU_out.merge(df_GU_in, how='inner', on=['stay_id','time'])
df_calculation['value'] = df_calculation['value_out'] - df_calculation['value_in']
df_calculation['interval'] = df_calculation['time'].diff()
df_calculation['urine_per_hour'] = df_calculation['value']/df_calculation['interval']
df_calculation['urine_per_hour_per_kg'] = df_calculation['urine_per_hour']/df_stay['itemid_weight']

df_calculation["AKI_volume"] = df['urine_per_hour_per_kg'].apply(lambda x: 1 if x < 0.5 else 0)
df_calculation["AKI_time"] = df['interval'].apply(lambda x: 1 if x >= 12 else 0)
df_calculation["AKI_preliminary"] = df_calculation["AKI_volume"] + df_calculation["AKI_time"]
df_calculation['AKI'] = df_calculation['AKI_preliminary'].apply(lambda x: 1 if x == 2 else 0)

In [138]:
data['chart']

Unnamed: 0,stay_id,itemid,event_time_from_admit,valuenum
0,32669861,220045,1 days 22:29:31,93.0
1,32669861,220210,1 days 22:29:31,26.0
2,32669861,220277,1 days 22:29:31,93.0
3,32669861,220179,1 days 22:30:31,128.0
4,32669861,220180,1 days 22:30:31,55.0
...,...,...,...,...
1496889,30259797,227442,12 days 16:33:25,4.7
1496890,30259797,227443,12 days 16:33:25,29.0
1496891,30259797,227465,12 days 16:33:25,14.2
1496892,30259797,227466,12 days 16:33:25,28.1


In [98]:
import pandas as pd

def extract_aki_info(combined_aki):
    """
    Extracts Acute Kidney Injury (AKI) information from a combined DataFrame of medical measurements.

    Parameters:
    combined_aki (pd.DataFrame): A DataFrame that contains medical data with 'stay_id', 'itemid', 'start_time', 
                                 and 'value' columns. 'itemid' identifies different measurements, such as urine 
                                 output and patient weight.

    The function calculates AKI based on the following conditions:
    - Urine output per hour per kilogram of body weight < 0.5 for at least 12 hours indicates potential AKI.
    
    It calculates for all `stay_id`s at once:
    - Urine output per hour.
    - Urine output per hour per kilogram.
    - Flags whether the AKI volume condition is met (urine_per_hour_per_kg < 0.5).
    - Flags whether the AKI time condition is met (interval >= 12 hours).
    - Combines the volume and time criteria to flag AKI.

    Returns:
    pd.DataFrame: A DataFrame with the AKI-related calculations for each stay.
    """

    df = combined_aki
    
    itemid_urine  = 226627 # '226631'
    itemid_weight = 226512

    # Filter for urine output and weight records
    urine_output_df = df[df['itemid'] == itemid_urine][['stay_id', 'start_time', 'value']]
    weight_df = df[df['itemid'] == itemid_weight][['stay_id', 'value']].drop_duplicates(subset=['stay_id'])

    import pdb
    pdb.set_trace() 
    
    # Merge urine output with patient weight data
    df_merged = pd.merge(urine_output_df, weight_df, on='stay_id', how='outer', suffixes=('_urine', '_weight'))

    # Calculate time intervals (in hours) for each stay_id
    df_merged['interval'] = df_merged.groupby('stay_id')['start_time'].diff()

    # Calculate urine output per hour
    df_merged['urine_per_hour'] = df_merged['value_urine'] / df_merged['interval']

    # Calculate urine output per hour per kilogram of body weight
    df_merged['urine_per_hour_per_kg'] = df_merged['urine_per_hour'] / df_merged['value_weight']

    # AKI volume criteria: urine output < 0.5 mL/kg/h
    df_merged["AKI_volume"] = df_merged['urine_per_hour_per_kg'].apply(lambda x: 1 if x < 0.5 else 0)

    # AKI time criteria: interval >= 12 hours
    df_merged["AKI_time"] = df_merged['interval'].apply(lambda x: 1 if x >= 12 else 0)

    # Preliminary AKI: both conditions must be met
    df_merged["AKI_preliminary"] = df_merged["AKI_volume"] + df_merged["AKI_time"]

    # Final AKI flag: both conditions must be satisfied (AKI_preliminary == 2)
    df_merged['AKI'] = df_merged['AKI_preliminary'].apply(lambda x: 1 if x == 2 else 0)

    return df_merged


In [99]:
df_merged = extract_aki_info(combined_aki)

> [0;32m/tmp/ipykernel_3988/2486658497.py[0m(39)[0;36mextract_aki_info[0;34m()[0m
[0;32m     37 [0;31m[0;34m[0m[0m
[0m[0;32m     38 [0;31m    [0;31m# Merge urine output with patient weight data[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m---> 39 [0;31m    [0mdf_merged[0m [0;34m=[0m [0mpd[0m[0;34m.[0m[0mmerge[0m[0;34m([0m[0murine_output_df[0m[0;34m,[0m [0mweight_df[0m[0;34m,[0m [0mon[0m[0;34m=[0m[0;34m'stay_id'[0m[0;34m,[0m [0mhow[0m[0;34m=[0m[0;34m'outer'[0m[0;34m,[0m [0msuffixes[0m[0;34m=[0m[0;34m([0m[0;34m'_urine'[0m[0;34m,[0m [0;34m'_weight'[0m[0;34m)[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     40 [0;31m[0;34m[0m[0m
[0m[0;32m     41 [0;31m    [0;31m# Calculate time intervals (in hours) for each stay_id[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[0m
ipdb> n
> [0;32m/tmp/ipykernel_3988/2486658497.py[0m(42)[0;36mextract_aki_info[0;34m()[0m
[0;32m     40 [0;31m[0;34m[0m[0m
[0m[0;32m 

BdbQuit: 

## Visualize the dynamic data
Now, since we processed the data into a temporal format, we can now visualize the resulting tables

In [None]:
from data_loading.load_preprocessed_dynamic_data import DynamicDataLoader

# Load data with default paths
data_loader = DynamicDataLoader(data_icu=True)
X, y = data_loader.load_data()

# Load data with custom paths
data_loader_custom = DataLoader(data_icu=True, 
                                labels_path='./data/csv/labels.csv',
                                data_dir='./data/csv/')
X_custom, y_custom = data_loader_custom.load_data()

# Load data for a specific number of patients (e.g., 10 patients) with custom paths
data_loader_limited = DynamicDataLoader(data_icu=True, 
                                 labels_path='./data/csv/labels.csv',
                                 data_dir='./data/csv/', 
                                 num_patients=10)
X_limited, y_limited = data_loader_limited.load_data()

In [None]:
# lets take a look 
X_custom.head()

## Visualize the dynamic data
Now, since we processed the data into a temporal format, we can now visualize the resulting tables

In [None]:
print("=======Machine :earning Models=======")
radio_input5 = widgets.RadioButtons(options=['Logistic Regression','Random Forest','Gradient Bossting','Xgboost'],value='Gradient Bossting')
display(radio_input5)
print("Do you wnat to conactenate the time-series feature")
radio_input6 = widgets.RadioButtons(options=['Conactenate','Aggregate'],value='Conactenate')
display(radio_input6)
print("Please select below option for cross-validation")
radio_input7 = widgets.RadioButtons(options=['No CV','5-fold CV','10-fold CV'],value='5-fold CV')
display(radio_input7)
print("Do you want to do oversampling for minority calss ?")
radio_input8 = widgets.RadioButtons(options=['True','False'],value='True')
display(radio_input8)

In [None]:
if radio_input7.value=='No CV':
    cv=0
elif radio_input7.value=='5-fold CV':
    cv=int(5)
elif radio_input7.value=='10-fold CV':
    cv=int(10)
ml=ml_models.ML_models(data_icu,cv,radio_input5.value,concat=radio_input6.value=='Conactenate',oversampling=radio_input8.value=='True')

## 9. Deep Learning Models
- Time-series LSTM and Time-series CNN which will only use time-series events like medications, charts, labs, output events to train model.

- Hybrid LSTM and Hybrid CNN will use static data - diagnosis, demographic data aong with other time-series data to train model.

- LSTM with Attention model will use attention layer to rank the important features and learn to predict output. It will use both static and time-series data.

**Go to ./model/parameter.py and define all variables needed for model building and training**

**Please run below cell to select which model to use**

In [None]:
radio_input6=widgets.RadioButtons(options=['Time-series LSTM','Time-series CNN','Hybrid LSTM','Hybrid CNN'],value='Time-series LSTM')
display(radio_input6)
print("Please select below option for cross-validation")
radio_input7 = widgets.RadioButtons(options=['No CV','5-fold CV','10-fold CV'],value='5-fold CV')
display(radio_input7)
print("Do you want to do oversampling for minority calss ?")
radio_input8 = widgets.RadioButtons(options=['True','False'],value='True')
display(radio_input8)

In [None]:
if radio_input7.value=='No CV':
    cv=0
elif radio_input7.value=='5-fold CV':
    cv=int(5)
elif radio_input7.value=='10-fold CV':
    cv=int(10)
    
if data_icu:
    model=dl_train.DL_models(data_icu,diag_flag,proc_flag,out_flag,chart_flag,med_flag,False,radio_input6.value,cv,oversampling=radio_input8.value=='True',model_name='attn_icu_read',train=True)
else:
    model=dl_train.DL_models(data_icu,diag_flag,proc_flag,False,False,med_flag,lab_flag,radio_input6.value,cv,oversampling=radio_input8.value=='True',model_name='attn_icu_read',train=True)

## 10. Running BEHRT
Below we integrate the implementation of BEHRT in our pipeline.
We perform pre-procesing needed to run BEHRT model. https://github.com/deepmedicine/BEHRT

Few things to note before running BEHRT -
- The numerical values are binned into quantiles.
- BEHRT has recommended maximum number of events per sample to be 512. 
    So feature selection is important so that number of events per sample does not exceed 512.
- The model is quite computationally heavy so it requires a GPU.

The output files for BEHRT will be saved in ./data/behrt/ folder

**Please run below cell to to pre-process and run BEHRT on the selected cohort**

In [None]:
if data_icu:
    token=tokenization.BEHRT_models(data_icu,diag_flag,proc_flag,out_flag,chart_flag,med_flag,False)
    tokenized_src, tokenized_age, tokenized_gender, tokenized_ethni, tokenized_ins, tokenized_labels=token.tokenize()
else:
    token=tokenization.BEHRT_models(data_icu,diag_flag,proc_flag,False,False,med_flag,lab_flag)
    tokenized_src, tokenized_age, tokenized_gender, tokenized_ethni, tokenized_ins, tokenized_labels=token.tokenize()
    
behrt_train.train_behrt(tokenized_src, tokenized_age, tokenized_gender, tokenized_ethni, tokenized_ins, tokenized_labels)

### EVALUATION AS STANDALONE MODULE
Below cell shows an exaple of how evaluation module can be used as a standalone module.

evaluation.Loss class can be instantiated and model output and ground truth can be passed to it to obtain results.

In the example below we captured model output and ground truth in a file and used that file to read the data.

In function definition ***loss(prob,truth,logits,False)***

prob -> List of Output predicted probabilities of case being positive

truth -> List of ground truth labels

logits -> List of logits obtained from last fully connected layer before applying softmax.sigmoid function in the model.

In [None]:
if torch.cuda.is_available():
    device='cuda:0'
#device='cpu'
loss=evaluation.Loss(device,acc=True,ppv=True,sensi=True,tnr=True,npv=True,auroc=True,aurocPlot=True,auprc=True,auprcPlot=True,callb=True,callbPlot=True)
with open("./data/output/outputDict", 'rb') as fp:
    outputDict=pickle.load(fp)
prob=list(outputDict['Prob'])
truth=list(outputDict['Labels'])
logits=list(outputDict['Logits'])
#print(torch.tensor(prob))
print("======= TESTING ========")
loss(prob,truth,logits,train=False,standalone=True)


### 11. FAIRNESS EVALUATION
In train and testing step we save output files in **./data/output/** folder.

This file conatins list of demographic variables included in training and testing of the model.

It also contains the ground truth labels and predicted probability for each sample.

We use the above saved data to perform fairness evaluation of the results obtained from model testing.

This module can be used as stand-alone module also.

Please create a file that contains predicted probabilites form the last sigmoid layer in column named **Prob** and
ground truth labels for each sample in column named **Labels**.

In [None]:
fairness.fairness_evaluation(inputFile='outputDict',outputFile='fairnessReport')

### 12. MODEL CALLIBRATION

Please run below cell if you want to callibrate predicted probabilites of the model on test data.
It will use the output saved during the testing of the model.

The file is saved in **./data/output/**.

This module can be used as stand-alone module also.

Please create a file that contain predicted logits form the last fully connected layer in column named **Logits** and <br>ground truth labels for each sample in a column named **Labels**.

In [None]:
callibrate_output.callibrate(inputFile='outputDict',outputFile='callibratedResults')