# Extract and encode MIMIC data
We extract the mimic information and encode it as time-series data. 
We use the following tables: 
- labevents: 
    - contains blood tests, etc. 
- microbiology events: 

- output events: 
    - contains 
    
- chartevents
    - contains vital signs
   



In [1]:
%load_ext autoreload
%autoreload 2

import os 

Define the relative or absolute data path. The mimic data must be in the folders: 
- data_root + '/mimiciv/3.0/hosp/' 
- data_root + '/mimiciv/3.0/icu/' 

In [2]:
# define a relative or absolute path to the mimic folder. 
# data_path = '/../data/real_world_data/physionet.org_small/files'
# data_path = '/../data/real_world_data/physionet.org/files'
data_path = '/../data/real_world_data/physionet.org_small/files'
# data_path = "/lustre/groups/labs/marr/qscd01/datasets/vonKleist/physionet.org/files"
root_dir = os.path.dirname(os.path.abspath('UserInterface.ipynb')) + data_path

# define path for processed files
target_path = './data/features/'

In [3]:
import ipywidgets as widgets
import sys
from pathlib import Path
import importlib
import pandas as pd


module_path='preprocessing/day_intervals_preproc'
if module_path not in sys.path:
    sys.path.append(module_path)

module_path='utils'
if module_path not in sys.path:
    sys.path.append(module_path)
    
module_path='preprocessing/hosp_module_preproc'
if module_path not in sys.path:
    sys.path.append(module_path)
    
module_path='model'
if module_path not in sys.path:
    sys.path.append(module_path)
#print(sys.path)
# define the path to the data from current repository 
# (from here data is in mimiciv/3.0/hosp )
root_dir = os.path.dirname(os.path.abspath('UserInterface.ipynb')) + data_path

# root_dir = data_path

import day_intervals_cohort
from day_intervals_cohort import *

import day_intervals_cohort_v2
from day_intervals_cohort_v2 import *

import day_intervals_cohort_v3
from day_intervals_cohort_v3 import *

import data_generation_icu

import data_generation
import evaluation

import feature_selection_hosp
from feature_selection_hosp import *

In [4]:
# this is only for training
import ml_models
from ml_models import *

import dl_train
from dl_train import *

import tokenization
from tokenization import *

import behrt_train
from behrt_train import *

import feature_selection_icu
from feature_selection_icu import *
import fairness
import callibrate_output

In [5]:
importlib.reload(day_intervals_cohort)
import day_intervals_cohort
from day_intervals_cohort import *

importlib.reload(day_intervals_cohort_v2)
import day_intervals_cohort_v2
from day_intervals_cohort_v2 import *

importlib.reload(day_intervals_cohort_v3)
import day_intervals_cohort_v3
from day_intervals_cohort_v3 import *


importlib.reload(data_generation_icu)
import data_generation_icu
importlib.reload(data_generation)
import data_generation

importlib.reload(feature_selection_hosp)
import feature_selection_hosp
from feature_selection_hosp import *

importlib.reload(feature_selection_icu)
import feature_selection_icu
from feature_selection_icu import *

importlib.reload(tokenization)
import tokenization
from tokenization import *

importlib.reload(ml_models)
import ml_models
from ml_models import *

importlib.reload(dl_train)
import dl_train
from dl_train import *

importlib.reload(behrt_train)
import behrt_train
from behrt_train import *

importlib.reload(fairness)
import fairness

importlib.reload(callibrate_output)
import callibrate_output

importlib.reload(evaluation)
import evaluation

# Welcome to your MIMIC-IV Project

This repository explains the steps to download and clean MIMIC-IV dataset for analysis.
The repository is compatible with MIMIC-IV v1.0 and MIMIC-IV v2.0

Please go to:
- https://physionet.org/content/mimiciv/1.0/ for v1.0
- https://physionet.org/content/mimiciv/2.0/ for v2.0
- https://physionet.org/content/mimiciv/3.0/ for v3.0

Follow instructions to get access to MIMIC-IV dataset.

Download the files using your terminal: 
- wget -r -N -c -np --user mehakg --ask-password https://physionet.org/files/mimiciv/1.0/ or
- wget -r -N -c -np --user mehakg --ask-password https://physionet.org/files/mimiciv/2.0/ or
- wget -r -N -c -np --user mehakg --ask-password https://physionet.org/files/mimiciv/3.0/        

Save downloaded files in the parent directory of this github repo. 

The structure should look like below for v1.0-
- mimiciv/1.0/core
- mimiciv/1.0/hosp
- mimiciv/1.0/icu

The structure should look like below for v2.0-
- mimiciv/2.0/hosp
- mimiciv/2.0/icu

The structure should look like below for v3.0-
- mimiciv/3.0/hosp
- mimiciv/3.0/icu

## 1. DATA EXTRACTION
Please run below cell to select option for cohort selection.
The cohort will be svaed in **./data/cohort/**

In [6]:
print("Please select the approriate version of MIMIC-IV for which you have downloaded data ?")
version = widgets.RadioButtons(options=['Version 1','Version 2','Version 3'],value='Version 3')
display(version)

print("Please select what prediction task you want to perform ?")
radio_input4 = widgets.RadioButtons(options=['Mortality','Length of Stay','Readmission','Phenotype'],value='Phenotype')
display(radio_input4)


Please select the approriate version of MIMIC-IV for which you have downloaded data ?


RadioButtons(index=2, options=('Version 1', 'Version 2', 'Version 3'), value='Version 3')

Please select what prediction task you want to perform ?


RadioButtons(index=3, options=('Mortality', 'Length of Stay', 'Readmission', 'Phenotype'), value='Phenotype')

### Refining Cohort and Prediction Task Definition

Based on your current selection following block will provide option to further refine prediction task and cohort associated with it:

- First you will refine the prediction task choosing from following options -
    - **Length of Stay** - You can select from two predefined options or enter custom number of days to predict length os stay greater than number of days.

    - **Readmission** - You can select from two predefined options or enter custom number of days to predict readmission after "number of days" after previous admission.

    - **Phenotype Prediction** - You can select from four major chronic diseases to predict its future outcome

        - Heart failure
        - CAD (Coronary Artery Disease)
        - CKD (Chronic Kidney Disease)
        - COPD (Chronic obstructive pulmonary disease)

- Second, you will choode whether to perfom above task using ICU or non-ICU admissions data

- Third, you can refine the refine the cohort selection for any of the above choosen prediction tasks by including the admission samples admitted with particular chronic disease - 
    - Heart failure
    - CAD (Coronary Artery Disease)
    - CKD (Chronic Kidney Disease)
    - COPD (Chronic obstructive pulmonary disease)
    
print("**Please run below cell to extract the cohort for selected options**")

In [7]:
# define what data to load 
print("Please specify which prediction task to perform:")
if radio_input4.value=='Length of Stay':
    options=['Length of Stay ge 3','Length of Stay ge 7','Custom']
    value='Length of Stay ge 3'
    radio_input2 = widgets.RadioButtons(options = option, value=  value)
    display(radio_input2)
    text1=widgets.IntSlider(
    value=3,
    min=1,
    max=10,
    step=1,
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='d'
)
    display(widgets.HBox([widgets.Label('Length of stay ge (in days)',layout={'width': '180px'}), text1]))
elif radio_input4.value=='Readmission':
    options=['30 Day Readmission','60 Day Readmission','90 Day Readmission',
             '120 Day Readmission','Custom']
    value= '30 Day Readmission'
    radio_input2 = widgets.RadioButtons(options = options,value = value)
    display(radio_input2)
    text1=widgets.IntSlider(
    value=30,
    min=10,
    max=150,
    step=10,
    disabled=False
    )
    display(widgets.HBox([widgets.Label('Readmission after (in days)',layout={'width': '180px'}), text1]))
elif radio_input4.value=='Phenotype':
    options = ['No specific prediction', 'Heart Failure in 30 days','CAD in 30 days','CKD in 30 days','COPD in 30 days']
    value ='No specific prediction'
    radio_input2 = widgets.RadioButtons(options=options, value = value)
    display(radio_input2)
elif radio_input4.value=='Mortality':
    options=['Mortality']
    value='Mortality'
    radio_input2 = widgets.RadioButtons(options=options,value=value)
    #display(radio_input2)

print("Extract Data")
print("Please select below if you want to work with ICU or Non-ICU data ?")
radio_input1 = widgets.RadioButtons(options=['ICU', 'Non-ICU'],value='ICU')
display(radio_input1)

print("Please select if you want to perform choosen prediction task for a specific disease.")
radio_input3 = widgets.RadioButtons(options=['No Disease Filter','Heart Failure','CKD','CAD','COPD'],value='No Disease Filter')
display(radio_input3)

Please specify which prediction task to perform:


RadioButtons(options=('No specific prediction', 'Heart Failure in 30 days', 'CAD in 30 days', 'CKD in 30 days'…

Extract Data
Please select below if you want to work with ICU or Non-ICU data ?


RadioButtons(options=('ICU', 'Non-ICU'), value='ICU')

Please select if you want to perform choosen prediction task for a specific disease.


RadioButtons(options=('No Disease Filter', 'Heart Failure', 'CKD', 'CAD', 'COPD'), value='No Disease Filter')

In [8]:
disease_label=""
time=0
label=radio_input4.value

if label=='Readmission':
    if radio_input2.value=='Custom':
        time=text1.value
    else:
        time=int(radio_input2.value.split()[0])
elif label=='Length of Stay':
    if radio_input2.value=='Custom':
        time=text1.value
    else:
        time=int(radio_input2.value.split()[4])

if label=='Phenotype':    
    if radio_input2.value=='Heart Failure in 30 days':
        label='Readmission'
        time=30
        disease_label='I50'
    elif radio_input2.value=='CAD in 30 days':
        label='Readmission'
        time=30
        disease_label='I25'
    elif radio_input2.value=='CKD in 30 days':
        label='Readmission'
        time=30
        disease_label='N18'
    elif radio_input2.value=='COPD in 30 days':
        label='Readmission'
        time=30
        disease_label='J44'
    elif radio_input2.value=='No specific prediction':
        label=''
        time=100000
        disease_label='no_label'
    
data_icu=radio_input1.value=="ICU"
data_mort=label=="Mortality"
data_admn=label=='Readmission'
data_los=label=='Length of Stay'
        

if (radio_input3.value=="Heart Failure"):
    icd_code='I50'
elif (radio_input3.value=="CKD"):
    icd_code='N18'
elif (radio_input3.value=="COPD"):
    icd_code='J44'
elif (radio_input3.value=="CAD"):
    icd_code='I25'
else:
    icd_code='No Disease Filter'

if version.value=='Version 1':
    version_path= "mimiciv/1.0"
    cohort_output = day_intervals_cohort.extract_data(radio_input1.value,label,time,icd_code, root_dir,disease_label)
elif version.value=='Version 2':
    version_path= "mimiciv/2.0"
    cohort_output = day_intervals_cohort_v2.extract_data(radio_input1.value,label,time,icd_code, root_dir,disease_label)
elif version.value=='Version 3':
    version_path= "mimiciv/3.0"
    cohort_output = day_intervals_cohort_v3.extract_data(radio_input1.value,label,time,icd_code, root_dir,disease_label)

EXTRACTING FOR: | ICU |  DUE TO NO_LABEL | 100000 | 
[ COHORT SUCCESSFULLY SAVED ]
[ SUMMARY SUCCESSFULLY SAVED ]
ICU DATA
# Admission Records: 2999
# Patients: 1960


## 2. FEATURE SELECTION
Features available for ICU data -
- Diagnosis (https://mimic.mit.edu/docs/iv/modules/hosp/diagnoses_icd/)
- Procedures (https://mimic.mit.edu/docs/iv/modules/icu/procedureevents/)
- Medications (https://mimic.mit.edu/docs/iv/modules/icu/inputevents/)
- Output Events (https://mimic.mit.edu/docs/iv/modules/icu/outputevents/)
- Chart Events (https://mimic.mit.edu/docs/iv/modules/icu/chartevents/)
- Lab Events (https://mimic.mit.edu/docs/iv/modules/hosp/labevents/)
- Microbiology Events (https://mimic.mit.edu/docs/iv/modules/hosp/microbiologyevents/)

All features will be saved in the defined **target_path**

**Please run below cell to select features**

In [9]:
print("Feature Selection")
if data_icu:
    print("Which Features you want to include for cohort?")
    check_input1 = widgets.Checkbox(description='Diagnosis', value=True)
    display(check_input1)
    check_input2 = widgets.Checkbox(description='Output Events', value=True)
    display(check_input2)
    check_input3 = widgets.Checkbox(description='Chart Events', value=True)
    display(check_input3)
    check_input4 = widgets.Checkbox(description='Procedures', value=True)
    display(check_input4)
    check_input5 = widgets.Checkbox(description='Medications', value=True)
    display(check_input5)
    check_input100 = widgets.Checkbox(description='Lab Events', value=True)
    display(check_input100)
    check_input101 = widgets.Checkbox(description='Microbiology Events', value=False)
    display(check_input101)

    print("icu")
else:
    print("Which Features you want to include for cohort?")
    check_input1 = widgets.Checkbox(description='Diagnosis', value=True)
    display(check_input1)
    check_input2 = widgets.Checkbox(description='Labs', value=True)
    display(check_input2)
    check_input3 = widgets.Checkbox(description='Procedures', value=True)
    display(check_input3)
    check_input4 = widgets.Checkbox(description='Medications', value=True )
    display(check_input4)
print("**Please run below cell to extract selected features**")

Feature Selection
Which Features you want to include for cohort?


Checkbox(value=True, description='Diagnosis')

Checkbox(value=True, description='Output Events')

Checkbox(value=True, description='Chart Events')

Checkbox(value=True, description='Procedures')

Checkbox(value=True, description='Medications')

Checkbox(value=True, description='Lab Events')

Checkbox(value=False, description='Microbiology Events')

icu
**Please run below cell to extract selected features**


In [10]:
diag_flag=check_input1.value
out_flag=check_input2.value
chart_flag=check_input3.value
proc_flag=check_input4.value
med_flag=check_input5.value
lab_flag=check_input100.value
micro_flag=check_input101.value

In [11]:
# load and preprocess the data (by dropping some columns)
if data_icu:
    diag_flag=check_input1.value
    out_flag=check_input2.value
    chart_flag=check_input3.value
    proc_flag=check_input4.value
    med_flag=check_input5.value

    lab_flag=check_input100.value
    micro_flag=check_input101.value

    
    #feature_icu(cohort_output, root_dir, root_dir + '/'+ version_path,diag_flag,out_flag,chart_flag,proc_flag,med_flag)
    data = feature_icu( cohort_output = cohort_output, 
                        root_dir = root_dir, 
                        version_path = root_dir + '/'+ version_path,
                        save_path = target_path, 
                        diag_flag = diag_flag,
                        out_flag = out_flag,
                        chart_flag = chart_flag,
                        proc_flag = proc_flag,
                        med_flag  = med_flag,
                        lab_flag = lab_flag,
                        micro_flag = micro_flag)
else:
    # not adapted yet
    print("Warning: this code hasn't been checked!")
    diag_flag=check_input1.value
    lab_flag=check_input2.value
    proc_flag=check_input3.value
    med_flag=check_input4.value
    feature_nonicu(cohort_outputroot_dir, root_dir+ '/'+ version_path , version_path,diag_flag,lab_flag,proc_flag,med_flag)

[EXTRACTING DIAGNOSIS DATA]
[RESULTS] Admissions with diagnoses not in cohort: 4746
[RESULTS] Admissions in cohort without diagnosis: 0
[RESULTS] Patients with diagnoses not in cohort: 0
[RESULTS] Patients in cohort without diagnosis: 0
# unique ICD-9 codes 2226
# unique ICD-10 codes 3256
# unique ICD-10 codes (After converting ICD-9 to ICD-10) 3569
# unique ICD-10 codes (After clinical gruping ICD-10 codes) 991
# Admissions:   2999
Total rows 56933
Columns kept for diagnosis: ['subject_id', 'hadm_id', 'stay_id', 'icd_code', 'root_icd10_convert', 'root']
[SUCCESSFULLY SAVED DIAGNOSIS DATA]
[EXTRACTING OUTPUT EVENTS DATA]
[RESULTS] Stays with output info not in cohort: 0
[RESULTS] Stays in cohort without output info: 0
# Unique Events:   67
# Admissions:   2909
Total rows 164170
Columns kept for output events: ['subject_id', 'hadm_id', 'stay_id', 'itemid', 'charttime', 'intime', 'event_time_from_admit', 'value']


0it [00:00, ?it/s]

[SUCCESSFULLY SAVED OUTPUT EVENTS DATA]
[EXTRACTING CHART EVENTS DATA]


2it [00:24, 12.21s/it]


Total rows processed: 13383573
Total rows with missing 'valuenum': 8180197
[RESULTS] Stays in charts not in cohort: 0
[RESULTS] Stays in cohort without chart info: 0
# Unique Events:   2027
# Admissions:   2999
Total rows 11730995
Columns kept for chart events: ['stay_id', 'itemid', 'event_time_from_admit', 'valuenum']
[SUCCESSFULLY SAVED CHART EVENTS DATA]
[EXTRACTING PROCEDURES DATA]
[RESULTS] Stays with proc not in cohort: 0
[RESULTS] Stays in cohort without proc info: 319
# Unique Events:   143
# Admissions:   2680
Total rows 24050
Columns kept for procedures: ['subject_id', 'hadm_id', 'stay_id', 'itemid', 'starttime', 'intime', 'event_time_from_admit']
[SUCCESSFULLY SAVED PROCEDURES DATA]
[EXTRACTING MEDICATION DATA]
[RESULTS] Stays with medication data not in cohort: 0
[RESULTS] Stays in cohort without medications: 313
# of unique type of drug:  139
# Admissions:   2546
# Total rows 173465
Columns kept for medication: ['subject_id', 'hadm_id', 'stay_id', 'itemid', 'starttime', 'e

0it [00:00, ?it/s]

[SUCCESSFULLY SAVED MEDICATION DATA]
[EXTRACTING LAB EVENTS DATA]


3it [00:07,  2.63s/it]


[RESULTS] Admissions with lab info not in cohort: 4192
[RESULTS] Admissions in cohort without lab info: 22
# of unique type of lab events:  749
# Admissions:   2977
# Total rows 1734157
Columns kept for lab events: ['subject_id', 'hadm_id', 'stay_id', 'itemid', 'charttime', 'storetime', 'valuenum', 'valueuom', 'ref_range_lower', 'ref_range_upper', 'chart_hours_from_admit', 'store_hours_from_admit']
[SUCCESSFULLY SAVED LAB EVENTS DATA]


In [12]:
# look at the data 
data['lab']

Unnamed: 0,subject_id,hadm_id,stay_id,itemid,charttime,storetime,valuenum,valueuom,ref_range_lower,ref_range_upper,chart_hours_from_admit,store_hours_from_admit
0,10005593,26835370.0,32896438,53185,2125-06-23 20:18:00,2125-06-24 06:33:00,,,,,-3 days +09:04:58,-3 days +19:19:58
1,10005593,26835370.0,34389119,53185,2125-06-23 20:18:00,2125-06-24 06:33:00,,,,,-3 days +01:56:37,-3 days +12:11:37
2,10005593,26835370.0,32896438,51463,2125-06-23 20:19:00,2125-06-23 22:57:00,,/hpf,,,-3 days +09:05:58,-3 days +11:43:58
3,10005593,26835370.0,34389119,51463,2125-06-23 20:19:00,2125-06-23 22:57:00,,/hpf,,,-3 days +01:57:37,-3 days +04:35:37
4,10005593,26835370.0,32896438,51464,2125-06-23 20:19:00,2125-06-23 22:57:00,,,,,-3 days +09:05:58,-3 days +11:43:58
...,...,...,...,...,...,...,...,...,...,...,...,...
1734152,19997367,21508795.0,36980198,51506,2127-04-16 17:21:00,2127-04-16 18:22:00,,,,,14 days 04:34:04,14 days 05:35:04
1734153,19997367,21508795.0,36980198,51508,2127-04-16 17:21:00,2127-04-16 18:22:00,,,,,14 days 04:34:04,14 days 05:35:04
1734154,19997367,21508795.0,36980198,51514,2127-04-16 17:21:00,2127-04-16 18:22:00,,mg/dL,0.2,1.0,14 days 04:34:04,14 days 05:35:04
1734155,19997367,21508795.0,36980198,51516,2127-04-16 17:21:00,2127-04-16 18:22:00,4.0,#/hpf,0.0,5.0,14 days 04:34:04,14 days 05:35:04


## 3. CLINICAL GROUPING
Below you will have option to clinically group diagnosis and medications.
Grouping medical codes will reduce dimensional space of features.

Default options selected below will group medical codes to reduce feature dimension space.

**Please run below cell to select preprocessing for diferent features**

In [13]:
if data_icu:
    if diag_flag:
        print("Do you want to group ICD 10 DIAG codes ?")
        radio_input4 = widgets.RadioButtons(options=['Keep both ICD-9 and ICD-10 codes','Convert ICD-9 to ICD-10 codes','Convert ICD-9 to ICD-10 and group ICD-10 codes'],value='Convert ICD-9 to ICD-10 and group ICD-10 codes',layout={'width': '100%'})
        display(radio_input4)   
    
else:
    if diag_flag:
        print("Do you want to group ICD 10 DIAG codes ?")
        radio_input4 = widgets.RadioButtons(options=['Keep both ICD-9 and ICD-10 codes','Convert ICD-9 to ICD-10 codes','Convert ICD-9 to ICD-10 and group ICD-10 codes'],value='Convert ICD-9 to ICD-10 and group ICD-10 codes',layout={'width': '100%'})
        display(radio_input4)     
    if med_flag:
        print("Do you want to group Medication codes to use Non propietary names?")
        radio_input5 = widgets.RadioButtons(options=['Yes','No'],value='Yes',layout={'width': '100%'})
        display(radio_input5)
    if proc_flag:
        print("Which ICD codes for Procedures you want to keep in data?")
        radio_input6 = widgets.RadioButtons(options=['ICD-9 and ICD-10','ICD-10'],value='ICD-10',layout={'width': '100%'})
        display(radio_input6)
print("**Please run below cell to perform feature preprocessing**")

Do you want to group ICD 10 DIAG codes ?


RadioButtons(index=2, layout=Layout(width='100%'), options=('Keep both ICD-9 and ICD-10 codes', 'Convert ICD-9…

**Please run below cell to perform feature preprocessing**


In [14]:
group_diag=False
group_med=False
group_proc=False
if data_icu:
    if diag_flag:
        group_diag=radio_input4.value
    preprocess_features_icu(cohort_output = cohort_output, 
                            save_path = target_path, 
                            diag_flag = diag_flag, 
                            group_diag = group_diag,
                            chart_flag = False, 
                            clean_chart = False, 
                            impute_outlier_chart = False,  
                            thresh = 0, 
                            left_thresh= 0)
else:
    # not adapted yet
    print("Warning: this code hasn't been checked!")
    if diag_flag:
        group_diag=radio_input4.value
    if med_flag:
        group_med=radio_input5.value
    if proc_flag:
        group_proc=radio_input6.value
    preprocess_features_hosp(cohort_output, diag_flag,proc_flag,med_flag,False,group_diag,group_med,group_proc,False,False,0,0)

[PROCESSING DIAGNOSIS DATA]
Total number of rows 54612
[SUCCESSFULLY SAVED DIAGNOSIS DATA]


### 4. SUMMARY OF FEATURES

This step will generate summary of all features extracted so far.<br>
It will save summary files in **./data/summary/**<br>
- These files provide summary about **mean frequency** of medical codes per admission.<br>
- It also provides **total occurrence count** of each medical code.<br>
- For labs and chart events it will also provide <br>**missing %** which tells how many rows for a certain medical code has missing value.

Please use this information to further refine your cohort by selecting <br>which medical codes in each feature you want to keep and <br>which codes you would like to remove for downstream analysis tasks.

**Please run below cell to generate summary files**

In [15]:
if data_icu:
    #generate_summary_icu(diag_flag,proc_flag,med_flag,out_flag,chart_flag)
    generate_summary_icu(diag_flag,proc_flag,med_flag,out_flag,chart_flag, lab_flag, micro_flag)
else:
    generate_summary_hosp(diag_flag,proc_flag,med_flag,lab_flag)

[GENERATING FEATURE SUMMARY]
[SUCCESSFULLY SAVED FEATURE SUMMARY]


## 5. Feature Selection

based on the files generated in previous step and other infromation gathered by you,<br>
Please select which medical codes you want to include in this study.

Please run below cell to to select options for which features you want to perform feature selection.

- Select **Yes** if you want to select a subset of medical codes for that feature and<br> **edit** the corresponding feature file for it.
- Select **No** if you want to keep all the codes in a feature.

In [16]:
if data_icu:
    if diag_flag:
        print("Do you want to do Feature Selection for Diagnosis \n (If yes, please edit list of codes in ./data/summary/diag_features.csv)")
        radio_input4 = widgets.RadioButtons(options=['Yes','No'],value='No')
        display(radio_input4)       
    if med_flag:
        print("Do you want to do Feature Selection for Medication \n (If yes, please edit list of codes in ./data/summary/med_features.csv)")
        radio_input5 = widgets.RadioButtons(options=['Yes','No'],value='No')
        display(radio_input5)   
    if proc_flag:
        print("Do you want to do Feature Selection for Procedures \n (If yes, please edit list of codes in ./data/summary/proc_features.csv)")
        radio_input6 = widgets.RadioButtons(options=['Yes','No'],value='No')
        display(radio_input6)   
    if out_flag:
        print("Do you want to do Feature Selection for Output event \n (If yes, please edit list of codes in ./data/summary/out_features.csv)")
        radio_input7 = widgets.RadioButtons(options=['Yes','No'],value='No')
        display(radio_input7)  
    if chart_flag:
        print("Do you want to do Feature Selection for Chart events \n (If yes, please edit list of codes in ./data/summary/chart_features.csv)")
        radio_input8 = widgets.RadioButtons(options=['Yes','No'],value='No')
        display(radio_input8)  
else:
    if diag_flag:
        print("Do you want to do Feature Selection for Diagnosis \n (If yes, please edit list of codes in ./data/summary/diag_features.csv)")
        radio_input4 = widgets.RadioButtons(options=['Yes','No'],value='No')
        display(radio_input4)         
    if med_flag:
        print("Do you want to do Feature Selection for Medication \n (If yes, please edit list of codes in ./data/summary/med_features.csv)")
        radio_input5 = widgets.RadioButtons(options=['Yes','No'],value='No')
        display(radio_input5)   
    if proc_flag:
        print("Do you want to do Feature Selection for Procedures \n (If yes, please edit list of codes in ./data/summary/proc_features.csv)")
        radio_input6 = widgets.RadioButtons(options=['Yes','No'],value='No')
        display(radio_input6)   
    if lab_flag:
        print("Do you want to do Feature Selection for Labs \n (If yes, please edit list of codes in ./data/summary/lab_features.csv)")
        radio_input7 = widgets.RadioButtons(options=['Yes','No'],value='No')
        display(radio_input7)   
print("**Please run below cell to perform feature selection**")

Do you want to do Feature Selection for Diagnosis 
 (If yes, please edit list of codes in ./data/summary/diag_features.csv)


RadioButtons(index=1, options=('Yes', 'No'), value='No')

Do you want to do Feature Selection for Medication 
 (If yes, please edit list of codes in ./data/summary/med_features.csv)


RadioButtons(index=1, options=('Yes', 'No'), value='No')

Do you want to do Feature Selection for Procedures 
 (If yes, please edit list of codes in ./data/summary/proc_features.csv)


RadioButtons(index=1, options=('Yes', 'No'), value='No')

Do you want to do Feature Selection for Output event 
 (If yes, please edit list of codes in ./data/summary/out_features.csv)


RadioButtons(index=1, options=('Yes', 'No'), value='No')

Do you want to do Feature Selection for Chart events 
 (If yes, please edit list of codes in ./data/summary/chart_features.csv)


RadioButtons(index=1, options=('Yes', 'No'), value='No')

**Please run below cell to perform feature selection**


In [17]:
select_diag=False
select_med=False
select_proc=False
select_lab=False
select_out=False
select_chart=False

if data_icu:
    if diag_flag:
        select_diag=radio_input4.value == 'Yes'
    if med_flag:
        select_med=radio_input5.value == 'Yes'
    if proc_flag:
        select_proc=radio_input6.value == 'Yes'
    if out_flag:
        select_out=radio_input7.value == 'Yes'
    if chart_flag:
        select_chart=radio_input8.value == 'Yes'
    features_selection_icu(cohort_output, diag_flag,proc_flag,med_flag,out_flag, chart_flag,select_diag,select_med,select_proc,select_out,select_chart)
else:
    if diag_flag:
        select_diag=radio_input4.value == 'Yes'
    if med_flag:
        select_med=radio_input5.value == 'Yes'
    if proc_flag:
        select_proc=radio_input6.value == 'Yes'
    if lab_flag:
        select_lab=radio_input7.value == 'Yes'
    features_selection_hosp(cohort_output, diag_flag,proc_flag,med_flag,lab_flag,select_diag,select_med,select_proc,select_lab)

## 6. CLEANING OF FEATURES
Below you will have option to to clean lab and chart events by performing outlier removal and unit conversion.

Outlier removal is performed to remove values higher than selected **right threshold** percentile and lower than selected **left threshold** percentile among all values for each itemid. 

**Please run below cell to select preprocessing for diferent features**

In [18]:
if data_icu:
    if chart_flag:
        print("Outlier removal in values of chart events ?")
        layout = widgets.Layout(width='100%', height='40px') #set width and height

        radio_input5 = widgets.RadioButtons(options=['No outlier detection','Impute Outlier (default:98)','Remove outliers (default:98)'],value='No outlier detection',layout=layout)
        display(radio_input5)
        outlier=widgets.IntSlider(
        value=98,
        min=90,
        max=99,
        step=1,
        disabled=False,layout={'width': '100%'}
        )
        left_outlier=widgets.IntSlider(
        value=0,
        min=0,
        max=10,
        step=1,
        disabled=False,layout={'width': '100%'}
        )
        #display(oulier)
        display(widgets.HBox([widgets.Label('Right Outlier Threshold',layout={'width': '150px'}), outlier]))
        display(widgets.HBox([widgets.Label('Left Outlier Threshold',layout={'width': '150px'}), left_outlier]))
    
else:      
    if lab_flag:
        print("Outlier removal in values of lab events ?")
        layout = widgets.Layout(width='100%', height='40px') #set width and height

        radio_input7 = widgets.RadioButtons(options=['No outlier detection','Impute Outlier (default:98)','Remove outliers (default:98)'],value='No outlier detection',layout=layout)
        display(radio_input7)
        outlier=widgets.IntSlider(
        value=98,
        min=90,
        max=99,
        step=1,
        disabled=False,layout={'width': '100%'}
        )
        left_outlier=widgets.IntSlider(
        value=0,
        min=0,
        max=10,
        step=1,
        disabled=False,layout={'width': '100%'}
        )
        #display(oulier)
        display(widgets.HBox([widgets.Label('Right Outlier Threshold',layout={'width': '150px'}), outlier]))
        display(widgets.HBox([widgets.Label('Left Outlier Threshold',layout={'width': '150px'}), left_outlier]))
print("**Please run below cell to perform feature preprocessing**")

Outlier removal in values of chart events ?


RadioButtons(layout=Layout(height='40px', width='100%'), options=('No outlier detection', 'Impute Outlier (def…

HBox(children=(Label(value='Right Outlier Threshold', layout=Layout(width='150px')), IntSlider(value=98, layou…

HBox(children=(Label(value='Left Outlier Threshold', layout=Layout(width='150px')), IntSlider(value=0, layout=…

**Please run below cell to perform feature preprocessing**


In [19]:
thresh=0
if data_icu:
    if chart_flag:
        clean_chart=radio_input5.value!='No outlier detection'
        impute_outlier_chart=radio_input5.value=='Impute Outlier (default:98)'
        thresh=outlier.value
        left_thresh=left_outlier.value
    preprocess_features_icu(cohort_output = cohort_output, 
                            save_path = target_path, 
                            diag_flag = False, 
                            group_diag = False,
                            chart_flag = chart_flag,
                            clean_chart = clean_chart,
                            impute_outlier_chart = impute_outlier_chart,
                            thresh = thresh,
                            left_thresh = left_thresh)
else:
    # not adapted yet
    print("Warning: this code hasn't been checked!")
    if lab_flag:
        clean_lab=radio_input7.value!='No outlier detection'
        impute_outlier=radio_input7.value=='Impute Outlier (default:98)'
        thresh=outlier.value
        left_thresh=left_outlier.value
    preprocess_features_hosp(cohort_output, False,False,False,lab_flag,False,False,False,clean_lab,impute_outlier,thresh,left_thresh)

## Visualize the data
Before encoding the data into time-series format, we can first load and visualize it to understand it better

In [20]:
from data_loading.load_preprocessed_data import DataLoader

#data_loader = DataLoader(root_dir, cohort_output,data_mort,data_admn,data_los,diag_flag,proc_flag,out_flag,chart_flag,med_flag, lab_flag, micro_flag,impute, include,bucket,predW)
data_loader = DataLoader(root_dir = root_dir, 
                         cohort_output = cohort_output,
                         if_mort = data_mort,
                         if_admn = data_admn,
                         if_los = data_los,
                         feat_cond = diag_flag,
                         feat_proc = proc_flag,
                         feat_out = out_flag,
                         feat_chart = chart_flag,
                         feat_med = med_flag, 
                         feat_lab = lab_flag, 
                         feat_micro = micro_flag,
                         impute  = 0)

In [21]:
# load each table in a dictionary 
dataset = data_loader.load()
print("Loaded the data for: ", dataset.keys())

[ READ ADM FEATURES ]


0it [00:00, ?it/s]



3it [01:16, 25.65s/it]


[ READ ALL FEATURES ]
Loaded the data for:  dict_keys(['data', 'cond', 'cond_per_adm', 'meds', 'proc', 'out', 'chart', 'lab'])


In [26]:
# now we can look at individual tables
dataset['cond']['stay_id'].unique().shape

(2996,)

In [51]:
dataset['proc']['stay_id'].unique().shape

(2662,)

## 7. Time-Series Representation
In this section, please choose how you want to process and represent time-series data.

- First option is to select the length of time-series data you want to include for this study. (Default is 72 hours)

- Second option is to select bucket size which tells in what size time windows you want to divide your time-series.<br>
For example, if you select **2** bucket size, it wil aggregate data for every 2 hours and <br>a time-series of length 24 hours will be represented as time-series with 12 time-windows <br>where data for every 2 hours is agggregated from original raw time-series.

During this step, we will also save the time-series data in data dictionaries in the format that can be directly used for following deep learning analysis.

### Imputation
You can also choose if you want to impute lab/chart values. The imputation will be done by froward fill and mean or median imputation.<br>
Values will be forward fill first and if no value exists for that admission we will use mean or median value for the patient.

The data dictionaries will be saved in **./data/dict/**

Please refer the readme to know the structure of data dictionaries.

**Please run below cell to select time-series representation**

In [71]:
print("=======Time-series Data Represenation=======")

print("Length of data to be included for time-series prediction ?")
if(data_mort):
    radio_input8 = widgets.RadioButtons(options=['First 72 hours','First 48 hours','First 24 hours','Custom'],value='First 72 hours')
    display(radio_input8)
    text2=widgets.IntSlider(
    value=72,
    min=24,
    max=72,
    step=1,
    description='Fisrt',
    disabled=False
    )
    display(widgets.HBox([widgets.Label('Fisrt (in hours):',layout={'width': '150px'}), text2]))
elif(data_admn):
    radio_input8 = widgets.RadioButtons(options=['Last 72 hours','Last 48 hours','Last 24 hours','Custom'],value='Last 72 hours')
    display(radio_input8)
    text2=widgets.IntSlider(
    value=72,
    min=24,
    max=72,
    step=1,
    description='Last',
    disabled=False
    )
    display(widgets.HBox([widgets.Label('Last (in hours):',layout={'width': '150px'}), text2]))
elif(data_los):
    radio_input8 = widgets.RadioButtons(options=['First 12 hours','First 24 hours','Custom'],value='First 24 hours')
    display(radio_input8)
    text2=widgets.IntSlider(
    value=72,
    min=12,
    max=72,
    step=1,
    description='First',
    disabled=False
    )
    display(widgets.HBox([widgets.Label('Fisrt (in hours):',layout={'width': '150px'}), text2]))
else:
    radio_input8 = widgets.RadioButtons(options=['First 72 hours','First 48 hours','First 24 hours','Custom'],value='First 72 hours')
    display(radio_input8)
    text2=widgets.IntSlider(
    value=72,
    min=24,
    max=72,
    step=1,
    description='Fisrt',
    disabled=False
    )
    display(widgets.HBox([widgets.Label('Fisrt (in hours):',layout={'width': '150px'}), text2]))
    
print("What time bucket size you want to choose ?")
radio_input7 = widgets.RadioButtons(options=['1 hour','2 hour','3 hour','4 hour','5 hour','Custom'],value='1 hour')
display(radio_input7)
text1=widgets.IntSlider(
    value=1,
    min=1,
    max=6,
    step=1,
    disabled=False
    )
#display(text1)
display(widgets.HBox([widgets.Label('Bucket Size (in hours):',layout={'width': '150px'}), text1]))
print("Do you want to forward fill and mean or median impute lab/chart values to form continuous data signal?")
radio_impute = widgets.RadioButtons(options=['No Imputation', 'forward fill and mean','forward fill and median'],value='No Imputation')
display(radio_impute)   

radio_input6 = widgets.RadioButtons(options=['0 hours','2 hours','4 hours','6 hours'],value='0 hours')
if(data_mort):
    print("If you have choosen mortality prediction task, then what prediction window length you want to keep?")
    radio_input6 = widgets.RadioButtons(options=['2 hours','4 hours','6 hours','8 hours','Custom'],value='2 hours')
    display(radio_input6)
    text3=widgets.IntSlider(
    value=2,
    min=2,
    max=8,
    step=1,
    disabled=False
    )
    display(widgets.HBox([widgets.Label('Prediction window (in hours)',layout={'width': '180px'}), text3]))
print("**Please run below cell to perform time-series represenation and save in data dictionaries**")

Length of data to be included for time-series prediction ?


[autoreload of data_generation_icu failed: Traceback (most recent call last):
  File "/home/henrik/miniconda3/envs/mimic_prep_env/lib/python3.7/site-packages/IPython/extensions/autoreload.py", line 245, in check
    superreload(m, reload, self.old_objects)
  File "/home/henrik/miniconda3/envs/mimic_prep_env/lib/python3.7/site-packages/IPython/extensions/autoreload.py", line 394, in superreload
    module = reload(module)
  File "/home/henrik/miniconda3/envs/mimic_prep_env/lib/python3.7/imp.py", line 314, in reload
    return importlib.reload(module)
  File "/home/henrik/miniconda3/envs/mimic_prep_env/lib/python3.7/importlib/__init__.py", line 169, in reload
    _bootstrap._exec(spec, module)
  File "<frozen importlib._bootstrap>", line 630, in _exec
  File "<frozen importlib._bootstrap_external>", line 724, in exec_module
  File "<frozen importlib._bootstrap_external>", line 860, in get_code
  File "<frozen importlib._bootstrap_external>", line 791, in source_to_code
  File "<frozen im

RadioButtons(options=('First 72 hours', 'First 48 hours', 'First 24 hours', 'Custom'), value='First 72 hours')

HBox(children=(Label(value='Fisrt (in hours):', layout=Layout(width='150px')), IntSlider(value=72, description…

What time bucket size you want to choose ?


RadioButtons(options=('1 hour', '2 hour', '3 hour', '4 hour', '5 hour', 'Custom'), value='1 hour')

HBox(children=(Label(value='Bucket Size (in hours):', layout=Layout(width='150px')), IntSlider(value=1, max=6,…

Do you want to forward fill and mean or median impute lab/chart values to form continuous data signal?


RadioButtons(options=('No Imputation', 'forward fill and mean', 'forward fill and median'), value='No Imputati…

**Please run below cell to perform time-series represenation and save in data dictionaries**


In [75]:
if (radio_input6.value=='Custom'):
    predW=int(text3.value)
else:
    predW=int(radio_input6.value[0].strip())
if (radio_input7.value=='Custom'):
    bucket=int(text1.value)
else:
    bucket=int(radio_input7.value[0].strip())
if (radio_input8.value=='Custom'):
    include=int(text2.value)
else:
    include=int(radio_input8.value.split()[1])
if (radio_impute.value=='forward fill and mean'):
    impute='Mean'
elif (radio_impute.value=='forward fill and median'):
    impute='Median'
else:
    impute=False

if data_icu:
    gen=data_generation_icu.Generator(
        root_dir  = root_dir, 
        cohort_output = cohort_output,
        if_mort = data_mort,
        if_admn = data_admn,
        # if_los = data_los,
        if_los = True,
        feat_cond = diag_flag,
        feat_proc  = proc_flag,
        feat_out = out_flag,
        # feat_chart = chart_flag,
        feat_chart = False,
        feat_med = med_flag,
        # feat_med = False,
        # feat_lab = lab_flag,
        feat_lab = False,
        feat_micro =  micro_flag,
        impute = impute,
        include_time=include,
        bucket=bucket,
        predW=predW)
else:
    gen=data_generation.Generator(root_dir, cohort_output,data_mort,data_admn,data_los,diag_flag,lab_flag,proc_flag,med_flag,impute,include,bucket,predW)

[ READ COHORT ]


  0%|                                                                                            | 0/72 [00:00<?, ?it/s]

[ READ ALL FEATURES ]
include_time 72
[ PROCESSED TIME SERIES TO EQUAL LENGTH  ]


100%|███████████████████████████████████████████████████████████████████████████████████| 72/72 [00:01<00:00, 43.22it/s]


[ PROCESSED TIME SERIES TO EQUAL TIME INTERVAL ]
> [0;32m/mnt/c/Users/HenrikvonKleist/OneDrive - Helmholtz Zentrum München/Dokumente/PhD/Code/Active Feature Acquisition/MIMIC-IV-Data-Pipeline-main/model/data_generation_icu.py[0m(706)[0;36msmooth_meds[0;34m()[0m
[0;32m    704 [0;31m        [0;32mimport[0m [0mpdb[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    705 [0;31m        [0mpdb[0m[0;34m.[0m[0mset_trace[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m--> 706 [0;31m        [0mself[0m[0;34m.[0m[0mcreate_Dict[0m[0;34m([0m[0mfinal_meds[0m[0;34m,[0m[0mfinal_proc[0m[0;34m,[0m[0mfinal_out[0m[0;34m,[0m[0mfinal_chart[0m[0;34m,[0m [0mfinal_lab[0m[0;34m,[0m [0mfinal_micro[0m[0;34m,[0m[0mlos[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    707 [0;31m[0;34m[0m[0m
[0m[0;32m    708 [0;31m[0;34m[0m[0m
[0m
ipdb> final_proc
        stay_id  itemid  subject_id  start_time
0      30008635  224275    15102490           0

BdbQuit: 

## Extract acute kidney injury information

In [None]:
# Step 1: extract weight

In [52]:
import pandas as pd

def extract_patient_weights(chart_df):
    """
    Extract and compute average patient weights from the chart events DataFrame.

    This function filters the input chart data for relevant weight item IDs, converts 
    measurements from pounds to kilograms where necessary, and calculates the average 
    weight for each patient (identified by `stay_id`). It also checks for large 
    discrepancies between different weight measurements and raises warnings if any 
    differences exceed 15 units.

    Parameters:
    - chart_df: DataFrame containing chart events. The DataFrame should have at least the following columns:
        - stay_id: The unique identifier for the patient's ICU stay.
        - itemid: The identifier for the type of measurement.
        - value: The measurement value (e.g., weight).
    
    Returns:
    - averages_df: DataFrame with the following columns:
        - stay_id: The unique identifier for each patient's ICU stay.
        - weight: The average weight for each patient, in kilograms.
    
    Notes:
    - Converts weight values recorded in pounds (itemid 226531) to kilograms.
    - Raises a warning if the difference between two weight measurements exceeds 15 kg.
    """
    
    # Define the itemid_map for weight
    itemid_map_weight = {
        224639: {'name': 'Daily Weight', 'uom': 'kg'},
        226512: {'name': 'Admission Weight (Kg)', 'uom': 'kg'},
        226531: {'name': 'Admission Weight (lbs.)', 'uom': 'lbs'},
    }
    
    # Define the itemids for weight
    weight_itemids = list(itemid_map_weight.keys())
    
    # Filter the DataFrame for the relevant itemids
    weights_df = chart_df[chart_df['itemid'].isin(weight_itemids)]
    
    # Convert the weight from pounds to kilograms for the last itemid
    pound_to_kg_factor = 0.453592
    weights_df.loc[weights_df['itemid'] == 226531, 'valuenum'] = (
        weights_df.loc[weights_df['itemid'] == 226531, 'valuenum'] * pound_to_kg_factor )
    
    # Calculate the average weights for each stay_id
    averages_df = weights_df.groupby('stay_id').agg({
        'valuenum': 'mean'
    }).reset_index()
    

    # Rename the 'value' column to indicate it's the average weight
    averages_df.rename(columns={'valuenum': 'weight'}, inplace=True)
    
    # Check the differences between the weight columns
    weight_means = weights_df.groupby('itemid')['valuenum'].mean()
    
    if len(weight_means) > 1:  # Ensure there are at least two itemid groups to compare
        for i in range(len(weight_itemids) - 1):  # Loop through itemids
            item1 = weight_itemids[i]
            item2 = weight_itemids[i + 1]
            if item1 in weight_means and item2 in weight_means:
                diff = abs(weight_means[item1] - weight_means[item2])
                if diff > 15:
                    print(f"Warning: Differences greater than 15 kg found between itemids {item1} and {item2}")
    
    # Return only the average weight DataFrame
    return averages_df


In [39]:
weight_df = extract_patient_weights(dataset['chart'])

In [40]:
weight_df['stay_id'].unique().shape

(2528,)

In [41]:
weight_df

Unnamed: 0,stay_id,weight
0,30004242,76.766514
1,30008635,74.661243
2,30014281,95.992255
3,30020961,57.878339
4,30030100,92.714205
...,...,...
2523,39986935,67.857363
2524,39993968,57.473936
2525,39994129,73.426584
2526,39995974,63.366802


In [None]:
# Step 2: extract urine output information


In [42]:
def combine_urine_output(out_df):
    """
    Combine different versions of urine output into one version, summing measurements if there are conflicts
    (i.e., multiple measurements at the same start_time) and generating warnings when combining measurements.

    Parameters:
    - out_df: DataFrame containing output events, including urine output.

    Returns:
    - DataFrame with stay_id, urine_volume (summed), and start_time.
    """
    
    # Define the itemid_map for urine output
    itemid_map_urine = {
        "226627": {'name': 'OR Urine', 'uom': 'mL'},
        "226631": {'name': 'PACU Urine', 'uom': 'mL'},
        # "227488": {'name': 'GU Irrigant Volume In', 'uom': 'mL'},
        # "227489": {'name': 'GU Irrigant/Urine Volume Out', 'uom': 'mL'},
        # "227519": {'name': 'Urine output_ApacheIV', 'uom': None},
        # "226566": {'name': 'Urine and GU Irrigant Out', 'uom': 'mL'},
        "226559": {'name': 'Foley', 'uom': 'mL'},
        "226561": {'name': 'Condom Cath', 'uom': 'mL'},
    }

    # Filter out_df to only include the specified urine output itemids
    itemids_urine = list(itemid_map_urine.keys())
    filtered_df = out_df[out_df['itemid'].isin(map(int, itemids_urine))]

    # Group by stay_id and start_time, then sum the 'value' if there are multiple entries
    combined_df = filtered_df.groupby(['stay_id', 'start_time']).agg(
        urine_volume=('value', 'sum')).reset_index()

    # Check for conflicts (multiple rows combined)
    conflicting_entries = filtered_df.groupby(['stay_id', 'start_time']).size()
    conflicts = conflicting_entries[conflicting_entries > 1]
    
    if not conflicts.empty:
        for index, num_measurements in conflicts.items():
            stay_id, start_time = index
            conflicting_values = filtered_df[(filtered_df['stay_id'] == stay_id) &
                                             (filtered_df['start_time'] == start_time)]
            conflicting_itemids = conflicting_values['itemid'].unique()
            item_names = [itemid_map_urine[str(itemid)]['name'] for itemid in conflicting_itemids]

            # print(f"Warning: Combining {num_measurements} measurements at stay_id {stay_id}, start_time {start_time}.")
            # print(f"Itemids combined: {conflicting_itemids}, names: {item_names}")
            # print(f"Urine volumes combined: {conflicting_values['value'].tolist()}")
    
    return combined_df



In [43]:
urine_df  = combine_urine_output(out_df = dataset['out'])

In [44]:
urine_df

Unnamed: 0,stay_id,start_time,urine_volume
0,30004242,14,925.0
1,30004242,18,75.0
2,30004242,19,175.0
3,30004242,21,75.0
4,30004242,22,175.0
...,...,...,...
107766,39997710,321,250.0
107767,39997710,326,160.0
107768,39997710,328,100.0
107769,39997710,330,90.0


In [None]:
# normalize urine output by weight

In [45]:
import pandas as pd

def normalize_urine_output(urine_df: pd.DataFrame, weight_df: pd.DataFrame) -> pd.DataFrame:
    """
    Normalizes urine output to urine volume per kilogram of weight. Imputes missing weights with the mean and provides a warning about the imputation.

    Parameters:
    - urine_df (pd.DataFrame): Dataframe containing urine output data with columns ['stay_id', 'start_time', 'urine_volume'].
    - weight_df (pd.DataFrame): Dataframe containing weight data with columns ['stay_id', 'weight'].

    Returns:
    - pd.DataFrame: A dataframe containing ['stay_id', 'start_time', 'urine_volume_per_kg'] after normalization.
    
    Note:
    - If a weight is missing for a stay_id, the function imputes the missing weight using the mean of the weights from the `weight_df`.
    - A warning is displayed with the percentage of weights that were imputed.
    """
    # Merge urine_df with weight_df on 'stay_id'
    merged_df = pd.merge(urine_df, weight_df, on='stay_id', how='left')
    
    # Calculate mean weight for imputation
    mean_weight = merged_df['weight'].mean()

    # Count missing weights
    missing_weights_count = merged_df['weight'].isna().sum()

    # Impute missing weights with the mean weight
    merged_df['weight'].fillna(mean_weight, inplace=True)

    # Issue a warning about the percentage of missing weights imputed
    if missing_weights_count > 0:
        imputed_percentage = (missing_weights_count / len(merged_df)) * 100
        print(f"Warning: {imputed_percentage:.2f}% of the weights were imputed using the mean value.")

    # Normalize urine volume per kg
    merged_df['urine_volume_per_kg'] = merged_df['urine_volume'] / merged_df['weight']

    # Return the new dataframe with the required columns
    return merged_df[['stay_id', 'start_time', 'urine_volume_per_kg']]


In [46]:
urine_norm_df = normalize_urine_output(urine_df = urine_df, weight_df = weight_df)



## Visualize the dynamic data
Now, since we processed the data into a temporal format, we can now visualize the resulting tables

In [70]:
from data_loading.load_preprocessed_dynamic_data import DynamicDataLoader

# Load data with default paths
data_loader = DynamicDataLoader( data_icu=True, 
                                 root_dir = root_dir,
                                cohort_output = cohort_output,
                                 data_dir='./data/csv/', 
                                 num_stays=None
                               )
temporal_df, static_df = data_loader.load_data()


Processing IDs:   7%|████▊                                                           | 224/2999 [02:08<38:44,  1.19id/s]

Skipping id: 32029950 (missing file)
Skipping id: 33957882 (missing file)


Processing IDs:  38%|███████████████████████                                      | 1133/2999 [22:10<1:27:17,  2.81s/id]

Skipping id: 35065251 (missing file)


Processing IDs:  60%|█████████████████████████████████████▍                         | 1785/2999 [53:25<36:20,  1.80s/id]


KeyboardInterrupt: 

In [65]:
temporal_df.columns

MultiIndex([( 'MEDS', '222168'),
            ( 'MEDS', '225158'),
            ( 'MEDS', '225943'),
            ( 'MEDS', '221906'),
            ( 'MEDS', '225823'),
            ( 'MEDS', '225942'),
            ( 'MEDS', '222042'),
            ( 'MEDS', '221668'),
            ( 'MEDS', '225828'),
            ( 'MEDS', '220949'),
            ...
            ('CHART', '227607'),
            ('CHART', '227608'),
            ('CHART', '227609'),
            ('CHART', '227612'),
            ('CHART', '227613'),
            ('CHART', '227618'),
            ('CHART', '227596'),
            ('CHART', '229993'),
            (   'id',       ''),
            ( 'time',       '')],
           length=2986)

In [68]:
import pandas as pd

# read items
df_items = pd.read_csv(root_dir + '/mimiciv/3.0/icu/d_items.csv.gz')

def rename_columns_with_descriptions(df_items, df_data):
    # Create a dictionary to map item_id to description
    id_to_description = df_items.set_index('itemid')['label'].to_dict()

    # Count the number of unique first-level index values and how many columns belong to each
    first_index_counts = df_data.columns.get_level_values(0).value_counts()

    # Print the results
    for first_index, count in first_index_counts.items():
        print(f"Category: {first_index} has {count} different features")
    
    # Create a new list of column names
    new_columns = []
    for col in df_data.columns:
        if isinstance(col, tuple):
            # Multi-level column
            first_level = col[0]  # e.g., 'MEDS', 'CHART'
            second_level = col[1]  # e.g., item_id
            
            # Get the description for the second level; use the item_id if no description found
            if len(second_level) > 0:
                label = id_to_description.get(int(second_level), "")

                # Combine first level column name and description
                new_col_name = f"{first_level}_{label}" if label else first_level
            else:
                new_col_name = f"{first_level}"
        else:
            # Single-level column; keep it unchanged
            new_col_name = col
        
        # Avoid adding an underscore for single-level columns like 'id' and 'time'
        if new_col_name.endswith('_'):
            new_col_name = new_col_name[:-1]  # Remove trailing underscore if present

        new_columns.append(new_col_name)
    
    # Rename the columns with the new names
    df_data.columns = new_columns
    
    return df_data



In [69]:
df_data = rename_columns_with_descriptions(df_items = df_items, df_data = temporal_df)

Category: CHART has 1995 different features
Category: LABS has 657 different features
Category: PROC has 138 different features
Category: MEDS has 128 different features
Category: OUT has 66 different features
Category: time has 1 different features
Category: id has 1 different features


In [None]:
df_data['id']

In [None]:
urine_df['stay_id']

In [None]:
unique

## Add AKI label

In [None]:
unique_ids_in_urine = urine_df['stay_id'].nunique()
print(unique_ids_in_urine )
unique_ids_in_data = df_data ['id'].nunique()
print(unique_ids_in_data)

In [None]:
df_dat

In [None]:
def add_urine_output_and_aki_label(urine_df, df_temporal):
    # Merge urine output per hour from df1 into df2 based on stay_id and start_time
    df_temporal = df_temporal.merge(urine_df[['stay_id', 'start_time', 'urine_volume_per_kg']], 
                    left_on=['id', 'time'], 
                    right_on=['stay_id', 'start_time'], 
                    how='left')
    
    # Fill missing urine output values with 0
    df_temporal['urine_volume_per_kg'] = df_temporal['urine_volume_per_kg'].fillna(0)
    
    # Calculate 12-hour rolling sum and average urine output over the last 12 hours
    df_temporal['urine_volume_per_kg_12hr_avg'] = df_temporal.groupby('id')['urine_volume_per_kg']\
                                        .transform(lambda x: x.rolling(window=12, min_periods=1).sum() / 12)
    
    # Add AKI label based on the condition
    df_temporal['aki_label'] = (df_temporal['urine_volume_per_kg_12hr_avg'] < 0.5).astype(int)

    
    # compute fraction of positive cases: 
    # Step 1: Identify stay_ids with at least one aki_label == 1
    stay_ids_with_aki = df_temporal[df_temporal['aki_label'] == 1]['stay_id'].unique()

    # Step 2: Count the total number of unique stay_ids
    total_stay_ids = df_temporal['id'].nunique()

    # Step 3: Compute the fraction
    fraction_with_aki = len(stay_ids_with_aki) / total_stay_ids

    # Output the result
    print(f"Fraction of stay_ids with at least one aki_label == 1: {fraction_with_aki:.2f}")

    return df_temporal

In [None]:
df_temporal = add_urine_output_and_aki_label(urine_df = urine_norm_df, df_temporal = df_data)

In [None]:
df_temporal

In [None]:
urine_norm_df

In [None]:
# can you compute 


## Visualize the dynamic data
Now, since we processed the data into a temporal format, we can now visualize the resulting tables

In [None]:
print("=======Machine :earning Models=======")
radio_input5 = widgets.RadioButtons(options=['Logistic Regression','Random Forest','Gradient Bossting','Xgboost'],value='Gradient Bossting')
display(radio_input5)
print("Do you wnat to conactenate the time-series feature")
radio_input6 = widgets.RadioButtons(options=['Conactenate','Aggregate'],value='Conactenate')
display(radio_input6)
print("Please select below option for cross-validation")
radio_input7 = widgets.RadioButtons(options=['No CV','5-fold CV','10-fold CV'],value='5-fold CV')
display(radio_input7)
print("Do you want to do oversampling for minority calss ?")
radio_input8 = widgets.RadioButtons(options=['True','False'],value='True')
display(radio_input8)

In [None]:
if radio_input7.value=='No CV':
    cv=0
elif radio_input7.value=='5-fold CV':
    cv=int(5)
elif radio_input7.value=='10-fold CV':
    cv=int(10)
ml=ml_models.ML_models(data_icu,cv,radio_input5.value,concat=radio_input6.value=='Conactenate',oversampling=radio_input8.value=='True')

## 9. Deep Learning Models
- Time-series LSTM and Time-series CNN which will only use time-series events like medications, charts, labs, output events to train model.

- Hybrid LSTM and Hybrid CNN will use static data - diagnosis, demographic data aong with other time-series data to train model.

- LSTM with Attention model will use attention layer to rank the important features and learn to predict output. It will use both static and time-series data.

**Go to ./model/parameter.py and define all variables needed for model building and training**

**Please run below cell to select which model to use**

In [None]:
radio_input6=widgets.RadioButtons(options=['Time-series LSTM','Time-series CNN','Hybrid LSTM','Hybrid CNN'],value='Time-series LSTM')
display(radio_input6)
print("Please select below option for cross-validation")
radio_input7 = widgets.RadioButtons(options=['No CV','5-fold CV','10-fold CV'],value='5-fold CV')
display(radio_input7)
print("Do you want to do oversampling for minority calss ?")
radio_input8 = widgets.RadioButtons(options=['True','False'],value='True')
display(radio_input8)

In [None]:
if radio_input7.value=='No CV':
    cv=0
elif radio_input7.value=='5-fold CV':
    cv=int(5)
elif radio_input7.value=='10-fold CV':
    cv=int(10)
    
if data_icu:
    model=dl_train.DL_models(data_icu,diag_flag,proc_flag,out_flag,chart_flag,med_flag,False,radio_input6.value,cv,oversampling=radio_input8.value=='True',model_name='attn_icu_read',train=True)
else:
    model=dl_train.DL_models(data_icu,diag_flag,proc_flag,False,False,med_flag,lab_flag,radio_input6.value,cv,oversampling=radio_input8.value=='True',model_name='attn_icu_read',train=True)

## 10. Running BEHRT
Below we integrate the implementation of BEHRT in our pipeline.
We perform pre-procesing needed to run BEHRT model. https://github.com/deepmedicine/BEHRT

Few things to note before running BEHRT -
- The numerical values are binned into quantiles.
- BEHRT has recommended maximum number of events per sample to be 512. 
    So feature selection is important so that number of events per sample does not exceed 512.
- The model is quite computationally heavy so it requires a GPU.

The output files for BEHRT will be saved in ./data/behrt/ folder

**Please run below cell to to pre-process and run BEHRT on the selected cohort**

In [None]:
if data_icu:
    token=tokenization.BEHRT_models(data_icu,diag_flag,proc_flag,out_flag,chart_flag,med_flag,False)
    tokenized_src, tokenized_age, tokenized_gender, tokenized_ethni, tokenized_ins, tokenized_labels=token.tokenize()
else:
    token=tokenization.BEHRT_models(data_icu,diag_flag,proc_flag,False,False,med_flag,lab_flag)
    tokenized_src, tokenized_age, tokenized_gender, tokenized_ethni, tokenized_ins, tokenized_labels=token.tokenize()
    
behrt_train.train_behrt(tokenized_src, tokenized_age, tokenized_gender, tokenized_ethni, tokenized_ins, tokenized_labels)

### EVALUATION AS STANDALONE MODULE
Below cell shows an exaple of how evaluation module can be used as a standalone module.

evaluation.Loss class can be instantiated and model output and ground truth can be passed to it to obtain results.

In the example below we captured model output and ground truth in a file and used that file to read the data.

In function definition ***loss(prob,truth,logits,False)***

prob -> List of Output predicted probabilities of case being positive

truth -> List of ground truth labels

logits -> List of logits obtained from last fully connected layer before applying softmax.sigmoid function in the model.

In [None]:
if torch.cuda.is_available():
    device='cuda:0'
#device='cpu'
loss=evaluation.Loss(device,acc=True,ppv=True,sensi=True,tnr=True,npv=True,auroc=True,aurocPlot=True,auprc=True,auprcPlot=True,callb=True,callbPlot=True)
with open("./data/output/outputDict", 'rb') as fp:
    outputDict=pickle.load(fp)
prob=list(outputDict['Prob'])
truth=list(outputDict['Labels'])
logits=list(outputDict['Logits'])
#print(torch.tensor(prob))
print("======= TESTING ========")
loss(prob,truth,logits,train=False,standalone=True)


### 11. FAIRNESS EVALUATION
In train and testing step we save output files in **./data/output/** folder.

This file conatins list of demographic variables included in training and testing of the model.

It also contains the ground truth labels and predicted probability for each sample.

We use the above saved data to perform fairness evaluation of the results obtained from model testing.

This module can be used as stand-alone module also.

Please create a file that contains predicted probabilites form the last sigmoid layer in column named **Prob** and
ground truth labels for each sample in column named **Labels**.

In [None]:
fairness.fairness_evaluation(inputFile='outputDict',outputFile='fairnessReport')

### 12. MODEL CALLIBRATION

Please run below cell if you want to callibrate predicted probabilites of the model on test data.
It will use the output saved during the testing of the model.

The file is saved in **./data/output/**.

This module can be used as stand-alone module also.

Please create a file that contain predicted logits form the last fully connected layer in column named **Logits** and <br>ground truth labels for each sample in a column named **Labels**.

In [None]:
callibrate_output.callibrate(inputFile='outputDict',outputFile='callibratedResults')