# eICU Data Joining
---

Reading and joining all preprocessed parts of the eICU dataset from MIT with the data from over 139k patients collected in the US.

The main goal of this notebook is to prepare a single CSV document that contains all the relevant data to be used when training a machine learning model that predicts mortality, joining tables, filtering useless columns and performing imputation.

## Importing the necessary packages

In [None]:
import os                                  # os handles directory/workspace changes
import numpy as np                         # NumPy to handle numeric and NaN operations
import yaml                                # Save and load YAML files

In [None]:
# Debugging packages
import pixiedust                           # Debugging in Jupyter Notebook cells

In [None]:
# Change to parent directory (presumably "Documents")
os.chdir("../../..")
# Path to the CSV dataset files
data_path = 'data/eICU/uncompressed/cleaned/'
# Path to the code files
project_path = 'code/eICU-mortality-prediction/'

In [None]:
# Make sure that every large operation can be handled, by using the disk as an overflow for the memory
!export MODIN_OUT_OF_CORE=true
# Another trick to do with Pandas so as to be able to allocate bigger objects to memory
!sudo bash -c 'echo 1 > /proc/sys/vm/overcommit_memory'

In [None]:
import modin.pandas as pd                  # Optimized distributed version of Pandas
# import pandas as pd
import data_utils as du                    # Data science and machine learning relevant methods

Allow pandas to show more columns:

In [None]:
pd.set_option('display.max_columns', 1000)
pd.set_option('display.max_rows', 1000)

Set the random seed for reproducibility

In [None]:
du.set_random_seed(42)

## Initializing variables

In [None]:
dtype_dict = {'patientunitstayid': 'uint32',
              'gender': 'UInt8',
              'age': 'UInt8',
              'ethnicity': 'Int8',
              'admissionheight': 'float32',
              'admissionweight': 'float32',
              'death_ts': 'Int32',
              'ts': 'int32',
              'smoking_status': 'Int8',
              'ethanol_use': 'Int8',
              'cad': 'UInt8',
              'cancer': 'UInt8',
              # 'diagnosis_type_1': 'UInt64',
              # 'diagnosis_disorder_2': 'UInt64',
              # 'diagnosis_detailed_3': 'UInt64',
              # 'allergyname': 'UInt64',
              # 'drugallergyhiclseqno': 'UInt64',
              # 'pasthistoryvalue': 'UInt64',
              # 'pasthistorytype': 'UInt64',
              # 'pasthistorydetails': 'UInt64',
              # 'treatmenttype': 'UInt64',
              # 'treatmenttherapy': 'UInt64',
              # 'treatmentdetails': 'UInt64',
              # 'drugunit_x': 'UInt64',
              # 'drugadmitfrequency_x': 'UInt64',
              # 'drughiclseqno_x': 'UInt64',
              'drugdosage_x': 'float32',
              # 'drugadmitfrequency_y': 'UInt64',
              # 'drughiclseqno_y': 'UInt64',
              'drugdosage_y': 'float32',
              # 'drugunit_y': 'UInt64',
              'bodyweight_(kg)': 'float32',
              'oral_intake': 'float32',
              'urine_output': 'float32',
              'i.v._intake	': 'float32',
              'saline_flush_(ml)_intake': 'float32',
              'volume_given_ml': 'float32',
              'stool_output': 'float32',
              'prbcs_intake': 'float32',
              'gastric_(ng)_output': 'float32',
              'dialysis_output': 'float32',
              'propofol_intake': 'float32',
              'lr_intake': 'float32',
              'indwellingcatheter_output': 'float32',
              'feeding_tube_flush_ml': 'float32',
              'patient_fluid_removal': 'float32',
              'fentanyl_intake': 'float32',
              'norepinephrine_intake': 'float32',
              'crystalloids_intake': 'float32',
              'voided_amount': 'float32',
              'nutrition_total_intake': 'float32',
              # 'nutrition': 'UInt64',
              # 'nurse_treatments': 'UInt64',
              # 'hygiene/adls': 'UInt64',
              # 'activity': 'UInt64',
              # 'pupils': 'UInt64',
              # 'neurologic': 'UInt64',
              # 'secretions': 'UInt64',
              # 'cough': 'UInt64',
              'priorvent': 'UInt8',
              'onvent': 'UInt8',
              'noninvasivesystolic': 'float32',
              'noninvasivediastolic': 'float32',
              'noninvasivemean': 'float32',
              'paop': 'float32',
              'cardiacoutput': 'float32',
              'cardiacinput': 'float32',
              'svr': 'float32',
              'svri': 'float32',
              'pvr': 'float32',
              'pvri': 'float32',
              'temperature': 'float32',
              'sao2': 'float32',
              'heartrate': 'float32',
              'respiration': 'float32',
              'cvp': 'float32',
              'etco2': 'float32',
              'systemicsystolic': 'float32',
              'systemicdiastolic': 'float32',
              'systemicmean': 'float32',
              'pasystolic': 'float32',
              'padiastolic': 'float32',
              'pamean': 'float32',
              'st1': 'float32',
              'st2': 'float32',
              'st3': 'float32',
              'icp': 'float32',
              # 'labtypeid': 'UInt64',
              # 'labname': 'UInt64',
              # 'lab_units': 'UInt64',
              'lab_result': 'float32'}

## Loading the data

### Patient information

In [None]:
patient_df = pd.read_csv(f'{data_path}normalized/patient.csv')
patient_df.head()

In [None]:
du.search_explore.dataframe_missing_values(patient_df)

Remove rows that don't identify the patient:

In [None]:
patient_df = patient_df[~patient_df.patientunitstayid.isnull()]

In [None]:
du.search_explore.dataframe_missing_values(patient_df)

In [None]:
patient_df.patientunitstayid = patient_df.patientunitstayid.astype(int)
patient_df.ts = patient_df.ts.astype(int)
patient_df.dtypes

In [None]:
note_df = pd.read_csv(f'{data_path}normalized/note.csv')
note_df.head()

Remove the uneeded 'Unnamed: 0' column:

In [None]:
patient_df = patient_df.drop(columns='Unnamed: 0')
note_df = note_df.drop(columns='Unnamed: 0')

### Diagnosis

In [None]:
diagns_df = pd.read_csv(f'{data_path}normalized/diagnosis.csv')
diagns_df.head()

In [None]:
alrg_df = pd.read_csv(f'{data_path}normalized/allergy.csv')
alrg_df.head()

In [None]:
past_hist_df = pd.read_csv(f'{data_path}normalized/pastHistory.csv')
past_hist_df.head()

Remove the uneeded 'Unnamed: 0' column:

In [None]:
diagns_df = diagns_df.drop(columns='Unnamed: 0')
alrg_df = alrg_df.drop(columns='Unnamed: 0')
past_hist_df = past_hist_df.drop(columns='Unnamed: 0')

### Treatments

In [None]:
treat_df = pd.read_csv(f'{data_path}normalized/treatment.csv')
treat_df.head()

In [None]:
adms_drug_df = pd.read_csv(f'{data_path}normalized/admissionDrug.csv')
adms_drug_df.head()

In [None]:
inf_drug_df = pd.read_csv(f'{data_path}normalized/infusionDrug.csv')
inf_drug_df.head()

In [None]:
med_df = pd.read_csv(f'{data_path}normalized/medication.csv')
med_df.head()

In [None]:
in_out_df = pd.read_csv(f'{data_path}normalized/intakeOutput.csv')
in_out_df.head()

Remove the uneeded 'Unnamed: 0' column:

In [None]:
treat_df = treat_df.drop(columns='Unnamed: 0')
adms_drug_df = adms_drug_df.drop(columns='Unnamed: 0')
inf_drug_df = inf_drug_df.drop(columns='Unnamed: 0')
med_df = med_df.drop(columns='Unnamed: 0')
in_out_df = in_out_df.drop(columns='Unnamed: 0')

### Nursing data

In [None]:
# nurse_care_df = pd.read_csv(f'{data_path}normalized/nurseCare.csv')
# nurse_care_df.head()

In [None]:
# nurse_assess_df = pd.read_csv(f'{data_path}normalized/nurseAssessment.csv')
# nurse_assess_df.head()

Remove the uneeded 'Unnamed: 0' column:

In [None]:
# nurse_care_df = nurse_care_df.drop(columns='Unnamed: 0')
# nurse_assess_df = nurse_assess_df.drop(columns='Unnamed: 0')

### Respiratory data

In [None]:
resp_care_df = pd.read_csv(f'{data_path}normalized/respiratoryCare.csv')
resp_care_df.head()

Remove the uneeded 'Unnamed: 0' column:

In [None]:
resp_care_df = resp_care_df.drop(columns='Unnamed: 0')

### Vital signals

In [None]:
vital_aprdc_df = pd.read_csv(f'{data_path}normalized/vitalAperiodic.csv')
vital_aprdc_df.head()

In [None]:
vital_prdc_df = pd.read_csv(f'{data_path}normalized/vitalPeriodic.csv')
vital_prdc_df.head()

Remove the uneeded 'Unnamed: 0' column:

In [None]:
vital_aprdc_df = vital_aprdc_df.drop(columns='Unnamed: 0')
vital_prdc_df = vital_prdc_df.drop(columns='Unnamed: 0')

### Exams data

In [None]:
lab_df = pd.read_csv(f'{data_path}normalized/lab.csv')
lab_df.head()

Remove the uneeded 'Unnamed: 0' column:

In [None]:
lab_df = lab_df.drop(columns='Unnamed: 0')

## Joining dataframes

### Checking the matching of unit stays IDs

In [None]:
full_stays_list = set(patient_df.patientunitstayid.unique())

Total number of unit stays:

In [None]:
len(full_stays_list)

In [None]:
note_stays_list = set(note_df.patientunitstayid.unique())

In [None]:
len(note_stays_list)

Number of unit stays that have note data:

In [None]:
len(set.intersection(full_stays_list, note_stays_list))

In [None]:
diagns_stays_list = set(diagns_df.patientunitstayid.unique())

In [None]:
len(diagns_stays_list)

Number of unit stays that have diagnosis data:

In [None]:
len(set.intersection(full_stays_list, diagns_stays_list))

In [None]:
alrg_stays_list = set(alrg_df.patientunitstayid.unique())

In [None]:
len(alrg_stays_list)

Number of unit stays that have allergy data:

In [None]:
len(set.intersection(full_stays_list, alrg_stays_list))

In [None]:
past_hist_stays_list = set(past_hist_df.patientunitstayid.unique())

In [None]:
len(past_hist_stays_list)

Number of unit stays that have past history data:

In [None]:
len(set.intersection(full_stays_list, past_hist_stays_list))

In [None]:
treat_stays_list = set(treat_df.patientunitstayid.unique())

In [None]:
len(treat_stays_list)

Number of unit stays that have treatment data:

In [None]:
len(set.intersection(full_stays_list, treat_stays_list))

In [None]:
adms_drug_stays_list = set(adms_drug_df.patientunitstayid.unique())

In [None]:
len(adms_drug_stays_list)

Number of unit stays that have admission drug data:

In [None]:
len(set.intersection(full_stays_list, adms_drug_stays_list))

In [None]:
inf_drug_stays_list = set(inf_drug_df.patientunitstayid.unique())

In [None]:
len(inf_drug_stays_list)

Number of unit stays that have infusion drug data:

In [None]:
len(set.intersection(full_stays_list, inf_drug_stays_list))

In [None]:
med_stays_list = set(med_df.patientunitstayid.unique())

In [None]:
len(med_stays_list)

Number of unit stays that have medication data:

In [None]:
len(set.intersection(full_stays_list, med_stays_list))

In [None]:
in_out_stays_list = set(in_out_df.patientunitstayid.unique())

In [None]:
len(in_out_stays_list)

Number of unit stays that have intake and output data:

In [None]:
len(set.intersection(full_stays_list, in_out_stays_list))

In [None]:
# nurse_care_stays_list = set(nurse_care_df.patientunitstayid.unique())

In [None]:
# len(nurse_care_stays_list)

Number of unit stays that have nurse care data:

In [None]:
# len(set.intersection(full_stays_list, nurse_care_stays_list))

In [None]:
# nurse_assess_stays_list = set(nurse_assess_df.patientunitstayid.unique())

In [None]:
# len(nurse_assess_stays_list)

Number of unit stays that have nurse assessment data:

In [None]:
# len(set.intersection(full_stays_list, nurse_assess_stays_list))

In [None]:
resp_care_stays_list = set(resp_care_df.patientunitstayid.unique())

In [None]:
len(resp_care_stays_list)

Number of unit stays that have respiratory care data:

In [None]:
len(set.intersection(full_stays_list, resp_care_stays_list))

In [None]:
vital_aprdc_stays_list = set(vital_aprdc_df.patientunitstayid.unique())

In [None]:
len(vital_aprdc_stays_list)

Number of unit stays that have vital aperiodic data:

In [None]:
len(set.intersection(full_stays_list, vital_aprdc_stays_list))

In [None]:
vital_prdc_stays_list = set(vital_prdc_df.patientunitstayid.unique())

In [None]:
len(vital_prdc_stays_list)

Number of unit stays that have vital periodic data:

In [None]:
len(set.intersection(full_stays_list, vital_prdc_stays_list))

In [None]:
lab_stays_list = set(lab_df.patientunitstayid.unique())

In [None]:
len(lab_stays_list)

Number of unit stays that have lab data:

In [None]:
len(set.intersection(full_stays_list, lab_stays_list))

### Joining patient with note data

In [None]:
eICU_df = pd.merge(patient_df, note_df, how='outer', on=['patientunitstayid', 'ts'])
eICU_df.head()

In [None]:
eICU_df.patientunitstayid.nunique()

### Joining with diagnosis data

Filter to the unit stays that also have data in the other tables:

In [None]:
diagns_df = diagns_df[diagns_df.patientunitstayid.isin(eICU_df.patientunitstayid.unique())]

Also filter only to the unit stays that have data in this new table, considering its importance:

In [None]:
eICU_df = eICU_df[eICU_df.patientunitstayid.isin(diagns_df.patientunitstayid.unique())]

Merge the dataframes:

In [None]:
eICU_df = pd.merge(eICU_df, diagns_df, how='outer', on=['patientunitstayid', 'ts'])
eICU_df.head()

In [None]:
eICU_df.patientunitstayid.nunique()

### Joining with allergy data

Filter to the unit stays that also have data in the other tables:

In [None]:
alrg_df = alrg_df[alrg_df.patientunitstayid.isin(eICU_df.patientunitstayid.unique())]

Merge the dataframes:

In [None]:
eICU_df = pd.merge(eICU_df, alrg_df, how='outer', on=['patientunitstayid', 'ts'])
eICU_df.head()

In [None]:
eICU_df.patientunitstayid.nunique()

### Joining with past history data

Filter to the unit stays that also have data in the other tables:

In [None]:
past_hist_df = past_hist_df[past_hist_df.patientunitstayid.isin(eICU_df.patientunitstayid.unique())]

Also filter only to the unit stays that have data in this new table, considering its importance:

In [None]:
eICU_df = eICU_df[eICU_df.patientunitstayid.isin(past_hist_df.patientunitstayid.unique())]

Merge the dataframes:

In [None]:
len(eICU_df)

In [None]:
eICU_df = pd.merge(eICU_df, past_hist_df, how='outer', on='patientunitstayid')
eICU_df.head()

In [None]:
len(eICU_df)

In [None]:
eICU_df.patientunitstayid.nunique()

### Joining with treatment data

Filter to the unit stays that also have data in the other tables:

In [None]:
treat_df = treat_df[treat_df.patientunitstayid.isin(eICU_df.patientunitstayid.unique())]

Also filter only to the unit stays that have data in this new table, considering its importance:

In [None]:
eICU_df = eICU_df[eICU_df.patientunitstayid.isin(treat_df.patientunitstayid.unique())]

Merge the dataframes:

In [None]:
eICU_df = pd.merge(eICU_df, treat_df, how='outer', on=['patientunitstayid', 'ts'])
eICU_df.head()

In [None]:
eICU_df.patientunitstayid.nunique()

### Joining with admission drug data

Filter to the unit stays that also have data in the other tables:

In [None]:
adms_drug_df = adms_drug_df[adms_drug_df.patientunitstayid.isin(eICU_df.patientunitstayid.unique())]

Merge the dataframes:

In [None]:
eICU_df = pd.merge(eICU_df, adms_drug_df, how='outer', on=['patientunitstayid', 'ts'])
eICU_df.head()

In [None]:
eICU_df.patientunitstayid.nunique()

### Joining with infusion drug data

Filter to the unit stays that also have data in the other tables:

In [None]:
inf_drug_df = inf_drug_df[inf_drug_df.patientunitstayid.isin(eICU_df.patientunitstayid.unique())]

Merge the dataframes:

In [None]:
eICU_df = pd.merge(eICU_df, inf_drug_df, how='outer', on=['patientunitstayid', 'ts'])
eICU_df.head()

In [None]:
eICU_df.patientunitstayid.nunique()

### Joining with medication data

Filter to the unit stays that also have data in the other tables:

In [None]:
med_df = med_df[med_df.patientunitstayid.isin(eICU_df.patientunitstayid.unique())]

Also filter only to the unit stays that have data in this new table, considering its importance:

In [None]:
eICU_df = eICU_df[eICU_df.patientunitstayid.isin(med_df.patientunitstayid.unique())]

Merge the dataframes:

In [None]:
eICU_df = pd.merge(eICU_df, med_df, how='outer', on=['patientunitstayid', 'ts'])
eICU_df.head()

In [None]:
eICU_df.patientunitstayid.nunique()

### Joining with intake outake data

Filter to the unit stays that also have data in the other tables:

In [None]:
in_out_df = in_out_df[in_out_df.patientunitstayid.isin(eICU_df.patientunitstayid.unique())]

Also filter only to the unit stays that have data in this new table, considering its importance:

In [None]:
eICU_df = eICU_df[eICU_df.patientunitstayid.isin(in_out_df.patientunitstayid.unique())]

Merge the dataframes:

In [None]:
eICU_df = pd.merge(eICU_df, in_out_df, how='outer', on=['patientunitstayid', 'ts'])
eICU_df.head()

In [None]:
eICU_df.patientunitstayid.nunique()

### Joining with nurse care data

In [None]:
# eICU_df = pd.merge(eICU_df, nurse_care_df, how='outer', on=['patientunitstayid', 'ts'])
# eICU_df.head()

### Joining with nurse assessment data

In [None]:
# eICU_df = pd.merge(eICU_df, nurse_assess_df, how='outer', on=['patientunitstayid', 'ts'])
# eICU_df.head()

### Joining with nurse charting data

In [None]:
# eICU_df = pd.merge(eICU_df, nurse_chart_df, how='outer', on=['patientunitstayid', 'ts'])
# eICU_df.head()

### Joining with respiratory care data

Filter to the unit stays that also have data in the other tables:

In [None]:
resp_care_df = resp_care_df[resp_care_df.patientunitstayid.isin(eICU_df.patientunitstayid.unique())]

Merge the dataframes:

In [None]:
eICU_df = pd.merge(eICU_df, resp_care_df, how='outer', on=['patientunitstayid', 'ts'])
eICU_df.head()

In [None]:
eICU_df.patientunitstayid.nunique()

### Joining with aperiodic vital signals data

Filter to the unit stays that also have data in the other tables:

In [None]:
vital_aprdc_df = vital_aprdc_df[vital_aprdc_df.patientunitstayid.isin(eICU_df.patientunitstayid.unique())]

Also filter only to the unit stays that have data in this new table, considering its importance:

In [None]:
eICU_df = eICU_df[eICU_df.patientunitstayid.isin(vital_aprdc_df.patientunitstayid.unique())]

Merge the dataframes:

In [None]:
eICU_df = pd.merge(eICU_df, vital_aprdc_df, how='outer', on=['patientunitstayid', 'ts'])
eICU_df.head()

In [None]:
eICU_df.patientunitstayid.nunique()

### Joining with periodic vital signals data

Filter to the unit stays that also have data in the other tables:

In [None]:
vital_prdc_df = vital_prdc_df[vital_prdc_df.patientunitstayid.isin(eICU_df.patientunitstayid.unique())]

Also filter only to the unit stays that have data in this new table, considering its importance:

In [None]:
eICU_df = eICU_df[eICU_df.patientunitstayid.isin(vital_prdc_df.patientunitstayid.unique())]

Save the current dataframe:

In [None]:
eICU_df.to_csv(f'{data_path}normalized/eICU_before_joining_vital_prdc.csv')

Merge the dataframes:

In [None]:
eICU_df = pd.read_csv(f'{data_path}normalized/eICU_before_joining_vital_prdc.csv',
                      dtype=dtype_dict)
eICU_df = eICU_df.drop(columns=['Unnamed: 0'])
eICU_df.head()

In [None]:
eICU_df = pd.merge(eICU_df, vital_prdc_df, how='outer', on=['patientunitstayid', 'ts'])
eICU_df.head()

Save the current dataframe:

In [None]:
eICU_df.to_csv(f'{data_path}normalized/eICU_post_joining_vital_prdc.csv')

In [None]:
eICU_df = pd.read_csv(f'{data_path}normalized/eICU_post_joining_vital_prdc.csv',
                      dtype=dtype_dict)
eICU_df = eICU_df.drop(columns=['Unnamed: 0'])
eICU_df.head()

In [None]:
eICU_df.patientunitstayid.nunique()

### Joining with lab data

Filter to the unit stays that also have data in the other tables:

In [None]:
lab_df = lab_df[lab_df.patientunitstayid.isin(eICU_df.patientunitstayid.unique())]

Also filter only to the unit stays that have data in this new table, considering its importance:

In [None]:
eICU_df = eICU_df[eICU_df.patientunitstayid.isin(lab_df.patientunitstayid.unique())]

Merge the dataframes:

In [None]:
eICU_df = pd.merge(eICU_df, lab_df, how='outer', on=['patientunitstayid', 'ts'])
eICU_df.head()

Save the current dataframe:

In [None]:
eICU_df.to_csv(f'{data_path}normalized/eICU_post_joining.csv')

In [None]:
eICU_df = pd.read_csv(f'{data_path}normalized/eICU_post_joining_0.csv', dtype=dtype_dict)
eICU_df = eICU_df.drop(columns=['Unnamed: 0'])
eICU_df.head()

In [None]:
eICU_df.columns

In [None]:
eICU_df.dtypes

In [None]:
eICU_df.patientunitstayid.nunique()

## Cleaning the joined data

### Removing unit stays that are too short

In [None]:
eICU_df.info(memory_usage='deep')

Make sure that the dataframe is ordered by time `ts`:

In [None]:
eICU_df = eICU_df.sort_values('ts')
eICU_df.head()

Remove unit stays that have less than 10 records:

In [None]:
unit_stay_len = eICU_df.groupby('patientunitstayid').patientunitstayid.count()
unit_stay_len

In [None]:
unit_stay_short = set(unit_stay_len[unit_stay_len < 10].index)
unit_stay_short

In [None]:
len(unit_stay_short)

In [None]:
eICU_df.patientunitstayid.nunique()

In [None]:
eICU_df = eICU_df[~eICU_df.patientunitstayid.isin(unit_stay_short)]

In [None]:
eICU_df.patientunitstayid.nunique()

Remove unit stays that have data that represent less than 48h:

In [None]:
unit_stay_duration = eICU_df.groupby('patientunitstayid').ts.apply(lambda x: x.max() - x.min())
unit_stay_duration

In [None]:
unit_stay_short = set(unit_stay_duration[unit_stay_duration < 48*60].index)
unit_stay_short

In [None]:
len(unit_stay_short)

In [None]:
eICU_df.patientunitstayid.nunique()

In [None]:
eICU_df = eICU_df[~eICU_df.patientunitstayid.isin(unit_stay_short)]

In [None]:
eICU_df.patientunitstayid.nunique()

### Joining duplicate columns

#### Continuous features

In [None]:
set([col.split('_x')[0].split('_y')[0] for col in eICU_df.columns if col.endswith('_x') or col.endswith('_y')])

In [None]:
eICU_df[['drugdosage_x', 'drugdosage_y']].head(20)

In [None]:
eICU_df[eICU_df.index == 2564878][['drugdosage_x', 'drugdosage_y']]

Convert dataframe to Pandas, as the next cells aren't working properly with Modin:

In [None]:
eICU_df, pd = du.utils.convert_dataframe(eICU_df, to='pandas')

In [None]:
eICU_df = du.data_processing.merge_columns(eICU_df, cols_to_merge=['drugdosage'])
eICU_df.sample(20)

In [None]:
eICU_df['drugdosage'].head(20)

In [None]:
eICU_df[eICU_df.index == 2564878][['drugdosage']]

Save the current dataframe:

In [None]:
eICU_df.to_csv(f'{data_path}normalized/eICU_post_merge_continuous_cols.csv')

In [None]:
eICU_df = pd.read_csv(f'{data_path}normalized/eICU_post_merge_continuous_cols.csv', dtype=dtype_dict)
eICU_df = eICU_df.drop(columns=['Unnamed: 0'])
eICU_df.head()

#### Categorical features

Join encodings of the same features, from different tables.

Load encoding dictionaries:

In [None]:
stream_adms_drug = open(f'{data_path}cat_embed_feat_enum_adms_drug.yaml', 'r')
stream_med = open(f'{data_path}cat_embed_feat_enum_med.yaml', 'r')
cat_embed_feat_enum_adms_drug = yaml.load(stream_adms_drug, Loader=yaml.FullLoader)
cat_embed_feat_enum_med = yaml.load(stream_med, Loader=yaml.FullLoader)

In [None]:
eICU_df[['drugadmitfrequency_x', 'drugunit_x', 'drughiclseqno_x',
         'drugadmitfrequency_y', 'drugunit_y', 'drughiclseqno_y']].head(20)

Standardize the encoding of similar columns:

In [None]:
list(cat_embed_feat_enum_adms_drug.keys())

In [None]:
list(cat_embed_feat_enum_med.keys())

Convert dataframe to Pandas, as the next cells aren't working properly with Modin:

In [None]:
eICU_df, cat_embed_feat_enum['drugadmitfrequency'] = du.embedding.converge_enum(eICU_df, cat_feat_name=['drugadmitfrequency_x',
                                                                                                        'drugadmitfrequency_y'],
                                                                                dict1=cat_embed_feat_enum_adms_drug['drugadmitfrequency'],
                                                                                dict2=cat_embed_feat_enum_med['frequency'],
                                                                                nan_value=0, sort=True, inplace=True)

In [None]:
eICU_df, pd = du.utils.convert_dataframe(eICU_df, to='pandas')

In [None]:
eICU_df, cat_embed_feat_enum['drugunit'] = du.embedding.converge_enum(eICU_df, cat_feat_name=['drugunit_x',
                                                                                              'drugunit_y'],
                                                                      dict1=cat_embed_feat_enum_adms_drug['drugunit'],
                                                                      dict2=cat_embed_feat_enum_med['drugunit'],
                                                                      nan_value=0, sort=True, inplace=True)

In [None]:
eICU_df, cat_embed_feat_enum['drughiclseqno'] = du.embedding.converge_enum(eICU_df, cat_feat_name=['drughiclseqno_x',
                                                                                                   'drughiclseqno_y'],
                                                                           dict1=cat_embed_feat_enum_adms_drug['drughiclseqno'],
                                                                           dict2=cat_embed_feat_enum_med['drughiclseqno'],
                                                                           nan_value=0, sort=True, inplace=True)

Merge the features:

In [None]:
eICU_df = du.data_processing.merge_columns(eICU_df, cols_to_merge=['drugadmitfrequency', 'drugunit', 'drughiclseqno'])
eICU_df.sample(20)

In [None]:
eICU_df[['drugadmitfrequency', 'drugunit', 'drughiclseqno']].head(20)

Save the current dataframe:

In [None]:
eICU_df.to_csv(f'{data_path}normalized/eICU_post_merge_categorical_cols.csv')

In [None]:
eICU_df = pd.read_csv(f'{data_path}normalized/eICU_post_merge_categorical_cols.csv')
eICU_df = eICU_df.drop(columns=['Unnamed: 0'])
eICU_df.head()

### Creating a single encoding dictionary for the complete dataframe

Combine the encoding dictionaries of all tables, having in account the converged ones, into a single dictionary representative of all the categorical features in the resulting dataframe.

In [None]:
stream_adms_drug = open(f'{data_path}cat_embed_feat_enum_adms_drug.yaml', 'r')
stream_inf_drug = open(f'{data_path}cat_embed_feat_enum_inf_drug.yaml', 'r')
stream_med = open(f'{data_path}cat_embed_feat_enum_med.yaml', 'r')
stream_treat = open(f'{data_path}cat_embed_feat_enum_treat.yaml', 'r')
stream_in_out = open(f'{data_path}cat_embed_feat_enum_in_out.yaml', 'r')
stream_diag = open(f'{data_path}cat_embed_feat_enum_diag.yaml', 'r')
stream_alrg = open(f'{data_path}cat_embed_feat_enum_alrg.yaml', 'r')
stream_past_hist = open(f'{data_path}cat_embed_feat_enum_past_hist.yaml', 'r')
stream_resp_care = open(f'{data_path}cat_embed_feat_enum_resp_care.yaml', 'r')
# stream_nurse_care = open(f'{data_path}cat_embed_feat_enum_nurse_care.yaml', 'r')
# stream_nurse_assess = open(f'{data_path}cat_embed_feat_enum_nurse_assess.yaml', 'r')
stream_lab = open(f'{data_path}cat_embed_feat_enum_lab.yaml', 'r')
stream_patient = open(f'{data_path}cat_embed_feat_enum_patient.yaml', 'r')
stream_notes = open(f'{data_path}cat_embed_feat_enum_notes.yaml', 'r')

In [None]:
cat_embed_feat_enum_adms_drug = yaml.load(stream_adms_drug, Loader=yaml.FullLoader)
cat_embed_feat_enum_inf_drug = yaml.load(stream_inf_drug, Loader=yaml.FullLoader)
cat_embed_feat_enum_med = yaml.load(stream_med, Loader=yaml.FullLoader)
cat_embed_feat_enum_treat = yaml.load(stream_treat, Loader=yaml.FullLoader)
cat_embed_feat_enum_in_out = yaml.load(stream_in_out, Loader=yaml.FullLoader)
cat_embed_feat_enum_diag = yaml.load(stream_diag, Loader=yaml.FullLoader)
cat_embed_feat_enum_alrg = yaml.load(stream_alrg, Loader=yaml.FullLoader)
cat_embed_feat_enum_past_hist = yaml.load(stream_past_hist, Loader=yaml.FullLoader)
cat_embed_feat_enum_resp_care = yaml.load(stream_resp_care, Loader=yaml.FullLoader)
# cat_embed_feat_enum_nurse_care = yaml.load(stream_nurse_care, Loader=yaml.FullLoader)
# cat_embed_feat_enum_nurse_assess = yaml.load(stream_nurse_assess, Loader=yaml.FullLoader)
cat_embed_feat_enum_lab = yaml.load(stream_lab, Loader=yaml.FullLoader)
cat_embed_feat_enum_patient = yaml.load(stream_patient, Loader=yaml.FullLoader)
cat_embed_feat_enum_notes = yaml.load(stream_notes, Loader=yaml.FullLoader)

In [None]:
cat_embed_feat_enum = du.utils.merge_dicts([cat_embed_feat_enum_adms_drug, cat_embed_feat_enum_inf_drug,
                                            cat_embed_feat_enum_med, cat_embed_feat_enum_treat,
                                            cat_embed_feat_enum_in_out, cat_embed_feat_enum_diag,
                                            cat_embed_feat_enum_alrg, cat_embed_feat_enum_past_hist,
                                            cat_embed_feat_enum_resp_care, cat_embed_feat_enum_lab,
                                            cat_embed_feat_enum_patient, cat_embed_feat_enum_notes,
                                            cat_embed_feat_enum])

In [None]:
list(cat_embed_feat_enum.keys())

Save the final encoding dictionary:

In [None]:
stream = open(f'{data_path}/cleaned/cat_embed_feat_enum_eICU.yaml', 'w')
yaml.dump(cat_embed_feat_enum, stream, default_flow_style=False)

### Removing unit stays with too many missing values

Consider removing all unit stays that have, combining rows and columns, a very high percentage of missing values.

Reconvert dataframe to Modin:

In [None]:
eICU_df, pd = du.utils.convert_dataframe(vital_prdc_df, to='modin')

In [None]:
n_features = len(eICU_df.columns)
n_features

Create a temporary column that counts each row's number of missing values:

In [None]:
eICU_df['row_msng_val'] = eICU_df.isnull().sum(axis=1)
eICU_df[['patientunitstayid', 'ts', 'row_msng_val']].head()

Check each unit stay's percentage of missing data points:

In [None]:
# Number of possible data points in each unit stay
n_data_points = eICU_df.groupby('patientunitstayid').ts.count() * n_features
n_data_points

In [None]:
# Number of missing values in each unit stay
n_msng_val = eICU_df.groupby('patientunitstayid').row_msng_val.sum()
n_msng_val

In [None]:
# Percentage of missing values in each unit stay
msng_val_prct = (n_msng_val / n_data_points) * 100
msng_val_prct

In [None]:
msng_val_prct.describe()

Remove unit stays that have too many missing values (>70% of their respective data points):

In [None]:
unit_stay_high_msgn = set(msng_val_prct[msng_val_prct > 70].index)
unit_stay_high_msgn

In [None]:
eICU_df.patientunitstayid.nunique()

In [None]:
eICU_df = eICU_df[~eICU_df.patientunitstayid.isin(unit_stay_high_msgn)]

In [None]:
eICU_df.patientunitstayid.nunique()

### Removing columns with too many missing values

We should remove features that have too many missing values (in this case, those that have more than 40% of missing values). Without enough data, it's even risky to do imputation, as it's unlikely for the imputation to correctly model the missing feature.

In [None]:
du.search_explore.dataframe_missing_values(eICU_df)

In [None]:
prev_features = eICU_df.columns
len(prev_features)

In [None]:
eICU_df = du.data_processing.remove_cols_with_many_nans(eICU_df, nan_percent_thrsh=70, inplace=True)

In [None]:
features = eICU_df.columns
len(features)

Removed features:

In [None]:
set(prev_features) - set(features)

In [None]:
eICU_df.head()

### Removing rows with too many missing values

This actually might not make sense to do, as some tables, such as `patient`, are set in a timestamp that is unlikely to have matches in other tables, although it's still useful to add.

In [None]:
len(eICU_df)

In [None]:
n_features = len(eICU_df.columns)
n_features

In [None]:
eICU_df = eICU_df[eICU_df.isnull().sum(axis=1) < 0.5 * n_features]

In [None]:
len(eICU_df)

### Performing imputation

In [None]:
du.search_explore.dataframe_missing_values(eICU_df)

In [None]:
# [TODO] Be careful to avoid interpolating categorical features (e.g. `drugunit`); these must only
# be imputated through zero filling
eICU_df = du.data_processing.missing_values_imputation(eICU_df, method='interpolation',
                                                       id_column='patientunitstay', inplace=True)
eICU_df.head()

In [None]:
du.search_explore.dataframe_missing_values(eICU_df)

## Setting the label

Define the label column considering the desired time window on which we want to predict mortality (0, 24h, 48h, 72h, etc).

In [None]:
time_window_h = 24

In [None]:
eICU_df['label'] = eICU_df[eICU_df.death_ts - eICU_df.ts <= time_window_h * 60]
eICU_df.head()