<h1> Prepare Data </h1>
To prepare the data for our models we need to:
<ol>
<li> cut unnecessary rows from the csv's (the rows that will not be used as input for the model, as explained in the instructions) </li>
<li> create train and validation datasets </li>
</ol>
<h2> Adding Columns </h2>
For each patients (and row) we will add 3 columns to the data frame:
<ol>
<li> max_ICULOS - the total time a patient was in the ICU </li>
<li> time_bm - the difference between the current time and the total time a patient was in the ICU. defined as $time_bm = ICULOS-max ICULOS$
</li>
<li> Label column -  1 if the patient had sepsis after some time in the ICU and 0 otherwise
</li>
</ol>


In [1]:
import pandas as pd
import os
import tqdm
from random import sample
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from DataPreaparators import create_patients_df, DataPreparator

In [2]:
TRAIN_PATH = 'filtered_train_df_0705'
VAL_PATH = 'filtered_val_df_0705'
TEST_PATH = 'filtered_test_df_0705'
TRAIN_MEAN_PATH = 'filtered_train_mean.csv'

First we read patients from given directory, split the train data into train and validation, and save to csv 1 dataframe for each set (train,validation,test)

In [4]:
for d_type in ['train', 'test']:
    data_path = f'/home/student/Early_Prediction_of_Sepsis/data/{d_type}/'
    patients = os.listdir(f'data/{d_type}')
    if d_type=='train':
        train_patients = sample(patients,int(len(patients)*0.8))
        val_patients = [x for x in patients if x not in train_patients]
        train_df = create_patients_df(train_patients,data_path)
        train_df.to_csv(f'{TRAIN_PATH}.csv',index=False)
        val_df = create_patients_df(val_patients,data_path)
        val_df.to_csv(f'{VAL_PATH}.csv',index=False)
    else:
        test_df = create_patients_df(patients,data_path)
        test_df.to_csv(f'{TEST_PATH}.csv',index=False)

In [5]:
frequency_used_attributes = ['BaseExcess',  'FiO2', 'pH', 'PaCO2', 'Glucose','Lactate', 'PTT']
# FREQUENCY_ATTR =['5w_sum_BaseExcess', '5w_sum_FiO2', '5w_sum_pH', '5w_sum_PaCO2', '5w_sum_Glucose', '5w_sum_Lactate', '5w_sum_PTT']
# LAB_ATTR = ['Hct',  'Glucose','Potassium']
CONST_ATTR = ['max_ICULOS','Gender']
OTHER_ATTR = ['HR','MAP','O2Sat', 'Resp','SBP','ICULOS']
ALL_LAB_ATTR = ['BaseExcess', 'HCO3', 'FiO2', 'pH', 'PaCO2',
 'SaO2', 'AST', 'BUN', 'Alkalinephos', 'Calcium', 'Chloride',
 'Creatinine', 'Bilirubin_direct', 'Glucose', 'Lactate',
 'Magnesium', 'Phosphate', 'Potassium', 'Bilirubin_total',
 'TroponinI', 'Hct', 'Hgb', 'PTT', 'WBC', 'Fibrinogen','Platelets']
COLS = CONST_ATTR+OTHER_ATTR

<h2> RNN Pre Process </h2>

In order to use the data as input for RNN models we need some additional pre process and imputation, as explained in the report.
The DataPreparator class will add frequency and window columns and will use iterative imputer to impute each patient missing data.

To impute features data for patient who have some all NULL values in some feature, we use mean imputation using the train mean value for this feature

In [7]:
train_df = pd.read_csv(f'{TRAIN_PATH}.csv')
all_data_mean = train_df.mean().reset_index().to_csv(TRAIN_MEAN_PATH,index=False)

In [8]:
p = DataPreparator(columns=COLS,freq_columns=ALL_LAB_ATTR)

In [9]:
train_df = pd.read_csv(f'{TRAIN_PATH}.csv')
train_df = p.prepare_data(train_df,rolling=False, freq=True)
train_df.to_csv(f'{TRAIN_PATH}_LSTM_new.csv',index=False)

In [10]:
val_df = pd.read_csv(f'{VAL_PATH}.csv')
val_df = p.prepare_data(val_df)
val_df.to_csv(f'{VAL_PATH}_LSTM_new.csv',index=False)

In [11]:
test_df = pd.read_csv(f'{TEST_PATH}.csv')
test_df = p.prepare_data(test_df)
test_df.to_csv(f'{TEST_PATH}_LSTM_new.csv',index=False)