# eICU Data Joining
---

Reading and joining all parts of the eICU dataset from MIT with the data from over 139k patients collected in the US.

The main goal of this notebook is to prepare a single CSV document that contains all the relevant data to be used when training a machine learning model that predicts mortality, joining tables, filtering useless columns and performing imputation.

## Importing the necessary packages

In [None]:
import dask.dataframe as dd                # Dask to handle big data in dataframes
import pandas as pd                        # Pandas to load the data initially
from dask.distributed import Client        # Dask scheduler
from dask.diagnostics import ProgressBar   # Dask progress bar
import re                                  # re to do regex searches in string data
import os                                  # os handles directory/workspace changes
import numpy as np                         # NumPy to handle numeric and NaN operations
from tqdm import tqdm_notebook             # tqdm allows to track code execution progress
import numbers                             # numbers allows to check if data is numeric
import utils                               # Contains auxiliary functions

In [None]:
# Debugging packages
import pixiedust                           # Debugging in Jupyter Notebook cells

In [None]:
# Change to parent directory (presumably "Documents")
os.chdir("../..")

# Path to the CSV dataset files
data_path = 'Datasets/Thesis/eICU/uncompressed/'

In [None]:
# Activate the progress bar for all dask computations
pbar = ProgressBar()
pbar.register()

In [None]:
# Set up local cluster
# client = Client("tcp://127.0.0.1:57608")
# client

In [None]:
# Upload the utils.py file, so that the Dask cluster has access to relevant auxiliary functions
# client.upload_file('GitHub/eICU-mortality-prediction/NeuralNetwork.py')
# client.upload_file('utils.py')

## Initialize variables

In [None]:
cat_feat = []                              # List of categorical features
cat_embed_feat = []                        # List of categorical features that will be embedded
cat_embed_feat_enum = dict()               # Dictionary of the enumerations of the categorical features that will be embedded

## Patient data

### Read the data

In [None]:
patient_df = pd.read_csv(f'{data_path}patient.csv')
patient_df = dd.from_pandas(patient_df, npartitions=8)
patient_df.head()

In [None]:
# patient_df = dd.read_csv(f'{data_path}patient.csv')
# patient_df.head()

In [None]:
# patient_df = patient_df.repartition(npartitions=4)

In [None]:
patient_df.visualize()

In [None]:
patient_df.columns

In [None]:
patient_df.dtypes

In [None]:
patient_df.npartitions

### Remove unneeded features

In [None]:
patient_df = patient_df[['patientunitstayid', 'gender', 'age', 'ethnicity', 'apacheadmissiondx',  'admissionheight', 
                         'hospitaldischargeoffset', 'hospitaldischargelocation', 'hospitaldischargestatus', 
                         'admissionweight', 'dischargeweight', 'unitdischargeoffset']]
patient_df.head()

### Check for missing values

In [None]:
utils.dataframe_missing_values(patient_df)

In [None]:
patient_df.visualize()

### Make the age feature numeric

In the eICU dataset, ages above 89 years old are not specified. Instead, we just receive the indication "> 89". In order to be able to work with the age feature numerically, we'll just replace the "> 89" values with "90", as if the patient is 90 years old. It might not always be the case, but it shouldn't be very different and it probably doesn't affect too much the model's logic.

In [None]:
patient_df.age.value_counts().head()

In [None]:
# Replace the "> 89" years old indication with 90 years
patient_df.age = patient_df.age.replace(to_replace='> 89', value=90)

In [None]:
patient_df.age.value_counts().head()

In [None]:
# Make the age feature numeric
patient_df.age = patient_df.age.astype(float)

In [None]:
patient_df.visualize()

In [None]:
# Save current dataframe in memory to avoid accumulating several operations on the dask graph
# patient_df = client.persist(patient_df)
patient_df = patient_df.persist()

In [None]:
patient_df.visualize()

### Discretize categorical features

Convert binary categorical features into simple numberings, one hot encode features with a low number of categories (in this case, 5) and enumerate sparse categorical features that will be embedded.

#### Convert binary categorical features into numeric

In [None]:
patient_df.gender.value_counts().compute()

In [None]:
patient_df.gender = patient_df.gender.map(lambda x: 1 if x == 'Male' else 0 if x == 'Female' else np.nan)

In [None]:
patient_df.gender.value_counts().compute()

#### Separate and prepare features for embedding

Identify categorical features that have more than 5 unique categories, which will go through an embedding layer afterwards, and enumerate them.

[TODO] Only enumerate the `apacheadmissiondx` feature after joining it with all the remaining diagnosis features

In [None]:
# Update list of categorical features
cat_feat = ['ethnicity', 'apacheadmissiondx']

In [None]:
cat_feat_nunique = [patient_df[feature].nunique().compute() for feature in cat_feat]
cat_feat_nunique

In [None]:
patient_df[cat_feat].head()

In [None]:
for i in range(len(cat_feat)):
    feature = cat_feat[i]
    if cat_feat_nunique[i] > 5 and feature is not 'apacheadmissiondx':
        # Prepare for embedding, i.e. enumerate categories
        patient_df[feature], cat_embed_feat_enum[feature] = utils.enum_categorical_feature(patient_df, feature)

In [None]:
patient_df[cat_feat].head()

In [None]:
cat_embed_feat_enum

In [None]:
patient_df[cat_feat].dtypes

In [None]:
patient_df.visualize()

In [None]:
# Save current dataframe in memory to avoid accumulating several operations on the dask graph
# patient_df = client.persist(patient_df)
patient_df = patient_df.persist()

In [None]:
patient_df.visualize()

### Create mortality label

Combine info from discharge location and discharge status. Using the hospital discharge data, instead of the unit, as it has a longer perspective on the patient's status. I then save a feature called "deathOffset", which has a number if the patient is dead on hospital discharge or is NaN if the patient is still alive/unknown (presumed alive if unknown). Based on this, a label can be made later on, when all the tables are combined in a single dataframe, indicating if a patient dies in the following X time, according to how faraway we want to predict.

In [None]:
patient_df.hospitaldischargestatus.value_counts().compute()

In [None]:
patient_df.hospitaldischargelocation.value_counts().compute()

In [None]:
patient_df['deathoffset'] = patient_df.apply(lambda df: df['hospitaldischargeoffset'] 
                                                        if df['hospitaldischargestatus'] == 'Expired' or
                                                        df['hospitaldischargelocation'] == 'Death' else np.nan, axis=1, 
                                                        meta=('x', float))

In [None]:
patient_df.head()

Remove the now unneeded hospital discharge features:

In [None]:
patient_df = patient_df.drop(['hospitaldischargeoffset', 'hospitaldischargestatus', 'hospitaldischargelocation'], axis=1)
patient_df.head(6)

In [None]:
patient_df.visualize()

In [None]:
# Save current dataframe in memory to avoid accumulating several operations on the dask graph
# patient_df = client.persist(patient_df)
patient_df = patient_df.persist()

In [None]:
patient_df.visualize()

### Create a discharge instance and the timestamp feature

Create the timestamp (`ts`) feature:

In [None]:
patient_df['ts'] = 0
patient_df.head()

In [None]:
patient_df.patientunitstayid.value_counts().compute()

Duplicate every row, so as to create a discharge event:

In [None]:
patient_df = patient_df.append(patient_df)
patient_df.patientunitstayid.value_counts().compute()

Sort by `patientunitstayid` so as to keep the timestamps of the same patient together:

In [None]:
patient_df = patient_df.compute().sort_values(by='patientunitstayid')
patient_df.head(6)

Create a weight feature:

In [None]:
# Create feature weight and assign the initial weight that the patient has on admission
patient_df['weight'] = patient_df['admissionweight']
patient_df.head()

Set the `weight` and `ts` features to initially have the value on admission and, on the second timestamp, have the value on discharge:

In [None]:
def set_weight(row):
    global patient_first_row
    if not patient_first_row:
        row['weight'] = row['dischargeweight']
        patient_first_row = True
    else:
        patient_first_row = False
    return row

In [None]:
patient_first_row = False
patient_df = patient_df.apply(lambda row: set_weight(row), axis=1)
patient_df.head(6)

In [None]:
def set_ts(row):
    global patient_first_row
    if not patient_first_row:
        row['ts'] = row['unitdischargeoffset']
        patient_first_row = True
    else:
        patient_first_row = False
    return row

In [None]:
patient_first_row = False
patient_df = dd.from_pandas(patient_df.apply(lambda row: set_ts(row), axis=1), npartitions=8)
patient_df.head(6)

Remove the remaining, now unneeded, weight and timestamp features:

In [None]:
patient_df = patient_df.drop(['admissionweight', 'dischargeweight', 'unitdischargeoffset'], axis=1)
patient_df.head(6)

In [None]:
patient_df.visualize()

In [None]:
# Save current dataframe in memory to avoid accumulating several operations on the dask graph
# patient_df = client.persist(patient_df)
patient_df = patient_df.persist()

In [None]:
patient_df.visualize()

### Normalize data

Save the dataframe before normalizing:

In [None]:
patient_df.to_parquet(f'{data_path}/cleaned/patient.parquet')

In [None]:
patient_df_norm = utils.normalize_data(patient_df)
patient_df_norm

### Rename columns

## Vital signs periodic data

### Join dataframes and save to a parquet file