# eICU Data Joining
---

Reading and joining all parts of the eICU dataset from MIT with the data from over 139k patients collected in the US.

The main goal of this notebook is to prepare a single CSV document that contains all the relevant data to be used when training a machine learning model that predicts mortality, joining tables, filtering useless columns and performing imputation.

## Importing the necessary packages

In [None]:
import os                                  # os handles directory/workspace changes
import numpy as np                         # NumPy to handle numeric and NaN operations
import yaml                                # Save and load YAML files

In [None]:
# Debugging packages
import pixiedust                           # Debugging in Jupyter Notebook cells

In [None]:
# Change to parent directory (presumably "Documents")
os.chdir("../../..")

# Path to the CSV dataset files
data_path = 'Documents/Datasets/Thesis/eICU/uncompressed/'

# Path to the code files
project_path = 'Documents/GitHub/eICU-mortality-prediction/'

In [None]:
import modin.pandas as pd                  # Optimized distributed version of Pandas
import data_utils as du                    # Data science and machine learning relevant methods

Set the random seed for reproducibility

In [None]:
du.set_random_seed(42)

## Initialize variables

In [None]:
cat_feat = []                              # List of categorical features
cat_embed_feat = []                        # List of categorical features that will be embedded
cat_embed_feat_enum = dict()               # Dictionary of the enumerations of the categorical features that will be embedded

## Vital signs periodic data

### Read the data

In [None]:
vital_prdc_df = pd.read_csv(f'{data_path}original/vitalPeriodic.csv')
vital_prdc_df.head()

Get an overview of the dataframe through the `describe` method:

In [None]:
vital_prdc_df.describe().transpose()

In [None]:
vital_prdc_df.columns

In [None]:
vital_prdc_df.dtypes

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(patient_df)

### Remove unneeded features

In [None]:
patient_df = patient_df[['patientunitstayid', 'gender', 'age', 'ethnicity', 'apacheadmissiondx',  'admissionheight',
                         'hospitaldischargeoffset', 'hospitaldischargelocation', 'hospitaldischargestatus',
                         'admissionweight', 'dischargeweight', 'unitdischargeoffset']]
patient_df.head()

### Discretize categorical features

Convert binary categorical features into simple numberings, one hot encode features with a low number of categories (in this case, 5) and enumerate sparse categorical features that will be embedded.

#### Convert binary categorical features into numeric

In [None]:
patient_df.gender.value_counts()

In [None]:
patient_df.gender = patient_df.gender.map(lambda x: 1 if x == 'Male' else 0 if x == 'Female' else np.nan)

In [None]:
patient_df.gender.value_counts()

#### Separate and prepare features for embedding

Identify categorical features that have more than 5 unique categories, which will go through an embedding layer afterwards, and enumerate them.

Update list of categorical features and add those that will need embedding (features with more than 5 unique values):

In [None]:
new_cat_feat = ['ethnicity', 'apacheadmissiondx']
[cat_feat.append(col) for col in new_cat_feat]

In [None]:
cat_feat_nunique = [patient_df[feature].nunique() for feature in new_cat_feat]
cat_feat_nunique

In [None]:
new_cat_embed_feat = []
for i in range(len(new_cat_feat)):
    if cat_feat_nunique[i] > 5:
        # Add feature to the list of those that will be embedded
        cat_embed_feat.append(new_cat_feat[i])
        # Add feature to the list of the new ones (from the current table) that will be embedded
        new_cat_embed_feat.append(new_cat_feat[i])

In [None]:
patient_df[new_cat_feat].head()

In [None]:
for i in range(len(new_cat_embed_feat)):
    feature = new_cat_embed_feat[i]
    # Prepare for embedding, i.e. enumerate categories
    patient_df[feature], cat_embed_feat_enum[feature] = du.embedding.enum_categorical_feature(patient_df, feature)

In [None]:
patient_df[new_cat_feat].head()

In [None]:
cat_embed_feat_enum

In [None]:
patient_df[cat_feat].dtypes

#### Save enumeration encoding mapping

Save the dictionary that maps from the original categories/strings to the new numerical encondings.

In [None]:
stream = open('cat_embed_feat_enum_vital.yaml', 'w')
yaml.dump(cat_embed_feat_enum, stream, default_flow_style=False)

### Create the timestamp feature and sort

Create the timestamp (`ts`) feature:

In [None]:
patient_df['ts'] = 0
patient_df = patient_df.drop('observationoffset', axis=1)
patient_df.head()

In [None]:
patient_df.patientunitstayid.value_counts()

Remove duplicate rows:

In [None]:
len(patient_df)

In [None]:
patient_df = patient_df.drop_duplicates()
patient_df.head()

In [None]:
len(patient_df)

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
vital_prdc_df = vital_prdc_df.set_index('ts')
vital_prdc_df.head(6)

Check for possible multiple rows with the same unit stay ID and timestamp:

In [None]:
micro_df.reset_index().head()

In [None]:
micro_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='culturesite').head()

In [None]:
micro_df[micro_df.patientunitstayid == 3069495].head(20)

### Join rows that have the same IDs

In [None]:
micro_df = du.embedding.join_categorical_enum(micro_df, new_cat_embed_feat)
micro_df.head()

In [None]:
micro_df.dtypes

In [None]:
micro_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='culturesite').head()

In [None]:
micro_df[micro_df.patientunitstayid == 3069495].head(20)

Comparing the output from the two previous cells with what we had before the `join_categorical_enum` method, we can see that all rows with duplicate IDs have been successfully joined.

### Normalize data

In [None]:
patient_df_norm = du.data_processing.normalize_data(patient_df, embed_columns=new_cat_feat,
                                                    id_columns=['patientunitstayid', 'ts', 'deathoffset'])
patient_df_norm.head(6)

Confirm that everything is ok through the `describe` method:

In [None]:
patient_df_norm.describe().transpose()

### Clean column names

Standardize all column names to be on lower case, have spaces replaced by underscores and remove comas.

In [None]:
patient_df.columns = du.data_processing.clean_naming(patient_df.columns)
patient_df_norm.columns = du.data_processing.clean_naming(patient_df_norm.columns)
patient_df_norm.head()

### Save the dataframe

Save the dataframe before normalizing:

In [None]:
patient_df.to_csv(f'{data_path}cleaned/unnormalized/patient.csv')

Save the dataframe after normalizing:

In [None]:
patient_df_norm.to_csv(f'{data_path}cleaned/normalized/patient.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
patient_df_norm.describe().transpose()

### Join dataframes

Merge dataframes by the unit stay, `patientunitstayid`, and the timestamp, `ts`, with a tolerence for a difference of up to 30 minutes.

In [None]:
patient_df = pd.read_csv(f'{data_path}cleaned/normalized/patient.csv')
patient_df.head()

In [None]:
vital_prdc_df = pd.read_csv(f'{data_path}cleaned/normalized/vitalPeriodic.csv')
vital_prdc_df.head()

In [None]:
eICU_df = pd.merge_asof(patient_df, vital_aprdc_df, on='ts', by='patientunitstayid', direction='nearest', tolerance=30)
eICU_df.head()

## Vital signs aperiodic data

### Read the data

In [None]:
vital_aprdc_df = pd.read_csv(f'{data_path}original/vitalAperiodic.csv')
vital_aprdc_df.head()

In [None]:
len(vital_aprdc_df)

In [None]:
vital_aprdc_df.patientunitstayid.nunique()

Get an overview of the dataframe through the `describe` method:

In [None]:
vital_aprdc_df.describe().transpose()

In [None]:
vital_aprdc_df.columns

In [None]:
vital_aprdc_df.dtypes

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(vital_aprdc_df)

### Remove unneeded features

In [None]:
vital_aprdc_df = vital_aprdc_df.drop('vitalaperiodicid', axis=1)
vital_aprdc_df.head()

### Create the timestamp feature and sort

Create the timestamp (`ts`) feature:

In [None]:
vital_aprdc_df['ts'] = vital_aprdc_df['observationoffset']
vital_aprdc_df = vital_aprdc_df.drop('observationoffset', axis=1)
vital_aprdc_df.head()

Remove duplicate rows:

In [None]:
len(vital_aprdc_df)

In [None]:
vital_aprdc_df = vital_aprdc_df.drop_duplicates()
vital_aprdc_df.head()

In [None]:
len(vital_aprdc_df)

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
vital_aprdc_df = vital_aprdc_df.set_index('ts')
vital_aprdc_df.head(6)

Check for possible multiple rows with the same unit stay ID and timestamp:

In [None]:
vital_aprdc_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='noninvasivemean').head()

In [None]:
vital_aprdc_df[micro_df.patientunitstayid == 3069495].head(20)

### Join rows that have the same IDs

In [None]:
micro_df = du.embedding.join_categorical_enum(micro_df, new_cat_embed_feat)
micro_df.head()

In [None]:
micro_df.dtypes

In [None]:
micro_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='culturesite').head()

In [None]:
micro_df[micro_df.patientunitstayid == 3069495].head(20)

Comparing the output from the two previous cells with what we had before the `join_categorical_enum` method, we can see that all rows with duplicate IDs have been successfully joined.

### Normalize data

In [None]:
vital_aprdc_df_norm = du.data_processing.normalize_data(vital_aprdc_df,
                                                        id_columns=['patientunitstayid', 'ts'])
vital_aprdc_df_norm.head(6)

Confirm that everything is ok through the `describe` method:

In [None]:
vital_aprdc_df_norm.describe().transpose()

### Clean column names

Standardize all column names to be on lower case, have spaces replaced by underscores and remove comas.

In [None]:
vital_aprdc_df.columns = du.data_processing.clean_naming(vital_aprdc_df.columns)
vital_aprdc_df_norm.columns = du.data_processing.clean_naming(vital_aprdc_df_norm.columns)
vital_aprdc_df_norm.head()

### Save the dataframe

Save the dataframe before normalizing:

In [None]:
vital_aprdc_df.to_csv(f'{data_path}cleaned/unnormalized/vitalAperiodic.csv')

Save the dataframe after normalizing:

In [None]:
vital_aprdc_df_norm.to_csv(f'{data_path}cleaned/normalized/vitalAperiodic.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
vital_aprdc_df_norm.describe().transpose()

### Join dataframes

Merge dataframes by the unit stay, `patientunitstayid`, and the timestamp, `ts`, with a tolerence for a difference of up to 30 minutes.

In [None]:
vital_aprdc_df = pd.read_csv(f'{data_path}cleaned/normalized/vitalAperiodic.csv')
vital_aprdc_df.head()

In [None]:
len(vital_aprdc_df)

In [None]:
vital_aprdc_df.patientunitstayid.nunique()

In [None]:
eICU_df = pd.merge_asof(eICU_df, vital_aprdc_df, on='ts', by='patientunitstayid', direction='nearest', tolerance=30)
eICU_df.head()