# Vital Data Preprocessing
---

Reading and preprocessing vital signals data of the eICU dataset from MIT with the data from over 139k patients collected in the US.

This notebook addresses the preprocessing of the following eICU tables:
* vitalAperiodic
* vitalPeriodic

## Importing the necessary packages

In [None]:
import os                                  # os handles directory/workspace changes
import numpy as np                         # NumPy to handle numeric and NaN operations
import yaml                                # Save and load YAML files

In [None]:
# Debugging packages
import pixiedust                           # Debugging in Jupyter Notebook cells

In [None]:
# Change to parent directory (presumably "Documents")
os.chdir("../../../..")
# Path to the CSV dataset files
data_path = 'Datasets/Thesis/eICU/uncompressed/'
# Path to the code files
project_path = 'GitHub/eICU-mortality-prediction/'

In [None]:
# import modin.pandas as pd                  # Optimized distributed version of Pandas
import pandas as pd
import data_utils as du                    # Data science and machine learning relevant methods

Set the random seed for reproducibility

In [None]:
du.set_random_seed(42)

## Vital signs aperiodic data

### Read the data

In [None]:
vital_aprdc_df = pd.read_csv(f'{data_path}original/vitalAperiodic.csv')
vital_aprdc_df.head()

In [None]:
len(vital_aprdc_df)

In [None]:
vital_aprdc_df.patientunitstayid.nunique()

Get an overview of the dataframe through the `describe` method:

In [None]:
vital_aprdc_df.describe().transpose()

In [None]:
vital_aprdc_df.info()

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(vital_aprdc_df)

### Remove unneeded features

In [None]:
vital_aprdc_df.noninvasivesystolic.value_counts()

In [None]:
vital_aprdc_df.noninvasivediastolic.value_counts()

In [None]:
vital_aprdc_df.noninvasivemean.value_counts()

In [None]:
vital_aprdc_df.paop.value_counts()

In [None]:
vital_aprdc_df.cardiacoutput.value_counts()

In [None]:
vital_aprdc_df.cardiacinput.value_counts()

In [None]:
vital_aprdc_df.svr.value_counts()

In [None]:
vital_aprdc_df.svri.value_counts()

In [None]:
vital_aprdc_df.pvr.value_counts()

In [None]:
vital_aprdc_df.pvri.value_counts()

In [None]:
vital_aprdc_df = vital_aprdc_df.drop(columns=['vitalaperiodicid'])
vital_aprdc_df.head()

### Create the timestamp feature and sort

Create the timestamp (`ts`) feature:

In [None]:
vital_aprdc_df = vital_aprdc_df.rename(columns={'observationoffset': 'ts'})
vital_aprdc_df.head()

Remove duplicate rows:

In [None]:
len(vital_aprdc_df)

In [None]:
vital_aprdc_df = vital_aprdc_df.drop_duplicates()
vital_aprdc_df.head()

In [None]:
len(vital_aprdc_df)

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
vital_aprdc_df = vital_aprdc_df.sort_values('ts')
vital_aprdc_df.head()

Check for possible multiple rows with the same unit stay ID and timestamp:

In [None]:
vital_aprdc_df.groupby(['patientunitstayid', 'ts']).count().nlargest(columns='noninvasivemean', n=5).head()

In [None]:
vital_aprdc_df[(vital_aprdc_df.patientunitstayid == 625065) & (vital_aprdc_df.ts == 1515)].head(10)

We can see that there are up to 4 rows per set of `patientunitstayid` and `ts`. As such, we must join them. However, this is a different scenario than in the other cases. Since all features are numeric, we just need to average the features.

### Join rows that have the same IDs

In [None]:
vital_aprdc_df = du.embedding.join_categorical_enum(vital_aprdc_df, cont_join_method='mean', inplace=True)
vital_aprdc_df.head()

In [None]:
vital_aprdc_df.groupby(['patientunitstayid', 'ts']).count().nlargest(columns='noninvasivemean', n=5).head()

In [None]:
vital_aprdc_df[(vital_aprdc_df.patientunitstayid == 625065) & (vital_aprdc_df.ts == 1515)].head(10)

Comparing the output from the two previous cells with what we had before the `join_categorical_enum` method, we can see that all rows with duplicate IDs have been successfully joined.

### Clean column names

Standardize all column names to be on lower case, have spaces replaced by underscores and remove comas.

In [None]:
vital_aprdc_df.columns = du.data_processing.clean_naming(vital_aprdc_df.columns)
vital_aprdc_df.head()

### Normalize data

Save the dataframe before normalizing:

In [None]:
vital_aprdc_df.to_csv(f'{data_path}cleaned/unnormalized/vitalAperiodic.csv')

In [None]:
vital_aprdc_df = du.data_processing.normalize_data(vital_aprdc_df, inplace=True)
vital_aprdc_df.head(6)

### Save the dataframe

Save the dataframe after normalizing:

In [None]:
vital_aprdc_df.to_csv(f'{data_path}cleaned/normalized/vitalAperiodic.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
vital_aprdc_df.describe().transpose()

In [None]:
vital_aprdc_df.info()

## Vital signs periodic data

### Read the data

In [None]:
vital_prdc_df = pd.read_csv(f'{data_path}original/vitalPeriodic.csv')
vital_prdc_df.head()

In [None]:
len(vital_prdc_df)

In [None]:
vital_prdc_df.patientunitstayid.nunique()

Get an overview of the dataframe through the `describe` method:

In [None]:
vital_prdc_df.describe().transpose()

In [None]:
vital_prdc_df.info()

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(vital_prdc_df)

### Remove unneeded features

In [None]:
vital_prdc_df.temperature.value_counts()

In [None]:
vital_prdc_df.sao2.value_counts()

In [None]:
vital_prdc_df.heartrate.value_counts()

In [None]:
vital_prdc_df.respiration.value_counts()

In [None]:
vital_prdc_df.cvp.value_counts()

In [None]:
vital_prdc_df.systemicsystolic.value_counts()

In [None]:
vital_prdc_df.systemicdiastolic.value_counts()

In [None]:
vital_prdc_df.systemicmean.value_counts()

In [None]:
vital_prdc_df.pasystolic.value_counts()

In [None]:
vital_prdc_df.padiastolic.value_counts()

In [None]:
vital_prdc_df.pamean.value_counts()

In [None]:
vital_prdc_df.st1.value_counts()

In [None]:
vital_prdc_df.st2.value_counts()

In [None]:
vital_prdc_df.st3.value_counts()

In [None]:
vital_prdc_df.icp.value_counts()

In [None]:
vital_prdc_df = vital_prdc_df.drop(columns=['vitalperiodicid'])
vital_prdc_df.head()

### Create the timestamp feature and sort

Create the timestamp (`ts`) feature:

In [None]:
vital_prdc_df = vital_prdc_df.rename(columns={'observationoffset': 'ts'})
vital_prdc_df.head()

Remove duplicate rows:

In [None]:
len(vital_prdc_df)

In [None]:
vital_prdc_df = vital_prdc_df.drop_duplicates()
vital_prdc_df.head()

In [None]:
len(vital_prdc_df)

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
vital_prdc_df = vital_prdc_df.sort_values('ts')
vital_prdc_df.head()

Check for possible multiple rows with the same unit stay ID and timestamp:

In [None]:
vital_prdc_df.groupby(['patientunitstayid', 'ts']).count().nlargest(columns='heartrate', n=5).head()

In [None]:
vital_prdc_df[(vital_prdc_df.patientunitstayid == 625065) & (vital_prdc_df.ts == 1515)].head(10)

We can see that there are up to 4 rows per set of `patientunitstayid` and `ts`. As such, we must join them. However, this is a different scenario than in the other cases. Since all features are numeric, we just need to average the features.

### Join rows that have the same IDs

In [None]:
vital_prdc_df = du.embedding.join_categorical_enum(vital_prdc_df, cont_join_method='mean', inplace=True)
vital_prdc_df.head()

In [None]:
vital_prdc_df.groupby(['patientunitstayid', 'ts']).count().nlargest(columns='heartrate', n=5).head()

In [None]:
vital_prdc_df[(vital_prdc_df.patientunitstayid == 625065) & (vital_prdc_df.ts == 1515)].head(10)

Comparing the output from the two previous cells with what we had before the `join_categorical_enum` method, we can see that all rows with duplicate IDs have been successfully joined.

### Clean column names

Standardize all column names to be on lower case, have spaces replaced by underscores and remove comas.

In [None]:
vital_prdc_df.columns = du.data_processing.clean_naming(vital_prdc_df.columns)
vital_prdc_df.head()

### Normalize data

Save the dataframe before normalizing:

In [None]:
vital_prdc_df.to_csv(f'{data_path}cleaned/unnormalized/vitalPeriodic.csv')

In [None]:
vital_prdc_df = du.data_processing.normalize_data(vital_prdc_df, inplace=True)
vital_prdc_df.head(6)

### Save the dataframe

Save the dataframe after normalizing:

In [None]:
vital_prdc_df.to_csv(f'{data_path}cleaned/normalized/vitalPeriodic.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
vital_prdc_df.describe().transpose()

In [None]:
vital_prdc_df.info()