# Respiratory Data Preprocessing
---

Reading and preprocessing respiratory data of the eICU dataset from MIT with the data from over 139k patients collected in the US.

This notebook addresses the preprocessing of the following eICU tables:
* respiratoryCare
* respiratoryCharting

## Importing the necessary packages

In [None]:
import os                                  # os handles directory/workspace changes
import numpy as np                         # NumPy to handle numeric and NaN operations
import yaml                                # Save and load YAML files

In [None]:
# Debugging packages
import pixiedust                           # Debugging in Jupyter Notebook cells

In [None]:
# Change to parent directory (presumably "Documents")
os.chdir("../../../..")

# Path to the CSV dataset files
data_path = 'Datasets/Thesis/eICU/uncompressed/'

# Path to the code files
project_path = 'GitHub/eICU-mortality-prediction/'

In [None]:
import modin.pandas as pd                  # Optimized distributed version of Pandas
import data_utils as du                    # Data science and machine learning relevant methods

Set the random seed for reproducibility

In [None]:
du.set_random_seed(42)

## Initialize variables

In [None]:
cat_feat = []                              # List of categorical features
cat_embed_feat = []                        # List of categorical features that will be embedded
cat_embed_feat_enum = dict()               # Dictionary of the enumerations of the categorical features that will be embedded

## Respiratory care data

### Initialize variables

In [None]:
cat_feat = []                              # List of categorical features
cat_embed_feat = []                        # List of categorical features that will be embedded
cat_embed_feat_enum = dict()               # Dictionary of the enumerations of the categorical features that will be embedded

### Read the data

In [None]:
resp_care_df = pd.read_csv(f'{data_path}original/respiratoryCare.csv', dtype={'airwayposition': 'object',
                                                                              'airwaytype': 'object',
                                                                              'airwaysize': 'object',
                                                                              'apneaparms': 'object',
                                                                              'setapneafio2': 'object',
                                                                              'setapneaie': 'object',
                                                                              'setapneainsptime': 'object',
                                                                              'setapneainterval': 'object',
                                                                              'setapneaippeephigh': 'object',
                                                                              'setapneapeakflow': 'object',
                                                                              'setapneatv': 'object'})
resp_care_df.head()

In [None]:
len(resp_care_df)

In [None]:
resp_care_df.patientunitstayid.nunique()

Get an overview of the dataframe through the `describe` method:

In [None]:
resp_care_df.describe().transpose()

In [None]:
resp_care_df.columns

In [None]:
resp_care_df.dtypes

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(resp_care_df)

### Remove unneeded features

For the respiratoryCare table, I'm not going to use any of the several features that detail what the vent in the hospital is like. Besides not appearing to be very relevant for the patient, they have a lot of missing values (>67%). Instead, I'm going to set a ventilation label (between the start and the end), and a previous ventilation label.

In [None]:
resp_care_df = resp_care_df[['patientunitstayid', 'ventstartoffset',
                             'ventendoffset', 'priorventstartoffset']]
resp_care_df.head()

### Create the timestamp feature and sort

Create the timestamp (`ts`) feature:

In [None]:
resp_care_df = resp_care_df.rename(columns={'ventstartoffset': 'ts'})
resp_care_df.head()

Remove duplicate rows:

In [None]:
len(resp_care_df)

In [None]:
resp_care_df = resp_care_df.drop_duplicates()
resp_care_df.head()

In [None]:
len(resp_care_df)

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
resp_care_df = resp_care_df.sort_values('ts')
resp_care_df.head()

Check for possible multiple rows with the same unit stay ID and timestamp:

In [None]:
resp_care_df.groupby(['patientunitstayid', 'ts']).count().nlargest(columns='ventendoffset', n=5).head()

In [None]:
resp_care_df[resp_care_df.patientunitstayid == 1113084].head(20)

We can see that there are up to 5283 duplicate rows per set of `patientunitstayid` and `ts`. As such, we must join them.

### Join rows that have the same IDs

Even after removing duplicates rows, there are still some that have different information for the same ID and timestamp. We have to apply a groupby function, selecting the minimum value for each of the offset features, as the larger values don't make sense (in the `priorventstartoffset`).

In [None]:
((resp_care_df.ts > resp_care_df.ventendoffset) & resp_care_df.ventendoffset != 0).value_counts()

There are no errors of having the start vent timestamp later than the end vent timestamp.

In [None]:
resp_care_df = du.embedding.join_categorical_enum(resp_care_df, cont_join_method='min', inplace=True)
resp_care_df.head()

In [None]:
resp_care_df.groupby(['patientunitstayid', 'ts']).count().nlargest(columns='ventendoffset', n=5).head()

In [None]:
resp_care_df[resp_care_df.patientunitstayid == 1113084].head(10)

Comparing the output from the two previous cells with what we had before the `join_categorical_enum` method, we can see that all rows with duplicate IDs have been successfully joined.

Convert dataframe to Pandas, as the next cells aren't working properly with Modin:

In [None]:
resp_care_df, pd = du.utils.convert_dataframe(resp_care_df, to='pandas')

In [None]:
type(resp_care_df)

Only keep the first instance of each patient, as we're only keeping track of when they are on ventilation:

In [None]:
resp_care_df = resp_care_df.groupby('patientunitstayid').first().sort_values('ts').reset_index()
resp_care_df.head(20)

### Create prior ventilation label

Make a feature `priorvent` that indicates if the patient has been on ventilation before.

Create the prior ventilation column:

In [None]:
resp_care_df['priorvent'] = (resp_care_df.priorventstartoffset < resp_care_df.ts).astype(int)
resp_care_df.head()

Remove the now unneeded `priorventstartoffset` column:

In [None]:
resp_care_df = resp_care_df.drop('priorventstartoffset', axis=1)
resp_care_df.head()

### Create current ventilation label

Make a feature `onvent` that indicates if the patient is currently on ventilation.

Create a `onvent` feature:

In [None]:
resp_care_df['onvent'] = 1
resp_care_df.head(6)

Duplicate every row, so as to create a discharge event:

In [None]:
new_df = resp_care_df.copy()
new_df.head()

Set the new dataframe's rows to have the ventilation stop timestamp, indicating that ventilation use ended:

In [None]:
new_df.ts = new_df.ventendoffset
new_df.onvent = 0
new_df.head()

Join the new rows to the remaining dataframe:

In [None]:
resp_care_df = resp_care_df.append(new_df)
resp_care_df.head()

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
resp_care_df = resp_care_df.sort_values('ts')
resp_care_df.head()

Remove the now unneeded ventilation end column:

In [None]:
resp_care_df = resp_care_df.drop('ventendoffset', axis=1)
resp_care_df.head(6)

In [None]:
resp_care_df.tail(6)

In [None]:
resp_care_df[resp_care_df.patientunitstayid == 1557538]

Reconvert dataframe to Modin:

In [None]:
resp_care_df, pd = du.utils.convert_dataframe(resp_care_df, to='modin')

In [None]:
type(resp_care_df)

### Clean column names

Standardize all column names to be on lower case, have spaces replaced by underscores and remove comas.

In [None]:
resp_care_df.columns = du.data_processing.clean_naming(resp_care_df.columns)
resp_care_df.head()

### Save the dataframe

Save the dataframe before normalizing:

In [None]:
resp_care_df.to_csv(f'{data_path}cleaned/unnormalized/respiratoryCare.csv')

Save the dataframe after normalizing:

In [None]:
resp_care_df.to_csv(f'{data_path}cleaned/normalized/respiratoryCare.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
resp_care_df.describe().transpose()

## Respiratory charting data

### Initialize variables

In [None]:
cat_feat = []                              # List of categorical features
cat_embed_feat = []                        # List of categorical features that will be embedded
cat_embed_feat_enum = dict()               # Dictionary of the enumerations of the categorical features that will be embedded

### Read the data

In [None]:
resp_chart_df = pd.read_csv(f'{data_path}original/respiratoryCharting.csv')
resp_chart_df.head()

In [None]:
len(resp_chart_df)

In [None]:
resp_chart_df.patientunitstayid.nunique()

Only 13001 unit stays have nurse care data. Might not be useful to include them.

Get an overview of the dataframe through the `describe` method:

In [None]:
resp_chart_df.describe().transpose()

In [None]:
resp_chart_df.columns

In [None]:
resp_chart_df.dtypes

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(resp_chart_df)

### Remove unneeded features

In [None]:
resp_chart_df.celllabel.value_counts()

In [None]:
resp_chart_df.cellattribute.value_counts()

In [None]:
resp_chart_df.cellattributevalue.value_counts()

In [None]:
resp_chart_df.cellattributepath.value_counts()

In [None]:
resp_chart_df[resp_chart_df.celllabel == 'Intervention'].cellattributevalue.value_counts()

In [None]:
resp_chart_df[resp_chart_df.celllabel == 'Neurologic'].cellattributevalue.value_counts()

In [None]:
resp_chart_df[resp_chart_df.celllabel == 'Pupils'].cellattributevalue.value_counts()

In [None]:
resp_chart_df[resp_chart_df.celllabel == 'Edema'].cellattributevalue.value_counts()

In [None]:
resp_chart_df[resp_chart_df.celllabel == 'Secretions'].cellattributevalue.value_counts()

In [None]:
resp_chart_df[resp_chart_df.celllabel == 'Cough'].cellattributevalue.value_counts()

In [None]:
resp_chart_df[resp_chart_df.celllabel == 'Neurologic'].cellattribute.value_counts()

In [None]:
resp_chart_df[resp_chart_df.celllabel == 'Pupils'].cellattribute.value_counts()

In [None]:
resp_chart_df[resp_chart_df.celllabel == 'Secretions'].cellattribute.value_counts()

In [None]:
resp_chart_df[resp_chart_df.celllabel == 'Cough'].cellattribute.value_counts()

Besides the usual removal of row identifier, `nurseAssessID`, and the timestamp when data was added, `nurseAssessEntryOffset`, I'm also removing `cellattributepath` and `cellattribute`, which have redundant info with `celllabel`. Regarding data categories, I'm only keeping `Neurologic`, `Pupils`, `Secretions` and `Cough`, as the remaining ones either don't add much value, have too little data or are redundant with data from other tables.

In [None]:
resp_chart_df = resp_chart_df.drop(['nurseassessid', 'nurseassessentryoffset',
                                      'cellattributepath', 'cellattribute'], axis=1)
resp_chart_df.head()

In [None]:
categories_to_keep = ['Neurologic', 'Pupils', 'Secretions', 'Cough']

In [None]:
resp_chart_df.celllabel.isin(categories_to_keep).head()

In [None]:
resp_chart_df = resp_chart_df[resp_chart_df.celllabel.isin(categories_to_keep)]
resp_chart_df.head()

### Convert categories to features

Make the `celllabel` and `cellattributevalue` columns of type categorical:

In [None]:
resp_chart_df = resp_chart_df.categorize(columns=['celllabel', 'cellattributevalue'])

In [None]:
resp_chart_df.head()

Transform the `celllabel` categories and `cellattributevalue` values into separate features:

Now we have the categories separated into their own features, as desired.

Remove the old `celllabel` and `cellattributevalue` columns:

In [None]:
resp_chart_df = resp_chart_df.drop(['celllabel', 'cellattributevalue'], axis=1)
resp_chart_df.head()

In [None]:
resp_chart_df['Neurologic'].value_counts()

In [None]:
resp_chart_df['Pupils'].value_counts()

In [None]:
resp_chart_df['Secretions'].value_counts()

In [None]:
resp_chart_df['Cough'].value_counts()

### Discretize categorical features

Convert binary categorical features into simple numberings, one hot encode features with a low number of categories (in this case, 5) and enumerate sparse categorical features that will be embedded.

#### Separate and prepare features for embedding

Identify categorical features that have more than 5 unique categories, which will go through an embedding layer afterwards, and enumerate them.

Update list of categorical features and add those that will need embedding (features with more than 5 unique values):

In [None]:
new_cat_feat = ['Pupils', 'Neurologic', 'Secretions', 'Cough']
[cat_feat.append(col) for col in new_cat_feat]

In [None]:
cat_feat_nunique = [resp_chart_df[feature].nunique() for feature in new_cat_feat]
cat_feat_nunique

In [None]:
new_cat_embed_feat = []
for i in range(len(new_cat_feat)):
    if cat_feat_nunique[i] > 5:
        # Add feature to the list of those that will be embedded
        cat_embed_feat.append(new_cat_feat[i])
        new_cat_embed_feat.append(new_cat_feat[i])

In [None]:
resp_chart_df[new_cat_feat].head()

In [None]:
for i in range(len(new_cat_embed_feat)):
    feature = new_cat_embed_feat[i]
    # Prepare for embedding, i.e. enumerate categories
    resp_chart_df[feature], cat_embed_feat_enum[feature] = du.embedding.enum_categorical_feature(resp_chart_df, feature, nan_value=0,
                                                                                                 forbidden_digit=0)

In [None]:
resp_chart_df[new_cat_feat].head()

In [None]:
cat_embed_feat_enum

In [None]:
resp_chart_df[new_cat_feat].dtypes

#### Save enumeration encoding mapping

Save the dictionary that maps from the original categories/strings to the new numerical encondings.

In [None]:
stream = open(f'{data_path}/cleaned/cat_embed_feat_enum_resp.yaml', 'w')
yaml.dump(cat_embed_feat_enum, stream, default_flow_style=False)

### Create the timestamp feature and sort

Create the timestamp (`ts`) feature:

In [None]:
resp_chart_df = resp_chart_df.rename(columns={'nurseassessoffset': 'ts'})
resp_chart_df.head()

Remove duplicate rows:

In [None]:
len(resp_chart_df)

In [None]:
resp_chart_df = resp_chart_df.drop_duplicates()
resp_chart_df.head()

In [None]:
len(resp_chart_df)

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
resp_chart_df = resp_chart_df.sort_values('ts')
resp_chart_df.head()

Check for possible multiple rows with the same unit stay ID and timestamp:

In [None]:
resp_chart_df.groupby(['patientunitstayid', 'ts']).count().nlargest(columns='Cough', n=5).head()

In [None]:
resp_chart_df[resp_chart_df.patientunitstayid == 2553254].head(10)

We can see that there are up to 80 categories per set of `patientunitstayid` and `ts`. As such, we must join them.

### Join rows that have the same IDs

Convert dataframe to Pandas, as the groupby operation in `join_categorical_enum` isn't working properly with Modin:

In [None]:
resp_chart_df, pd = du.utils.convert_dataframe(resp_chart_df, to='pandas')

In [None]:
type(resp_chart_df)

In [None]:
resp_chart_df = du.embedding.join_categorical_enum(resp_chart_df, new_cat_embed_feat, inplace=True)
resp_chart_df.head()

Reconvert dataframe to Modin:

In [None]:
resp_chart_df, pd = du.utils.convert_dataframe(resp_chart_df, to='modin')

In [None]:
type(resp_chart_df)

In [None]:
resp_chart_df.dtypes

In [None]:
resp_chart_df.groupby(['patientunitstayid', 'ts']).count().nlargest(columns='Cough', n=5).head()

In [None]:
resp_chart_df[resp_chart_df.patientunitstayid == 2553254].head(10)

Comparing the output from the two previous cells with what we had before the `join_categorical_enum` method, we can see that all rows with duplicate IDs have been successfully joined.

### Clean column names

Standardize all column names to be on lower case, have spaces replaced by underscores and remove comas.

In [None]:
resp_chart_df.columns = du.data_processing.clean_naming(resp_chart_df.columns)
resp_chart_df.head()

### Save the dataframe

Save the dataframe before normalizing:

In [None]:
resp_chart_df.to_csv(f'{data_path}cleaned/unnormalized/respiratoryCharting.csv')

Save the dataframe after normalizing:

In [None]:
resp_chart_df.to_csv(f'{data_path}cleaned/normalized/respiratoryCharting.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
resp_chart_df.describe().transpose()