# Patient Data Preprocessing
---

Reading and preprocessing patient data of the eICU dataset from MIT with the data from over 139k patients collected in the US.

This notebook addresses the preprocessing of the following eICU tables:
* patient
* note

## Importing the necessary packages

In [None]:
import os                                  # os handles directory/workspace changes
import numpy as np                         # NumPy to handle numeric and NaN operations
import yaml                                # Save and load YAML files

In [None]:
# Debugging packages
import pixiedust                           # Debugging in Jupyter Notebook cells

In [None]:
# Change to parent directory (presumably "Documents")
os.chdir("../../../..")

# Path to the CSV dataset files
data_path = 'Datasets/Thesis/eICU/uncompressed/'

# Path to the code files
project_path = 'GitHub/eICU-mortality-prediction/'

In [None]:
import modin.pandas as pd                  # Optimized distributed version of Pandas
import data_utils as du                    # Data science and machine learning relevant methods

Set the random seed for reproducibility

In [None]:
du.set_random_seed(42)

## Patient data

### Initialize variables

In [None]:
cat_feat = []                              # List of categorical features
cat_embed_feat = []                        # List of categorical features that will be embedded
cat_embed_feat_enum = dict()               # Dictionary of the enumerations of the categorical features that will be embedded

### Read the data

In [None]:
patient_df = pd.read_csv(f'{data_path}original/patient.csv')
patient_df.head()

In [None]:
len(patient_df)

In [None]:
patient_df.patientunitstayid.nunique()

In [None]:
patient_df.patientunitstayid.value_counts()

Get an overview of the dataframe through the `describe` method:

In [None]:
patient_df.describe().transpose()

In [None]:
patient_df.columns

In [None]:
patient_df.dtypes

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(patient_df)

### Remove unneeded features

Besides removing unneeded hospital and time information, I'm also removing the admission diagnosis (`apacheadmissiondx`) as it doesn't follow the same structure as the remaining diagnosis data (which is categorized in increasingly specific categories, separated by "|").

In [None]:
patient_df = patient_df[['patientunitstayid', 'gender', 'age', 'ethnicity',  'admissionheight',
                         'hospitaldischargeoffset', 'hospitaldischargelocation', 'hospitaldischargestatus',
                         'admissionweight']]
patient_df.head()

### Make the age feature numeric

In the eICU dataset, ages above 89 years old are not specified. Instead, we just receive the indication "> 89". In order to be able to work with the age feature numerically, we'll just replace the "> 89" values with "90", as if the patient is 90 years old. It might not always be the case, but it shouldn't be very different and it probably doesn't affect too much the model's logic.

In [None]:
patient_df.age.value_counts().head()

In [None]:
# Replace the "> 89" years old indication with 90 years
patient_df.age = patient_df.age.replace(to_replace='> 89', value=90)

In [None]:
patient_df.age.value_counts().head()

In [None]:
# Make the age feature numeric
patient_df.age = patient_df.age.astype(float)

### Discretize categorical features

Convert binary categorical features into simple numberings, one hot encode features with a low number of categories (in this case, 5) and enumerate sparse categorical features that will be embedded.

#### Convert binary categorical features into numeric

In [None]:
patient_df.gender.value_counts()

In [None]:
patient_df.gender = patient_df.gender.map(lambda x: 1 if x == 'Male' else 0 if x == 'Female' else np.nan)

In [None]:
patient_df.gender.value_counts()

#### Separate and prepare features for embedding

Identify categorical features that have more than 5 unique categories, which will go through an embedding layer afterwards, and enumerate them.


Update list of categorical features and add those that will need embedding (features with more than 5 unique values):

In [None]:
new_cat_feat = ['ethnicity']
[cat_feat.append(col) for col in new_cat_feat]

In [None]:
cat_feat_nunique = [patient_df[feature].nunique() for feature in new_cat_feat]
cat_feat_nunique

In [None]:
new_cat_embed_feat = []
for i in range(len(new_cat_feat)):
    if cat_feat_nunique[i] > 5:
        # Add feature to the list of those that will be embedded
        cat_embed_feat.append(new_cat_feat[i])
        # Add feature to the list of the new ones (from the current table) that will be embedded
        new_cat_embed_feat.append(new_cat_feat[i])

In [None]:
patient_df[new_cat_feat].head()

In [None]:
for i in range(len(new_cat_embed_feat)):
    feature = new_cat_embed_feat[i]
    # Prepare for embedding, i.e. enumerate categories
    patient_df[feature], cat_embed_feat_enum[feature] = du.embedding.enum_categorical_feature(patient_df, feature, nan_value=0,
                                                                                              forbidden_digit=0)

In [None]:
patient_df[new_cat_feat].head()

In [None]:
cat_embed_feat_enum

In [None]:
patient_df[cat_feat].dtypes

#### Save enumeration encoding mapping

Save the dictionary that maps from the original categories/strings to the new numerical encondings.

In [None]:
stream = open(f'{data_path}/cleaned/cat_embed_feat_enum_patient.yaml', 'w')
yaml.dump(cat_embed_feat_enum, stream, default_flow_style=False)

### Create mortality label

Combine info from discharge location and discharge status. Using the hospital discharge data, instead of the unit, as it has a longer perspective on the patient's status. I then save a feature called "deathOffset", which has a number if the patient is dead on hospital discharge or is NaN if the patient is still alive/unknown (presumed alive if unknown). Based on this, a label can be made later on, when all the tables are combined in a single dataframe, indicating if a patient dies in the following X time, according to how faraway we want to predict.

In [None]:
patient_df.hospitaldischargestatus.value_counts()

In [None]:
patient_df.hospitaldischargelocation.value_counts()

In [None]:
tmp_col = patient_df.apply(lambda df: df['hospitaldischargeoffset']
                                      if df['hospitaldischargestatus'] == 'Expired'
                                      else np.nan, axis=1)
tmp_col

In [None]:
tmp_col.index.value_counts()

In [None]:
tmp_col[0][100783]

In [None]:
(~tmp_col.isnull()).sum()

In [None]:
len(patient_df)

In [None]:
# Remove duplicate NaNs
tmp_col = tmp_col.loc[~tmp_col.index.duplicated(keep='first')]

In [None]:
tmp_col.index.value_counts()

In [None]:
tmp_col[0][100783]

In [None]:
(~tmp_col.isnull()).sum()

In [None]:
len(patient_df)

Something's wrong with this `apply` line, as it's creating duplicate and fake (in rows where there is a value) NaN rows. Moving on to a more complex solution, where we can filter out the fake NaNs and remove the duplicate NaNs.

In [None]:
def get_death_ts(df):
    if df['hospitaldischargestatus'] == 'Expired':
        df['death_ts'] = df['hospitaldischargeoffset']
    else:
        df['death_ts'] = np.nan
    return df

In [None]:
tmp_df = patient_df.copy()
tmp_df['death_ts'] = tmp_df['hospitaldischargeoffset']

In [None]:
tmp_df = tmp_df.apply(get_death_ts, axis=1, result_type='broadcast')
tmp_df.head()

In [None]:
tmp_df['death_ts'].index.value_counts()

In [None]:
tmp_df['death_ts'][100783]

In [None]:
(~tmp_df['death_ts'].isnull()).sum()

In [None]:
tmp_col = tmp_df.groupby('patientunitstayid').death_ts.max()
tmp_col

In [None]:
tmp_col.index.value_counts()

In [None]:
(~tmp_col.isnull()).sum()

In [None]:
patient_df['death_ts'] = tmp_col
patient_df.head()

Remove the now unneeded hospital discharge features:

In [None]:
patient_df = patient_df.drop(['hospitaldischargeoffset', 'hospitaldischargestatus', 'hospitaldischargelocation'], axis=1)
patient_df.head(6)

### Create the timestamp feature and sort

Create the timestamp (`ts`) feature:

In [None]:
patient_df['ts'] = 0
patient_df.head()

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
# Index setting is failing on modin, so we're just going to skip this part for now
# patient_df = patient_df.set_index('ts')
# patient_df.head()

### Normalize data

Save the dataframe before normalizing:

In [None]:
patient_df.to_csv(f'{data_path}cleaned/unnormalized/patient.csv')

In [None]:
new_cat_feat

In [None]:
patient_df_norm = du.data_processing.normalize_data(patient_df, categ_columns=new_cat_feat,
                                                    id_columns=['patientunitstayid', 'ts', 'death_ts'],
                                                    inplace=True)
patient_df_norm.head(6)

In [None]:
patient_df_norm.to_csv(f'{data_path}cleaned/normalized/patient.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
patient_df_norm.describe().transpose()

In [None]:
# [TODO] Remove the rows with ts = 0 if there are no matching rows in other tables

## Notes data

### Initialize variables

In [None]:
cat_feat = []                              # List of categorical features
cat_embed_feat = []                        # List of categorical features that will be embedded
cat_embed_feat_enum = dict()               # Dictionary of the enumerations of the categorical features that will be embedded

### Read the data

In [None]:
note_df = pd.read_csv(f'{data_path}original/note.csv')
note_df.head()

In [None]:
len(note_df)

In [None]:
note_df.patientunitstayid.nunique()

Get an overview of the dataframe through the `describe` method:

In [None]:
note_df.describe().transpose()

In [None]:
note_df.columns

In [None]:
note_df.dtypes

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(note_df)

### Remove unneeded features

In [None]:
note_df.notetype.value_counts().head(20)

In [None]:
note_df.notepath.value_counts().head(40)

In [None]:
note_df.notevalue.value_counts().head(20)

In [None]:
note_df[note_df.notepath.str.contains('notes/Progress Notes/Social History')].head(20)

In [None]:
note_df[note_df.notepath.str.contains('notes/Progress Notes/Social History')].notepath.value_counts().head(20)

In [None]:
note_df[note_df.notepath.str.contains('notes/Progress Notes/Social History')].notevalue.value_counts().head(20)

Out of all the possible notes, only those addressing the patient's social history seem to be interesting and containing information not found in other tables. As such, we'll only keep the note paths that mention social history:

In [None]:
note_df = note_df[note_df.notepath.str.contains('notes/Progress Notes/Social History')]
note_df.head()

In [None]:
len(note_df)

There are still rows that seem to contain irrelevant data. Let's remove them by finding rows that contain specific words, like "obtain" and "print", that only appear in said irrelevant rows:

In [None]:
category_types_to_remove = ['obtain', 'print', 'copies', 'options']

In [None]:
du.search_explore.find_row_contains_word(note_df, feature='notepath', words=category_types_to_remove).value_counts()

In [None]:
note_df = note_df[~du.search_explore.find_row_contains_word(note_df, feature='notepath', words=category_types_to_remove)]
note_df.head()

In [None]:
len(note_df)

In [None]:
note_df.patientunitstayid.nunique()

In [None]:
note_df.notetype.value_counts().head(20)

Filtering just for interesting social history data greatly reduced the data volume of the notes table, now only present in around 20.5% of the unit stays. Still, it might be useful to include.

Besides the usual removal of row identifier, `noteid`, I'm also removing apparently irrelevant (`noteenteredoffset`, `notetype`) and redundant (`notetext`) columns:

In [None]:
note_df = note_df.drop(['noteid', 'noteenteredoffset', 'notetype', 'notetext'], axis=1)
note_df.head()

### Separate high level notes

In [None]:
note_df.notepath.value_counts().head(20)

In [None]:
note_df.notepath.map(lambda x: x.split('/')).head().values

In [None]:
note_df.notepath.map(lambda x: len(x.split('/'))).min()

In [None]:
note_df.notepath.map(lambda x: len(x.split('/'))).max()

In [None]:
note_df.notepath.apply(lambda x: du.search_explore.get_element_from_split(x, 1, separator='/')).value_counts()

In [None]:
note_df.notepath.apply(lambda x: du.search_explore.get_element_from_split(x, 2, separator='/')).value_counts()

In [None]:
note_df.notepath.apply(lambda x: du.search_explore.get_element_from_split(x, 3, separator='/')).value_counts()

In [None]:
note_df.notepath.apply(lambda x: du.search_explore.get_element_from_split(x, 4, separator='/')).value_counts()

In [None]:
note_df.notepath.apply(lambda x: du.search_explore.get_element_from_split(x, 5, separator='/')).value_counts()

In [None]:
note_df.notepath.apply(lambda x: du.search_explore.get_element_from_split(x, 6, separator='/')).value_counts()

In [None]:
note_df.notepath.apply(lambda x: du.search_explore.get_element_from_split(x, 7, separator='/')).value_counts()

In [None]:
note_df.notevalue.value_counts()

There are always 8 levels of the notes. As the first 6 ones are essentially always the same ("notes/Progress Notes/Social History / Family History/Social History/Social History/"), it's best to just preserve the 7th one and isolate the 8th in a new feature. This way, the split provides further insight to the model on similar notes. However, it's also worth taking note that the 8th level of `notepath` seems to be identical to the feature `notevalue`. We'll look more into it later.

In [None]:
note_df['notetopic'] = note_df.notepath.apply(lambda x: du.search_explore.get_element_from_split(x, 6, separator='/'))
note_df['notedetails'] = note_df.notepath.apply(lambda x: du.search_explore.get_element_from_split(x, 7, separator='/'))
note_df.head()

Remove the now redundant `notepath` column:

In [None]:
note_df = note_df.drop('notepath', axis=1)
note_df.head()

Compare columns `notevalue` and `notedetails`:

In [None]:
note_df[note_df.notevalue != note_df.notedetails]

The previous blank output confirms that the newly created `notedetails` feature is exactly equal to the already existing `notevalue` feature. So, we should remove one of them:

In [None]:
note_df = note_df.drop('notedetails', axis=1)
note_df.head()

In [None]:
note_df[note_df.notetopic == 'Smoking Status'].notevalue.value_counts()

In [None]:
note_df[note_df.notetopic == 'Ethanol Use'].notevalue.value_counts()

In [None]:
note_df[note_df.notetopic == 'CAD'].notevalue.value_counts()

In [None]:
note_df[note_df.notetopic == 'Cancer'].notevalue.value_counts()

In [None]:
note_df[note_df.notetopic == 'Recent Travel'].notevalue.value_counts()

In [None]:
note_df[note_df.notetopic == 'Bleeding Disorders'].notevalue.value_counts()

Considering how only the categories of "Smoking Status" and "Ethanol Use" in `notetopic` have more than one possible `notevalue` category, with the remaining being only 2 useful ones (categories "Recent Travel" and "Bleeding Disorders" have too little samples), it's probably best to just turn them into features, instead of packing in the same embedded feature.

### Convert categories to features

Make the `notetopic` and `notevalue` columns of type categorical:

In [None]:
# Only needed while using Dask, not with Modin or Pandas
# note_df = note_df.categorize(columns=['notetopic', 'notevalue'])

Transform the `notetopic` categories and `notevalue` values into separate features:

In [None]:
note_df = du.data_processing.category_to_feature(note_df, categories_feature='notetopic',
                                                 values_feature='notevalue', min_len=1000, inplace=True)
note_df.head()

Now we have the categories separated into their own features, as desired. Notice also how categories `Bleeding Disorders` and `Recent Travel` weren't added, as they appeared in less than the specified minimum of 1000 rows.

Remove the old `notevalue` and `notetopic` columns:

In [None]:
note_df = note_df.drop(['notevalue', 'notetopic'], axis=1)
note_df.head()

While `Ethanol Use` and `Smoking Status` have several unique values, `CAD` and `Cancer` only have 1, indicating when that characteristic is present. As such,we should turn `CAD` and `Cancer` into binary features:

In [None]:
note_df['CAD'] = note_df['CAD'].apply(lambda x: 1 if x == 'CAD' else 0)
note_df['Cancer'] = note_df['Cancer'].apply(lambda x: 1 if x == 'Cancer' else 0)
note_df.head()

In [None]:
note_df['CAD'].value_counts()

In [None]:
note_df['Cancer'].value_counts()

### Discretize categorical features

Convert binary categorical features into simple numberings, one hot encode features with a low number of categories (in this case, 5) and enumerate sparse categorical features that will be embedded.

#### Separate and prepare features for embedding

Identify categorical features that have more than 5 unique categories, which will go through an embedding layer afterwards, and enumerate them.

Update list of categorical features and add those that will need embedding (features with more than 5 unique values):

In [None]:
new_cat_feat = ['Smoking Status', 'Ethanol Use', 'CAD', 'Cancer']
[cat_feat.append(col) for col in new_cat_feat]

In [None]:
cat_feat_nunique = [note_df[feature].nunique() for feature in new_cat_feat]
cat_feat_nunique

In [None]:
new_cat_embed_feat = []
for i in range(len(new_cat_feat)):
    if cat_feat_nunique[i] > 5:
        # Add feature to the list of those that will be embedded
        cat_embed_feat.append(new_cat_feat[i])
        new_cat_embed_feat.append(new_cat_feat[i])

In [None]:
note_df[new_cat_feat].head()

In [None]:
for i in range(len(new_cat_embed_feat)):
    feature = new_cat_embed_feat[i]
    # Prepare for embedding, i.e. enumerate categories
    note_df[feature], cat_embed_feat_enum[feature] = du.embedding.enum_categorical_feature(note_df, feature, nan_value=0,
                                                                                           forbidden_digit=0)

In [None]:
note_df[new_cat_feat].head()

In [None]:
cat_embed_feat_enum

In [None]:
note_df[new_cat_feat].dtypes

#### Save enumeration encoding mapping

Save the dictionary that maps from the original categories/strings to the new numerical encondings.

In [None]:
stream = open(f'{data_path}/cleaned/cat_embed_feat_enum_note.yaml', 'w')
yaml.dump(cat_embed_feat_enum, stream, default_flow_style=False)

### Create the timestamp feature and sort

Create the timestamp (`ts`) feature:

In [None]:
note_df = note_df.rename(columns={'noteoffset': 'ts'})
note_df.head()

Remove duplicate rows:

In [None]:
len(note_df)

In [None]:
note_df = note_df.drop_duplicates()
note_df.head()

In [None]:
len(note_df)

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
note_df = note_df.sort_values('ts')
note_df.head()

Check for possible multiple rows with the same unit stay ID and timestamp:

In [None]:
note_df.groupby(['patientunitstayid', 'ts']).count().nlargest(columns='CAD', n=5).head()

In [None]:
note_df[note_df.patientunitstayid == 3091883].head(10)

In [None]:
note_df[note_df.patientunitstayid == 3052175].head(10)

We can see that there are up to 5 categories per set of `patientunitstayid` and `ts`. As such, we must join them. However, this is a different scenario than in the other cases. Since we created the features from one categorical column, it doesn't have repeated values, only different rows to indicate each of the new features' values. As such, we just need to sum the features.

### Join rows that have the same IDs

In [None]:
note_df = du.embedding.join_categorical_enum(note_df, cont_join_method='max', inplace=True)
note_df.head()

In [None]:
note_df.dtypes

In [None]:
note_df.groupby(['patientunitstayid', 'ts']).count().nlargest(columns='CAD', n=5).head()

In [None]:
note_df[note_df.patientunitstayid == 3091883].head(10)

In [None]:
note_df[note_df.patientunitstayid == 3052175].head(10)

Comparing the output from the two previous cells with what we had before the `join_categorical_enum` method, we can see that all rows with duplicate IDs have been successfully joined.

### Clean column names

Standardize all column names to be on lower case, have spaces replaced by underscores and remove comas.

In [None]:
note_df.columns = du.data_processing.clean_naming(note_df.columns)
note_df.head()

### Save the dataframe

Save the dataframe before normalizing:

In [None]:
note_df.to_csv(f'{data_path}cleaned/unnormalized/note.csv')

Save the dataframe after normalizing:

In [None]:
note_df.to_csv(f'{data_path}cleaned/normalized/note.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
note_df.describe().transpose()