# Nursing Data Preprocessing
---

Reading and preprocessing nursing data of the eICU dataset from MIT with the data from over 139k patients collected in the US.

This notebook addresses the preprocessing of the following eICU tables:
* nurseAssessment
* nurseCare
* nurseCharting

## Importing the necessary packages

In [None]:
import os                                  # os handles directory/workspace changes
import numpy as np                         # NumPy to handle numeric and NaN operations
import yaml                                # Save and load YAML files

In [None]:
# Debugging packages
import pixiedust                           # Debugging in Jupyter Notebook cells

In [None]:
# Change to parent directory (presumably "Documents")
os.chdir("../../../..")

# Path to the CSV dataset files
data_path = 'Datasets/Thesis/eICU/uncompressed/'

# Path to the code files
project_path = 'GitHub/eICU-mortality-prediction/'

In [None]:
import modin.pandas as pd                  # Optimized distributed version of Pandas
import data_utils as du                    # Data science and machine learning relevant methods

Set the random seed for reproducibility

In [None]:
du.set_random_seed(42)

## Initialize variables

In [None]:
cat_feat = []                              # List of categorical features
cat_embed_feat = []                        # List of categorical features that will be embedded
cat_embed_feat_enum = dict()               # Dictionary of the enumerations of the categorical features that will be embedded

## Nurse care data

### Read the data

In [None]:
nursecare_df = pd.read_csv(f'{data_path}original/nurseCare.csv')
nursecare_df.head()

In [None]:
len(nursecare_df)

In [None]:
nursecare_df.patientunitstayid.nunique()

Only 13052 unit stays have nurse care data. Might not be useful to include them.

Get an overview of the dataframe through the `describe` method:

In [None]:
nursecare_df.describe().transpose()

In [None]:
nursecare_df.columns

In [None]:
nursecare_df.dtypes

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(nursecare_df)

### Remove unneeded features

In [None]:
nursecare_df.celllabel.value_counts()

In [None]:
nursecare_df.cellattribute.value_counts()

In [None]:
nursecare_df.cellattributevalue.value_counts()

In [None]:
nursecare_df.cellattributepath.value_counts()

In [None]:
nursecare_df[nursecare_df.celllabel == 'Nutrition'].cellattributevalue.value_counts()

In [None]:
nursecare_df[nursecare_df.celllabel == 'Activity'].cellattributevalue.value_counts()

In [None]:
nursecare_df[nursecare_df.celllabel == 'Hygiene/ADLs'].cellattributevalue.value_counts()

In [None]:
nursecare_df[nursecare_df.celllabel == 'Safety'].cellattributevalue.value_counts()

In [None]:
nursecare_df[nursecare_df.celllabel == 'Treatments'].cellattributevalue.value_counts()

In [None]:
nursecare_df[nursecare_df.celllabel == 'Isolation Precautions'].cellattributevalue.value_counts()

In [None]:
nursecare_df[nursecare_df.celllabel == 'Restraints'].cellattributevalue.value_counts()

In [None]:
nursecare_df[nursecare_df.celllabel == 'Equipment'].cellattributevalue.value_counts()

Besides the usual removal of row identifier, `nursecareid`, and the timestamp when data was added, `nursecareentryoffset`, I'm also removing `cellattributepath` and `cellattribute`, which have redundant info with `celllabel`.

In [None]:
nursecare_df = nursecare_df.drop(['nursecareid', 'nursecareentryoffset',
                                  'cellattributepath', 'cellattribute'], axis=1)
nursecare_df.head()

Additionally, some information like "Equipment" and "Restraints" seem to be unnecessary. So let's remove them:

In [None]:
categories_to_remove = ['Safety', 'Restraints', 'Equipment', 'Airway Type',
                        'Isolation Precautions', 'Airway Size']

In [None]:
~(nursecare_df.celllabel.isin(categories_to_remove)).head()

In [None]:
nursecare_df = nursecare_df[~(nursecare_df.celllabel.isin(categories_to_remove))]
nursecare_df.head()

### Convert categories to features

Transform the `celllabel` categories and `cellattributevalue` values into separate features:

In [None]:
nursecare_df = du.data_processing.category_to_feature(nursecare_df, categories_feature='celllabel',
                                                      values_feature='cellattributevalue', min_len=1000, inplace=True)
nursecare_df.head()

Now we have the categories separated into their own features, as desired.

Remove the old `celllabel` and `cellattributevalue` columns:

In [None]:
nursecare_df = nursecare_df.drop(['celllabel', 'cellattributevalue'], axis=1)
nursecare_df.head()

In [None]:
nursecare_df['Nutrition'].value_counts()

In [None]:
nursecare_df['Treatments'].value_counts()

In [None]:
nursecare_df['Hygiene/ADLs'].value_counts()

In [None]:
nursecare_df['Activity'].value_counts()

### Discretize categorical features

Convert binary categorical features into simple numberings, one hot encode features with a low number of categories (in this case, 5) and enumerate sparse categorical features that will be embedded.

#### Separate and prepare features for embedding

Identify categorical features that have more than 5 unique categories, which will go through an embedding layer afterwards, and enumerate them.

Update list of categorical features and add those that will need embedding (features with more than 5 unique values):

In [None]:
new_cat_feat = ['Nutrition', 'Treatments', 'Hygiene/ADLs', 'Activity']
[cat_feat.append(col) for col in new_cat_feat]

In [None]:
cat_feat_nunique = [nursecare_df[feature].nunique() for feature in new_cat_feat]
cat_feat_nunique

In [None]:
new_cat_embed_feat = []
for i in range(len(new_cat_feat)):
    if cat_feat_nunique[i] > 5:
        # Add feature to the list of those that will be embedded
        cat_embed_feat.append(new_cat_feat[i])
        new_cat_embed_feat.append(new_cat_feat[i])

In [None]:
nursecare_df[new_cat_feat].head()

In [None]:
for i in range(len(new_cat_embed_feat)):
    feature = new_cat_embed_feat[i]
    # Prepare for embedding, i.e. enumerate categories
    nursecare_df[feature], cat_embed_feat_enum[feature] = du.embedding.enum_categorical_feature(nursecare_df, feature, nan_value=0)

In [None]:
nursecare_df[new_cat_feat].head()

In [None]:
cat_embed_feat_enum

In [None]:
nursecare_df[new_cat_feat].dtypes

#### Save enumeration encoding mapping

Save the dictionary that maps from the original categories/strings to the new numerical encondings.

In [None]:
stream = open(f'{data_path}/cleaned/cat_embed_feat_enum_nurse_care.yaml', 'w')
yaml.dump(cat_embed_feat_enum, stream, default_flow_style=False)

### Create the timestamp feature and sort

Create the timestamp (`ts`) feature:

In [None]:
nursecare_df = nursecare_df.rename(columns={'nursecareoffset': 'ts'})
nursecare_df.head()

Remove duplicate rows:

In [None]:
len(nursecare_df)

In [None]:
nursecare_df = nursecare_df.drop_duplicates()
nursecare_df.head()

In [None]:
len(nursecare_df)

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
nursecare_df = nursecare_df.sort_values('ts')
nursecare_df.head()

Check for possible multiple rows with the same unit stay ID and timestamp:

In [None]:
nursecare_df.groupby(['patientunitstayid', 'ts']).count().nlargest(columns='Nutrition', n=5).head()

In [None]:
nursecare_df[nursecare_df.patientunitstayid == 2798325].head(10)

We can see that there are up to 21 categories per set of `patientunitstayid` and `ts`. As such, we must join them. However, this is a different scenario than in the other cases. Since we created the features from one categorical column, it doesn't have repeated values, only different rows to indicate each of the new features' values. As such, we just need to sum the features.

### Join rows that have the same IDs

Convert dataframe to Pandas, as the groupby operation in `join_categorical_enum` isn't working properly with Modin:

In [None]:
nursecare_df, pd = du.utils.convert_dataframe(nursecare_df, to='pandas')

In [None]:
type(nursecare_df)

In [None]:
nursecare_df = du.embedding.join_categorical_enum(nursecare_df, new_cat_embed_feat, inplace=True)
nursecare_df.head()

Reconvert dataframe to Modin:

In [None]:
nursecare_df, pd = du.utils.convert_dataframe(nursecare_df, to='modin')

In [None]:
type(nursecare_df)

In [None]:
nursecare_df.dtypes

In [None]:
nursecare_df.groupby(['patientunitstayid', 'ts']).count().nlargest(columns='Nutrition', n=5).head()

In [None]:
nursecare_df[nursecare_df.patientunitstayid == 2798325].head(10)

Comparing the output from the two previous cells with what we had before the `join_categorical_enum` method, we can see that all rows with duplicate IDs have been successfully joined.

### Rename columns

In [None]:
nursecare_df = nursecare_df.rename(columns={'Treatments':'nurse_treatments'})
nursecare_df.head()

### Clean column names

Standardize all column names to be on lower case, have spaces replaced by underscores and remove comas.

In [None]:
nursecare_df.columns = du.data_processing.clean_naming(nursecare_df.columns)
nursecare_df.head()

### Save the dataframe

Save the dataframe before normalizing:

In [None]:
nursecare_df.to_csv(f'{data_path}cleaned/unnormalized/nurseCare.csv')

Save the dataframe after normalizing:

In [None]:
nursecare_df.to_csv(f'{data_path}cleaned/normalized/nurseCare.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
nursecare_df.describe().transpose()

## Nurse assessment data

### Read the data

In [None]:
nurseassess_df = pd.read_csv(f'{data_path}original/nurseAssessment.csv')
nurseassess_df.head()

In [None]:
len(nurseassess_df)

In [None]:
nurseassess_df.patientunitstayid.nunique()

Only 13001 unit stays have nurse assessment data. Might not be useful to include them.

Get an overview of the dataframe through the `describe` method:

In [None]:
nurseassess_df.describe().transpose()

In [None]:
nurseassess_df.columns

In [None]:
nurseassess_df.dtypes

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(nurseassess_df)

### Remove unneeded features

In [None]:
nurseassess_df.celllabel.value_counts()

In [None]:
nurseassess_df.cellattribute.value_counts()

In [None]:
nurseassess_df.cellattributevalue.value_counts()

In [None]:
nurseassess_df.cellattributepath.value_counts()

In [None]:
nurseassess_df[nurseassess_df.celllabel == 'Intervention'].cellattributevalue.value_counts()

In [None]:
nurseassess_df[nurseassess_df.celllabel == 'Neurologic'].cellattributevalue.value_counts()

In [None]:
nurseassess_df[nurseassess_df.celllabel == 'Pupils'].cellattributevalue.value_counts()

In [None]:
nurseassess_df[nurseassess_df.celllabel == 'Edema'].cellattributevalue.value_counts()

In [None]:
nurseassess_df[nurseassess_df.celllabel == 'Secretions'].cellattributevalue.value_counts()

In [None]:
nurseassess_df[nurseassess_df.celllabel == 'Cough'].cellattributevalue.value_counts()

In [None]:
nurseassess_df[nurseassess_df.celllabel == 'Neurologic'].cellattribute.value_counts()

In [None]:
nurseassess_df[nurseassess_df.celllabel == 'Pupils'].cellattribute.value_counts()

In [None]:
nurseassess_df[nurseassess_df.celllabel == 'Secretions'].cellattribute.value_counts()

In [None]:
nurseassess_df[nurseassess_df.celllabel == 'Cough'].cellattribute.value_counts()

Besides the usual removal of row identifier, `nurseAssessID`, and the timestamp when data was added, `nurseAssessEntryOffset`, I'm also removing `cellattributepath` and `cellattribute`, which have redundant info with `celllabel`. Regarding data categories, I'm only keeping `Neurologic`, `Pupils`, `Secretions` and `Cough`, as the remaining ones either don't add much value, have too little data or are redundant with data from other tables.

In [None]:
nurseassess_df = nurseassess_df.drop(['nurseassessid', 'nurseassessentryoffset',
                                      'cellattributepath', 'cellattribute'], axis=1)
nurseassess_df.head()

In [None]:
categories_to_keep = ['Neurologic', 'Pupils', 'Secretions', 'Cough']

In [None]:
nurseassess_df.celllabel.isin(categories_to_keep).head()

In [None]:
nurseassess_df = nurseassess_df[nurseassess_df.celllabel.isin(categories_to_keep)]
nurseassess_df.head()

### Convert categories to features

Transform the `celllabel` categories and `cellattributevalue` values into separate features:

In [None]:
nurseassess_df = du.data_processing.category_to_feature(nurseassess_df, categories_feature='celllabel',
                                                        values_feature='cellattributevalue', min_len=1000, inplace=True)
nurseassess_df.head()

Now we have the categories separated into their own features, as desired.

Remove the old `celllabel` and `cellattributevalue` columns:

In [None]:
nurseassess_df = nurseassess_df.drop(['celllabel', 'cellattributevalue'], axis=1)
nurseassess_df.head()

In [None]:
nurseassess_df['Neurologic'].value_counts()

In [None]:
nurseassess_df['Pupils'].value_counts()

In [None]:
nurseassess_df['Secretions'].value_counts()

In [None]:
nurseassess_df['Cough'].value_counts()

### Discretize categorical features

Convert binary categorical features into simple numberings, one hot encode features with a low number of categories (in this case, 5) and enumerate sparse categorical features that will be embedded.

#### Separate and prepare features for embedding

Identify categorical features that have more than 5 unique categories, which will go through an embedding layer afterwards, and enumerate them.

Update list of categorical features and add those that will need embedding (features with more than 5 unique values):

In [None]:
new_cat_feat = ['Pupils', 'Neurologic', 'Secretions', 'Cough']
[cat_feat.append(col) for col in new_cat_feat]

In [None]:
cat_feat_nunique = [nurseassess_df[feature].nunique() for feature in new_cat_feat]
cat_feat_nunique

In [None]:
new_cat_embed_feat = []
for i in range(len(new_cat_feat)):
    if cat_feat_nunique[i] > 5:
        # Add feature to the list of those that will be embedded
        cat_embed_feat.append(new_cat_feat[i])
        new_cat_embed_feat.append(new_cat_feat[i])

In [None]:
nurseassess_df[new_cat_feat].head()

In [None]:
for i in range(len(new_cat_embed_feat)):
    feature = new_cat_embed_feat[i]
    # Prepare for embedding, i.e. enumerate categories
    nurseassess_df[feature], cat_embed_feat_enum[feature] = du.embedding.enum_categorical_feature(nurseassess_df, feature, nan_value=0)

In [None]:
nurseassess_df[new_cat_feat].head()

In [None]:
cat_embed_feat_enum

In [None]:
nurseassess_df[new_cat_feat].dtypes

#### Save enumeration encoding mapping

Save the dictionary that maps from the original categories/strings to the new numerical encondings.

In [None]:
stream = open(f'{data_path}/cleaned/cat_embed_feat_enum_nurse_assess.yaml', 'w')
yaml.dump(cat_embed_feat_enum, stream, default_flow_style=False)

### Create the timestamp feature and sort

Create the timestamp (`ts`) feature:

In [None]:
nurseassess_df = nurseassess_df.rename(columns={'nurseassessoffset': 'ts'})
nurseassess_df.head()

Remove duplicate rows:

In [None]:
len(nurseassess_df)

In [None]:
nurseassess_df = nurseassess_df.drop_duplicates()
nurseassess_df.head()

In [None]:
len(nurseassess_df)

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
nurseassess_df = nurseassess_df.sort_values('ts')
nurseassess_df.head()

Check for possible multiple rows with the same unit stay ID and timestamp:

In [None]:
nurseassess_df.groupby(['patientunitstayid', 'ts']).count().nlargest(columns='Cough', n=5).head()

In [None]:
nurseassess_df[nurseassess_df.patientunitstayid == 2553254].head(10)

We can see that there are up to 80 categories per set of `patientunitstayid` and `ts`. As such, we must join them.

### Join rows that have the same IDs

Convert dataframe to Pandas, as the groupby operation in `join_categorical_enum` isn't working properly with Modin:

In [None]:
nurseassess_df, pd = du.utils.convert_dataframe(nurseassess_df, to='pandas')

In [None]:
type(nurseassess_df)

In [None]:
nurseassess_df = du.embedding.join_categorical_enum(nurseassess_df, new_cat_embed_feat, inplace=True)
nurseassess_df.head()

Reconvert dataframe to Modin:

In [None]:
nurseassess_df, pd = du.utils.convert_dataframe(nurseassess_df, to='modin')

In [None]:
type(nurseassess_df)

In [None]:
nurseassess_df.dtypes

In [None]:
nurseassess_df.groupby(['patientunitstayid', 'ts']).count().nlargest(columns='Cough', n=5).head()

In [None]:
nurseassess_df[nurseassess_df.patientunitstayid == 2553254].head(10)

Comparing the output from the two previous cells with what we had before the `join_categorical_enum` method, we can see that all rows with duplicate IDs have been successfully joined.

### Clean column names

Standardize all column names to be on lower case, have spaces replaced by underscores and remove comas.

In [None]:
nurseassess_df.columns = du.data_processing.clean_naming(nurseassess_df.columns)
nurseassess_df.head()

### Save the dataframe

Save the dataframe before normalizing:

In [None]:
nurseassess_df.to_csv(f'{data_path}cleaned/unnormalized/nurseAssessment.csv')

Save the dataframe after normalizing:

In [None]:
nurseassess_df.to_csv(f'{data_path}cleaned/normalized/nurseAssessment.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
nurseassess_df.describe().transpose()

## Nurse charting data

### Read the data

In [None]:
nursechart_df = pd.read_csv(f'{data_path}original/nurseCharting.csv')
nursechart_df.head()

In [None]:
len(nursechart_df)

In [None]:
nursechart_df.patientunitstayid.nunique()

Get an overview of the dataframe through the `describe` method:

In [None]:
nursechart_df.describe().transpose()

In [None]:
nursechart_df.columns

In [None]:
nursechart_df.dtypes

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(nursechart_df)

### Remove unneeded features

In [None]:
nursechart_df.nursingchartcelltypecat.value_counts()

In [None]:
nursechart_df.nursingchartcelltypevallabel.value_counts()

In [None]:
nursechart_df.nursingchartcelltypevalname.value_counts()

In [None]:
nursechart_df.nursingchartvalue.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypecat == 'Vital Signs'].nursingchartcelltypevallabel.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypecat == 'Scores'].nursingchartcelltypevallabel.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypecat == 'Other Vital Signs and Infusions'].nursingchartcelltypevallabel.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypecat == 'Vital Signs and Infusions'].nursingchartcelltypevallabel.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypecat == 'Invasive'].nursingchartcelltypevallabel.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypecat == 'SVO2'].nursingchartcelltypevallabel.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypecat == 'ECG'].nursingchartcelltypevallabel.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevalname == 'Pain Score'].nursingchartcelltypevallabel.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevalname == 'Pain Score'].nursingchartvalue.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Pain Assessment'].nursingchartcelltypevalname.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Pain Assessment'].nursingchartvalue.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Pain Present'].nursingchartcelltypevalname.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Pain Present'].nursingchartvalue.value_counts()

Regarding patient's pain information, the only label that seems to be relevant is `Pain Score`. However, it's important to note that this score has different possible measurement systems (`Pain Assessment`). Due to this, we will only consider the most frequent pain scale (`WDL`). `Pain Present` has less information and, as such, is less relevant.

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Glasgow coma score'].nursingchartcelltypevalname.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Glasgow coma score'].nursingchartvalue.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevalname == 'GCS Total'].nursingchartcelltypevallabel.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevalname == 'GCS Total'].nursingchartvalue.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Score (Glasgow Coma Scale)'].nursingchartcelltypevalname.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Score (Glasgow Coma Scale)'].nursingchartvalue.value_counts()

Labels `GCS Total` and `Score (Glasgow Coma Scale)` should be merged, as they represent exactly the same thing.

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'SEDATION SCORE'].nursingchartcelltypevalname.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'SEDATION SCORE'].nursingchartvalue.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Sedation Scale/Score/Goal'].nursingchartcelltypevalname.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Sedation Scale/Score/Goal'].nursingchartvalue.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevalname == 'Sedation Score'].nursingchartvalue.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevalname == 'Sedation Scale'].nursingchartvalue.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Delirium Scale/Score'].nursingchartcelltypevalname.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Delirium Scale/Score'].nursingchartvalue.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevalname == 'Delirium Score'].nursingchartvalue.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevalname == 'Delirium Scale'].nursingchartvalue.value_counts()

Sedation and delirium scores could be interesting features, however they are presented in different scales, like in pain score, which don't seem to be directly convertable between them. Due to this, we will only consider the most frequent scale for each case (`RASS` and `CAM-ICU`, respectively).

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Best Motor Response'].nursingchartcelltypevalname.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Best Motor Response'].nursingchartvalue.value_counts()

These "Best ___ Response" features are subparts of the total Glasgow Coma Score calculation. Because of that, and for having less data, they will be discarded.

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Gastrointestinal Assessment'].nursingchartcelltypevalname.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Gastrointestinal Assessment'].nursingchartvalue.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Genitourinary Assessment'].nursingchartcelltypevalname.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Genitourinary Assessment'].nursingchartvalue.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Integumentary Assessment'].nursingchartcelltypevalname.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Integumentary Assessment'].nursingchartvalue.value_counts()

Some other information, like these gastrointestinal, genitourinary and integumentary domains, could be relevant to add. The problem is that we only seem to have acccess to how they were measured (i.e. their scale) and not the real values.

Besides the usual removal of row identifier, `nurseAssessID`, and the timestamp when data was added, `nurseAssessEntryOffset`, I'm also removing all labels and names except those that relate to pain, coma, sedation and delirium scores. Furthermore, `nursingchartcelltypecat` doesn't add much relevant info either, so it will be removed.

In [None]:
nursechart_df = nursechart_df.drop(['nursingchartid', 'nursingchartentryoffset', 'nursingchartcelltypecat'], axis=1)
nursechart_df.head()

In [None]:
labels_to_keep = ['Glasgow coma score', 'Score (Glasgow Coma Scale)',
                  'Sedation Scale/Score/Goal', 'Delirium Scale/Score']

In [None]:
nursechart_df = nursechart_df[nursechart_df.nursingchartcelltypevallabel.isin(labels_to_keep)]
nursechart_df.head()

In [None]:
names_to_keep = ['Pain Score', 'GCS Total', 'Value', 'Sedation Score',
                 'Sedation Scale', 'Delirium Score', 'Delirium Scale']

In [None]:
nursechart_df = nursechart_df[nursechart_df.nursingchartcelltypevalname.isin(names_to_keep)]
nursechart_df.head()

### Convert categories to features

Make the `nursingchartcelltypevallabel` and `nursingchartcelltypevalname` columns of type categorical:

In [None]:
nursechart_df = nursechart_df.categorize(columns=['nursingchartcelltypevallabel', 'nursingchartcelltypevalname'])

In [None]:
nursechart_df.head()

Transform the `nursingchartcelltypevallabel` categories and `nursingchartvalue` values into separate features:

Now we have the categories separated into their own features, as desired.

Remove the old `nursingchartcelltypevallabel`, `nursingchartcelltypevalname` and `nursingchartvalue` columns:

In [None]:
nursechart_df = nursechart_df.drop(['nursingchartcelltypevallabel', 'nursingchartcelltypevalname', 'nursingchartvalue'], axis=1)
nursechart_df.head()

In [None]:
nursechart_df['Pain Score'].value_counts()

### Filter the most common measurement scales

Only keep data thats is in the same, most common measurement scale.

In [None]:
nursechart_df = nursechart_df[((nursechart_df['Pain Assessment'] == 'WDL')
                               | (nursechart_df['Sedation Scale'] == 'RASS')
                               | (nursechart_df['Delirium Scale'] == 'CAM-ICU'))]
nursechart_df.head()

Merge Glasgow coma score columns:

In [None]:
def set_glc(df):
    if np.isnan(df['GLC Total']):
        return df['Score (Glasgow Coma Scale)']
    else:
        return df['GLC Total']

In [None]:
nursechart_df['glasgow_coma_score'] = nursechart_df.apply(lambda df: set_glc(df), axis=1)
nursechart_df.head()

Drop unneeded columns:

In [None]:
nursechart_df = nursechart_df.drop(['Pain Assessment', 'GLC Total', 'Score (Glasgow Coma Scale)',
                                    'Value', 'Sedation Scale', 'Delirium Scale'], axis=1)
nursechart_df.head()

### Discretize categorical features

Convert binary categorical features into simple numberings, one hot encode features with a low number of categories (in this case, 5) and enumerate sparse categorical features that will be embedded.

#### Separate and prepare features for embedding

Identify categorical features that have more than 5 unique categories, which will go through an embedding layer afterwards, and enumerate them.

Update list of categorical features and add those that will need embedding (features with more than 5 unique values):

In [None]:
new_cat_feat = ['nursingchartcelltypecat', 'nursingchartvalue']
[cat_feat.append(col) for col in new_cat_feat]

In [None]:
cat_feat_nunique = [nursechart_df[feature].nunique() for feature in new_cat_feat]
cat_feat_nunique

In [None]:
new_cat_embed_feat = []
for i in range(len(new_cat_feat)):
    if cat_feat_nunique[i] > 5:
        # Add feature to the list of those that will be embedded
        cat_embed_feat.append(new_cat_feat[i])
        new_cat_embed_feat.append(new_cat_feat[i])

In [None]:
nursechart_df[new_cat_feat].head()

In [None]:
for i in range(len(new_cat_embed_feat)):
    feature = new_cat_embed_feat[i]
    # Prepare for embedding, i.e. enumerate categories
    nursechart_df[feature], cat_embed_feat_enum[feature] = du.embedding.enum_categorical_feature(nursechart_df, feature, nan_value=0)

In [None]:
nursechart_df[new_cat_feat].head()

In [None]:
cat_embed_feat_enum

In [None]:
nursechart_df[new_cat_feat].dtypes

#### Save enumeration encoding mapping

Save the dictionary that maps from the original categories/strings to the new numerical encondings.

In [None]:
stream = open(f'{data_path}/cleaned/cat_embed_feat_enum_nurse_chart.yaml', 'w')
yaml.dump(cat_embed_feat_enum, stream, default_flow_style=False)

### Create the timestamp feature and sort

Create the timestamp (`ts`) feature:

In [None]:
nursechart_df = nursechart_df.rename(columns={'nursechartoffset': 'ts'})
nursechart_df.head()

Remove duplicate rows:

In [None]:
len(nursechart_df)

In [None]:
nursechart_df = nursechart_df.drop_duplicates()
nursechart_df.head()

In [None]:
len(nursechart_df)

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
nursechart_df = nursechart_df.sort_values('ts')
nursechart_df.head()

Check for possible multiple rows with the same unit stay ID and timestamp:

In [None]:
nursechart_df.groupby(['patientunitstayid', 'ts']).count().nlargest(columns='nursingchartcelltypecat', n=5).head()

In [None]:
nursechart_df[nursechart_df.patientunitstayid == 2553254].head(10)

We can see that there are up to 80 categories per set of `patientunitstayid` and `ts`. As such, we must join them.

### Join rows that have the same IDs

Convert dataframe to Pandas, as the groupby operation in `join_categorical_enum` isn't working properly with Modin:

In [None]:
nursechart_df, pd = du.utils.convert_dataframe(nursechart_df, to='pandas')

In [None]:
type(nursechart_df)

In [None]:
nursechart_df = du.embedding.join_categorical_enum(nursechart_df, new_cat_embed_feat, inplace=True)
nursechart_df.head()

Reconvert dataframe to Modin:

In [None]:
nursechart_df, pd = du.utils.convert_dataframe(nursechart_df, to='modin')

In [None]:
type(nursechart_df)

In [None]:
nursechart_df.dtypes

In [None]:
nursechart_df.groupby(['patientunitstayid', 'ts']).count().nlargest(columns='nursingchartcelltypecat', n=5).head()

In [None]:
nursechart_df[nursechart_df.patientunitstayid == 2553254].head(10)

Comparing the output from the two previous cells with what we had before the `join_categorical_enum` method, we can see that all rows with duplicate IDs have been successfully joined.

### Rename columns

In [None]:
nursechart_df = nursechart_df.rename(columns={'nursingchartcelltypecat':'nurse_assess_label',
                                                'nursingchartvalue':'nurse_assess_value'})
nursechart_df.head()

### Clean column names

Standardize all column names to be on lower case, have spaces replaced by underscores and remove comas.

In [None]:
nursechart_df.columns = du.data_processing.clean_naming(nursechart_df.columns)
nursechart_df_norm.columns = du.data_processing.clean_naming(nursechart_df_norm.columns)
nursechart_df_norm.head()

### Save the dataframe

Save the dataframe before normalizing:

In [None]:
nursechart_df.to_csv(f'{data_path}cleaned/unnormalized/nurseCharting.csv')

Save the dataframe after normalizing:

In [None]:
nursechart_df.to_csv(f'{data_path}cleaned/normalized/nurseCharting.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
nursechart_df.describe().transpose()