# eICU Data Joining
---

Reading and joining all parts of the eICU dataset from MIT with the data from over 139k patients collected in the US.

The main goal of this notebook is to prepare a single CSV document that contains all the relevant data to be used when training a machine learning model that predicts mortality, joining tables, filtering useless columns and performing imputation.

## Importing the necessary packages

In [None]:
import os                                  # os handles directory/workspace changes
import numpy as np                         # NumPy to handle numeric and NaN operations
import yaml                                # Save and load YAML files

In [None]:
# Debugging packages
import pixiedust                           # Debugging in Jupyter Notebook cells

In [None]:
# Change to parent directory (presumably "Documents")
os.chdir("../../..")

# Path to the CSV dataset files
data_path = 'Documents/Datasets/Thesis/eICU/uncompressed/'

# Path to the code files
project_path = 'Documents/GitHub/eICU-mortality-prediction/'

In [None]:
import modin.pandas as pd                  # Optimized distributed version of Pandas
import data_utils as du                    # Data science and machine learning relevant methods

Set the random seed for reproducibility

In [None]:
du.set_random_seed(42)

## Initialize variables

In [None]:
cat_feat = []                              # List of categorical features
cat_embed_feat = []                        # List of categorical features that will be embedded
cat_embed_feat_enum = dict()               # Dictionary of the enumerations of the categorical features that will be embedded

### Read the data

In [None]:
patient_df = pd.read_csv(f'{data_path}original/patient.csv')
patient_df.head()

In [None]:
len(patient_df)

In [None]:
patient_df.patientunitstayid.nunique()

In [None]:
patient_df.patientunitstayid.value_counts()

Get an overview of the dataframe through the `describe` method:

In [None]:
patient_df.describe().transpose()

In [None]:
patient_df.columns

In [None]:
patient_df.dtypes

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(patient_df)

### Remove unneeded features

Besides removing unneeded hospital and time information, I'm also removing the admission diagnosis (`apacheadmissiondx`) as it doesn't follow the same structure as the remaining diagnosis data (which is categorized in increasingly specific categories, separated by "|").

In [None]:
patient_df = patient_df[['patientunitstayid', 'gender', 'age', 'ethnicity',  'admissionheight',
                         'hospitaldischargeoffset', 'hospitaldischargelocation', 'hospitaldischargestatus',
                         'admissionweight', 'dischargeweight', 'unitdischargeoffset']]
patient_df.head()

### Make the age feature numeric

In the eICU dataset, ages above 89 years old are not specified. Instead, we just receive the indication "> 89". In order to be able to work with the age feature numerically, we'll just replace the "> 89" values with "90", as if the patient is 90 years old. It might not always be the case, but it shouldn't be very different and it probably doesn't affect too much the model's logic.

In [None]:
patient_df.age.value_counts().head()

In [None]:
# Replace the "> 89" years old indication with 90 years
patient_df.age = patient_df.age.replace(to_replace='> 89', value=90)

In [None]:
patient_df.age.value_counts().head()

In [None]:
# Make the age feature numeric
patient_df.age = patient_df.age.astype(float)

### Discretize categorical features

Convert binary categorical features into simple numberings, one hot encode features with a low number of categories (in this case, 5) and enumerate sparse categorical features that will be embedded.

#### Convert binary categorical features into numeric

In [None]:
patient_df.gender.value_counts()

In [None]:
patient_df.gender = patient_df.gender.map(lambda x: 1 if x == 'Male' else 0 if x == 'Female' else np.nan)

In [None]:
patient_df.gender.value_counts()

#### Separate and prepare features for embedding

Identify categorical features that have more than 5 unique categories, which will go through an embedding layer afterwards, and enumerate them.


Update list of categorical features and add those that will need embedding (features with more than 5 unique values):

In [None]:
new_cat_feat = ['ethnicity']
[cat_feat.append(col) for col in new_cat_feat]

In [None]:
cat_feat_nunique = [patient_df[feature].nunique() for feature in new_cat_feat]
cat_feat_nunique

In [None]:
new_cat_embed_feat = []
for i in range(len(new_cat_feat)):
    if cat_feat_nunique[i] > 5:
        # Add feature to the list of those that will be embedded
        cat_embed_feat.append(new_cat_feat[i])
        # Add feature to the list of the new ones (from the current table) that will be embedded
        new_cat_embed_feat.append(new_cat_feat[i])

In [None]:
patient_df[new_cat_feat].head()

In [None]:
for i in range(len(new_cat_embed_feat)):
    feature = new_cat_embed_feat[i]
    # Prepare for embedding, i.e. enumerate categories
    patient_df[feature], cat_embed_feat_enum[feature] = du.embedding.enum_categorical_feature(patient_df, feature)

In [None]:
patient_df[new_cat_feat].head()

In [None]:
cat_embed_feat_enum

In [None]:
patient_df[cat_feat].dtypes

#### Save enumeration encoding mapping

Save the dictionary that maps from the original categories/strings to the new numerical encondings.

In [None]:
stream = open('cat_embed_feat_enum.yaml', 'w')
yaml.dump(cat_embed_feat_enum, stream, default_flow_style=False)

### Create mortality label

Combine info from discharge location and discharge status. Using the hospital discharge data, instead of the unit, as it has a longer perspective on the patient's status. I then save a feature called "deathOffset", which has a number if the patient is dead on hospital discharge or is NaN if the patient is still alive/unknown (presumed alive if unknown). Based on this, a label can be made later on, when all the tables are combined in a single dataframe, indicating if a patient dies in the following X time, according to how faraway we want to predict.

In [None]:
patient_df.hospitaldischargestatus.value_counts()

In [None]:
patient_df.hospitaldischargelocation.value_counts()

In [None]:
patient_df['deathoffset'] = patient_df.apply(lambda df: df['hospitaldischargeoffset']
                                                        if df['hospitaldischargestatus'] == 'Expired' or
                                                        df['hospitaldischargelocation'] == 'Death' else np.nan, axis=1,
                                                        meta=('x', float))
patient_df.head()

Remove the now unneeded hospital discharge features:

In [None]:
patient_df = patient_df.drop(['hospitaldischargeoffset', 'hospitaldischargestatus', 'hospitaldischargelocation'], axis=1)
patient_df.head(6)

### Create a discharge instance and the timestamp feature

Create the timestamp (`ts`) feature:

In [None]:
patient_df['ts'] = 0
patient_df.head()

Create a weight feature:

In [None]:
# Create feature weight and assign the initial weight that the patient has on admission
patient_df['weight'] = patient_df['admissionweight']
patient_df.head()

Duplicate every row, so as to create a discharge event:

In [None]:
new_df = patient_df.copy()
new_df.head()

Set the `weight` and `ts` features to initially have the value on admission and, on the second timestamp, have the value on discharge:

In [None]:
new_df.ts = new_df.unitdischargeoffset
new_df.weight = new_df.dischargeweight
new_df.head()

Join the new rows to the remaining dataframe:

In [None]:
patient_df = patient_df.append(new_df)
patient_df.head()

Remove the remaining, now unneeded, weight and timestamp features:

In [None]:
patient_df = patient_df.drop(['admissionweight', 'dischargeweight', 'unitdischargeoffset'], axis=1)
patient_df.head(6)

Sort by `patientunitstayid` so as to check the data of the each patient together:

In [None]:
patient_df.sort_values(by='patientunitstayid').head(6)

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
patient_df = patient_df.set_index('ts')
patient_df.head(6)

### Normalize data

Save the dataframe before normalizing:

In [None]:
patient_df.to_csv(f'{data_path}cleaned/unnormalized/patient.csv')

In [None]:
new_cat_feat

In [None]:
patient_df_norm = du.data_processing.normalize_data(patient_df, embed_columns=new_cat_feat,
                                                 id_columns=['patientunitstayid', 'deathoffset'])
patient_df_norm.head(6)

In [None]:
patient_df_norm.to_csv(f'{data_path}cleaned/normalized/patient.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
patient_df_norm.describe().transpose()

### Create the unifying dataframe

In [None]:
eICU_df = patient_df
eICU_df.head()

## Vital signs periodic data

### Read the data

In [None]:
vital_prdc_df = pd.read_csv(f'{data_path}original/vitalPeriodic.csv')
vital_prdc_df.head()

Get an overview of the dataframe through the `describe` method:

In [None]:
vital_prdc_df.describe().transpose()

In [None]:
vital_prdc_df.columns

In [None]:
vital_prdc_df.dtypes

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(patient_df)

### Remove unneeded features

In [None]:
patient_df = patient_df[['patientunitstayid', 'gender', 'age', 'ethnicity', 'apacheadmissiondx',  'admissionheight',
                         'hospitaldischargeoffset', 'hospitaldischargelocation', 'hospitaldischargestatus',
                         'admissionweight', 'dischargeweight', 'unitdischargeoffset']]
patient_df.head()

### Discretize categorical features

Convert binary categorical features into simple numberings, one hot encode features with a low number of categories (in this case, 5) and enumerate sparse categorical features that will be embedded.

#### Convert binary categorical features into numeric

In [None]:
patient_df.gender.value_counts()

In [None]:
patient_df.gender = patient_df.gender.map(lambda x: 1 if x == 'Male' else 0 if x == 'Female' else np.nan)

In [None]:
patient_df.gender.value_counts()

#### Separate and prepare features for embedding

Identify categorical features that have more than 5 unique categories, which will go through an embedding layer afterwards, and enumerate them.

Update list of categorical features and add those that will need embedding (features with more than 5 unique values):

In [None]:
new_cat_feat = ['ethnicity', 'apacheadmissiondx']
[cat_feat.append(col) for col in new_cat_feat]

In [None]:
cat_feat_nunique = [patient_df[feature].nunique() for feature in new_cat_feat]
cat_feat_nunique

In [None]:
new_cat_embed_feat = []
for i in range(len(new_cat_feat)):
    if cat_feat_nunique[i] > 5:
        # Add feature to the list of those that will be embedded
        cat_embed_feat.append(new_cat_feat[i])
        # Add feature to the list of the new ones (from the current table) that will be embedded
        new_cat_embed_feat.append(new_cat_feat[i])

In [None]:
patient_df[new_cat_feat].head()

In [None]:
for i in range(len(new_cat_embed_feat)):
    feature = new_cat_embed_feat[i]
    # Prepare for embedding, i.e. enumerate categories
    patient_df[feature], cat_embed_feat_enum[feature] = du.embedding.enum_categorical_feature(patient_df, feature)

In [None]:
patient_df[new_cat_feat].head()

In [None]:
cat_embed_feat_enum

In [None]:
patient_df[cat_feat].dtypes

#### Save enumeration encoding mapping

Save the dictionary that maps from the original categories/strings to the new numerical encondings.

In [None]:
stream = open('cat_embed_feat_enum.yaml', 'w')
yaml.dump(cat_embed_feat_enum, stream, default_flow_style=False)

### Create the timestamp feature and sort

Create the timestamp (`ts`) feature:

In [None]:
patient_df['ts'] = 0
patient_df = patient_df.drop('observationoffset', axis=1)
patient_df.head()

In [None]:
patient_df.patientunitstayid.value_counts()

Remove duplicate rows:

In [None]:
len(patient_df)

In [None]:
patient_df = patient_df.drop_duplicates()
patient_df.head()

In [None]:
len(patient_df)

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
vital_prdc_df = vital_prdc_df.set_index('ts')
vital_prdc_df.head(6)

Check for possible multiple rows with the same unit stay ID and timestamp:

In [None]:
micro_df.reset_index().head()

In [None]:
micro_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='culturesite').head()

In [None]:
micro_df[micro_df.patientunitstayid == 3069495].head(20)

### Join rows that have the same IDs

In [None]:
micro_df = du.embedding.join_categorical_enum(micro_df, new_cat_embed_feat)
micro_df.head()

In [None]:
micro_df.dtypes

In [None]:
micro_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='culturesite').head()

In [None]:
micro_df[micro_df.patientunitstayid == 3069495].head(20)

Comparing the output from the two previous cells with what we had before the `join_categorical_enum` method, we can see that all rows with duplicate IDs have been successfully joined.

### Normalize data

In [None]:
patient_df_norm = du.data_processing.normalize_data(patient_df, embed_columns=new_cat_feat,
                                                    id_columns=['patientunitstayid', 'ts', 'deathoffset'])
patient_df_norm.head(6)

Confirm that everything is ok through the `describe` method:

In [None]:
patient_df_norm.describe().transpose()

### Clean column names

Standardize all column names to be on lower case, have spaces replaced by underscores and remove comas.

In [None]:
patient_df.columns = du.data_processing.clean_naming(patient_df.columns)
patient_df_norm.columns = du.data_processing.clean_naming(patient_df_norm.columns)
patient_df_norm.head()

### Save the dataframe

Save the dataframe before normalizing:

In [None]:
patient_df.to_csv(f'{data_path}cleaned/unnormalized/patient.csv')

Save the dataframe after normalizing:

In [None]:
patient_df_norm.to_csv(f'{data_path}cleaned/normalized/patient.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
patient_df_norm.describe().transpose()

### Join dataframes

Merge dataframes by the unit stay, `patientunitstayid`, and the timestamp, `ts`, with a tolerence for a difference of up to 30 minutes.

In [None]:
patient_df = pd.read_csv(f'{data_path}cleaned/normalized/patient.csv')
patient_df.head()

In [None]:
vital_prdc_df = pd.read_csv(f'{data_path}cleaned/normalized/vitalPeriodic.csv')
vital_prdc_df.head()

In [None]:
eICU_df = pd.merge_asof(patient_df, vital_aprdc_df, on='ts', by='patientunitstayid', direction='nearest', tolerance=30)
eICU_df.head()

## Vital signs aperiodic data

### Read the data

In [None]:
vital_aprdc_df = pd.read_csv(f'{data_path}original/vitalAperiodic.csv')
vital_aprdc_df.head()

In [None]:
len(vital_aprdc_df)

In [None]:
vital_aprdc_df.patientunitstayid.nunique()

Get an overview of the dataframe through the `describe` method:

In [None]:
vital_aprdc_df.describe().transpose()

In [None]:
vital_aprdc_df.columns

In [None]:
vital_aprdc_df.dtypes

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(vital_aprdc_df)

### Remove unneeded features

In [None]:
vital_aprdc_df = vital_aprdc_df.drop('vitalaperiodicid', axis=1)
vital_aprdc_df.head()

### Create the timestamp feature and sort

Create the timestamp (`ts`) feature:

In [None]:
vital_aprdc_df['ts'] = vital_aprdc_df['observationoffset']
vital_aprdc_df = vital_aprdc_df.drop('observationoffset', axis=1)
vital_aprdc_df.head()

Remove duplicate rows:

In [None]:
len(vital_aprdc_df)

In [None]:
vital_aprdc_df = vital_aprdc_df.drop_duplicates()
vital_aprdc_df.head()

In [None]:
len(vital_aprdc_df)

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
vital_aprdc_df = vital_aprdc_df.set_index('ts')
vital_aprdc_df.head(6)

Check for possible multiple rows with the same unit stay ID and timestamp:

In [None]:
vital_aprdc_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='noninvasivemean').head()

In [None]:
vital_aprdc_df[micro_df.patientunitstayid == 3069495].head(20)

### Join rows that have the same IDs

In [None]:
micro_df = du.embedding.join_categorical_enum(micro_df, new_cat_embed_feat)
micro_df.head()

In [None]:
micro_df.dtypes

In [None]:
micro_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='culturesite').head()

In [None]:
micro_df[micro_df.patientunitstayid == 3069495].head(20)

Comparing the output from the two previous cells with what we had before the `join_categorical_enum` method, we can see that all rows with duplicate IDs have been successfully joined.

### Normalize data

In [None]:
vital_aprdc_df_norm = du.data_processing.normalize_data(vital_aprdc_df,
                                                        id_columns=['patientunitstayid', 'ts'])
vital_aprdc_df_norm.head(6)

Confirm that everything is ok through the `describe` method:

In [None]:
vital_aprdc_df_norm.describe().transpose()

### Clean column names

Standardize all column names to be on lower case, have spaces replaced by underscores and remove comas.

In [None]:
vital_aprdc_df.columns = du.data_processing.clean_naming(vital_aprdc_df.columns)
vital_aprdc_df_norm.columns = du.data_processing.clean_naming(vital_aprdc_df_norm.columns)
vital_aprdc_df_norm.head()

### Save the dataframe

Save the dataframe before normalizing:

In [None]:
vital_aprdc_df.to_csv(f'{data_path}cleaned/unnormalized/vitalAperiodic.csv')

Save the dataframe after normalizing:

In [None]:
vital_aprdc_df_norm.to_csv(f'{data_path}cleaned/normalized/vitalAperiodic.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
vital_aprdc_df_norm.describe().transpose()

### Join dataframes

Merge dataframes by the unit stay, `patientunitstayid`, and the timestamp, `ts`, with a tolerence for a difference of up to 30 minutes.

In [None]:
vital_aprdc_df = pd.read_csv(f'{data_path}cleaned/normalized/vitalAperiodic.csv')
vital_aprdc_df.head()

In [None]:
len(vital_aprdc_df)

In [None]:
vital_aprdc_df.patientunitstayid.nunique()

In [None]:
eICU_df = pd.merge_asof(eICU_df, vital_aprdc_df, on='ts', by='patientunitstayid', direction='nearest', tolerance=30)
eICU_df.head()

## Infectious disease data

### Read the data

In [None]:
infect_df = pd.read_csv(f'{data_path}original/carePlanInfectiousDisease.csv')
infect_df.head()

In [None]:
infect_df.infectdiseasesite.value_counts().head(10)

In [None]:
infect_df.infectdiseaseassessment.value_counts().head(10)

In [None]:
infect_df.responsetotherapy.value_counts().head(10)

In [None]:
infect_df.treatment.value_counts().head(10)

Most features in this table either don't add much information or they have a lot of missing values. The truly relevant one seems to be `infectdiseasesite`. Even `activeupondischarge` doesn't seem very practical as we don't have complete information as to when infections end, might as well just register when they are first verified.

Get an overview of the dataframe through the `describe` method:

In [None]:
infect_df.describe().transpose()

In [None]:
infect_df.columns

In [None]:
infect_df.dtypes

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(infect_df)

### Remove unneeded features

In [None]:
infect_df = infect_df[['patientunitstayid', 'cplinfectdiseaseoffset', 'infectdiseasesite']]
infect_df.head()

### Discretize categorical features

Convert binary categorical features into simple numberings, one hot encode features with a low number of categories (in this case, 5) and enumerate sparse categorical features that will be embedded.

#### Separate and prepare features for embedding

Identify categorical features that have more than 5 unique categories, which will go through an embedding layer afterwards, and enumerate them.

Update list of categorical features and add those that will need embedding (features with more than 5 unique values):

In [None]:
new_cat_feat = ['infectdiseasesite']
[cat_feat.append(col) for col in new_cat_feat]

In [None]:
cat_feat_nunique = [infect_df[feature].nunique() for feature in new_cat_feat]
cat_feat_nunique

In [None]:
new_cat_embed_feat = []
for i in range(len(new_cat_feat)):
    if cat_feat_nunique[i] > 5:
        # Add feature to the list of those that will be embedded
        cat_embed_feat.append(new_cat_feat[i])
        # Add feature to the list of the new ones (from the current table) that will be embedded
        new_cat_embed_feat.append(new_cat_feat[i])

In [None]:
infect_df[new_cat_feat].head()

In [None]:
for i in range(len(new_cat_embed_feat)):
    feature = new_cat_embed_feat[i]
    # Prepare for embedding, i.e. enumerate categories
    infect_df[feature], cat_embed_feat_enum[feature] = du.embedding.enum_categorical_feature(infect_df, feature)

In [None]:
infect_df[new_cat_feat].head()

In [None]:
cat_embed_feat_enum

In [None]:
infect_df[cat_feat].dtypes

#### Save enumeration encoding mapping

Save the dictionary that maps from the original categories/strings to the new numerical encondings.

In [None]:
stream = open('cat_embed_feat_enum.yaml', 'w')
yaml.dump(cat_embed_feat_enum, stream, default_flow_style=False)

### Create the timestamp feature and sort

Create the timestamp (`ts`) feature:

In [None]:
infect_df['ts'] = infect_df['cplinfectdiseaseoffset']
infect_df = infect_df.drop('cplinfectdiseaseoffset', axis=1)
infect_df.head()

In [None]:
infect_df.patientunitstayid.value_counts()

Only 3620 unit stays have infection data. Might not be useful to include them.

Remove duplicate rows:

In [None]:
len(infect_df)

In [None]:
infect_df = infect_df.drop_duplicates()
infect_df.head()

In [None]:
len(infect_df)

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
infect_df = infect_df.set_index('ts')
infect_df.head(6)

Check for possible multiple rows with the same unit stay ID and timestamp:

In [None]:
infect_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='infectdiseasesite').head()

In [None]:
infect_df[infect_df.patientunitstayid == 3049689].head(20)

We can see that there are up to 6 categories per set of `patientunitstayid` and `ts`. As such, we must join them.

### Join rows that have the same IDs

In [None]:
infect_df = du.embedding.join_categorical_enum(infect_df, new_cat_embed_feat)
infect_df.head()

In [None]:
infect_df.dtypes

In [None]:
infect_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='infectdiseasesite').head()

In [None]:
infect_df[infect_df.patientunitstayid == 3049689].head(20)

Comparing the output from the two previous cells with what we had before the `join_categorical_enum` method, we can see that all rows with duplicate IDs have been successfully joined.

### Normalize data

In [None]:
infect_df_norm = du.data_processing.normalize_data(infect_df, embed_columns=new_cat_feat,
                                                id_columns=['patientunitstayid'])
infect_df_norm.head(6)

Confirm that everything is ok through the `describe` method:

In [None]:
infect_df_norm.describe().transpose()

### Clean column names

Standardize all column names to be on lower case, have spaces replaced by underscores and remove comas.

In [None]:
infect_df.columns = du.data_processing.clean_naming(infect_df.columns)
infect_df_norm.columns = du.data_processing.clean_naming(infect_df_norm.columns)
infect_df_norm.head()

### Save the dataframe

Save the dataframe before normalizing:

In [None]:
infect_df.to_csv(f'{data_path}cleaned/unnormalized/carePlanInfectiousDisease.csv')

Save the dataframe after normalizing:

In [None]:
infect_df_norm.to_csv(f'{data_path}cleaned/normalized/carePlanInfectiousDisease.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
infect_df_norm.describe().transpose()

### Join dataframes

Merge dataframes by the unit stay, `patientunitstayid`, and the timestamp, `ts`, with a tolerence for a difference of up to 30 minutes.

In [None]:
infect_df = pd.read_csv(f'{data_path}cleaned/normalized/carePlanInfectiousDisease.csv')
infect_df.head()

In [None]:
len(infect_df)

In [None]:
infect_df.patientunitstayid.nunique()

In [None]:
eICU_df = pd.merge_asof(eICU_df, infect_df, on='ts', by='patientunitstayid', direction='nearest', tolerance=30)
eICU_df.head()

## Microbiology data

### Read the data

In [None]:
micro_df = pd.read_csv(f'{data_path}original/microLab.csv')
micro_df.head()

In [None]:
len(micro_df)

In [None]:
micro_df.patientunitstayid.nunique()

Only 2923 unit stays have microbiology data. Might not be useful to include them.

Get an overview of the dataframe through the `describe` method:

In [None]:
micro_df.describe().transpose()

In [None]:
micro_df.columns

In [None]:
micro_df.dtypes

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(micro_df)

### Remove unneeded features

In [None]:
micro_df.culturesite.value_counts()

In [None]:
micro_df.organism.value_counts()

In [None]:
micro_df.antibiotic.value_counts()

In [None]:
micro_df.sensitivitylevel.value_counts()

All features appear to be relevant, except the unique identifier of the table.

In [None]:
micro_df = micro_df.drop('microlabid', axis=1)
micro_df.head()

### Discretize categorical features

Convert binary categorical features into simple numberings, one hot encode features with a low number of categories (in this case, 5) and enumerate sparse categorical features that will be embedded.

#### Separate and prepare features for embedding

Identify categorical features that have more than 5 unique categories, which will go through an embedding layer afterwards, and enumerate them.

In the case of microbiology data, we're also going to embed the antibiotic `sensitivitylevel`, not because it has many categories, but because there can be several rows of data per timestamp (which would be impractical on one hot encoded data).

Update list of categorical features and add those that will need embedding (features with more than 5 unique values):

In [None]:
new_cat_feat = ['culturesite', 'organism', 'antibiotic', 'sensitivitylevel']
[cat_feat.append(col) for col in new_cat_feat]

In [None]:
cat_feat_nunique = [micro_df[feature].nunique() for feature in new_cat_feat]
cat_feat_nunique

In [None]:
new_cat_embed_feat = []
for i in range(len(new_cat_feat)):
    if cat_feat_nunique[i] > 5 or new_cat_feat[i] == 'sensitivitylevel':
        # Add feature to the list of those that will be embedded
        cat_embed_feat.append(new_cat_feat[i])
        new_cat_embed_feat.append(new_cat_feat[i])

In [None]:
micro_df[new_cat_feat].head()

In [None]:
for i in range(len(new_cat_embed_feat)):
    feature = new_cat_embed_feat[i]
    # Prepare for embedding, i.e. enumerate categories
    micro_df[feature], cat_embed_feat_enum[feature] = du.embedding.enum_categorical_feature(micro_df, feature)

In [None]:
micro_df[new_cat_feat].head()

In [None]:
cat_embed_feat_enum

In [None]:
micro_df[new_cat_feat].dtypes

#### Save enumeration encoding mapping

Save the dictionary that maps from the original categories/strings to the new numerical encondings.

In [None]:
stream = open('cat_embed_feat_enum.yaml', 'w')
yaml.dump(cat_embed_feat_enum, stream, default_flow_style=False)

### Create the timestamp feature and sort

Create the timestamp (`ts`) feature:

In [None]:
micro_df['ts'] = micro_df['culturetakenoffset']
micro_df = micro_df.drop('culturetakenoffset', axis=1)
micro_df.head()

Remove duplicate rows:

In [None]:
len(micro_df)

In [None]:
micro_df = micro_df.drop_duplicates()
micro_df.head()

In [None]:
len(micro_df)

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
micro_df = micro_df.set_index('ts')
micro_df.head()

Check for possible multiple rows with the same unit stay ID and timestamp:

In [None]:
micro_df.reset_index().head()

In [None]:
micro_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='culturesite').head()

In [None]:
micro_df[micro_df.patientunitstayid == 3069495].head(20)

We can see that there are up to 120 categories per set of `patientunitstayid` and `ts`. As such, we must join them.

### Join rows that have the same IDs

In [None]:
micro_df = du.embedding.join_categorical_enum(micro_df, new_cat_embed_feat)
micro_df.head()

In [None]:
micro_df.dtypes

In [None]:
micro_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='culturesite').head()

In [None]:
micro_df[micro_df.patientunitstayid == 3069495].head(20)

Comparing the output from the two previous cells with what we had before the `join_categorical_enum` method, we can see that all rows with duplicate IDs have been successfully joined.

### Normalize data

In [None]:
micro_df_norm = du.data_processing.normalize_data(micro_df, embed_columns=new_cat_feat,
                                               id_columns=['patientunitstayid'])
micro_df_norm.head(6)

Confirm that everything is ok through the `describe` method:

In [None]:
micro_df_norm.describe().transpose()

### Clean column names

Standardize all column names to be on lower case, have spaces replaced by underscores and remove comas.

In [None]:
micro_df.columns = du.data_processing.clean_naming(micro_df.columns)
micro_df_norm.columns = du.data_processing.clean_naming(micro_df_norm.columns)
micro_df_norm.head()

### Save the dataframe

Save the dataframe before normalizing:

In [None]:
micro_df.to_csv(f'{data_path}cleaned/unnormalized/microLab.csv')

Save the dataframe after normalizing:

In [None]:
micro_df_norm.to_csv(f'{data_path}cleaned/normalized/microLab.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
micro_df_norm.describe().transpose()

### Join dataframes

Merge dataframes by the unit stay, `patientunitstayid`, and the timestamp, `ts`, with a tolerence for a difference of up to 30 minutes.

In [None]:
micro_df = pd.read_csv(f'{data_path}cleaned/normalized/microLab.csv')
micro_df.head()

In [None]:
len(micro_df)

In [None]:
micro_df.patientunitstayid.nunique()

In [None]:
eICU_df = pd.merge_asof(eICU_df, micro_df, on='ts', by='patientunitstayid', direction='nearest', tolerance=30)
eICU_df.head()

## Respiratory care data

### Read the data

In [None]:
resp_care_df = pd.read_csv(f'{data_path}original/respiratoryCare.csv', dtype={'airwayposition': 'object',
                                                                              'airwaysize': 'object',
                                                                              'apneaparms': 'object',
                                                                              'setapneafio2': 'object',
                                                                              'setapneaie': 'object',
                                                                              'setapneainsptime': 'object',
                                                                              'setapneainterval': 'object',
                                                                              'setapneaippeephigh': 'object',
                                                                              'setapneapeakflow': 'object',
                                                                              'setapneatv': 'object'})
resp_care_df.head()

In [None]:
len(resp_care_df)

In [None]:
resp_care_df.patientunitstayid.nunique()

Get an overview of the dataframe through the `describe` method:

In [None]:
resp_care_df.describe().transpose()

In [None]:
resp_care_df.columns

In [None]:
resp_care_df.dtypes

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(resp_care_df)

### Remove unneeded features

For the respiratoryCare table, I'm not going to use any of the several features that detail what the vent in the hospital is like. Besides not appearing to be very relevant for the patient, they have a lot of missing values (>67%). Instead, I'm going to set a ventilation label (between the start and the end), and a previous ventilation label.

In [None]:
resp_care_df = resp_care_df[['patientunitstayid', 'ventstartoffset',
                             'ventendoffset', 'priorventstartoffset']]
resp_care_df.head()

### Create the timestamp feature and sort

Create the timestamp (`ts`) feature:

In [None]:
resp_care_df['ts'] = resp_care_df['ventstartoffset']
resp_care_df = resp_care_df.drop('ventstartoffset', axis=1)
resp_care_df.head()

Remove duplicate rows:

In [None]:
len(resp_care_df)

In [None]:
resp_care_df = resp_care_df.drop_duplicates()
resp_care_df.head()

In [None]:
len(resp_care_df)

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
resp_care_df = resp_care_df.set_index('ts')
resp_care_df.head()

Check for possible multiple rows with the same unit stay ID and timestamp:

In [None]:
resp_care_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='ventendoffset').head()

In [None]:
resp_care_df[resp_care_df.patientunitstayid == 3348331].head(20)

We can see that there are up to 5283 duplicate rows per set of `patientunitstayid` and `ts`. As such, we must join them.

### Join rows that have the same IDs

Even after removing duplicates rows, there are still some that have different information for the same ID and timestamp. We have to apply a groupby function, selecting the minimum value for each of the offset features, as the larger values don't make sense (in the `priorventstartoffset`).

In [None]:
((resp_care_df.index > resp_care_df.ventendoffset) & resp_care_df.ventendoffset != 0).value_counts()

There are no errors of having the start vent timestamp later than the end vent timestamp.

In [None]:
resp_care_df = du.embedding.join_categorical_enum(resp_care_df, cont_join_method='min')
resp_care_df.head()

In [None]:
resp_care_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='ventendoffset').head()

In [None]:
resp_care_df[resp_care_df.patientunitstayid == 1113084].head(10)

Comparing the output from the two previous cells with what we had before the `join_categorical_enum` method, we can see that all rows with duplicate IDs have been successfully joined.

Only keep the first instance of each patient, as we're only keeping track of when they are on ventilation:

In [None]:
resp_care_df = resp_care_df.reset_index().groupby('patientunitstayid').first().reset_index().set_index('ts')
resp_care_df.head(20)

### Create prior ventilation label

Make a feature `priorvent` that indicates if the patient has been on ventilation before.

Convert to pandas:

In [None]:
resp_care_df = resp_care_df

Create the prior ventilation column:

In [None]:
resp_care_df['priorvent'] = (resp_care_df.priorventstartoffset < resp_care_df.index).astype(int)
resp_care_df.head()

Remove the now unneeded `priorventstartoffset` column:

In [None]:
resp_care_df = resp_care_df.drop('priorventstartoffset', axis=1)
resp_care_df.head()

### Create current ventilation label

Make a feature `onvent` that indicates if the patient is currently on ventilation.

Create a `onvent` feature:

In [None]:
resp_care_df['onvent'] = 1
resp_care_df.head(6)

Reset index to allow editing the `ts` column:

In [None]:
resp_care_df = resp_care_df.reset_index()
resp_care_df.head(6)

Duplicate every row, so as to create a discharge event:

In [None]:
new_df = resp_care_df.copy()
new_df.head()

Set the new dataframe's rows to have the ventilation stop timestamp, indicating that ventilation use ended:

In [None]:
new_df.ts = new_df.ventendoffset
new_df.onvent = 0
new_df.head()

Join the new rows to the remaining dataframe:

In [None]:
resp_care_df = resp_care_df.append(new_df)
resp_care_df.head()

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
resp_care_df = resp_care_df.set_index('ts')
resp_care_df.head()

Remove the now unneeded ventilation end column:

In [None]:
resp_care_df = resp_care_df.drop('ventendoffset', axis=1)
resp_care_df.head(6)

In [None]:
resp_care_df.tail(6)

In [None]:
resp_care_df[resp_care_df.patientunitstayid == 1557538]

### Clean column names

Standardize all column names to be on lower case, have spaces replaced by underscores and remove comas.

In [None]:
resp_care_df.columns = du.data_processing.clean_naming(resp_care_df.columns)
resp_care_df.head()

### Save the dataframe

Save the dataframe before normalizing:

In [None]:
resp_care_df.to_csv(f'{data_path}cleaned/unnormalized/respiratoryCare.csv')

Save the dataframe after normalizing:

In [None]:
resp_care_df.to_csv(f'{data_path}cleaned/normalized/respiratoryCare.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
resp_care_df.describe().transpose()

### Join dataframes

Merge dataframes by the unit stay, `patientunitstayid`, and the timestamp, `ts`, with a tolerence for a difference of up to 30 minutes.

In [None]:
resp_care_df = pd.read_csv(f'{data_path}cleaned/normalized/respiratoryCare.csv')
resp_care_df.head()

In [None]:
len(resp_care_df)

In [None]:
resp_care_df.patientunitstayid.nunique()

In [None]:
eICU_df = pd.merge_asof(eICU_df, resp_care_df, on='ts', by='patientunitstayid', direction='nearest', tolerance=30)
eICU_df.head()

## Respiratory charting data

### Read the data

In [None]:
resp_chart_df = pd.read_csv(f'{data_path}original/respiratoryCharting.csv')
resp_chart_df.head()

In [None]:
len(resp_chart_df)

In [None]:
resp_chart_df.patientunitstayid.nunique()

Only 13001 unit stays have nurse care data. Might not be useful to include them.

Get an overview of the dataframe through the `describe` method:

In [None]:
resp_chart_df.describe().transpose()

In [None]:
resp_chart_df.columns

In [None]:
resp_chart_df.dtypes

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(resp_chart_df)

### Remove unneeded features

In [None]:
resp_chart_df.celllabel.value_counts()

In [None]:
resp_chart_df.cellattribute.value_counts()

In [None]:
resp_chart_df.cellattributevalue.value_counts()

In [None]:
resp_chart_df.cellattributepath.value_counts()

In [None]:
resp_chart_df[resp_chart_df.celllabel == 'Intervention'].cellattributevalue.value_counts()

In [None]:
resp_chart_df[resp_chart_df.celllabel == 'Neurologic'].cellattributevalue.value_counts()

In [None]:
resp_chart_df[resp_chart_df.celllabel == 'Pupils'].cellattributevalue.value_counts()

In [None]:
resp_chart_df[resp_chart_df.celllabel == 'Edema'].cellattributevalue.value_counts()

In [None]:
resp_chart_df[resp_chart_df.celllabel == 'Secretions'].cellattributevalue.value_counts()

In [None]:
resp_chart_df[resp_chart_df.celllabel == 'Cough'].cellattributevalue.value_counts()

In [None]:
resp_chart_df[resp_chart_df.celllabel == 'Neurologic'].cellattribute.value_counts()

In [None]:
resp_chart_df[resp_chart_df.celllabel == 'Pupils'].cellattribute.value_counts()

In [None]:
resp_chart_df[resp_chart_df.celllabel == 'Secretions'].cellattribute.value_counts()

In [None]:
resp_chart_df[resp_chart_df.celllabel == 'Cough'].cellattribute.value_counts()

Besides the usual removal of row identifier, `nurseAssessID`, and the timestamp when data was added, `nurseAssessEntryOffset`, I'm also removing `cellattributepath` and `cellattribute`, which have redundant info with `celllabel`. Regarding data categories, I'm only keeping `Neurologic`, `Pupils`, `Secretions` and `Cough`, as the remaining ones either don't add much value, have too little data or are redundant with data from other tables.

In [None]:
resp_chart_df = resp_chart_df.drop(['nurseassessid', 'nurseassessentryoffset',
                                      'cellattributepath', 'cellattribute'], axis=1)
resp_chart_df.head()

In [None]:
categories_to_keep = ['Neurologic', 'Pupils', 'Secretions', 'Cough']

In [None]:
resp_chart_df.celllabel.isin(categories_to_keep).head()

In [None]:
resp_chart_df = resp_chart_df[resp_chart_df.celllabel.isin(categories_to_keep)]
resp_chart_df.head()

### Convert categories to features

Make the `celllabel` and `cellattributevalue` columns of type categorical:

In [None]:
resp_chart_df = resp_chart_df.categorize(columns=['celllabel', 'cellattributevalue'])

In [None]:
resp_chart_df.head()

Transform the `celllabel` categories and `cellattributevalue` values into separate features:

Now we have the categories separated into their own features, as desired.

Remove the old `celllabel` and `cellattributevalue` columns:

In [None]:
resp_chart_df = resp_chart_df.drop(['celllabel', 'cellattributevalue'], axis=1)
resp_chart_df.head()

In [None]:
resp_chart_df['Neurologic'].value_counts()

In [None]:
resp_chart_df['Pupils'].value_counts()

In [None]:
resp_chart_df['Secretions'].value_counts()

In [None]:
resp_chart_df['Cough'].value_counts()

### Discretize categorical features

Convert binary categorical features into simple numberings, one hot encode features with a low number of categories (in this case, 5) and enumerate sparse categorical features that will be embedded.

#### Separate and prepare features for embedding

Identify categorical features that have more than 5 unique categories, which will go through an embedding layer afterwards, and enumerate them.

Update list of categorical features and add those that will need embedding (features with more than 5 unique values):

In [None]:
new_cat_feat = ['Pupils', 'Neurologic', 'Secretions', 'Cough']
[cat_feat.append(col) for col in new_cat_feat]

In [None]:
cat_feat_nunique = [resp_chart_df[feature].nunique() for feature in new_cat_feat]
cat_feat_nunique

In [None]:
new_cat_embed_feat = []
for i in range(len(new_cat_feat)):
    if cat_feat_nunique[i] > 5:
        # Add feature to the list of those that will be embedded
        cat_embed_feat.append(new_cat_feat[i])
        new_cat_embed_feat.append(new_cat_feat[i])

In [None]:
resp_chart_df[new_cat_feat].head()

In [None]:
for i in range(len(new_cat_embed_feat)):
    feature = new_cat_embed_feat[i]
    # Prepare for embedding, i.e. enumerate categories
    resp_chart_df[feature], cat_embed_feat_enum[feature] = du.embedding.enum_categorical_feature(resp_chart_df, feature)

In [None]:
resp_chart_df[new_cat_feat].head()

In [None]:
cat_embed_feat_enum

In [None]:
resp_chart_df[new_cat_feat].dtypes

#### Save enumeration encoding mapping

Save the dictionary that maps from the original categories/strings to the new numerical encondings.

In [None]:
stream = open('cat_embed_feat_enum.yaml', 'w')
yaml.dump(cat_embed_feat_enum, stream, default_flow_style=False)

### Create the timestamp feature and sort

Create the timestamp (`ts`) feature:

In [None]:
resp_chart_df['ts'] = resp_chart_df['nurseassessoffset']
resp_chart_df = resp_chart_df.drop('nurseassessoffset', axis=1)
resp_chart_df.head()

Remove duplicate rows:

In [None]:
len(resp_chart_df)

In [None]:
resp_chart_df = resp_chart_df.drop_duplicates()
resp_chart_df.head()

In [None]:
len(resp_chart_df)

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
resp_chart_df = resp_chart_df.set_index('ts')
resp_chart_df.head()

Check for possible multiple rows with the same unit stay ID and timestamp:

In [None]:
resp_chart_df.reset_index().head()

In [None]:
resp_chart_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='Cough').head()

In [None]:
resp_chart_df[resp_chart_df.patientunitstayid == 2553254].head(10)

We can see that there are up to 80 categories per set of `patientunitstayid` and `ts`. As such, we must join them.

### Join rows that have the same IDs

In [None]:
resp_chart_df = du.embedding.join_categorical_enum(resp_chart_df, new_cat_embed_feat)
resp_chart_df.head()

In [None]:
resp_chart_df.dtypes

In [None]:
resp_chart_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='Cough').head()

In [None]:
resp_chart_df[resp_chart_df.patientunitstayid == 2553254].head(10)

Comparing the output from the two previous cells with what we had before the `join_categorical_enum` method, we can see that all rows with duplicate IDs have been successfully joined.

### Clean column names

Standardize all column names to be on lower case, have spaces replaced by underscores and remove comas.

In [None]:
resp_chart_df.columns = du.data_processing.clean_naming(resp_chart_df.columns)
resp_chart_df.head()

### Save the dataframe

Save the dataframe before normalizing:

In [None]:
resp_chart_df.to_csv(f'{data_path}cleaned/unnormalized/respiratoryCharting.csv')

Save the dataframe after normalizing:

In [None]:
resp_chart_df.to_csv(f'{data_path}cleaned/normalized/respiratoryCharting.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
resp_chart_df.describe().transpose()

### Join dataframes

Merge dataframes by the unit stay, `patientunitstayid`, and the timestamp, `ts`, with a tolerence for a difference of up to 30 minutes.

In [None]:
resp_chart_df = pd.read_csv(f'{data_path}cleaned/normalized/respiratoryCharting.csv')
resp_chart_df.head()

In [None]:
len(resp_chart_df)

In [None]:
resp_chart_df.patientunitstayid.nunique()

In [None]:
eICU_df = pd.merge_asof(eICU_df, resp_chart_df, on='ts', by='patientunitstayid', direction='nearest', tolerance=30)
eICU_df.head()

## Allergy data

### Read the data

In [None]:
alrg_df = pd.read_csv(f'{data_path}original/allergy.csv')
alrg_df.head()

In [None]:
len(alrg_df)

In [None]:
alrg_df.patientunitstayid.nunique()

Get an overview of the dataframe through the `describe` method:

In [None]:
alrg_df.describe().transpose()

In [None]:
alrg_df.columns

In [None]:
alrg_df.dtypes

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(alrg_df)

### Remove unneeded features

In [None]:
alrg_df[alrg_df.allergytype == 'Non Drug'].drughiclseqno.value_counts()

In [None]:
alrg_df[alrg_df.allergytype == 'Drug'].drughiclseqno.value_counts()

As we can see, the drug features in this table only have data if the allergy derives from using the drug. As such, we don't need the `allergytype` feature. Also ignoring hospital staff related information and using just the drug codes instead of their names, as they're independent of the drug brand.

In [None]:
alrg_df.allergynotetype.value_counts()

Feature `allergynotetype` also doesn't seem very relevant, discarding it.

In [None]:
alrg_df = alrg_df[['patientunitstayid', 'allergyoffset',
                   'allergyname', 'drughiclseqno']]
alrg_df.head()

### Discretize categorical features

Convert binary categorical features into simple numberings, one hot encode features with a low number of categories (in this case, 5) and enumerate sparse categorical features that will be embedded.

#### Separate and prepare features for embedding

Identify categorical features that have more than 5 unique categories, which will go through an embedding layer afterwards, and enumerate them.

In the case of microbiology data, we're also going to embed the antibiotic `sensitivitylevel`, not because it has many categories, but because there can be several rows of data per timestamp (which would be impractical on one hot encoded data).

Update list of categorical features and add those that will need embedding (features with more than 5 unique values):

In [None]:
new_cat_feat = ['allergyname', 'drughiclseqno']
[cat_feat.append(col) for col in new_cat_feat]

In [None]:
cat_feat_nunique = [alrg_df[feature].nunique() for feature in new_cat_feat]
cat_feat_nunique

In [None]:
new_cat_embed_feat = []
for i in range(len(new_cat_feat)):
    if cat_feat_nunique[i] > 5:
        # Add feature to the list of those that will be embedded
        cat_embed_feat.append(new_cat_feat[i])
        new_cat_embed_feat.append(new_cat_feat[i])

In [None]:
alrg_df[new_cat_feat].head()

In [None]:
for i in range(len(new_cat_embed_feat)):
    feature = new_cat_embed_feat[i]
    # Skip the 'drughiclseqno' from enumeration encoding
    if feature == 'drughiclseqno':
        continue
    # Prepare for embedding, i.e. enumerate categories
    alrg_df[feature], cat_embed_feat_enum[feature] = du.embedding.enum_categorical_feature(alrg_df, feature)

Fill missing values of the drug data with 0, so as to prepare for embedding:

In [None]:
alrg_df.drughiclseqno = alrg_df.drughiclseqno.fillna(0).astype(int)

In [None]:
alrg_df[new_cat_feat].head()

In [None]:
cat_embed_feat_enum

In [None]:
alrg_df[new_cat_feat].dtypes

#### Save enumeration encoding mapping

Save the dictionary that maps from the original categories/strings to the new numerical encondings.

In [None]:
stream = open('cat_embed_feat_enum.yaml', 'w')
yaml.dump(cat_embed_feat_enum, stream, default_flow_style=False)

### Create the timestamp feature and sort

Create the timestamp (`ts`) feature:

In [None]:
alrg_df['ts'] = alrg_df['allergyoffset']
alrg_df = alrg_df.drop('allergyoffset', axis=1)
alrg_df.head()

Remove duplicate rows:

In [None]:
len(alrg_df)

In [None]:
alrg_df = alrg_df.drop_duplicates()
alrg_df.head()

In [None]:
len(alrg_df)

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
alrg_df = alrg_df.set_index('ts')
alrg_df.head()

Check for possible multiple rows with the same unit stay ID and timestamp:

In [None]:
alrg_df.reset_index().head()

In [None]:
alrg_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='allergyname').head()

In [None]:
alrg_df[alrg_df.patientunitstayid == 3197554].head(10)

We can see that there are up to 47 categories per set of `patientunitstayid` and `ts`. As such, we must join them.

### Join rows that have the same IDs

Even after removing duplicates rows, there are still some that have different information for the same ID and timestamp. We have to concatenate the categorical enumerations.

In [None]:
alrg_df = du.embedding.join_categorical_enum(alrg_df, new_cat_embed_feat)
alrg_df.head()

In [None]:
alrg_df.dtypes

In [None]:
alrg_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='allergyname').head()

In [None]:
alrg_df[alrg_df.patientunitstayid == 3197554].head(10)

Comparing the output from the two previous cells with what we had before the `join_categorical_enum` method, we can see that all rows with duplicate IDs have been successfully joined.

### Rename columns

In [None]:
alrg_df = alrg_df.rename(columns={'drughiclseqno':'drugallergyhiclseqno'})
alrg_df.head()

### Clean column names

Standardize all column names to be on lower case, have spaces replaced by underscores and remove comas.

In [None]:
alrg_df.columns = du.data_processing.clean_naming(alrg_df.columns)
alrg_df.head()

### Save the dataframe

Save the dataframe before normalizing:

In [None]:
alrg_df.to_csv(f'{data_path}cleaned/unnormalized/allergy.csv')

Save the dataframe after normalizing:

In [None]:
alrg_df.to_csv(f'{data_path}cleaned/normalized/allergy.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
alrg_df.describe().transpose()

### Join dataframes

Merge dataframes by the unit stay, `patientunitstayid`, and the timestamp, `ts`, with a tolerence for a difference of up to 30 minutes.

In [None]:
alrg_df = pd.read_csv(f'{data_path}cleaned/normalized/allergy.csv')
alrg_df.head()

In [None]:
len(alrg_df)

In [None]:
alrg_df.patientunitstayid.nunique()

In [None]:
eICU_df = pd.merge_asof(eICU_df, alrg_df, on='ts', by='patientunitstayid', direction='nearest', tolerance=30)
eICU_df.head()

In [None]:
# [TODO] Check if careplangeneral table could be useful. It seems to have mostly subjective data.

## General care plan data

### Read the data

In [None]:
careplangen_df = pd.read_csv(f'{data_path}original/carePlanGeneral.csv')
careplangen_df.head()

In [None]:
len(careplangen_df)

In [None]:
careplangen_df.patientunitstayid.nunique()

Get an overview of the dataframe through the `describe` method:

In [None]:
careplangen_df.describe().transpose()

In [None]:
careplangen_df.columns

In [None]:
careplangen_df.dtypes

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(careplangen_df)

### Remove unneeded features

In [None]:
careplangen_df.cplgroup.value_counts()

In [None]:
careplangen_df.cplitemvalue.value_counts()

In [None]:
careplangen_df[careplangen_df.cplgroup == 'Activity'].cplitemvalue.value_counts()

In [None]:
careplangen_df[careplangen_df.cplgroup == 'Care Limitation'].cplitemvalue.value_counts()

In [None]:
careplangen_df[careplangen_df.cplgroup == 'Route-Status'].cplitemvalue.value_counts()

In [None]:
careplangen_df[careplangen_df.cplgroup == 'Critical Care Discharge/Transfer Planning'].cplitemvalue.value_counts()

In [None]:
careplangen_df[careplangen_df.cplgroup == 'Safety/Restraints'].cplitemvalue.value_counts()

In [None]:
careplangen_df[careplangen_df.cplgroup == 'Sedation'].cplitemvalue.value_counts()

In [None]:
careplangen_df[careplangen_df.cplgroup == 'Analgesia'].cplitemvalue.value_counts()

In [None]:
careplangen_df[careplangen_df.cplgroup == 'Ordered Protocols'].cplitemvalue.value_counts()

In [None]:
careplangen_df[careplangen_df.cplgroup == 'Volume Status'].cplitemvalue.value_counts()

In [None]:
careplangen_df[careplangen_df.cplgroup == 'Psychosocial Status'].cplitemvalue.value_counts()

In [None]:
careplangen_df[careplangen_df.cplgroup == 'Current Rate'].cplitemvalue.value_counts()

In [None]:
careplangen_df[careplangen_df.cplgroup == 'Baseline Status'].cplitemvalue.value_counts()

In [None]:
careplangen_df[careplangen_df.cplgroup == 'Protein'].cplitemvalue.value_counts()

In [None]:
careplangen_df[careplangen_df.cplgroup == 'Calories'].cplitemvalue.value_counts()

In this case, there aren't entire columns to remove. However, some specific types of care plan categories seem to be less relevant (e.g. activity, critical care discharge/transfer planning) or redundant (e.g. ventilation, infectious diseases). So, we're going to remove rows that have those categories.

In [None]:
careplangen_df = careplangen_df.drop('cplgeneralid', axis=1)
careplangen_df.head()

In [None]:
categories_to_remove = ['Ventilation', 'Airway', 'Activity', 'Care Limitation',
                        'Route-Status', 'Critical Care Discharge/Transfer Planning',
                        'Ordered Protocols', 'Acuity', 'Volume Status', 'Prognosis',
                        'Care Providers', 'Family/Health Care Proxy/Contact Info', 'Current Rate',
                        'Daily Goals/Safety Risks/Discharge Requirements', 'Goal Rate',
                        'Planned Procedures', 'Infectious Disease',
                        'Care Plan Reviewed with Patient/Family', 'Protein', 'Calories']

In [None]:
~(careplangen_df.cplgroup.isin(categories_to_remove)).head()

In [None]:
careplangen_df = careplangen_df[~(careplangen_df.cplgroup.isin(categories_to_remove))]
careplangen_df.head()

In [None]:
len(careplangen_df)

In [None]:
careplangen_df.patientunitstayid.nunique()

There's still plenty of data left, affecting around 92.48% of the unit stays, even after removing several categories.

### Discretize categorical features

Convert binary categorical features into simple numberings, one hot encode features with a low number of categories (in this case, 5) and enumerate sparse categorical features that will be embedded.

#### Separate and prepare features for embedding

Identify categorical features that have more than 5 unique categories, which will go through an embedding layer afterwards, and enumerate them.

Update list of categorical features and add those that will need embedding (features with more than 5 unique values):

In [None]:
new_cat_feat = ['cplgroup', 'cplitemvalue']
[cat_feat.append(col) for col in new_cat_feat]

In [None]:
cat_feat_nunique = [careplangen_df[feature].nunique() for feature in new_cat_feat]
cat_feat_nunique

In [None]:
new_cat_embed_feat = []
for i in range(len(new_cat_feat)):
    if cat_feat_nunique[i] > 5:
        # Add feature to the list of those that will be embedded
        cat_embed_feat.append(new_cat_feat[i])
        new_cat_embed_feat.append(new_cat_feat[i])

In [None]:
careplangen_df[new_cat_feat].head()

In [None]:
for i in range(len(new_cat_embed_feat)):
    feature = new_cat_embed_feat[i]
    # Prepare for embedding, i.e. enumerate categories
    careplangen_df[feature], cat_embed_feat_enum[feature] = du.embedding.enum_categorical_feature(careplangen_df, feature)

In [None]:
careplangen_df[new_cat_feat].head()

In [None]:
cat_embed_feat_enum

In [None]:
careplangen_df[new_cat_feat].dtypes

#### Save enumeration encoding mapping

Save the dictionary that maps from the original categories/strings to the new numerical encondings.

In [None]:
stream = open('cat_embed_feat_enum.yaml', 'w')
yaml.dump(cat_embed_feat_enum, stream, default_flow_style=False)

### Create the timestamp feature and sort

Create the timestamp (`ts`) feature:

In [None]:
careplangen_df['ts'] = careplangen_df['cplitemoffset']
careplangen_df = careplangen_df.drop('cplitemoffset', axis=1)
careplangen_df.head()

Remove duplicate rows:

In [None]:
len(careplangen_df)

In [None]:
careplangen_df = careplangen_df.drop_duplicates()
careplangen_df.head()

In [None]:
len(careplangen_df)

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
careplangen_df = careplangen_df.set_index('ts')
careplangen_df.head()

Check for possible multiple rows with the same unit stay ID and timestamp:

In [None]:
careplangen_df.reset_index().head()

In [None]:
careplangen_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='cplgroup').head()

In [None]:
careplangen_df[careplangen_df.patientunitstayid == 3138123].head(10)

We can see that there are up to 32 categories per set of `patientunitstayid` and `ts`. As such, we must join them.

### Join rows that have the same IDs

In [None]:
careplangen_df = du.embedding.join_categorical_enum(careplangen_df, new_cat_embed_feat)
careplangen_df.head()

In [None]:
careplangen_df.dtypes

In [None]:
careplangen_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='cplgroup').head()

In [None]:
careplangen_df[careplangen_df.patientunitstayid == 3138123].head(10)

Comparing the output from the two previous cells with what we had before the `join_categorical_enum` method, we can see that all rows with duplicate IDs have been successfully joined.

### Rename columns

Keeping the `activeupondischarge` feature so as to decide if forward fill or leave at NaN each general care plan value, when we have the full dataframe. However, we need to identify this feature's original table, general care plan, so as to not confound with other data.

In [None]:
careplangen_df = careplangen_df.rename(columns={'activeupondischarge':'cpl_activeupondischarge'})
careplangen_df.head()

### Clean column names

Standardize all column names to be on lower case, have spaces replaced by underscores and remove comas.

In [None]:
careplangen_df.columns = du.data_processing.clean_naming(careplangen_df.columns)
careplangen_df.head()

### Save the dataframe

Save the dataframe before normalizing:

In [None]:
careplangen_df.to_csv(f'{data_path}cleaned/unnormalized/carePlanGeneral.csv')

Save the dataframe after normalizing:

In [None]:
careplangen_df.to_csv(f'{data_path}cleaned/normalized/carePlanGeneral.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
careplangen_df.describe().transpose()

### Join dataframes

Merge dataframes by the unit stay, `patientunitstayid`, and the timestamp, `ts`, with a tolerence for a difference of up to 30 minutes.

In [None]:
careplangen_df = pd.read_csv(f'{data_path}cleaned/normalized/carePlanGeneral.csv')
careplangen_df.head()

In [None]:
len(careplangen_df)

In [None]:
careplangen_df.patientunitstayid.nunique()

In [None]:
eICU_df = pd.merge_asof(eICU_df, careplangen_df, on='ts', by='patientunitstayid', direction='nearest', tolerance=30)
eICU_df.head()

## Past history data

### Read the data

In [None]:
pasthist_df = pd.read_csv(f'{data_path}original/pastHistory.csv')
pasthist_df.head()

In [None]:
len(pasthist_df)

In [None]:
pasthist_df.patientunitstayid.nunique()

Get an overview of the dataframe through the `describe` method:

In [None]:
pasthist_df.describe().transpose()

In [None]:
pasthist_df.columns

In [None]:
pasthist_df.dtypes

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(pasthist_df)

### Remove unneeded features

In [None]:
pasthist_df.pasthistorypath.value_counts().head(20)

In [None]:
pasthist_df.pasthistorypath.value_counts().tail(20)

In [None]:
pasthist_df.pasthistoryvalue.value_counts()

In [None]:
pasthist_df.pasthistorynotetype.value_counts()

In [None]:
pasthist_df[pasthist_df.pasthistorypath == 'notes/Progress Notes/Past History/Past History Obtain Options/Performed'].pasthistoryvalue.value_counts()

In this case, considering that it regards past diagnosis of the patients, the timestamp when that was observed probably isn't very reliable nor useful. As such, I'm going to remove the offset variables. Furthermore, `pasthistoryvaluetext` is redundant with `pasthistoryvalue`, while `pasthistorynotetype` and the past history path 'notes/Progress Notes/Past History/Past History Obtain Options/Performed' seem to be irrelevant.

In [None]:
pasthist_df = pasthist_df.drop(['pasthistoryid', 'pasthistoryoffset', 'pasthistoryenteredoffset',
                                'pasthistorynotetype', 'pasthistoryvaluetext'], axis=1)
pasthist_df.head()

In [None]:
categories_to_remove = ['notes/Progress Notes/Past History/Past History Obtain Options/Performed']

In [None]:
~(pasthist_df.pasthistorypath.isin(categories_to_remove)).head()

In [None]:
pasthist_df = pasthist_df[~(pasthist_df.pasthistorypath.isin(categories_to_remove))]
pasthist_df.head()

In [None]:
len(pasthist_df)

In [None]:
pasthist_df.patientunitstayid.nunique()

In [None]:
pasthist_df.pasthistorypath.value_counts().head(20)

In [None]:
pasthist_df.pasthistorypath.value_counts().tail(20)

In [None]:
pasthist_df.pasthistoryvalue.value_counts()

There's still plenty of data left, affecting around 81.87% of the unit stays, even after removing several categories.

### Separate high level notes

In [None]:
pasthist_df.pasthistorypath.map(lambda x: x.split('/')).head().values

In [None]:
pasthist_df.pasthistorypath.map(lambda x: len(x.split('/'))).min()

In [None]:
pasthist_df.pasthistorypath.map(lambda x: len(x.split('/'))).max()

In [None]:
pasthist_df.pasthistorypath.apply(lambda x: du.search_explore.get_element_from_split(x, 0, separator='/'),
                                  meta=('x', str)).value_counts()

In [None]:
pasthist_df.pasthistorypath.apply(lambda x: du.search_explore.get_element_from_split(x, 1, separator='/'),
                                  meta=('x', str)).value_counts()

In [None]:
pasthist_df.pasthistorypath.apply(lambda x: du.search_explore.get_element_from_split(x, 2, separator='/'),
                                  meta=('x', str)).value_counts()

In [None]:
pasthist_df.pasthistorypath.apply(lambda x: du.search_explore.get_element_from_split(x, 3, separator='/'),
                                  meta=('x', str)).value_counts()

In [None]:
pasthist_df.pasthistorypath.apply(lambda x: du.search_explore.get_element_from_split(x, 4, separator='/'),
                                  meta=('x', str)).value_counts()

In [None]:
pasthist_df.pasthistorypath.apply(lambda x: du.search_explore.get_element_from_split(x, 5, separator='/'),
                                  meta=('x', str)).value_counts()

In [None]:
pasthist_df.pasthistorypath.apply(lambda x: du.search_explore.get_element_from_split(x, 6, separator='/'),
                                  meta=('x', str)).value_counts()

There are always at least 5 levels of the notes. As the first 4 ones are essentially always the same ("notes/Progress Notes/Past History/Organ Systems/") and the 5th one tends to not be very specific (only indicates which organ system it affected, when it isn't just a case of no health problems detected), it's best to preserve the 5th and isolate the remaining string as a new feature. This way, the split provides further insight to the model on similar notes.

In [None]:
pasthist_df['pasthistorytype'] = pasthist_df.pasthistorypath.apply(lambda x: du.search_explore.get_element_from_split(x, 4, separator='/'), meta=('x', str))
pasthist_df['pasthistorydetails'] = pasthist_df.pasthistorypath.apply(lambda x: du.search_explore.get_element_from_split(x, 5, separator='/', till_the_end=True), meta=('x', str))
pasthist_df.head()

`pasthistoryvalue` seems to correspond to the last element of `pasthistorydetails`. Let's confirm it:

In [None]:
pasthist_df['pasthistorydetails_last'] = pasthist_df.pasthistorydetails.map(lambda x: x.split('/')[-1])
pasthist_df.head()

Compare columns `pasthistoryvalue` and `pasthistorydetails`'s last element:

In [None]:
pasthist_df[pasthist_df.pasthistoryvalue != pasthist_df.pasthistorydetails_last]

The previous output confirms that the newly created `pasthistorydetails` feature's last elememt (last string in the symbol separated lists) is almost exactly equal to the already existing `pasthistoryvalue` feature, with the differences that `pasthistoryvalue` takes into account the scenarios of no health problems detected and behaves correctly in strings that contain the separator symbol in them. So, we should remove `pasthistorydetails`'s last element:

In [None]:
pasthist_df = pasthist_df.drop('pasthistorydetails_last', axis=1)
pasthist_df.head()

In [None]:
pasthist_df['pasthistorydetails'] = pasthist_df.pasthistorydetails.apply(lambda x: '/'.join(x.split('/')[:-1]), meta=('pasthistorydetails', str))
pasthist_df.head()

Remove irrelevant `Not Obtainable` and `Not Performed` values:

In [None]:
pasthist_df[pasthist_df.pasthistoryvalue == 'Not Obtainable'].pasthistorydetails.value_counts()

In [None]:
pasthist_df[pasthist_df.pasthistoryvalue == 'Not Performed'].pasthistorydetails.value_counts()

In [None]:
pasthist_df = pasthist_df[~((pasthist_df.pasthistoryvalue == 'Not Obtainable') | (pasthist_df.pasthistoryvalue == 'Not Performed'))]
pasthist_df.head()

In [None]:
pasthist_df.pasthistorytype.unique()

Replace blank `pasthistorydetails` values:

In [None]:
pasthist_df[pasthist_df.pasthistoryvalue == 'No Health Problems'].pasthistorydetails.value_counts()

In [None]:
pasthist_df[pasthist_df.pasthistoryvalue == 'No Health Problems'].pasthistorydetails.value_counts().index

In [None]:
pasthist_df[pasthist_df.pasthistorydetails == ''].head()

In [None]:
pasthist_df['pasthistorydetails'] = pasthist_df.apply(lambda df: 'No Health Problems' if df['pasthistorytype'] == 'No Health Problems'
                                                                 else df['pasthistorydetails'],
                                                      axis=1, meta=(None, str))
pasthist_df.head()

In [None]:
pasthist_df[pasthist_df.pasthistorydetails == '']

Remove the now redundant `pasthistorypath` column:

In [None]:
pasthist_df = pasthist_df.drop('pasthistorypath', axis=1)
pasthist_df.head()

### Discretize categorical features

Convert binary categorical features into simple numberings, one hot encode features with a low number of categories (in this case, 5) and enumerate sparse categorical features that will be embedded.

#### Separate and prepare features for embedding

Identify categorical features that have more than 5 unique categories, which will go through an embedding layer afterwards, and enumerate them.

Update list of categorical features and add those that will need embedding (features with more than 5 unique values):

In [None]:
new_cat_feat = ['pasthistoryvalue', 'pasthistorytype', 'pasthistorydetails']
[cat_feat.append(col) for col in new_cat_feat]

In [None]:
cat_feat_nunique = [pasthist_df[feature].nunique() for feature in new_cat_feat]
cat_feat_nunique

In [None]:
new_cat_embed_feat = []
for i in range(len(new_cat_feat)):
    if cat_feat_nunique[i] > 5:
        # Add feature to the list of those that will be embedded
        cat_embed_feat.append(new_cat_feat[i])
        new_cat_embed_feat.append(new_cat_feat[i])

In [None]:
pasthist_df[new_cat_feat].head()

In [None]:
for i in range(len(new_cat_embed_feat)):
    feature = new_cat_embed_feat[i]
    # Prepare for embedding, i.e. enumerate categories
    pasthist_df[feature], cat_embed_feat_enum[feature] = du.embedding.enum_categorical_feature(pasthist_df, feature)

In [None]:
pasthist_df[new_cat_feat].head()

In [None]:
cat_embed_feat_enum

In [None]:
pasthist_df[new_cat_feat].dtypes

#### Save enumeration encoding mapping

Save the dictionary that maps from the original categories/strings to the new numerical encondings.

In [None]:
stream = open('cat_embed_feat_enum.yaml', 'w')
yaml.dump(cat_embed_feat_enum, stream, default_flow_style=False)

### Remove duplicate rows

Remove duplicate rows:

In [None]:
len(pasthist_df)

In [None]:
pasthist_df = pasthist_df.drop_duplicates()
pasthist_df.head()

In [None]:
len(pasthist_df)

Check for possible multiple rows with the same unit stay ID and timestamp:

In [None]:
pasthist_df.groupby(['patientunitstayid']).count().nlargest(columns='pasthistoryvalue').head()

In [None]:
pasthist_df[pasthist_df.patientunitstayid == 1558102].head(10)

We can see that there are up to 20 categories per `patientunitstayid`. As such, we must join them.

### Join rows that have the same IDs

In [None]:
pasthist_df = du.embedding.join_categorical_enum(pasthist_df, new_cat_embed_feat, id_columns=['patientunitstayid'])
pasthist_df.head()

In [None]:
pasthist_df.dtypes

In [None]:
pasthist_df.groupby(['patientunitstayid']).count().nlargest(columns='pasthistoryvalue').head()

In [None]:
pasthist_df[pasthist_df.patientunitstayid == 1558102].head(10)

Comparing the output from the two previous cells with what we had before the `join_categorical_enum` method, we can see that all rows with duplicate IDs have been successfully joined.

### Clean column names

Standardize all column names to be on lower case, have spaces replaced by underscores and remove comas.

In [None]:
pasthist_df.columns = du.data_processing.clean_naming(pasthist_df.columns)
pasthist_df.head()

### Save the dataframe

Save the dataframe before normalizing:

In [None]:
pasthist_df.to_csv(f'{data_path}cleaned/unnormalized/pastHistory.csv')

Save the dataframe after normalizing:

In [None]:
pasthist_df.to_csv(f'{data_path}cleaned/normalized/pastHistory.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
pasthist_df.describe().transpose()

### Join dataframes

Merge dataframes by the unit stay, `patientunitstayid`, and the timestamp, `ts`, with a tolerence for a difference of up to 30 minutes.

In [None]:
pasthist_df = pd.read_csv(f'{data_path}cleaned/normalized/pastHistory.csv')
pasthist_df.head()

In [None]:
len(pasthist_df)

In [None]:
pasthist_df.patientunitstayid.nunique()

In [None]:
eICU_df = pd.merge_asof(eICU_df, pasthist_df, on='ts', by='patientunitstayid', direction='nearest', tolerance=30)
eICU_df.head()

## Infusion drug data

### Read the data

In [None]:
infdrug_df = pd.read_csv(f'{data_path}original/infusionDrug.csv')
infdrug_df.head()

In [None]:
len(infdrug_df)

In [None]:
infdrug_df.patientunitstayid.nunique()

Get an overview of the dataframe through the `describe` method:

In [None]:
infdrug_df.describe().transpose()

In [None]:
infdrug_df.columns

In [None]:
infdrug_df.dtypes

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(infdrug_df)

### Remove unneeded features

Besides removing the row ID `infusiondrugid`, I'm also removing `infusionrate`, `volumeoffluid` and `drugamount` as they seem redundant with `drugrate` although with a lot more missing values.

In [None]:
infdrug_df = infdrug_df.drop(['infusiondrugid', 'infusionrate', 'volumeoffluid', 'drugamount'], axis=1)
infdrug_df.head()

### Remove string drug rate values

In [None]:
infdrug_df[infdrug_df.drugrate.map(du.utils.is_definitely_string)].head()

In [None]:
infdrug_df[infdrug_df.drugrate.map(du.utils.is_definitely_string)].drugrate.value_counts()

In [None]:
infdrug_df.drugrate = infdrug_df.drugrate.map(lambda x: np.nan if du.utils.is_definitely_string(x) else x)
infdrug_df.head()

In [None]:
infdrug_df.patientunitstayid = infdrug_df.patientunitstayid.astype(int)
infdrug_df.infusionoffset = infdrug_df.infusionoffset.astype(int)
infdrug_df.drugname = infdrug_df.drugname.astype(str)
infdrug_df.drugrate = infdrug_df.drugrate.astype(float)
infdrug_df.patientweight = infdrug_df.patientweight.astype(float)
infdrug_df.head()

### Discretize categorical features

Convert binary categorical features into simple numberings, one hot encode features with a low number of categories (in this case, 5) and enumerate sparse categorical features that will be embedded.

#### Separate and prepare features for embedding

Identify categorical features that have more than 5 unique categories, which will go through an embedding layer afterwards, and enumerate them.

Update list of categorical features and add those that will need embedding (features with more than 5 unique values):

In [None]:
new_cat_feat = ['drugname']
[cat_feat.append(col) for col in new_cat_feat]

In [None]:
cat_feat_nunique = [infdrug_df[feature].nunique() for feature in new_cat_feat]
cat_feat_nunique

In [None]:
new_cat_embed_feat = []
for i in range(len(new_cat_feat)):
    if cat_feat_nunique[i] > 5:
        # Add feature to the list of those that will be embedded
        cat_embed_feat.append(new_cat_feat[i])
        new_cat_embed_feat.append(new_cat_feat[i])

In [None]:
infdrug_df[new_cat_feat].head()

In [None]:
for i in range(len(new_cat_embed_feat)):
    feature = new_cat_embed_feat[i]
    # Prepare for embedding, i.e. enumerate categories
    infdrug_df[feature], cat_embed_feat_enum[feature] = du.embedding.enum_categorical_feature(infdrug_df, feature)

In [None]:
infdrug_df[new_cat_feat].head()

In [None]:
cat_embed_feat_enum

In [None]:
infdrug_df[new_cat_feat].dtypes

#### Save enumeration encoding mapping

Save the dictionary that maps from the original categories/strings to the new numerical encondings.

In [None]:
stream = open('cat_embed_feat_enum.yaml', 'w')
yaml.dump(cat_embed_feat_enum, stream, default_flow_style=False)

### Create the timestamp feature and sort

Create the timestamp (`ts`) feature:

In [None]:
infdrug_df['ts'] = infdrug_df['infusionoffset']
infdrug_df = infdrug_df.drop('infusionoffset', axis=1)
infdrug_df.head()

Standardize drug names:

In [None]:
infdrug_df = du.data_processing.clean_categories_naming(infdrug_df, 'drugname')
infdrug_df.head()

Remove duplicate rows:

In [None]:
len(infdrug_df)

In [None]:
infdrug_df = infdrug_df.drop_duplicates()
infdrug_df.head()

In [None]:
len(infdrug_df)

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
infdrug_df = infdrug_df.set_index('ts')
infdrug_df.head(6)

Check for possible multiple rows with the same unit stay ID and timestamp:

In [None]:
infdrug_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='drugname').head()

In [None]:
infdrug_df[infdrug_df.patientunitstayid == 1785711].head(20)

We can see that there are up to 17 categories per set of `patientunitstayid` and `ts`. As such, we must join them. But first, as we shouldn't mix absolute values of drug rates from different drugs, we better normalize it first.

### Normalize data

In [None]:
infdrug_df_norm = du.data_processing.normalize_data(infdrug_df,
                                                 columns_to_normalize=['patientweight'],
                                                 columns_to_normalize_cat=[('drugname', 'drugrate')])
infdrug_df_norm.head()

In [None]:
infdrug_df_norm.patientweight.value_counts()

### Join rows that have the same IDs

In [None]:
infdrug_df_norm = du.embedding.join_categorical_enum(infdrug_df_norm, new_cat_embed_feat)
infdrug_df_norm.head()

In [None]:
infdrug_df_norm.dtypes

In [None]:
infdrug_df_norm.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='drugname').head()

In [None]:
infdrug_df_norm[infdrug_df_norm.patientunitstayid == 1785711].head(20)

Comparing the output from the two previous cells with what we had before the `join_categorical_enum` method, we can see that all rows with duplicate IDs have been successfully joined.

### Rename columns

In [None]:
infdrug_df = infdrug_df.rename(columns={'patientweight': 'weight', 'drugname': 'infusion_drugname',
                                        'drugrate': 'infusion_drugrate'})
infdrug_df.head()

In [None]:
infdrug_df_norm = infdrug_df_norm.rename(columns={'patientweight': 'weight', 'drugname': 'infusion_drugname',
                                                  'drugrate': 'infusion_drugrate'})
infdrug_df_norm.head()

### Clean column names

Standardize all column names to be on lower case, have spaces replaced by underscores and remove comas.

In [None]:
infdrug_df.columns = du.data_processing.clean_naming(infdrug_df.columns)
infdrug_df_norm.columns = du.data_processing.clean_naming(infdrug_df_norm.columns)
infdrug_df_norm.head()

### Save the dataframe

Save the dataframe before normalizing:

In [None]:
infdrug_df.to_csv(f'{data_path}cleaned/unnormalized/infusionDrug.csv')

Save the dataframe after normalizing:

In [None]:
infdrug_df_norm.to_csv(f'{data_path}cleaned/normalized/infusionDrug.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
infdrug_df_norm.describe().transpose()

### Join dataframes

Merge dataframes by the unit stay, `patientunitstayid`, and the timestamp, `ts`, with a tolerence for a difference of up to 30 minutes.

In [None]:
infdrug_df = pd.read_csv(f'{data_path}cleaned/normalized/infusionDrug.csv')
infdrug_df.head()

In [None]:
len(infdrug)

In [None]:
infdrug.patientunitstayid.nunique()

In [None]:
eICU_df = pd.merge_asof(eICU_df, infdrug_df, on='ts', by='patientunitstayid', direction='nearest', tolerance=30)
eICU_df.head()

## Diagnosis data

### Read the data

In [None]:
diagn_df = pd.read_csv(f'{data_path}original/diagnosis.csv')
diagn_df.head()

In [None]:
len(diagn_df)

In [None]:
diagn_df.patientunitstayid.nunique()

Get an overview of the dataframe through the `describe` method:

In [None]:
diagn_df.describe().transpose()

In [None]:
diagn_df.columns

In [None]:
diagn_df.dtypes

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(diagn_df)

### Remove unneeded features

Besides the usual removal of row identifier, `diagnosisid`, I'm also removing apparently irrelevant (and subjective) `diagnosispriority`, redundant, with missing values and other issues `icd9code`, and `activeupondischarge`, as we don't have complete information as to when diagnosis end.

In [None]:
diagn_df = diagn_df.drop(['diagnosisid', 'diagnosispriority', 'icd9code', 'activeupondischarge'], axis=1)
diagn_df.head()

### Separate high level diagnosis

In [None]:
diagn_df.diagnosisstring.value_counts()

In [None]:
diagn_df.diagnosisstring.map(lambda x: x.split('|')).head()

In [None]:
diagn_df.diagnosisstring.map(lambda x: len(x.split('|'))).min()

There are always at least 2 higher level diagnosis. It could be beneficial to extract those first 2 levels to separate features, so as to avoid the need for the model to learn similarities that are already known.

In [None]:
diagn_df['diagnosis_type_1'] = diagn_df.diagnosisstring.apply(lambda x: du.search_explore.get_element_from_split(x, 0, separator='|'), meta=('x', str))
diagn_df['diagnosis_disorder_2'] = diagn_df.diagnosisstring.apply(lambda x: du.search_explore.get_element_from_split(x, 1, separator='|'), meta=('x', str))
diagn_df['diagnosis_detailed_3'] = diagn_df.diagnosisstring.apply(lambda x: du.search_explore.get_element_from_split(x, 2, separator='|', till_the_end=True), meta=('x', str))
# Remove now redundant `diagnosisstring` feature
diagn_df = diagn_df.drop('diagnosisstring', axis=1)
diagn_df.head()

### Discretize categorical features

Convert binary categorical features into simple numberings, one hot encode features with a low number of categories (in this case, 5) and enumerate sparse categorical features that will be embedded.

#### Separate and prepare features for embedding

Identify categorical features that have more than 5 unique categories, which will go through an embedding layer afterwards, and enumerate them.

Update list of categorical features and add those that will need embedding (features with more than 5 unique values):

In [None]:
new_cat_feat = ['diagnosis_type_1', 'diagnosis_disorder_2', 'diagnosis_detailed_3']
[cat_feat.append(col) for col in new_cat_feat]

In [None]:
cat_feat_nunique = [diagn_df[feature].nunique() for feature in new_cat_feat]
cat_feat_nunique

In [None]:
new_cat_embed_feat = []
for i in range(len(new_cat_feat)):
    if cat_feat_nunique[i] > 5:
        # Add feature to the list of those that will be embedded
        cat_embed_feat.append(new_cat_feat[i])
        new_cat_embed_feat.append(new_cat_feat[i])

In [None]:
diagn_df[new_cat_feat].head()

In [None]:
for i in range(len(new_cat_embed_feat)):
    feature = new_cat_embed_feat[i]
    # Prepare for embedding, i.e. enumerate categories
    diagn_df[feature], cat_embed_feat_enum[feature] = du.embedding.enum_categorical_feature(diagn_df, feature)

In [None]:
diagn_df[new_cat_feat].head()

In [None]:
cat_embed_feat_enum

In [None]:
diagn_df[new_cat_feat].dtypes

#### Save enumeration encoding mapping

Save the dictionary that maps from the original categories/strings to the new numerical encondings.

In [None]:
stream = open('cat_embed_feat_enum.yaml', 'w')
yaml.dump(cat_embed_feat_enum, stream, default_flow_style=False)

### Create the timestamp feature and sort

Create the timestamp (`ts`) feature:

In [None]:
diagn_df['ts'] = diagn_df['diagnosisoffset']
diagn_df = diagn_df.drop('diagnosisoffset', axis=1)
diagn_df.head()

Remove duplicate rows:

In [None]:
len(diagn_df)

In [None]:
diagn_df = diagn_df.drop_duplicates()
diagn_df.head()

In [None]:
len(diagn_df)

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
diagn_df = diagn_df.set_index('ts')
diagn_df.head()

Check for possible multiple rows with the same unit stay ID and timestamp:

In [None]:
diagn_df.reset_index().head()

In [None]:
diagn_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='diagnosis_type_1').head()

In [None]:
diagn_df[diagn_df.patientunitstayid == 3089982].head(10)

We can see that there are up to 69 categories per set of `patientunitstayid` and `ts`. As such, we must join them.

### Join rows that have the same IDs

In [None]:
diagn_df = du.embedding.join_categorical_enum(diagn_df, new_cat_embed_feat)
diagn_df.head()

In [None]:
diagn_df.dtypes

In [None]:
diagn_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='diagnosis_type_1').head()

In [None]:
diagn_df[diagn_df.patientunitstayid == 3089982].head(10)

Comparing the output from the two previous cells with what we had before the `join_categorical_enum` method, we can see that all rows with duplicate IDs have been successfully joined.

### Clean column names

Standardize all column names to be on lower case, have spaces replaced by underscores and remove comas.

In [None]:
diagn_df.columns = du.data_processing.clean_naming(diagn_df.columns)
diagn_df.head()

### Save the dataframe

Save the dataframe before normalizing:

In [None]:
diagn_df.to_csv(f'{data_path}cleaned/unnormalized/diagnosis.csv')

Save the dataframe after normalizing:

In [None]:
diagn_df.to_csv(f'{data_path}cleaned/normalized/diagnosis.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
diagn_df.describe().transpose()

### Join dataframes

Merge dataframes by the unit stay, `patientunitstayid`, and the timestamp, `ts`, with a tolerence for a difference of up to 30 minutes.

In [None]:
diagn_df = pd.read_csv(f'{data_path}cleaned/normalized/diagnosis.csv')
diagn_df.head()

In [None]:
len(diagn_df)

In [None]:
diagn_df.patientunitstayid.nunique()

In [None]:
eICU_df = pd.merge_asof(eICU_df, diagn_df, on='ts', by='patientunitstayid', direction='nearest', tolerance=30)
eICU_df.head()

## Admission drug data

### Read the data

In [None]:
admsdrug_df = pd.read_csv(f'{data_path}original/admissionDrug.csv')
admsdrug_df.head()

In [None]:
len(admsdrug_df)

In [None]:
admsdrug_df.patientunitstayid.nunique()

There's not much admission drug data (only around 20% of the unit stays have this data). However, it might be useful, considering also that it complements the medication table.

Get an overview of the dataframe through the `describe` method:

In [None]:
admsdrug_df.describe().transpose()

In [None]:
admsdrug_df.columns

In [None]:
admsdrug_df.dtypes

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(admsdrug_df)

### Remove unneeded features

In [None]:
admsdrug_df.drugname.value_counts()

In [None]:
admsdrug_df.drughiclseqno.value_counts()

In [None]:
admsdrug_df.drugnotetype.value_counts()

In [None]:
admsdrug_df.drugdosage.value_counts()

In [None]:
admsdrug_df.drugunit.value_counts()

In [None]:
admsdrug_df.drugadmitfrequency.value_counts()

In [None]:
admsdrug_df[admsdrug_df.drugdosage == 0].head(20)

In [None]:
admsdrug_df[admsdrug_df.drugdosage == 0].drugunit.value_counts()

In [None]:
admsdrug_df[admsdrug_df.drugdosage == 0].drugadmitfrequency.value_counts()

In [None]:
admsdrug_df[admsdrug_df.drugunit == ' '].drugdosage.value_counts()

Oddly, `drugunit` and `drugadmitfrequency` have several blank values. At the same time, when this happens, `drugdosage` tends to be 0 (which is also an unrealistic value). Considering that no NaNs are reported, these blanks and zeros probably represent missing values.

Besides removing irrelevant or hospital staff related data (e.g. `usertype`), I'm also removing the `drugname` column, which is redundant with the codes `drughiclseqno`, while also being brand dependant.

In [None]:
admsdrug_df = admsdrug_df[['patientunitstayid', 'drugoffset', 'drugdosage',
                           'drugunit', 'drugadmitfrequency', 'drughiclseqno']]
admsdrug_df.head()

### Fix missing values representation

Replace blank and unrealistic zero values with NaNs.

In [None]:
admsdrug_df.drugdosage = admsdrug_df.drugdosage.replace(to_replace=0, value=np.nan)
admsdrug_df.drugunit = admsdrug_df.drugunit.replace(to_replace=' ', value=np.nan)
admsdrug_df.drugadmitfrequency = admsdrug_df.drugadmitfrequency.replace(to_replace=' ', value=np.nan)
admsdrug_df.head()

In [None]:
du.search_explore.dataframe_missing_values(admsdrug_df)

### Discretize categorical features

Convert binary categorical features into simple numberings, one hot encode features with a low number of categories (in this case, 5) and enumerate sparse categorical features that will be embedded.

#### Separate and prepare features for embedding

Identify categorical features that have more than 5 unique categories, which will go through an embedding layer afterwards, and enumerate them.

In the case of microbiology data, we're also going to embed the antibiotic `sensitivitylevel`, not because it has many categories, but because there can be several rows of data per timestamp (which would be impractical on one hot encoded data).

Update list of categorical features and add those that will need embedding (features with more than 5 unique values):

In [None]:
new_cat_feat = ['drugunit', 'drugadmitfrequency', 'drughiclseqno']
[cat_feat.append(col) for col in new_cat_feat]

In [None]:
cat_feat_nunique = [admsdrug_df[feature].nunique() for feature in new_cat_feat]
cat_feat_nunique

In [None]:
new_cat_embed_feat = []
for i in range(len(new_cat_feat)):
    if cat_feat_nunique[i] > 5:
        # Add feature to the list of those that will be embedded
        cat_embed_feat.append(new_cat_feat[i])
        new_cat_embed_feat.append(new_cat_feat[i])

In [None]:
admsdrug_df[new_cat_feat].head()

In [None]:
for i in range(len(new_cat_embed_feat)):
    feature = new_cat_embed_feat[i]
    # Skip the 'drughiclseqno' from enumeration encoding
    if feature == 'drughiclseqno':
        continue
    # Prepare for embedding, i.e. enumerate categories
    admsdrug_df[feature], cat_embed_feat_enum[feature] = du.embedding.enum_categorical_feature(admsdrug_df, feature)

In [None]:
admsdrug_df[new_cat_feat].head()

In [None]:
cat_embed_feat_enum

In [None]:
admsdrug_df[new_cat_feat].dtypes

#### Save enumeration encoding mapping

Save the dictionary that maps from the original categories/strings to the new numerical encondings.

In [None]:
stream = open('cat_embed_feat_enum.yaml', 'w')
yaml.dump(cat_embed_feat_enum, stream, default_flow_style=False)

### Create the timestamp feature and sort

Create the timestamp (`ts`) feature:

In [None]:
admsdrug_df['ts'] = admsdrug_df['drugoffset']
admsdrug_df = admsdrug_df.drop('drugoffset', axis=1)
admsdrug_df.head()

Remove duplicate rows:

In [None]:
len(admsdrug_df)

In [None]:
admsdrug_df = admsdrug_df.drop_duplicates()
admsdrug_df.head()

In [None]:
len(admsdrug_df)

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
admsdrug_df = admsdrug_df.set_index('ts')
admsdrug_df.head()

Check for possible multiple rows with the same unit stay ID and timestamp:

In [None]:
admsdrug_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='drughiclseqno').head()

In [None]:
admsdrug_df[admsdrug_df.patientunitstayid == 2346930].head(10)

We can see that there are up to 48 categories per set of `patientunitstayid` and `ts`. As such, we must join them. But first, we need to normalize the dosage by the respective sets of drug code and units, so as to avoid mixing different absolute values.

### Normalize data

In [None]:
admsdrug_df_norm = admsdrug_df.reset_index()
admsdrug_df_norm.head()

In [None]:
admsdrug_df_norm = du.data_processing.normalize_data(admsdrug_df_norm, columns_to_normalize=False,
                                                  columns_to_normalize_cat=[(['drughiclseqno', 'drugunit'], 'drugdosage')])
admsdrug_df_norm.head()

In [None]:
admsdrug_df_norm = admsdrug_df_norm.set_index('ts')
admsdrug_df_norm.head()

### Join rows that have the same IDs

Even after removing duplicates rows, there are still some that have different information for the same ID and timestamp. We have to concatenate the categorical enumerations.

In [None]:
admsdrug_df_norm = du.embedding.join_categorical_enum(admsdrug_df_norm, new_cat_embed_feat)
admsdrug_df_norm.head()

In [None]:
admsdrug_df_norm.dtypes

In [None]:
admsdrug_df_norm.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='drughiclseqno').head()

In [None]:
admsdrug_df_norm[admsdrug_df_norm.patientunitstayid == 2346930].head(10)

Comparing the output from the two previous cells with what we had before the `join_categorical_enum` method, we can see that all rows with duplicate IDs have been successfully joined.

### Clean column names

Standardize all column names to be on lower case, have spaces replaced by underscores and remove comas.

In [None]:
admsdrug_df.columns = du.data_processing.clean_naming(admsdrug_df.columns)
admsdrug_df_norm.columns = du.data_processing.clean_naming(admsdrug_df_norm.columns)
admsdrug_df_norm.head()

### Save the dataframe

Save the dataframe before normalizing:

In [None]:
admsdrug_df.to_csv(f'{data_path}cleaned/unnormalized/admissionDrug.csv')

Save the dataframe after normalizing:

In [None]:
admsdrug_df_norm.to_csv(f'{data_path}cleaned/normalized/admissionDrug.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
admsdrug_df_norm.describe().transpose()

### Join dataframes

Merge dataframes by the unit stay, `patientunitstayid`, and the timestamp, `ts`, with a tolerence for a difference of up to 30 minutes.

In [None]:
admsdrug_df = pd.read_csv(f'{data_path}cleaned/normalized/admissionDrug.csv')
admsdrug_df.head()

In [None]:
len(admsdrug_df)

In [None]:
admsdrug_df.patientunitstayid.nunique()

In [None]:
eICU_df = pd.merge_asof(eICU_df, admsdrug_df, on='ts', by='patientunitstayid', direction='nearest', tolerance=30)
eICU_df.head()

## Medication data

### Read the data

In [None]:
med_df = pd.read_csv(f'{data_path}original/medication.csv', dtype={'loadingdose': 'object'})
med_df.head()

In [None]:
len(med_df)

In [None]:
med_df.patientunitstayid.nunique()

There's not much admission drug data (only around 20% of the unit stays have this data). However, it might be useful, considering also that it complements the medication table.

Get an overview of the dataframe through the `describe` method:

In [None]:
med_df.describe().transpose()

In [None]:
med_df.columns

In [None]:
med_df.dtypes

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(med_df)

### Remove unneeded features

In [None]:
med_df.drugname.value_counts()

In [None]:
med_df.drughiclseqno.value_counts()

In [None]:
med_df.dosage.value_counts()

In [None]:
med_df.frequency.value_counts()

In [None]:
med_df.drugstartoffset.value_counts()

In [None]:
med_df[med_df.drugstartoffset == 0].head()

Besides removing less interesting data (e.g. `drugivadmixture`), I'm also removing the `drugname` column, which is redundant with the codes `drughiclseqno`, while also being brand dependant.

In [None]:
med_df = med_df[['patientunitstayid', 'drugstartoffset', 'drugstopoffset',
                 'drugordercancelled', 'dosage', 'frequency', 'drughiclseqno']]
med_df.head()

### Remove rows of which the drug has been cancelled or not specified

In [None]:
med_df.drugordercancelled.value_counts()

In [None]:
med_df = med_df[~((med_df.drugordercancelled == 'Yes') | (np.isnan(med_df.drughiclseqno)))]
med_df.head()

Remove the now unneeded `drugordercancelled` column:

In [None]:
med_df = med_df.drop('drugordercancelled', axis=1)
med_df.head()

In [None]:
du.search_explore.dataframe_missing_values(med_df)

### Separating units from dosage

In order to properly take into account the dosage quantities, as well as to standardize according to other tables like admission drugs, we should take the original `dosage` column and separate it to just the `drugdosage` values and the `drugunit`.

No need to create a separate `pyxis` feature, which would indicate the use of the popular automated medications manager, as the frequency embedding will have that into account.

Create dosage and unit features:

In [None]:
med_df['drugdosage'] = np.nan
med_df['drugunit'] = np.nan
med_df.head()

Get the dosage and unit values for each row:

In [None]:
med_df[['drugdosage', 'drugunit']] = med_df.apply(du.data_processing.set_dosage_and_units, axis=1, result_type='expand')
med_df.head()

Remove the now unneeded `dosage` column:

In [None]:
med_df = med_df.drop('dosage', axis=1)
med_df.head()

### Discretize categorical features

Convert binary categorical features into simple numberings, one hot encode features with a low number of categories (in this case, 5) and enumerate sparse categorical features that will be embedded.

#### Separate and prepare features for embedding

Identify categorical features that have more than 5 unique categories, which will go through an embedding layer afterwards, and enumerate them.

In the case of microbiology data, we're also going to embed the antibiotic `sensitivitylevel`, not because it has many categories, but because there can be several rows of data per timestamp (which would be impractical on one hot encoded data).

Update list of categorical features and add those that will need embedding (features with more than 5 unique values):

In [None]:
new_cat_feat = ['drugunit', 'frequency', 'drughiclseqno']
[cat_feat.append(col) for col in new_cat_feat]

In [None]:
cat_feat_nunique = [med_df[feature].nunique() for feature in new_cat_feat]
cat_feat_nunique

In [None]:
new_cat_embed_feat = []
for i in range(len(new_cat_feat)):
    if cat_feat_nunique[i] > 5:
        # Add feature to the list of those that will be embedded
        cat_embed_feat.append(new_cat_feat[i])
        new_cat_embed_feat.append(new_cat_feat[i])

In [None]:
med_df[new_cat_feat].head()

In [None]:
for i in range(len(new_cat_embed_feat)):
    feature = new_cat_embed_feat[i]
    # Skip the 'drughiclseqno' from enumeration encoding
    if feature == 'drughiclseqno':
        continue
    # Prepare for embedding, i.e. enumerate categories
    med_df[feature], cat_embed_feat_enum[feature] = du.embedding.enum_categorical_feature(med_df, feature)

In [None]:
med_df[new_cat_feat].head()

In [None]:
cat_embed_feat_enum

In [None]:
med_df[new_cat_feat].dtypes

#### Save enumeration encoding mapping

Save the dictionary that maps from the original categories/strings to the new numerical encondings.

In [None]:
stream = open('cat_embed_feat_enum.yaml', 'w')
yaml.dump(cat_embed_feat_enum, stream, default_flow_style=False)

### Create drug stop event

Add a timestamp corresponding to when each patient stops taking each medication.

Duplicate every row, so as to create a discharge event:

In [None]:
new_df = med_df.copy()
new_df.head()

Set the new dataframe's rows to have the drug stop timestamp, with no more information on the drug that was being used:

In [None]:
new_df.drugstartoffset = new_df.drugstopoffset
new_df.drugunit = np.nan
new_df.drugdosage = np.nan
new_df.frequency = np.nan
new_df.drughiclseqno = np.nan
new_df.head()

Join the new rows to the remaining dataframe:

In [None]:
med_df = med_df.append(new_df)
med_df.head()

Remove the now unneeded medication stop column:

In [None]:
med_df = med_df.drop('drugstopoffset', axis=1)
med_df.head(6)

### Create the timestamp feature and sort

Create the timestamp (`ts`) feature:

In [None]:
med_df['ts'] = med_df['drugstartoffset']
med_df = med_df.drop('drugstartoffset', axis=1)
med_df.head()

Remove duplicate rows:

In [None]:
len(med_df)

In [None]:
med_df = med_df.drop_duplicates()
med_df.head()

In [None]:
len(med_df)

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
med_df = med_df.set_index('ts')
med_df.head()

Check for possible multiple rows with the same unit stay ID and timestamp:

In [None]:
med_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='drughiclseqno').head()

In [None]:
med_df[med_df.patientunitstayid == 979183].head(10)

We can see that there are up to 41 categories per set of `patientunitstayid` and `ts`. As such, we must join them. But first, we need to normalize the dosage by the respective sets of drug code and units, so as to avoid mixing different absolute values.

### Normalize data

In [None]:
med_df_norm = med_df.reset_index()
med_df_norm.head()

In [None]:
med_df_norm = du.data_processing.normalize_data(med_df_norm, columns_to_normalize=False,
                                             columns_to_normalize_cat=[(['drughiclseqno', 'drugunit'], 'drugdosage')])
med_df_norm.head()

In [None]:
med_df_norm = med_df_norm.set_index('ts')
med_df_norm.head()

### Join rows that have the same IDs

Even after removing duplicates rows, there are still some that have different information for the same ID and timestamp. We have to concatenate the categorical enumerations.

In [None]:
list(set(med_df_norm.columns) - set(new_cat_embed_feat) - set(['patientunitstayid', 'ts']))

In [None]:
med_df_norm = du.embedding.join_categorical_enum(med_df_norm, new_cat_embed_feat)
med_df_norm.head()

In [None]:
med_df_norm.dtypes

In [None]:
med_df_norm.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='drughiclseqno').head()

In [None]:
med_df_norm[med_df_norm.patientunitstayid == 979183].head(10)

Comparing the output from the two previous cells with what we had before the `join_categorical_enum` method, we can see that all rows with duplicate IDs have been successfully joined.

### Rename columns

In [None]:
med_df_norm = med_df_norm.rename(columns={'frequency':'drugadmitfrequency'})
med_df_norm.head()

### Clean column names

Standardize all column names to be on lower case, have spaces replaced by underscores and remove comas.

In [None]:
med_df.columns = du.data_processing.clean_naming(med_df.columns)
med_df_norm.columns = du.data_processing.clean_naming(med_df_norm.columns)
med_df_norm.head()

### Save the dataframe

Save the dataframe before normalizing:

In [None]:
med_df.to_csv(f'{data_path}cleaned/unnormalized/medication.csv')

Save the dataframe after normalizing:

In [None]:
med_df_norm.to_csv(f'{data_path}cleaned/normalized/medication.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
med_df_norm.describe().transpose()

In [None]:
med_df.nlargest(columns='drugdosage')

Although the `drugdosage` looks good on mean (close to 0) and standard deviation (close to 1), it has very large magnitude minimum (-88.9) and maximum (174.1) values. Furthermore, these don't seem to be because of NaN values, whose groupby normalization could have been unideal. As such, it's hard to say if these are outliers or realistic values.

[TODO] Check if these very large extreme dosage values make sense and, if not, try to fix them.

### Join dataframes

Merge dataframes by the unit stay, `patientunitstayid`, and the timestamp, `ts`, with a tolerence for a difference of up to 30 minutes.

In [None]:
med_df = pd.read_csv(f'{data_path}cleaned/normalized/medication.csv')
med_df.head()

In [None]:
len(med_df)

In [None]:
med_df.patientunitstayid.nunique()

In [None]:
eICU_df = pd.merge_asof(eICU_df, med_df, on='ts', by='patientunitstayid', direction='nearest', tolerance=30)
eICU_df.head()

## Notes data

### Read the data

In [None]:
note_df = pd.read_csv(f'{data_path}original/note.csv')
note_df.head()

In [None]:
len(note_df)

In [None]:
note_df.patientunitstayid.nunique()

Get an overview of the dataframe through the `describe` method:

In [None]:
note_df.describe().transpose()

In [None]:
note_df.columns

In [None]:
note_df.dtypes

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(note_df)

### Remove unneeded features

In [None]:
note_df.notetype.value_counts().head(20)

In [None]:
note_df.notepath.value_counts().head(40)

In [None]:
note_df.notevalue.value_counts().head(20)

In [None]:
note_df[note_df.notepath.str.contains('notes/Progress Notes/Social History')].head(20)

In [None]:
note_df[note_df.notepath.str.contains('notes/Progress Notes/Social History')].notepath.value_counts().head(20)

In [None]:
note_df[note_df.notepath.str.contains('notes/Progress Notes/Social History')].notevalue.value_counts().head(20)

Out of all the possible notes, only those addressing the patient's social history seem to be interesting and containing information not found in other tables. As scuh, we'll only keep the note paths that mention social history:

In [None]:
note_df = note_df[note_df.notepath.str.contains('notes/Progress Notes/Social History')]
note_df.head()

In [None]:
len(note_df)

There are still rows that seem to contain irrelevant data. Let's remove them by finding rows that contain specific words, like "obtain" and "print", that only appear in said irrelevant rows:

In [None]:
category_types_to_remove = ['obtain', 'print', 'copies', 'options']

In [None]:
du.search_explore.find_row_contains_word(note_df, feature='notepath', words=category_types_to_remove).value_counts()

In [None]:
note_df = note_df[~du.search_explore.find_row_contains_word(note_df, feature='notepath', words=category_types_to_remove)]
note_df.head()

In [None]:
len(note_df)

In [None]:
note_df.patientunitstayid.nunique()

In [None]:
note_df.notetype.value_counts().head(20)

Filtering just for interesting social history data greatly reduced the data volume of the notes table, now only present in around 20.5% of the unit stays. Still, it might be useful to include.

Besides the usual removal of row identifier, `noteid`, I'm also removing apparently irrelevant (`noteenteredoffset`, `notetype`) and redundant (`notetext`) columns:

In [None]:
note_df = note_df.drop(['noteid', 'noteenteredoffset', 'notetype', 'notetext'], axis=1)
note_df.head()

### Separate high level notes

In [None]:
note_df.notepath.value_counts().head(20)

In [None]:
note_df.notepath.map(lambda x: x.split('/')).head().values

In [None]:
note_df.notepath.map(lambda x: len(x.split('/'))).min()

In [None]:
note_df.notepath.map(lambda x: len(x.split('/'))).max()

In [None]:
note_df.notepath.apply(lambda x: du.search_explore.get_element_from_split(x, 1, separator='/'),
                       meta=('x', str)).value_counts()

In [None]:
note_df.notepath.apply(lambda x: du.search_explore.get_element_from_split(x, 2, separator='/'),
                       meta=('x', str)).value_counts()

In [None]:
note_df.notepath.apply(lambda x: du.search_explore.get_element_from_split(x, 3, separator='/'),
                       meta=('x', str)).value_counts()

In [None]:
note_df.notepath.apply(lambda x: du.search_explore.get_element_from_split(x, 4, separator='/'),
                       meta=('x', str)).value_counts()

In [None]:
note_df.notepath.apply(lambda x: du.search_explore.get_element_from_split(x, 5, separator='/'),
                       meta=('x', str)).value_counts()

In [None]:
note_df.notepath.apply(lambda x: du.search_explore.get_element_from_split(x, 6, separator='/'),
                       meta=('x', str)).value_counts()

In [None]:
note_df.notepath.apply(lambda x: du.search_explore.get_element_from_split(x, 7, separator='/'),
                       meta=('x', str)).value_counts()

In [None]:
note_df.notevalue.value_counts()

There are always 8 levels of the notes. As the first 6 ones are essentially always the same ("notes/Progress Notes/Social History / Family History/Social History/Social History/"), it's best to just preserve the 7th one and isolate the 8th in a new feature. This way, the split provides further insight to the model on similar notes. However, it's also worth taking note that the 8th level of `notepath` seems to be identical to the feature `notevalue`. We'll look more into it later.

In [None]:
note_df['notetopic'] = note_df.notepath.apply(lambda x: du.search_explore.get_element_from_split(x, 6, separator='/'), meta=('x', str))
note_df['notedetails'] = note_df.notepath.apply(lambda x: du.search_explore.get_element_from_split(x, 7, separator='/'), meta=('x', str))
note_df.head()

Remove the now redundant `notepath` column:

In [None]:
note_df = note_df.drop('notepath', axis=1)
note_df.head()

Compare columns `notevalue` and `notedetails`:

In [None]:
note_df[note_df.notevalue != note_df.notedetails]

The previous blank output confirms that the newly created `notedetails` feature is exactly equal to the already existing `notevalue` feature. So, we should remove one of them:

In [None]:
note_df = note_df.drop('notedetails', axis=1)
note_df.head()

In [None]:
note_df[note_df.notetopic == 'Smoking Status'].notevalue.value_counts()

In [None]:
note_df[note_df.notetopic == 'Ethanol Use'].notevalue.value_counts()

In [None]:
note_df[note_df.notetopic == 'CAD'].notevalue.value_counts()

In [None]:
note_df[note_df.notetopic == 'Cancer'].notevalue.value_counts()

In [None]:
note_df[note_df.notetopic == 'Recent Travel'].notevalue.value_counts()

In [None]:
note_df[note_df.notetopic == 'Bleeding Disorders'].notevalue.value_counts()

Considering how only the categories of "Smoking Status" and "Ethanol Use" in `notetopic` have more than one possible `notevalue` category, with the remaining being only 2 useful ones (categories "Recent Travel" and "Bleeding Disorders" have too little samples), it's probably best to just turn them into features, instead of packing in the same embedded feature.

### Convert categories to features

Make the `notetopic` and `notevalue` columns of type categorical:

In [None]:
note_df = note_df.categorize(columns=['notetopic', 'notevalue'])

Transform the `notetopic` categories and `notevalue` values into separate features:

Now we have the categories separated into their own features, as desired. Notice also how categories `Bleeding Disorders` and `Recent Travel` weren't added, as they appeared in less than the specified minimum of 1000 rows.

Remove the old `notevalue` and `notetopic` columns:

In [None]:
note_df = note_df.drop(['notevalue', 'notetopic'], axis=1)
note_df.head()

While `Ethanol Use` and `Smoking Status` have several unique values, `CAD` and `Cancer` only have 1, indicating when that characteristic is present. As such,we should turn `CAD` and `Cancer` into binary features:

In [None]:
note_df['CAD'] = note_df['CAD'].apply(lambda x: 1 if x == 'CAD' else 0, meta=('CAD', int))
note_df['Cancer'] = note_df['Cancer'].apply(lambda x: 1 if x == 'Cancer' else 0, meta=('Cancer', int))
note_df.head()

In [None]:
note_df['CAD'].value_counts()

In [None]:
note_df['Cancer'].value_counts()

### Discretize categorical features

Convert binary categorical features into simple numberings, one hot encode features with a low number of categories (in this case, 5) and enumerate sparse categorical features that will be embedded.

#### Separate and prepare features for embedding

Identify categorical features that have more than 5 unique categories, which will go through an embedding layer afterwards, and enumerate them.

Update list of categorical features and add those that will need embedding (features with more than 5 unique values):

In [None]:
new_cat_feat = ['Smoking Status', 'Ethanol Use', 'CAD', 'Cancer']
[cat_feat.append(col) for col in new_cat_feat]

In [None]:
cat_feat_nunique = [note_df[feature].nunique() for feature in new_cat_feat]
cat_feat_nunique

In [None]:
new_cat_embed_feat = []
for i in range(len(new_cat_feat)):
    if cat_feat_nunique[i] > 5:
        # Add feature to the list of those that will be embedded
        cat_embed_feat.append(new_cat_feat[i])
        new_cat_embed_feat.append(new_cat_feat[i])

In [None]:
note_df[new_cat_feat].head()

In [None]:
for i in range(len(new_cat_embed_feat)):
    feature = new_cat_embed_feat[i]
    # Prepare for embedding, i.e. enumerate categories
    note_df[feature], cat_embed_feat_enum[feature] = du.embedding.enum_categorical_feature(note_df, feature)

In [None]:
note_df[new_cat_feat].head()

In [None]:
cat_embed_feat_enum

In [None]:
note_df[new_cat_feat].dtypes

#### Save enumeration encoding mapping

Save the dictionary that maps from the original categories/strings to the new numerical encondings.

In [None]:
stream = open('cat_embed_feat_enum.yaml', 'w')
yaml.dump(cat_embed_feat_enum, stream, default_flow_style=False)

### Create the timestamp feature and sort

Create the timestamp (`ts`) feature:

In [None]:
note_df['ts'] = note_df['noteoffset']
note_df = note_df.drop('noteoffset', axis=1)
note_df.head()

Remove duplicate rows:

In [None]:
len(note_df)

In [None]:
note_df = note_df.drop_duplicates()
note_df.head()

In [None]:
len(note_df)

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
note_df = note_df.set_index('ts')
note_df.head()

Check for possible multiple rows with the same unit stay ID and timestamp:

In [None]:
note_df.reset_index().head()

In [None]:
note_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='CAD').head()

In [None]:
note_df[note_df.patientunitstayid == 3091883].head(10)

In [None]:
note_df[note_df.patientunitstayid == 3052175].head(10)

We can see that there are up to 5 categories per set of `patientunitstayid` and `ts`. As such, we must join them. However, this is a different scenario than in the other cases. Since we created the features from one categorical column, it doesn't have repeated values, only different rows to indicate each of the new features' values. As such, we just need to sum the features.

### Join rows that have the same IDs

In [None]:
note_df = du.embedding.join_categorical_enum(note_df, cont_join_method='max')
note_df.head()

In [None]:
note_df.dtypes

In [None]:
note_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='CAD').head()

In [None]:
note_df[note_df.patientunitstayid == 3091883].head(10)

In [None]:
note_df[note_df.patientunitstayid == 3052175].head(10)

Comparing the output from the two previous cells with what we had before the `join_categorical_enum` method, we can see that all rows with duplicate IDs have been successfully joined.

### Clean column names

Standardize all column names to be on lower case, have spaces replaced by underscores and remove comas.

In [None]:
note_df.columns = du.data_processing.clean_naming(note_df.columns)
note_df.head()

### Save the dataframe

Save the dataframe before normalizing:

In [None]:
note_df.to_csv(f'{data_path}cleaned/unnormalized/note.csv')

Save the dataframe after normalizing:

In [None]:
note_df.to_csv(f'{data_path}cleaned/normalized/note.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
note_df.describe().transpose()

### Join dataframes

Merge dataframes by the unit stay, `patientunitstayid`, and the timestamp, `ts`, with a tolerence for a difference of up to 30 minutes.

In [None]:
note_df = pd.read_csv(f'{data_path}cleaned/normalized/note.csv')
note_df.head()

In [None]:
len(note_df)

In [None]:
note_df.patientunitstayid.nunique()

In [None]:
eICU_df = pd.merge_asof(eICU_df, note_df, on='ts', by='patientunitstayid', direction='nearest', tolerance=30)
eICU_df.head()

## Treatment data

### Read the data

In [None]:
treat_df = pd.read_csv(f'{data_path}original/treatment.csv')
treat_df.head()

In [None]:
len(treat_df)

In [None]:
treat_df.patientunitstayid.nunique()

Get an overview of the dataframe through the `describe` method:

In [None]:
treat_df.describe().transpose()

In [None]:
treat_df.columns

In [None]:
treat_df.dtypes

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(treat_df)

### Remove unneeded features

Besides the usual removal of row identifier, `treatmentid`, I'm also removing `activeupondischarge`, as we don't have complete information as to when diagnosis end.

In [None]:
treat_df = treat_df.drop(['treatmentid', 'activeupondischarge'], axis=1)
treat_df.head()

### Separate high level diagnosis

In [None]:
treat_df.treatmentstring.value_counts()

In [None]:
treat_df.treatmentstring.map(lambda x: x.split('|')).head()

In [None]:
treat_df.treatmentstring.map(lambda x: len(x.split('|'))).min()

In [None]:
treat_df.treatmentstring.map(lambda x: len(x.split('|'))).max()

There are always at least 3 higher level diagnosis. It could be beneficial to extract those first 3 levels to separate features, with the last one getting values until the end of the string, so as to avoid the need for the model to learn similarities that are already known.

In [None]:
treat_df.treatmentstring.apply(lambda x: du.search_explore.get_element_from_split(x, 0, separator='|'),
                               meta=('x', str)).value_counts()

In [None]:
treat_df.treatmentstring.apply(lambda x: du.search_explore.get_element_from_split(x, 1, separator='|'),
                               meta=('x', str)).value_counts()

In [None]:
treat_df.treatmentstring.apply(lambda x: du.search_explore.get_element_from_split(x, 2, separator='|'),
                               meta=('x', str)).value_counts()

In [None]:
treat_df.treatmentstring.apply(lambda x: du.search_explore.get_element_from_split(x, 3, separator='|'),
                               meta=('x', str)).value_counts()

In [None]:
treat_df.treatmentstring.apply(lambda x: du.search_explore.get_element_from_split(x, 4, separator='|'),
                               meta=('x', str)).value_counts()

In [None]:
treat_df.treatmentstring.apply(lambda x: du.search_explore.get_element_from_split(x, 5, separator='|'),
                               meta=('x', str)).value_counts()

<!-- There are always 8 levels of the notes. As the first 6 ones are essentially always the same ("notes/Progress Notes/Social History / Family History/Social History/Social History/"), it's best to just preserve the 7th one and isolate the 8th in a new feature. This way, the split provides further insight to the model on similar notes. However, it's also worth taking note that the 8th level of `notepath` seems to be identical to the feature `notevalue`. We'll look more into it later. -->

In [None]:
treat_df['treatmenttype'] = treat_df.treatmentstring.apply(lambda x: du.search_explore.get_element_from_split(x, 0, separator='|'), meta=('x', str))
treat_df['treatmenttherapy'] = treat_df.treatmentstring.apply(lambda x: du.search_explore.get_element_from_split(x, 1, separator='|'), meta=('x', str))
treat_df['treatmentdetails'] = treat_df.treatmentstring.apply(lambda x: du.search_explore.get_element_from_split(x, 2, separator='|', till_the_end=True), meta=('x', str))
treat_df.head()

Remove the now redundant `treatmentstring` column:

In [None]:
treat_df = treat_df.drop('treatmentstring', axis=1)
treat_df.head()

In [None]:
treat_df.treatmenttype.value_counts()

In [None]:
treat_df.treatmenttherapy.value_counts()

In [None]:
treat_df.treatmentdetails.value_counts()

### Discretize categorical features

Convert binary categorical features into simple numberings, one hot encode features with a low number of categories (in this case, 5) and enumerate sparse categorical features that will be embedded.

#### Separate and prepare features for embedding

Identify categorical features that have more than 5 unique categories, which will go through an embedding layer afterwards, and enumerate them.

Update list of categorical features and add those that will need embedding (features with more than 5 unique values):

In [None]:
new_cat_feat = ['treatmenttype', 'treatmenttherapy', 'treatmentdetails']
[cat_feat.append(col) for col in new_cat_feat]

In [None]:
cat_feat_nunique = [treat_df[feature].nunique() for feature in new_cat_feat]
cat_feat_nunique

In [None]:
new_cat_embed_feat = []
for i in range(len(new_cat_feat)):
    if cat_feat_nunique[i] > 5:
        # Add feature to the list of those that will be embedded
        cat_embed_feat.append(new_cat_feat[i])
        new_cat_embed_feat.append(new_cat_feat[i])

In [None]:
treat_df[new_cat_feat].head()

In [None]:
for i in range(len(new_cat_embed_feat)):
    feature = new_cat_embed_feat[i]
    # Prepare for embedding, i.e. enumerate categories
    treat_df[feature], cat_embed_feat_enum[feature] = du.embedding.enum_categorical_feature(treat_df, feature)

In [None]:
treat_df[new_cat_feat].head()

In [None]:
cat_embed_feat_enum

In [None]:
treat_df[new_cat_feat].dtypes

#### Save enumeration encoding mapping

Save the dictionary that maps from the original categories/strings to the new numerical encondings.

In [None]:
stream = open('cat_embed_feat_enum.yaml', 'w')
yaml.dump(cat_embed_feat_enum, stream, default_flow_style=False)

### Create the timestamp feature and sort

Create the timestamp (`ts`) feature:

In [None]:
treat_df['ts'] = treat_df['treatmentoffset']
treat_df = treat_df.drop('treatmentoffset', axis=1)
treat_df.head()

Remove duplicate rows:

In [None]:
len(treat_df)

In [None]:
treat_df = treat_df.drop_duplicates()
treat_df.head()

In [None]:
len(treat_df)

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
treat_df = treat_df.set_index('ts')
treat_df.head()

Check for possible multiple rows with the same unit stay ID and timestamp:

In [None]:
treat_df.reset_index().head()

In [None]:
treat_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='treatmenttype').head()

In [None]:
treat_df[treat_df.patientunitstayid == 1352520].head(10)

We can see that there are up to 105 categories per set of `patientunitstayid` and `ts`. As such, we must join them.

### Join rows that have the same IDs

In [None]:
treat_df = du.embedding.join_categorical_enum(treat_df, new_cat_embed_feat)
treat_df.head()

In [None]:
treat_df.dtypes

In [None]:
treat_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='treatmenttype').head()

In [None]:
treat_df[treat_df.patientunitstayid == 1352520].head(10)

Comparing the output from the two previous cells with what we had before the `join_categorical_enum` method, we can see that all rows with duplicate IDs have been successfully joined.

### Clean column names

Standardize all column names to be on lower case, have spaces replaced by underscores and remove comas.

In [None]:
treat_df.columns = du.data_processing.clean_naming(treat_df.columns)
treat_df.head()

### Save the dataframe

Save the dataframe before normalizing:

In [None]:
treat_df.to_csv(f'{data_path}cleaned/unnormalized/diagnosis.csv')

Save the dataframe after normalizing:

In [None]:
treat_df.to_csv(f'{data_path}cleaned/normalized/diagnosis.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
treat_df.describe().transpose()

### Join dataframes

Merge dataframes by the unit stay, `patientunitstayid`, and the timestamp, `ts`, with a tolerence for a difference of up to 30 minutes.

In [None]:
treat_df = pd.read_csv(f'{data_path}cleaned/normalized/diagnosis.csv')
treat_df.head()

In [None]:
len(treat_df)

In [None]:
treat_df.patientunitstayid.nunique()

In [None]:
eICU_df = pd.merge_asof(eICU_df, treat_df, on='ts', by='patientunitstayid', direction='nearest', tolerance=30)
eICU_df.head()

## Nurse care data

### Read the data

In [None]:
nursecare_df = pd.read_csv(f'{data_path}original/nurseCare.csv')
nursecare_df.head()

In [None]:
len(nursecare_df)

In [None]:
nursecare_df.patientunitstayid.nunique()

Only 13052 unit stays have nurse care data. Might not be useful to include them.

Get an overview of the dataframe through the `describe` method:

In [None]:
nursecare_df.describe().transpose()

In [None]:
nursecare_df.columns

In [None]:
nursecare_df.dtypes

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(nursecare_df)

### Remove unneeded features

In [None]:
nursecare_df.celllabel.value_counts()

In [None]:
nursecare_df.cellattribute.value_counts()

In [None]:
nursecare_df.cellattributevalue.value_counts()

In [None]:
nursecare_df.cellattributepath.value_counts()

In [None]:
nursecare_df[nursecare_df.celllabel == 'Nutrition'].cellattributevalue.value_counts()

In [None]:
nursecare_df[nursecare_df.celllabel == 'Activity'].cellattributevalue.value_counts()

In [None]:
nursecare_df[nursecare_df.celllabel == 'Hygiene/ADLs'].cellattributevalue.value_counts()

In [None]:
nursecare_df[nursecare_df.celllabel == 'Safety'].cellattributevalue.value_counts()

In [None]:
nursecare_df[nursecare_df.celllabel == 'Treatments'].cellattributevalue.value_counts()

In [None]:
nursecare_df[nursecare_df.celllabel == 'Isolation Precautions'].cellattributevalue.value_counts()

In [None]:
nursecare_df[nursecare_df.celllabel == 'Restraints'].cellattributevalue.value_counts()

In [None]:
nursecare_df[nursecare_df.celllabel == 'Equipment'].cellattributevalue.value_counts()

Besides the usual removal of row identifier, `nursecareid`, and the timestamp when data was added, `nursecareentryoffset`, I'm also removing `cellattributepath` and `cellattribute`, which have redundant info with `celllabel`.

In [None]:
nursecare_df = nursecare_df.drop(['nursecareid', 'nursecareentryoffset',
                                  'cellattributepath', 'cellattribute'], axis=1)
nursecare_df.head()

Additionally, some information like "Equipment" and "Restraints" seem to be unnecessary. So let's remove them:

In [None]:
categories_to_remove = ['Safety', 'Restraints', 'Equipment', 'Airway Type',
                        'Isolation Precautions', 'Airway Size']

In [None]:
~(nursecare_df.celllabel.isin(categories_to_remove)).head()

In [None]:
nursecare_df = nursecare_df[~(nursecare_df.celllabel.isin(categories_to_remove))]
nursecare_df.head()

### Convert categories to features

Make the `celllabel` and `cellattributevalue` columns of type categorical:

In [None]:
nursecare_df = nursecare_df.categorize(columns=['celllabel', 'cellattributevalue'])

Transform the `celllabel` categories and `cellattributevalue` values into separate features:

Now we have the categories separated into their own features, as desired.

Remove the old `celllabel` and `cellattributevalue` columns:

In [None]:
nursecare_df = nursecare_df.drop(['celllabel', 'cellattributevalue'], axis=1)
nursecare_df.head()

In [None]:
nursecare_df['Nutrition'].value_counts()

In [None]:
nursecare_df['Treatments'].value_counts()

In [None]:
nursecare_df['Hygiene/ADLs'].value_counts()

In [None]:
nursecare_df['Activity'].value_counts()

### Discretize categorical features

Convert binary categorical features into simple numberings, one hot encode features with a low number of categories (in this case, 5) and enumerate sparse categorical features that will be embedded.

#### Separate and prepare features for embedding

Identify categorical features that have more than 5 unique categories, which will go through an embedding layer afterwards, and enumerate them.

Update list of categorical features and add those that will need embedding (features with more than 5 unique values):

In [None]:
new_cat_feat = ['Nutrition', 'Treatments', 'Hygiene/ADLs', 'Activity']
[cat_feat.append(col) for col in new_cat_feat]

In [None]:
cat_feat_nunique = [nursecare_df[feature].nunique() for feature in new_cat_feat]
cat_feat_nunique

In [None]:
new_cat_embed_feat = []
for i in range(len(new_cat_feat)):
    if cat_feat_nunique[i] > 5:
        # Add feature to the list of those that will be embedded
        cat_embed_feat.append(new_cat_feat[i])
        new_cat_embed_feat.append(new_cat_feat[i])

In [None]:
nursecare_df[new_cat_feat].head()

In [None]:
for i in range(len(new_cat_embed_feat)):
    feature = new_cat_embed_feat[i]
    # Prepare for embedding, i.e. enumerate categories
    nursecare_df[feature], cat_embed_feat_enum[feature] = du.embedding.enum_categorical_feature(nursecare_df, feature)

In [None]:
nursecare_df[new_cat_feat].head()

In [None]:
cat_embed_feat_enum

In [None]:
nursecare_df[new_cat_feat].dtypes

#### Save enumeration encoding mapping

Save the dictionary that maps from the original categories/strings to the new numerical encondings.

In [None]:
stream = open('cat_embed_feat_enum.yaml', 'w')
yaml.dump(cat_embed_feat_enum, stream, default_flow_style=False)

### Create the timestamp feature and sort

Create the timestamp (`ts`) feature:

In [None]:
nursecare_df['ts'] = nursecare_df['nursecareoffset']
nursecare_df = nursecare_df.drop('nursecareoffset', axis=1)
nursecare_df.head()

Remove duplicate rows:

In [None]:
len(nursecare_df)

In [None]:
nursecare_df = nursecare_df.drop_duplicates()
nursecare_df.head()

In [None]:
len(nursecare_df)

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
nursecare_df = nursecare_df.set_index('ts')
nursecare_df.head()

Check for possible multiple rows with the same unit stay ID and timestamp:

In [None]:
nursecare_df.reset_index().head()

In [None]:
nursecare_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='Nutrition').head()

In [None]:
nursecare_df[nursecare_df.patientunitstayid == 2798325].head(10)

We can see that there are up to 21 categories per set of `patientunitstayid` and `ts`. As such, we must join them. However, this is a different scenario than in the other cases. Since we created the features from one categorical column, it doesn't have repeated values, only different rows to indicate each of the new features' values. As such, we just need to sum the features.

### Join rows that have the same IDs

In [None]:
# [TODO] Find a way to join rows while ignoring zeros
nursecare_df = du.embedding.join_categorical_enum(nursecare_df, new_cat_embed_feat)
nursecare_df.head()

In [None]:
nursecare_df.dtypes

In [None]:
nursecare_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='Nutrition').head()

In [None]:
nursecare_df[nursecare_df.patientunitstayid == 2798325].head(10)

Comparing the output from the two previous cells with what we had before the `join_categorical_enum` method, we can see that all rows with duplicate IDs have been successfully joined.

### Rename columns

In [None]:
nursecare_df = nursecare_df.rename(columns={'Treatments':'nurse_treatments'})
nursecare_df.head()

### Clean column names

Standardize all column names to be on lower case, have spaces replaced by underscores and remove comas.

In [None]:
nursecare_df.columns = du.data_processing.clean_naming(nursecare_df.columns)
nursecare_df.head()

### Save the dataframe

Save the dataframe before normalizing:

In [None]:
nursecare_df.to_csv(f'{data_path}cleaned/unnormalized/nurseCare.csv')

Save the dataframe after normalizing:

In [None]:
nursecare_df.to_csv(f'{data_path}cleaned/normalized/nurseCare.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
nursecare_df.describe().transpose()

### Join dataframes

Merge dataframes by the unit stay, `patientunitstayid`, and the timestamp, `ts`, with a tolerence for a difference of up to 30 minutes.

In [None]:
nursecare_df = pd.read_csv(f'{data_path}cleaned/normalized/nurseCare.csv')
nursecare_df.head()

In [None]:
len(nursecare_df)

In [None]:
nursecare_df.patientunitstayid.nunique()

In [None]:
eICU_df = pd.merge_asof(eICU_df, nursecare_df, on='ts', by='patientunitstayid', direction='nearest', tolerance=30)
eICU_df.head()

## Nurse assessment data

### Read the data

In [None]:
nurseassess_df = pd.read_csv(f'{data_path}original/nurseAssessment.csv')
nurseassess_df.head()

In [None]:
len(nurseassess_df)

In [None]:
nurseassess_df.patientunitstayid.nunique()

Only 13001 unit stays have nurse care data. Might not be useful to include them.

Get an overview of the dataframe through the `describe` method:

In [None]:
nurseassess_df.describe().transpose()

In [None]:
nurseassess_df.columns

In [None]:
nurseassess_df.dtypes

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(nurseassess_df)

### Remove unneeded features

In [None]:
nurseassess_df.celllabel.value_counts()

In [None]:
nurseassess_df.cellattribute.value_counts()

In [None]:
nurseassess_df.cellattributevalue.value_counts()

In [None]:
nurseassess_df.cellattributepath.value_counts()

In [None]:
nurseassess_df[nurseassess_df.celllabel == 'Intervention'].cellattributevalue.value_counts()

In [None]:
nurseassess_df[nurseassess_df.celllabel == 'Neurologic'].cellattributevalue.value_counts()

In [None]:
nurseassess_df[nurseassess_df.celllabel == 'Pupils'].cellattributevalue.value_counts()

In [None]:
nurseassess_df[nurseassess_df.celllabel == 'Edema'].cellattributevalue.value_counts()

In [None]:
nurseassess_df[nurseassess_df.celllabel == 'Secretions'].cellattributevalue.value_counts()

In [None]:
nurseassess_df[nurseassess_df.celllabel == 'Cough'].cellattributevalue.value_counts()

In [None]:
nurseassess_df[nurseassess_df.celllabel == 'Neurologic'].cellattribute.value_counts()

In [None]:
nurseassess_df[nurseassess_df.celllabel == 'Pupils'].cellattribute.value_counts()

In [None]:
nurseassess_df[nurseassess_df.celllabel == 'Secretions'].cellattribute.value_counts()

In [None]:
nurseassess_df[nurseassess_df.celllabel == 'Cough'].cellattribute.value_counts()

Besides the usual removal of row identifier, `nurseAssessID`, and the timestamp when data was added, `nurseAssessEntryOffset`, I'm also removing `cellattributepath` and `cellattribute`, which have redundant info with `celllabel`. Regarding data categories, I'm only keeping `Neurologic`, `Pupils`, `Secretions` and `Cough`, as the remaining ones either don't add much value, have too little data or are redundant with data from other tables.

In [None]:
nurseassess_df = nurseassess_df.drop(['nurseassessid', 'nurseassessentryoffset',
                                      'cellattributepath', 'cellattribute'], axis=1)
nurseassess_df.head()

In [None]:
categories_to_keep = ['Neurologic', 'Pupils', 'Secretions', 'Cough']

In [None]:
nurseassess_df.celllabel.isin(categories_to_keep).head()

In [None]:
nurseassess_df = nurseassess_df[nurseassess_df.celllabel.isin(categories_to_keep)]
nurseassess_df.head()

### Convert categories to features

Make the `celllabel` and `cellattributevalue` columns of type categorical:

In [None]:
nurseassess_df = nurseassess_df.categorize(columns=['celllabel', 'cellattributevalue'])

In [None]:
nurseassess_df.head()

Transform the `celllabel` categories and `cellattributevalue` values into separate features:

Now we have the categories separated into their own features, as desired.

Remove the old `celllabel` and `cellattributevalue` columns:

In [None]:
nurseassess_df = nurseassess_df.drop(['celllabel', 'cellattributevalue'], axis=1)
nurseassess_df.head()

In [None]:
nurseassess_df['Neurologic'].value_counts()

In [None]:
nurseassess_df['Pupils'].value_counts()

In [None]:
nurseassess_df['Secretions'].value_counts()

In [None]:
nurseassess_df['Cough'].value_counts()

### Discretize categorical features

Convert binary categorical features into simple numberings, one hot encode features with a low number of categories (in this case, 5) and enumerate sparse categorical features that will be embedded.

#### Separate and prepare features for embedding

Identify categorical features that have more than 5 unique categories, which will go through an embedding layer afterwards, and enumerate them.

Update list of categorical features and add those that will need embedding (features with more than 5 unique values):

In [None]:
new_cat_feat = ['Pupils', 'Neurologic', 'Secretions', 'Cough']
[cat_feat.append(col) for col in new_cat_feat]

In [None]:
cat_feat_nunique = [nurseassess_df[feature].nunique() for feature in new_cat_feat]
cat_feat_nunique

In [None]:
new_cat_embed_feat = []
for i in range(len(new_cat_feat)):
    if cat_feat_nunique[i] > 5:
        # Add feature to the list of those that will be embedded
        cat_embed_feat.append(new_cat_feat[i])
        new_cat_embed_feat.append(new_cat_feat[i])

In [None]:
nurseassess_df[new_cat_feat].head()

In [None]:
for i in range(len(new_cat_embed_feat)):
    feature = new_cat_embed_feat[i]
    # Prepare for embedding, i.e. enumerate categories
    nurseassess_df[feature], cat_embed_feat_enum[feature] = du.embedding.enum_categorical_feature(nurseassess_df, feature)

In [None]:
nurseassess_df[new_cat_feat].head()

In [None]:
cat_embed_feat_enum

In [None]:
nurseassess_df[new_cat_feat].dtypes

#### Save enumeration encoding mapping

Save the dictionary that maps from the original categories/strings to the new numerical encondings.

In [None]:
stream = open('cat_embed_feat_enum.yaml', 'w')
yaml.dump(cat_embed_feat_enum, stream, default_flow_style=False)

### Create the timestamp feature and sort

Create the timestamp (`ts`) feature:

In [None]:
nurseassess_df['ts'] = nurseassess_df['nurseassessoffset']
nurseassess_df = nurseassess_df.drop('nurseassessoffset', axis=1)
nurseassess_df.head()

Remove duplicate rows:

In [None]:
len(nurseassess_df)

In [None]:
nurseassess_df = nurseassess_df.drop_duplicates()
nurseassess_df.head()

In [None]:
len(nurseassess_df)

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
nurseassess_df = nurseassess_df.set_index('ts')
nurseassess_df.head()

Check for possible multiple rows with the same unit stay ID and timestamp:

In [None]:
nurseassess_df.reset_index().head()

In [None]:
nurseassess_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='Cough').head()

In [None]:
nurseassess_df[nurseassess_df.patientunitstayid == 2553254].head(10)

We can see that there are up to 80 categories per set of `patientunitstayid` and `ts`. As such, we must join them.

### Join rows that have the same IDs

In [None]:
nurseassess_df = du.embedding.join_categorical_enum(nurseassess_df, new_cat_embed_feat)
nurseassess_df.head()

In [None]:
nurseassess_df.dtypes

In [None]:
nurseassess_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='Cough').head()

In [None]:
nurseassess_df[nurseassess_df.patientunitstayid == 2553254].head(10)

Comparing the output from the two previous cells with what we had before the `join_categorical_enum` method, we can see that all rows with duplicate IDs have been successfully joined.

### Clean column names

Standardize all column names to be on lower case, have spaces replaced by underscores and remove comas.

In [None]:
nurseassess_df.columns = du.data_processing.clean_naming(nurseassess_df.columns)
nurseassess_df.head()

### Save the dataframe

Save the dataframe before normalizing:

In [None]:
nurseassess_df.to_csv(f'{data_path}cleaned/unnormalized/nurseAssessment.csv')

Save the dataframe after normalizing:

In [None]:
nurseassess_df.to_csv(f'{data_path}cleaned/normalized/nurseAssessment.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
nurseassess_df.describe().transpose()

### Join dataframes

Merge dataframes by the unit stay, `patientunitstayid`, and the timestamp, `ts`, with a tolerence for a difference of up to 30 minutes.

In [None]:
nurseassess_df = pd.read_csv(f'{data_path}cleaned/normalized/nurseAssessment.csv')
nurseassess_df.head()

In [None]:
len(nurseassess_df)

In [None]:
nurseassess_df.patientunitstayid.nunique()

In [None]:
eICU_df = pd.merge_asof(eICU_df, nurseassess_df, on='ts', by='patientunitstayid', direction='nearest', tolerance=30)
eICU_df.head()

## Nurse charting data

### Read the data

In [None]:
nursechart_df = pd.read_csv(f'{data_path}original/nurseCharting.csv')
nursechart_df.head()

In [None]:
len(nursechart_df)

In [None]:
nursechart_df.patientunitstayid.nunique()

Get an overview of the dataframe through the `describe` method:

In [None]:
nursechart_df.describe().transpose()

In [None]:
nursechart_df.columns

In [None]:
nursechart_df.dtypes

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(nursechart_df)

### Remove unneeded features

In [None]:
nursechart_df.nursingchartcelltypecat.value_counts()

In [None]:
nursechart_df.nursingchartcelltypevallabel.value_counts()

In [None]:
nursechart_df.nursingchartcelltypevalname.value_counts()

In [None]:
nursechart_df.nursingchartvalue.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypecat == 'Vital Signs'].nursingchartcelltypevallabel.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypecat == 'Scores'].nursingchartcelltypevallabel.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypecat == 'Other Vital Signs and Infusions'].nursingchartcelltypevallabel.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypecat == 'Vital Signs and Infusions'].nursingchartcelltypevallabel.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypecat == 'Invasive'].nursingchartcelltypevallabel.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypecat == 'SVO2'].nursingchartcelltypevallabel.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypecat == 'ECG'].nursingchartcelltypevallabel.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevalname == 'Pain Score'].nursingchartcelltypevallabel.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevalname == 'Pain Score'].nursingchartvalue.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Pain Assessment'].nursingchartcelltypevalname.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Pain Assessment'].nursingchartvalue.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Pain Present'].nursingchartcelltypevalname.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Pain Present'].nursingchartvalue.value_counts()

Regarding patient's pain information, the only label that seems to be relevant is `Pain Score`. However, it's important to note that this score has different possible measurement systems (`Pain Assessment`). Due to this, we will only consider the most frequent pain scale (`WDL`). `Pain Present` has less information and, as such, is less relevant.

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Glasgow coma score'].nursingchartcelltypevalname.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Glasgow coma score'].nursingchartvalue.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevalname == 'GCS Total'].nursingchartcelltypevallabel.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevalname == 'GCS Total'].nursingchartvalue.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Score (Glasgow Coma Scale)'].nursingchartcelltypevalname.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Score (Glasgow Coma Scale)'].nursingchartvalue.value_counts()

Labels `GCS Total` and `Score (Glasgow Coma Scale)` should be merged, as they represent exactly the same thing.

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'SEDATION SCORE'].nursingchartcelltypevalname.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'SEDATION SCORE'].nursingchartvalue.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Sedation Scale/Score/Goal'].nursingchartcelltypevalname.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Sedation Scale/Score/Goal'].nursingchartvalue.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevalname == 'Sedation Score'].nursingchartvalue.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevalname == 'Sedation Scale'].nursingchartvalue.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Delirium Scale/Score'].nursingchartcelltypevalname.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Delirium Scale/Score'].nursingchartvalue.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevalname == 'Delirium Score'].nursingchartvalue.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevalname == 'Delirium Scale'].nursingchartvalue.value_counts()

Sedation and delirium scores could be interesting features, however they are presented in different scales, like in pain score, which don't seem to be directly convertable between them. Due to this, we will only consider the most frequent scale for each case (`RASS` and `CAM-ICU`, respectively).

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Best Motor Response'].nursingchartcelltypevalname.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Best Motor Response'].nursingchartvalue.value_counts()

These "Best ___ Response" features are subparts of the total Glasgow Coma Score calculation. Because of that, and for having less data, they will be discarded.

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Gastrointestinal Assessment'].nursingchartcelltypevalname.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Gastrointestinal Assessment'].nursingchartvalue.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Genitourinary Assessment'].nursingchartcelltypevalname.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Genitourinary Assessment'].nursingchartvalue.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Integumentary Assessment'].nursingchartcelltypevalname.value_counts()

In [None]:
nursechart_df[nursechart_df.nursingchartcelltypevallabel == 'Integumentary Assessment'].nursingchartvalue.value_counts()

Some other information, like these gastrointestinal, genitourinary and integumentary domains, could be relevant to add. The problem is that we only seem to have acccess to how they were measured (i.e. their scale) and not the real values.

Besides the usual removal of row identifier, `nurseAssessID`, and the timestamp when data was added, `nurseAssessEntryOffset`, I'm also removing all labels and names except those that relate to pain, coma, sedation and delirium scores. Furthermore, `nursingchartcelltypecat` doesn't add much relevant info either, so it will be removed.

In [None]:
nursechart_df = nursechart_df.drop(['nursingchartid', 'nursingchartentryoffset', 'nursingchartcelltypecat'], axis=1)
nursechart_df.head()

In [None]:
labels_to_keep = ['Glasgow coma score', 'Score (Glasgow Coma Scale)',
                  'Sedation Scale/Score/Goal', 'Delirium Scale/Score']

In [None]:
nursechart_df = nursechart_df[nursechart_df.nursingchartcelltypevallabel.isin(labels_to_keep)]
nursechart_df.head()

In [None]:
names_to_keep = ['Pain Score', 'GCS Total', 'Value', 'Sedation Score',
                 'Sedation Scale', 'Delirium Score', 'Delirium Scale']

In [None]:
nursechart_df = nursechart_df[nursechart_df.nursingchartcelltypevalname.isin(names_to_keep)]
nursechart_df.head()

### Convert categories to features

Make the `nursingchartcelltypevallabel` and `nursingchartcelltypevalname` columns of type categorical:

In [None]:
nursechart_df = nursechart_df.categorize(columns=['nursingchartcelltypevallabel', 'nursingchartcelltypevalname'])

In [None]:
nursechart_df.head()

Transform the `nursingchartcelltypevallabel` categories and `nursingchartvalue` values into separate features:

Now we have the categories separated into their own features, as desired.

Remove the old `nursingchartcelltypevallabel`, `nursingchartcelltypevalname` and `nursingchartvalue` columns:

In [None]:
nursechart_df = nursechart_df.drop(['nursingchartcelltypevallabel', 'nursingchartcelltypevalname', 'nursingchartvalue'], axis=1)
nursechart_df.head()

In [None]:
nursechart_df['Pain Score'].value_counts()

### Filter the most common measurement scales

Only keep data thats is in the same, most common measurement scale.

In [None]:
nursechart_df = nursechart_df[((nursechart_df['Pain Assessment'] == 'WDL')
                               | (nursechart_df['Sedation Scale'] == 'RASS')
                               | (nursechart_df['Delirium Scale'] == 'CAM-ICU'))]
nursechart_df.head()

Merge Glasgow coma score columns:

In [None]:
def set_glc(df):
    if np.isnan(df['GLC Total']):
        return df['Score (Glasgow Coma Scale)']
    else:
        return df['GLC Total']

In [None]:
nursechart_df['glasgow_coma_score'] = nursechart_df.apply(lambda df: set_glc(df), axis=1)
nursechart_df.head()

Drop unneeded columns:

In [None]:
nursechart_df = nursechart_df.drop(['Pain Assessment', 'GLC Total', 'Score (Glasgow Coma Scale)',
                                    'Value', 'Sedation Scale', 'Delirium Scale'], axis=1)
nursechart_df.head()

### Discretize categorical features

Convert binary categorical features into simple numberings, one hot encode features with a low number of categories (in this case, 5) and enumerate sparse categorical features that will be embedded.

#### Separate and prepare features for embedding

Identify categorical features that have more than 5 unique categories, which will go through an embedding layer afterwards, and enumerate them.

Update list of categorical features and add those that will need embedding (features with more than 5 unique values):

In [None]:
new_cat_feat = ['nursingchartcelltypecat', 'nursingchartvalue']
[cat_feat.append(col) for col in new_cat_feat]

In [None]:
cat_feat_nunique = [nursechart_df[feature].nunique() for feature in new_cat_feat]
cat_feat_nunique

In [None]:
new_cat_embed_feat = []
for i in range(len(new_cat_feat)):
    if cat_feat_nunique[i] > 5:
        # Add feature to the list of those that will be embedded
        cat_embed_feat.append(new_cat_feat[i])
        new_cat_embed_feat.append(new_cat_feat[i])

In [None]:
nursechart_df[new_cat_feat].head()

In [None]:
for i in range(len(new_cat_embed_feat)):
    feature = new_cat_embed_feat[i]
    # Prepare for embedding, i.e. enumerate categories
    nursechart_df[feature], cat_embed_feat_enum[feature] = du.embedding.enum_categorical_feature(nursechart_df, feature)

In [None]:
nursechart_df[new_cat_feat].head()

In [None]:
cat_embed_feat_enum

In [None]:
nursechart_df[new_cat_feat].dtypes

#### Save enumeration encoding mapping

Save the dictionary that maps from the original categories/strings to the new numerical encondings.

In [None]:
stream = open('cat_embed_feat_enum.yaml', 'w')
yaml.dump(cat_embed_feat_enum, stream, default_flow_style=False)

### Create the timestamp feature and sort

Create the timestamp (`ts`) feature:

In [None]:
nursechart_df['ts'] = nursechart_df['nursechartoffset']
nursechart_df = nursechart_df.drop('nursechartoffset', axis=1)
nursechart_df.head()

Remove duplicate rows:

In [None]:
len(nursechart_df)

In [None]:
nursechart_df = nursechart_df.drop_duplicates()
nursechart_df.head()

In [None]:
len(nursechart_df)

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
nursechart_df = nursechart_df.set_index('ts')
nursechart_df.head()

Check for possible multiple rows with the same unit stay ID and timestamp:

In [None]:
nursechart_df.reset_index().head()

In [None]:
nursechart_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='nursingchartcelltypecat').head()

In [None]:
nursechart_df[nursechart_df.patientunitstayid == 2553254].head(10)

We can see that there are up to 80 categories per set of `patientunitstayid` and `ts`. As such, we must join them.

### Join rows that have the same IDs

In [None]:
nursechart_df = du.embedding.join_categorical_enum(nursechart_df, new_cat_embed_feat)
nursechart_df.head()

In [None]:
nursechart_df.dtypes

In [None]:
nursechart_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='nursingchartcelltypecat').head()

In [None]:
nursechart_df[nursechart_df.patientunitstayid == 2553254].head(10)

Comparing the output from the two previous cells with what we had before the `join_categorical_enum` method, we can see that all rows with duplicate IDs have been successfully joined.

### Rename columns

In [None]:
nursechart_df = nursechart_df.rename(columns={'nursingchartcelltypecat':'nurse_assess_label',
                                                'nursingchartvalue':'nurse_assess_value'})
nursechart_df.head()

### Clean column names

Standardize all column names to be on lower case, have spaces replaced by underscores and remove comas.

In [None]:
nursechart_df.columns = du.data_processing.clean_naming(nursechart_df.columns)
nursechart_df_norm.columns = du.data_processing.clean_naming(nursechart_df_norm.columns)
nursechart_df_norm.head()

### Save the dataframe

Save the dataframe before normalizing:

In [None]:
nursechart_df.to_csv(f'{data_path}cleaned/unnormalized/nurseCharting.csv')

Save the dataframe after normalizing:

In [None]:
nursechart_df.to_csv(f'{data_path}cleaned/normalized/nurseCharting.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
nursechart_df.describe().transpose()

### Join dataframes

Merge dataframes by the unit stay, `patientunitstayid`, and the timestamp, `ts`, with a tolerence for a difference of up to 30 minutes.

In [None]:
nursechart_df = pd.read_csv(f'{data_path}cleaned/normalized/nurseCharting.csv')
nursechart_df.head()

In [None]:
len(nursechart_df)

In [None]:
nursechart_df.patientunitstayid.nunique()

In [None]:
eICU_df = pd.merge_asof(eICU_df, nursechart_df, on='ts', by='patientunitstayid', direction='nearest', tolerance=30)
eICU_df.head()