# Treatment Data Preprocessing
---

Reading and preprocessing treatment data of the eICU dataset from MIT with the data from over 139k patients collected in the US.

This notebook addresses the preprocessing of the following eICU tables:
* admissionDrug
* infusionDrug
* medication
* treatment
* intakeOutput

## Importing the necessary packages

In [None]:
import os                                  # os handles directory/workspace changes
import numpy as np                         # NumPy to handle numeric and NaN operations
import yaml                                # Save and load YAML files

In [None]:
# Debugging packages
import pixiedust                           # Debugging in Jupyter Notebook cells

In [None]:
# Change to parent directory (presumably "Documents")
os.chdir("../../../..")
# Path to the CSV dataset files
data_path = 'Datasets/Thesis/eICU/uncompressed/'
# Path to the code files
project_path = 'GitHub/eICU-mortality-prediction/'

In [None]:
# import modin.pandas as pd                  # Optimized distributed version of Pandas
import pandas as pd
import data_utils as du                    # Data science and machine learning relevant methods

Set the random seed for reproducibility

In [None]:
du.set_random_seed(42)

## Infusion drug data

### Initialize variables

In [None]:
cat_feat = []                              # List of categorical features
cat_embed_feat = []                        # List of categorical features that will be embedded
cat_embed_feat_enum = dict()               # Dictionary of the enumerations of the categorical features that will be embedded

### Read the data

In [None]:
inf_drug_df = pd.read_csv(f'{data_path}original/infusionDrug.csv')
inf_drug_df.head()

In [None]:
len(inf_drug_df)

In [None]:
inf_drug_df.patientunitstayid.nunique()

Get an overview of the dataframe through the `describe` method:

In [None]:
inf_drug_df.describe().transpose()

In [None]:
inf_drug_df.columns

In [None]:
inf_drug_df.dtypes

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(inf_drug_df)

### Remove unneeded features

Besides removing the row ID `infusiondrugid`, I'm also removing `infusionrate`, `volumeoffluid` and `drugamount` as they seem redundant with `drugrate` although with a lot more missing values.

In [None]:
inf_drug_df = inf_drug_df.drop(['infusiondrugid', 'infusionrate', 'volumeoffluid',
                              'drugamount', 'patientweight'], axis=1)
inf_drug_df.head()

### Remove string drug rate values

In [None]:
inf_drug_df[inf_drug_df.drugrate.map(du.utils.is_definitely_string)].head()

In [None]:
inf_drug_df[inf_drug_df.drugrate.map(du.utils.is_definitely_string)].drugrate.value_counts()

In [None]:
inf_drug_df.drugrate = inf_drug_df.drugrate.map(lambda x: np.nan if du.utils.is_definitely_string(x) else x)
inf_drug_df.head()

In [None]:
inf_drug_df.patientunitstayid = inf_drug_df.patientunitstayid.astype(int)
inf_drug_df.infusionoffset = inf_drug_df.infusionoffset.astype(int)
inf_drug_df.drugname = inf_drug_df.drugname.astype(str)
inf_drug_df.drugrate = inf_drug_df.drugrate.astype(float)
inf_drug_df.head()

### Discretize categorical features

Convert binary categorical features into simple numberings, one hot encode features with a low number of categories (in this case, 5) and enumerate sparse categorical features that will be embedded.

#### Separate and prepare features for embedding

Identify categorical features that have more than 5 unique categories, which will go through an embedding layer afterwards, and enumerate them.

Update list of categorical features and add those that will need embedding (features with more than 5 unique values):

In [None]:
new_cat_feat = ['drugname']
[cat_feat.append(col) for col in new_cat_feat]

In [None]:
cat_feat_nunique = [inf_drug_df[feature].nunique() for feature in new_cat_feat]
cat_feat_nunique

In [None]:
new_cat_embed_feat = []
for i in range(len(new_cat_feat)):
    if cat_feat_nunique[i] > 5:
        # Add feature to the list of those that will be embedded
        cat_embed_feat.append(new_cat_feat[i])
        new_cat_embed_feat.append(new_cat_feat[i])

In [None]:
inf_drug_df[new_cat_feat].head()

In [None]:
for i in range(len(new_cat_embed_feat)):
    feature = new_cat_embed_feat[i]
    # Prepare for embedding, i.e. enumerate categories
    inf_drug_df[feature], cat_embed_feat_enum[feature] = du.embedding.enum_categorical_feature(inf_drug_df, feature, nan_value=0,
                                                                                               forbidden_digit=0)

In [None]:
inf_drug_df[new_cat_feat].head()

In [None]:
cat_embed_feat_enum

In [None]:
inf_drug_df[new_cat_feat].dtypes

#### Save enumeration encoding mapping

Save the dictionary that maps from the original categories/strings to the new numerical encondings.

In [None]:
stream = open(f'{data_path}/cleaned/cat_embed_feat_enum_inf_drug.yaml', 'w')
yaml.dump(cat_embed_feat_enum, stream, default_flow_style=False)

### Create the timestamp feature and sort

Create the timestamp (`ts`) feature:

In [None]:
inf_drug_df = inf_drug_df.rename(columns={'infusionoffset': 'ts'})
inf_drug_df.head()

Remove duplicate rows:

In [None]:
len(inf_drug_df)

In [None]:
inf_drug_df = inf_drug_df.drop_duplicates()
inf_drug_df.head()

In [None]:
len(inf_drug_df)

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
inf_drug_df = inf_drug_df.sort_values('ts')
inf_drug_df.head(6)

Convert dataframe to Pandas, as the next cells aren't working properly with Modin:

In [None]:
inf_drug_df.dtypes

In [None]:
inf_drug_df, pd = du.utils.convert_dataframe(inf_drug_df, to='pandas')

In [None]:
type(inf_drug_df)

In [None]:
inf_drug_df.dtypes

Check for possible multiple rows with the same unit stay ID and timestamp:

In [None]:
inf_drug_df.groupby(['patientunitstayid', 'ts']).count().nlargest(columns='drugname', n=5).head()

In [None]:
inf_drug_df[inf_drug_df.patientunitstayid == 1785711].head(20)

We can see that there are up to 17 categories per set of `patientunitstayid` and `ts`. As such, we must join them. But first, as we shouldn't mix absolute values of drug rates from different drugs, we better normalize it first.

### Normalize data

In [None]:
inf_drug_df.drugrate = inf_drug_df.drugrate.astype(float)

In [None]:
inf_drug_df_norm = du.data_processing.normalize_data(inf_drug_df, columns_to_normalize=False,
                                                    columns_to_normalize_categ=[('drugname', 'drugrate')],
                                                    inplace=True)
inf_drug_df_norm.head()

Prevent infinite drug rate values:

In [None]:
inf_drug_df_norm = inf_drug_df_norm.replace(to_replace=np.inf, value=0)

In [None]:
inf_drug_df_norm.drugrate.max()

### Join rows that have the same IDs

In [None]:
inf_drug_df_norm = du.embedding.join_categorical_enum(inf_drug_df_norm, new_cat_embed_feat, inplace=True)
inf_drug_df_norm.head()

Reconvert dataframe to Modin:

In [None]:
inf_drug_df_norm, pd = du.utils.convert_dataframe(inf_drug_df_norm, to='modin')

In [None]:
type(inf_drug_df_norm)

In [None]:
inf_drug_df_norm.dtypes

In [None]:
inf_drug_df_norm.groupby(['patientunitstayid', 'ts']).count().nlargest(columns='drugname', n=5).head()

In [None]:
inf_drug_df_norm[inf_drug_df_norm.patientunitstayid == 1785711].head(20)

Comparing the output from the two previous cells with what we had before the `join_categorical_enum` method, we can see that all rows with duplicate IDs have been successfully joined.

### Rename columns

In [None]:
inf_drug_df = inf_drug_df.rename(columns={'drugname': 'infusion_drugname',
                                        'drugrate': 'infusion_drugrate'})
inf_drug_df.head()

In [None]:
inf_drug_df_norm = inf_drug_df_norm.rename(columns={'drugname': 'infusion_drugname',
                                                  'drugrate': 'infusion_drugrate'})
inf_drug_df_norm.head()

### Clean column names

Standardize all column names to be on lower case, have spaces replaced by underscores and remove comas.

In [None]:
inf_drug_df.columns = du.data_processing.clean_naming(inf_drug_df.columns)
inf_drug_df_norm.columns = du.data_processing.clean_naming(inf_drug_df_norm.columns)
inf_drug_df_norm.head()

### Save the dataframe

Save the dataframe before normalizing:

In [None]:
inf_drug_df.to_csv(f'{data_path}cleaned/unnormalized/infusionDrug.csv')

Save the dataframe after normalizing:

In [None]:
inf_drug_df_norm.to_csv(f'{data_path}cleaned/normalized/infusionDrug.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
inf_drug_df_norm.describe().transpose()

## Admission drug data

### Initialize variables

In [None]:
cat_feat = []                              # List of categorical features
cat_embed_feat = []                        # List of categorical features that will be embedded
cat_embed_feat_enum = dict()               # Dictionary of the enumerations of the categorical features that will be embedded

### Read the data

In [None]:
adms_drug_df = pd.read_csv(f'{data_path}original/admissionDrug.csv')
adms_drug_df.head()

In [None]:
len(adms_drug_df)

In [None]:
adms_drug_df.patientunitstayid.nunique()

There's not much admission drug data (only around 20% of the unit stays have this data). However, it might be useful, considering also that it complements the medication table.

Get an overview of the dataframe through the `describe` method:

In [None]:
adms_drug_df.describe().transpose()

In [None]:
adms_drug_df.columns

In [None]:
adms_drug_df.dtypes

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(adms_drug_df)

### Remove unneeded features

In [None]:
adms_drug_df.drugname.value_counts()

In [None]:
adms_drug_df.drugname.nunique()

In [None]:
adms_drug_df.drughiclseqno.value_counts()

In [None]:
adms_drug_df.drughiclseqno.nunique()

In [None]:
adms_drug_df.drugnotetype.value_counts()

In [None]:
adms_drug_df.drugdosage.value_counts()

In [None]:
adms_drug_df.drugunit.value_counts()

In [None]:
adms_drug_df.drugadmitfrequency.value_counts()

In [None]:
adms_drug_df[adms_drug_df.drugdosage == 0].head(20)

In [None]:
adms_drug_df[adms_drug_df.drugdosage == 0].drugunit.value_counts()

In [None]:
adms_drug_df[adms_drug_df.drugdosage == 0].drugadmitfrequency.value_counts()

In [None]:
adms_drug_df[adms_drug_df.drugunit == ' '].drugdosage.value_counts()

Oddly, `drugunit` and `drugadmitfrequency` have several blank values. At the same time, when this happens, `drugdosage` tends to be 0 (which is also an unrealistic value). Considering that no NaNs are reported, these blanks and zeros probably represent missing values.

Besides removing irrelevant or hospital staff related data (e.g. `usertype`), I'm also removing the `drugname` column, which is redundant with the codes `drughiclseqno`, while also being brand dependant.

In [None]:
adms_drug_df = adms_drug_df[['patientunitstayid', 'drugoffset', 'drugdosage',
                           'drugunit', 'drugadmitfrequency', 'drughiclseqno']]
adms_drug_df.head()

### Fix missing values representation

Replace blank and unrealistic zero values with NaNs.

Convert dataframe to Pandas, as the next cells aren't working properly with Modin:

In [None]:
adms_drug_df, pd = du.utils.convert_dataframe(adms_drug_df, to='pandas')

In [None]:
type(adms_drug_df)

In [None]:
adms_drug_df.drugdosage = adms_drug_df.drugdosage.replace(to_replace=0, value=np.nan)
adms_drug_df.drugunit = adms_drug_df.drugunit.replace(to_replace=' ', value=np.nan)
adms_drug_df.drugadmitfrequency = adms_drug_df.drugadmitfrequency.replace(to_replace=' ', value=np.nan)
adms_drug_df.head()

In [None]:
du.search_explore.dataframe_missing_values(adms_drug_df)

### Discretize categorical features

Convert binary categorical features into simple numberings, one hot encode features with a low number of categories (in this case, 5) and enumerate sparse categorical features that will be embedded.

#### Separate and prepare features for embedding

Identify categorical features that have more than 5 unique categories, which will go through an embedding layer afterwards, and enumerate them.

In the case of microbiology data, we're also going to embed the antibiotic `sensitivitylevel`, not because it has many categories, but because there can be several rows of data per timestamp (which would be impractical on one hot encoded data).

Update list of categorical features and add those that will need embedding (features with more than 5 unique values):

In [None]:
new_cat_feat = ['drugunit', 'drugadmitfrequency', 'drughiclseqno']
[cat_feat.append(col) for col in new_cat_feat]

In [None]:
cat_feat_nunique = [adms_drug_df[feature].nunique() for feature in new_cat_feat]
cat_feat_nunique

In [None]:
new_cat_embed_feat = []
for i in range(len(new_cat_feat)):
    if cat_feat_nunique[i] > 5:
        # Add feature to the list of those that will be embedded
        cat_embed_feat.append(new_cat_feat[i])
        new_cat_embed_feat.append(new_cat_feat[i])

In [None]:
adms_drug_df[new_cat_feat].head()

In [None]:
for i in range(len(new_cat_embed_feat)):
    feature = new_cat_embed_feat[i]
    # Prepare for embedding, i.e. enumerate categories
    adms_drug_df[feature], cat_embed_feat_enum[feature] = du.embedding.enum_categorical_feature(adms_drug_df, feature, nan_value=0,
                                                                                                forbidden_digit=0)

In [None]:
adms_drug_df[new_cat_feat].head()

In [None]:
cat_embed_feat_enum

In [None]:
adms_drug_df[new_cat_feat].dtypes

#### Save enumeration encoding mapping

Save the dictionary that maps from the original categories/strings to the new numerical encondings.

In [None]:
stream = open(f'{data_path}/cleaned/cat_embed_feat_enum_adms_drug.yaml', 'w')
yaml.dump(cat_embed_feat_enum, stream, default_flow_style=False)

### Create the timestamp feature and sort

Create the timestamp (`ts`) feature:

In [None]:
adms_drug_df = adms_drug_df.rename(columns={'drugoffset': 'ts'})
adms_drug_df.head()

Remove duplicate rows:

In [None]:
len(adms_drug_df)

In [None]:
adms_drug_df = adms_drug_df.drop_duplicates()
adms_drug_df.head()

In [None]:
len(adms_drug_df)

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
adms_drug_df = adms_drug_df.sort_values('ts')
adms_drug_df.head()

Check for possible multiple rows with the same unit stay ID and timestamp:

In [None]:
adms_drug_df.groupby(['patientunitstayid', 'ts']).count().nlargest(columns='drughiclseqno', n=5).head()

In [None]:
adms_drug_df[adms_drug_df.patientunitstayid == 2346930].head(10)

We can see that there are up to 48 categories per set of `patientunitstayid` and `ts`. As such, we must join them. But first, we need to normalize the dosage by the respective sets of drug code and units, so as to avoid mixing different absolute values.

### Normalize data

In [None]:
adms_drug_df_norm = du.data_processing.normalize_data(adms_drug_df, columns_to_normalize=False,
                                                     columns_to_normalize_categ=[(['drughiclseqno', 'drugunit'], 'drugdosage')],
                                                     inplace=True)
adms_drug_df_norm.head()

In [None]:
adms_drug_df_norm = adms_drug_df_norm.sort_values('ts')
adms_drug_df_norm.head()

Prevent infinite values:

In [None]:
adms_drug_df_norm = adms_drug_df_norm.replace(to_replace=np.inf, value=0)

In [None]:
adms_drug_df_norm = adms_drug_df_norm.replace(to_replace=-np.inf, value=0)

In [None]:
adms_drug_df_norm.drugdosage.max()

### Join rows that have the same IDs

Even after removing duplicates rows, there are still some that have different information for the same ID and timestamp. We have to concatenate the categorical enumerations.

In [None]:
adms_drug_df_norm = du.embedding.join_categorical_enum(adms_drug_df_norm, new_cat_embed_feat, inplace=True)
adms_drug_df_norm.head()

Reconvert dataframe to Modin:

In [None]:
adms_drug_df_norm, pd = du.utils.convert_dataframe(adms_drug_df_norm, to='modin')

In [None]:
type(adms_drug_df_norm)

In [None]:
adms_drug_df_norm.dtypes

In [None]:
adms_drug_df_norm.groupby(['patientunitstayid', 'ts']).count().nlargest(columns='drughiclseqno', n=5).head()

In [None]:
adms_drug_df_norm[adms_drug_df_norm.patientunitstayid == 2346930].head(10)

Comparing the output from the two previous cells with what we had before the `join_categorical_enum` method, we can see that all rows with duplicate IDs have been successfully joined.

### Clean column names

Standardize all column names to be on lower case, have spaces replaced by underscores and remove comas.

In [None]:
adms_drug_df.columns = du.data_processing.clean_naming(adms_drug_df.columns)
adms_drug_df_norm.columns = du.data_processing.clean_naming(adms_drug_df_norm.columns)
adms_drug_df_norm.head()

### Save the dataframe

Save the dataframe before normalizing:

In [None]:
adms_drug_df.to_csv(f'{data_path}cleaned/unnormalized/admissionDrug.csv')

Save the dataframe after normalizing:

In [None]:
adms_drug_df_norm.to_csv(f'{data_path}cleaned/normalized/admissionDrug.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
adms_drug_df_norm.describe().transpose()

## Medication data

### Initialize variables

In [None]:
cat_feat = []                              # List of categorical features
cat_embed_feat = []                        # List of categorical features that will be embedded
cat_embed_feat_enum = dict()               # Dictionary of the enumerations of the categorical features that will be embedded

### Read the data

In [None]:
med_df = pd.read_csv(f'{data_path}original/medication.csv', dtype={'loadingdose': 'object'})
med_df.head()

In [None]:
len(med_df)

In [None]:
med_df.patientunitstayid.nunique()

There's not much admission drug data (only around 20% of the unit stays have this data). However, it might be useful, considering also that it complements the medication table.

Get an overview of the dataframe through the `describe` method:

In [None]:
med_df.describe().transpose()

In [None]:
med_df.columns

In [None]:
med_df.dtypes

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(med_df)

### Remove unneeded features

In [None]:
med_df.drugname.value_counts()

In [None]:
med_df.drughiclseqno.value_counts()

In [None]:
med_df.dosage.value_counts()

In [None]:
med_df.frequency.value_counts()

In [None]:
med_df.drugstartoffset.value_counts()

In [None]:
med_df[med_df.drugstartoffset == 0].head()

Besides removing less interesting data (e.g. `drugivadmixture`), I'm also removing the `drugname` column, which is redundant with the codes `drughiclseqno`, while also being brand dependant.

In [None]:
med_df = med_df[['patientunitstayid', 'drugstartoffset', 'drugstopoffset',
                 'drugordercancelled', 'dosage', 'frequency', 'drughiclseqno']]
med_df.head()

### Remove rows of which the drug has been cancelled or not specified

In [None]:
med_df.drugordercancelled.value_counts()

In [None]:
med_df = med_df[~((med_df.drugordercancelled == 'Yes') | (med_df.drughiclseqno.isnull()))]
med_df.head()

Remove the now unneeded `drugordercancelled` column:

In [None]:
med_df = med_df.drop('drugordercancelled', axis=1)
med_df.head()

In [None]:
du.search_explore.dataframe_missing_values(med_df)

### Separating units from dosage

In order to properly take into account the dosage quantities, as well as to standardize according to other tables like admission drugs, we should take the original `dosage` column and separate it to just the `drugdosage` values and the `drugunit`.

No need to create a separate `pyxis` feature, which would indicate the use of the popular automated medications manager, as the frequency embedding will have that into account.

Create dosage and unit features:

In [None]:
med_df['drugdosage'] = np.nan
med_df['drugunit'] = np.nan
med_df.head()

Convert dataframe to Pandas, as the next cells aren't working properly with Modin:

In [None]:
med_df, pd = du.utils.convert_dataframe(med_df, to='pandas')

In [None]:
type(med_df)

Get the dosage and unit values for each row:

In [None]:
med_df[['drugdosage', 'drugunit']] = med_df.apply(du.data_processing.set_dosage_and_units, axis=1, result_type='expand')
med_df.head()

In [None]:
med_df.drugunit.value_counts()

Remove the now unneeded `dosage` column:

In [None]:
med_df = med_df.drop('dosage', axis=1)
med_df.head()

### Discretize categorical features

Convert binary categorical features into simple numberings, one hot encode features with a low number of categories (in this case, 5) and enumerate sparse categorical features that will be embedded.

#### Separate and prepare features for embedding

Identify categorical features that have more than 5 unique categories, which will go through an embedding layer afterwards, and enumerate them.

In the case of microbiology data, we're also going to embed the antibiotic `sensitivitylevel`, not because it has many categories, but because there can be several rows of data per timestamp (which would be impractical on one hot encoded data).

Update list of categorical features and add those that will need embedding (features with more than 5 unique values):

In [None]:
new_cat_feat = ['drugunit', 'frequency', 'drughiclseqno']
[cat_feat.append(col) for col in new_cat_feat]

In [None]:
cat_feat_nunique = [med_df[feature].nunique() for feature in new_cat_feat]
cat_feat_nunique

In [None]:
new_cat_embed_feat = []
for i in range(len(new_cat_feat)):
    if cat_feat_nunique[i] > 5:
        # Add feature to the list of those that will be embedded
        cat_embed_feat.append(new_cat_feat[i])
        new_cat_embed_feat.append(new_cat_feat[i])

In [None]:
med_df[new_cat_feat].head()

In [None]:
for i in range(len(new_cat_embed_feat)):
    feature = new_cat_embed_feat[i]
    # Prepare for embedding, i.e. enumerate categories
    med_df[feature], cat_embed_feat_enum[feature] = du.embedding.enum_categorical_feature(med_df, feature, nan_value=0,
                                                                                          forbidden_digit=0)

In [None]:
med_df[new_cat_feat].head()

In [None]:
cat_embed_feat_enum

In [None]:
med_df[new_cat_feat].dtypes

#### Save enumeration encoding mapping

Save the dictionary that maps from the original categories/strings to the new numerical encondings.

In [None]:
stream = open(f'{data_path}/cleaned/cat_embed_feat_enum_med.yaml', 'w')
yaml.dump(cat_embed_feat_enum, stream, default_flow_style=False)

### Create drug stop event

Add a timestamp corresponding to when each patient stops taking each medication.

Duplicate every row, so as to create a discharge event:

In [None]:
new_df = med_df.copy()
new_df.head()

Set the new dataframe's rows to have the drug stop timestamp, with no more information on the drug that was being used:

In [None]:
new_df.drugstartoffset = new_df.drugstopoffset
new_df.drugunit = 0
new_df.drugdosage = np.nan
new_df.frequency = 0
new_df.drughiclseqno = 0
new_df.head()

Join the new rows to the remaining dataframe:

In [None]:
med_df = med_df.append(new_df)
med_df.head()

Remove the now unneeded medication stop column:

In [None]:
med_df = med_df.drop('drugstopoffset', axis=1)
med_df.head(6)

### Create the timestamp feature and sort

Create the timestamp (`ts`) feature:

In [None]:
med_df = med_df.rename(columns={'drugstartoffset': 'ts'})
med_df.head()

Remove duplicate rows:

In [None]:
len(med_df)

In [None]:
med_df = med_df.drop_duplicates()
med_df.head()

In [None]:
len(med_df)

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
med_df = med_df.sort_values('ts')
med_df.head()

Check for possible multiple rows with the same unit stay ID and timestamp:

In [None]:
med_df.groupby(['patientunitstayid', 'ts']).count().nlargest(columns='drughiclseqno', n=5).head()

In [None]:
med_df[med_df.patientunitstayid == 979183].head(10)

We can see that there are up to 41 categories per set of `patientunitstayid` and `ts`. As such, we must join them. But first, we need to normalize the dosage by the respective sets of drug code and units, so as to avoid mixing different absolute values.

### Normalize data

In [None]:
med_df_norm = du.data_processing.normalize_data(med_df, columns_to_normalize=False,
                                                columns_to_normalize_categ=[(['drughiclseqno', 'drugunit'], 'drugdosage')],
                                                inplace=True)
med_df_norm.head()

In [None]:
med_df_norm = med_df_norm.sort_values('ts')
med_df_norm.head()

Prevent infinite values:

In [None]:
med_df_norm = med_df_norm.replace(to_replace=np.inf, value=0)

In [None]:
med_df_norm = med_df_norm.replace(to_replace=-np.inf, value=0)

In [None]:
med_df_norm.drugdosage.max()

### Join rows that have the same IDs

Even after removing duplicates rows, there are still some that have different information for the same ID and timestamp. We have to concatenate the categorical enumerations.

In [None]:
list(set(med_df_norm.columns) - set(new_cat_embed_feat) - set(['patientunitstayid', 'ts']))

In [None]:
med_df_norm = du.embedding.join_categorical_enum(med_df_norm, new_cat_embed_feat, inplace=True)
med_df_norm.head()

Reconvert dataframe to Modin:

In [None]:
med_df_norm, pd = du.utils.convert_dataframe(med_df, to='modin')

In [None]:
type(med_df_norm)

In [None]:
med_df_norm.dtypes

In [None]:
med_df_norm.groupby(['patientunitstayid', 'ts']).count().nlargest(columns='drughiclseqno', n=5).head()

In [None]:
med_df_norm[med_df_norm.patientunitstayid == 979183].head(10)

Comparing the output from the two previous cells with what we had before the `join_categorical_enum` method, we can see that all rows with duplicate IDs have been successfully joined.

### Rename columns

In [None]:
med_df_norm = med_df_norm.rename(columns={'frequency':'drugadmitfrequency'})
med_df_norm.head()

### Clean column names

Standardize all column names to be on lower case, have spaces replaced by underscores and remove comas.

In [None]:
med_df.columns = du.data_processing.clean_naming(med_df.columns)
med_df_norm.columns = du.data_processing.clean_naming(med_df_norm.columns)
med_df_norm.head()

### Save the dataframe

Save the dataframe before normalizing:

In [None]:
med_df.to_csv(f'{data_path}cleaned/unnormalized/medication.csv')

Save the dataframe after normalizing:

In [None]:
med_df_norm.to_csv(f'{data_path}cleaned/normalized/medication.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
med_df_norm.describe().transpose()

In [None]:
med_df.nlargest(columns='drugdosage', n=5)

Although the `drugdosage` looks good on mean (close to 0) and standard deviation (close to 1), it has very large magnitude minimum (-88.9) and maximum (174.1) values. Furthermore, these don't seem to be because of NaN values, whose groupby normalization could have been unideal. As such, it's hard to say if these are outliers or realistic values.

[TODO] Check if these very large extreme dosage values make sense and, if not, try to fix them.

## Treatment data

### Initialize variables

In [None]:
cat_feat = []                              # List of categorical features
cat_embed_feat = []                        # List of categorical features that will be embedded
cat_embed_feat_enum = dict()               # Dictionary of the enumerations of the categorical features that will be embedded

### Read the data

In [None]:
treat_df = pd.read_csv(f'{data_path}original/treatment.csv')
treat_df.head()

In [None]:
len(treat_df)

In [None]:
treat_df.patientunitstayid.nunique()

Get an overview of the dataframe through the `describe` method:

In [None]:
treat_df.describe().transpose()

In [None]:
treat_df.columns

In [None]:
treat_df.dtypes

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(treat_df)

### Remove unneeded features

Besides the usual removal of row identifier, `treatmentid`, I'm also removing `activeupondischarge`, as we don't have complete information as to when diagnosis end.

In [None]:
treat_df = treat_df.drop(['treatmentid', 'activeupondischarge'], axis=1)
treat_df.head()

### Separate high level diagnosis

In [None]:
treat_df.treatmentstring.value_counts()

In [None]:
treat_df.treatmentstring.map(lambda x: x.split('|')).head()

In [None]:
treat_df.treatmentstring.map(lambda x: len(x.split('|'))).min()

In [None]:
treat_df.treatmentstring.map(lambda x: len(x.split('|'))).max()

There are always at least 3 higher level diagnosis. It could be beneficial to extract those first 3 levels to separate features, with the last one getting values until the end of the string, so as to avoid the need for the model to learn similarities that are already known.

In [None]:
treat_df.treatmentstring.apply(lambda x: du.search_explore.get_element_from_split(x, 0, separator='|')).value_counts()

In [None]:
treat_df.treatmentstring.apply(lambda x: du.search_explore.get_element_from_split(x, 1, separator='|')).value_counts()

In [None]:
treat_df.treatmentstring.apply(lambda x: du.search_explore.get_element_from_split(x, 2, separator='|')).value_counts()

In [None]:
treat_df.treatmentstring.apply(lambda x: du.search_explore.get_element_from_split(x, 3, separator='|')).value_counts()

In [None]:
treat_df.treatmentstring.apply(lambda x: du.search_explore.get_element_from_split(x, 4, separator='|')).value_counts()

In [None]:
treat_df.treatmentstring.apply(lambda x: du.search_explore.get_element_from_split(x, 5, separator='|')).value_counts()

<!-- There are always 8 levels of the notes. As the first 6 ones are essentially always the same ("notes/Progress Notes/Social History / Family History/Social History/Social History/"), it's best to just preserve the 7th one and isolate the 8th in a new feature. This way, the split provides further insight to the model on similar notes. However, it's also worth taking note that the 8th level of `notepath` seems to be identical to the feature `notevalue`. We'll look more into it later. -->

In [None]:
treat_df['treatmenttype'] = treat_df.treatmentstring.apply(lambda x: du.search_explore.get_element_from_split(x, 0, separator='|'))
treat_df['treatmenttherapy'] = treat_df.treatmentstring.apply(lambda x: du.search_explore.get_element_from_split(x, 1, separator='|'))
treat_df['treatmentdetails'] = treat_df.treatmentstring.apply(lambda x: du.search_explore.get_element_from_split(x, 2, separator='|', till_the_end=True))
treat_df.head()

Remove the now redundant `treatmentstring` column:

In [None]:
treat_df = treat_df.drop('treatmentstring', axis=1)
treat_df.head()

In [None]:
treat_df.treatmenttype.value_counts()

In [None]:
treat_df.treatmenttherapy.value_counts()

In [None]:
treat_df.treatmentdetails.value_counts()

### Discretize categorical features

Convert binary categorical features into simple numberings, one hot encode features with a low number of categories (in this case, 5) and enumerate sparse categorical features that will be embedded.

#### Separate and prepare features for embedding

Identify categorical features that have more than 5 unique categories, which will go through an embedding layer afterwards, and enumerate them.

Update list of categorical features and add those that will need embedding (features with more than 5 unique values):

In [None]:
new_cat_feat = ['treatmenttype', 'treatmenttherapy', 'treatmentdetails']
[cat_feat.append(col) for col in new_cat_feat]

In [None]:
cat_feat_nunique = [treat_df[feature].nunique() for feature in new_cat_feat]
cat_feat_nunique

In [None]:
new_cat_embed_feat = []
for i in range(len(new_cat_feat)):
    if cat_feat_nunique[i] > 5:
        # Add feature to the list of those that will be embedded
        cat_embed_feat.append(new_cat_feat[i])
        new_cat_embed_feat.append(new_cat_feat[i])

In [None]:
treat_df[new_cat_feat].head()

In [None]:
for i in range(len(new_cat_embed_feat)):
    feature = new_cat_embed_feat[i]
    # Prepare for embedding, i.e. enumerate categories
    treat_df[feature], cat_embed_feat_enum[feature] = du.embedding.enum_categorical_feature(treat_df, feature, nan_value=0,
                                                                                            forbidden_digit=0)

In [None]:
treat_df[new_cat_feat].head()

In [None]:
cat_embed_feat_enum

In [None]:
treat_df[new_cat_feat].dtypes

#### Save enumeration encoding mapping

Save the dictionary that maps from the original categories/strings to the new numerical encondings.

In [None]:
stream = open(f'{data_path}/cleaned/cat_embed_feat_enum_treat.yaml', 'w')
yaml.dump(cat_embed_feat_enum, stream, default_flow_style=False)

### Create the timestamp feature and sort

Create the timestamp (`ts`) feature:

In [None]:
treat_df = treat_df.rename(columns={'treatmentoffset': 'ts'})
treat_df.head()

Remove duplicate rows:

In [None]:
len(treat_df)

In [None]:
treat_df = treat_df.drop_duplicates()
treat_df.head()

In [None]:
len(treat_df)

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
treat_df = treat_df.sort_values('ts')
treat_df.head()

Check for possible multiple rows with the same unit stay ID and timestamp:

In [None]:
treat_df.groupby(['patientunitstayid', 'ts']).count().nlargest(columns='treatmenttype', n=5).head()

In [None]:
treat_df[treat_df.patientunitstayid == 1352520].head(10)

We can see that there are up to 105 categories per set of `patientunitstayid` and `ts`. As such, we must join them.

### Join rows that have the same IDs

Convert dataframe to Pandas, as the groupby operation in `join_categorical_enum` isn't working properly with Modin:

In [None]:
treat_df, pd = du.utils.convert_dataframe(treat_df, to='pandas')

In [None]:
type(treat_df)

In [None]:
treat_df = du.embedding.join_categorical_enum(treat_df, new_cat_embed_feat, inplace=True)
treat_df.head()

Reconvert dataframe to Modin:

In [None]:
treat_df, pd = du.utils.convert_dataframe(treat_df, to='modin')

In [None]:
type(treat_df)

In [None]:
treat_df.dtypes

In [None]:
treat_df.groupby(['patientunitstayid', 'ts']).count().nlargest(columns='treatmenttype', n=5).head()

In [None]:
treat_df[treat_df.patientunitstayid == 1352520].head(10)

Comparing the output from the two previous cells with what we had before the `join_categorical_enum` method, we can see that all rows with duplicate IDs have been successfully joined.

### Clean column names

Standardize all column names to be on lower case, have spaces replaced by underscores and remove comas.

In [None]:
treat_df.columns = du.data_processing.clean_naming(treat_df.columns)
treat_df.head()

### Save the dataframe

Save the dataframe before normalizing:

In [None]:
treat_df.to_csv(f'{data_path}cleaned/unnormalized/treatment.csv')

Save the dataframe after normalizing:

In [None]:
treat_df.to_csv(f'{data_path}cleaned/normalized/treatment.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
treat_df.describe().transpose()

## Intake output data

### Initialize variables

In [None]:
cat_feat = []                              # List of categorical features
cat_embed_feat = []                        # List of categorical features that will be embedded
cat_embed_feat_enum = dict()               # Dictionary of the enumerations of the categorical features that will be embedded

### Read the data

In [None]:
in_out_df = pd.read_csv(f'{data_path}original/intakeOutput.csv')
in_out_df.head()

In [None]:
len(in_out_df)

In [None]:
in_out_df.patientunitstayid.nunique()

Get an overview of the dataframe through the `describe` method:

In [None]:
in_out_df.describe().transpose()

In [None]:
in_out_df.info()

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(in_out_df)

### Remove unneeded features

In [None]:
in_out_df.celllabel.value_counts()

In [None]:
in_out_df.celllabel.value_counts().head(50)

In [None]:
in_out_df.cellvaluenumeric.value_counts()

In [None]:
in_out_df.cellvaluetext.value_counts()

In [None]:
in_out_df.cellpath.value_counts()

Besides the usual removal of row identifier, `intakeoutputid`, and the timestamp when data was added, `intakeoutputentryoffset`, I'm also removing the cumulative features `intaketotal`, `outputtotal`, `diaslysistotal`, and `nettotal` (cumulative data could damage the neural networks' logic with too high values and we're looking for data of the current moment), as well as `cellvaluetext`, which has redundant info with `cellvaluenumeric`.

In [None]:
in_out_df = in_out_df.drop(columns=['intakeoutputid', 'intakeoutputentryoffset',
                                    'intaketotal', 'outputtotal', 'dialysistotal',
                                    'nettotal', 'cellvaluetext'])
in_out_df.head()

Additionally, we're going to focus on the most common data categories, ignoring rarer ones and those that could be redundant with data from other tables (e.g. infusion drugs).

In [None]:
categories_to_keep = ['Urine', 'URINE CATHETER', 'Urinary Catheter Output: Indwelling/Continuous Ure',
                      'Bodyweight (kg)', 'P.O.', 'P.O. Intake', 'Oral Intake', 'Oral',
                      'Stool', 'Crystalloids', 'Indwelling Catheter Output', 'Nutrition Total', 
                      'Enteral/Gastric Tube Intake', 'pRBCs', 'Gastric (NG)', 
                      'propofol', 'LR', 'LR IVF', 'I.V.', 'Voided Amount', 'Out', 
                      'Feeding Tube Flush/Water Bolus Amt (mL)', 'norepinephrine', 
                      'Saline Flush (mL)', 'Volume Given (mL)', 
                      'Actual Patient Fluid Removal', 'fentaNYL']

In [None]:
(in_out_df.celllabel.isin(categories_to_keep)).head()

In [None]:
in_out_df = in_out_df[in_out_df.celllabel.isin(categories_to_keep)]
in_out_df.head()

### Merge redundant data

Urine, oral and lactated ringer are labeled in different ways. We need to merge them into unique representations to make the machine learning models learn more efficiently.

Unify urine data:

In [None]:
urine_labels = ['Urine', 'URINE CATHETER', 'Urinary Catheter Output: Indwelling/Continuous Ure']

In [None]:
in_out_df.loc[in_out_df.celllabel.isin(urine_labels), 'celllabel'] = 'urine'

Unify oral data:

In [None]:
oral_labels = ['P.O.', 'P.O. Intake', 'Oral Intake', 'Oral']

In [None]:
in_out_df.loc[in_out_df.celllabel.isin(oral_labels), 'celllabel'] = 'oral'

Unify lactated ringer data:

In [None]:
lr_labels = ['LR', 'LR IVF']

In [None]:
in_out_df.loc[in_out_df.celllabel.isin(lr_labels), 'celllabel'] = 'lr'

In [None]:
in_out_df.celllabel.value_counts()

### Distinguish intake from output data

For each label, separate into an intake and an output feature.

Get intake / output indicator:

In [None]:
in_out_df['in_out_indctr'] = in_out_df.cellpath.apply(lambda x: du.search_explore.get_element_from_split(x, 3, 
                                                                                                         separator='|'))
in_out_df.head()

In [None]:
in_out_df['in_out_indctr'].value_counts()

Add the appropriate intake or output indication to each label:

In [None]:
in_out_df.loc[in_out_df.in_out_indctr == 'Intake (ml)', 'celllabel'] += '_intake'

In [None]:
in_out_df.loc[in_out_df.in_out_indctr == 'Output (ml)', 'celllabel'] += '_output'

In [None]:
in_out_df.celllabel.value_counts()

Remove the now unneeded intake / output indicator and path columns:

In [None]:
in_out_df = in_out_df.drop(columns=['in_out_indctr', 'cellpath'])
in_out_df.head()

### Convert categories to features

Transform the `celllabel` categories and `cellvaluenumeric` values into separate features:

In [None]:
in_out_df = du.data_processing.category_to_feature(in_out_df, categories_feature='celllabel',
                                                   values_feature='cellvaluenumeric', min_len=1000, inplace=True)
in_out_df.head()

Now we have the categories separated into their own features, as desired.

Remove the old `celllabel` and `cellvaluenumeric` columns:

In [None]:
in_out_df = in_out_df.drop(['celllabel', 'cellvaluenumeric'], axis=1)
in_out_df.head()

In [None]:
in_out_df['urine_output'].value_counts()

In [None]:
in_out_df['oral_intake'].value_counts()

In [None]:
in_out_df['Bodyweight (kg)'].value_counts()

In [None]:
in_out_df['Stool_output'].value_counts()

### Create the timestamp feature and sort

Create the timestamp (`ts`) feature:

In [None]:
in_out_df = in_out_df.rename(columns={'intakeoutputoffset': 'ts'})
in_out_df.head()

Remove duplicate rows:

In [None]:
len(in_out_df)

In [None]:
in_out_df = in_out_df.drop_duplicates()
in_out_df.head()

In [None]:
len(in_out_df)

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
in_out_df = in_out_df.sort_values('ts')
in_out_df.head()

Check for possible multiple rows with the same unit stay ID and timestamp:

In [None]:
in_out_df.groupby(['patientunitstayid', 'ts']).count().nlargest(columns='urine_output', n=5).head()

In [None]:
in_out_df[(in_out_df.patientunitstayid == 433661) & (in_out_df.ts == 661)].head(10)

We can see that there are up to 2 rows per set of `patientunitstayid` and `ts`. As such, we must join them. However, this is a different scenario than in the other cases. Since all features are numeric, we just need to average the features.

### Join rows that have the same IDs

Convert dataframe to Pandas, as the groupby operation in `join_categorical_enum` isn't working properly with Modin:

In [None]:
in_out_df, pd = du.utils.convert_dataframe(in_out_df, to='pandas')

In [None]:
type(in_out_df)

In [None]:
in_out_df = du.embedding.join_categorical_enum(in_out_df, cont_join_method='mean', inplace=True)
in_out_df.head()

Reconvert dataframe to Modin:

In [None]:
in_out_df, pd = du.utils.convert_dataframe(in_out_df, to='modin')

In [None]:
type(in_out_df)

In [None]:
in_out_df.dtypes

In [None]:
in_out_df.groupby(['patientunitstayid', 'ts']).count().nlargest(columns='urine_output', n=5).head()

In [None]:
in_out_df[(in_out_df.patientunitstayid == 433661) & (in_out_df.ts == 661)].head(10)

Comparing the output from the two previous cells with what we had before the `join_categorical_enum` method, we can see that all rows with duplicate IDs have been successfully joined.

### Rename columns

In [None]:
in_out_df = in_out_df.rename(columns={'Out': 'dialysis_output', 
                                      'Indwelling Catheter Output_output': 'indwellingcatheter_output',
                                      'Voided Amount_output' : 'voided_amount',
                                      'Feeding Tube Flush/Water Bolus Amt (mL)_intake': 'feeding_tube_flush_ml',
                                      'Volume Given (mL)_intake': 'volume_given_ml',
                                      'Actual Patient Fluid Removal_output': 'patient_fluid_removal'})
in_out_df.head()

### Clean column names

Standardize all column names to be on lower case, have spaces replaced by underscores and remove comas.

In [None]:
in_out_df.columns = du.data_processing.clean_naming(in_out_df.columns)
in_out_df.head()

### Normalize data

Save the dataframe before normalizing:

In [None]:
in_out_df.to_csv(f'{data_path}cleaned/unnormalized/intakeOutput.csv')

In [None]:
in_out_df_norm = du.data_processing.normalize_data(in_out_df, inplace=True)
in_out_df_norm.head()

### Save the dataframe

Save the dataframe after normalizing:

In [None]:
in_out_df.to_csv(f'{data_path}cleaned/normalized/intakeOutput.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
in_out_df.describe().transpose()

In [None]:
du.search_explore.dataframe_missing_values(in_out_df)