# Exams Data Preprocessing
---

Reading and preprocessing exams data of the eICU dataset from MIT witvaluesdata from over 139k patients collected in the US.

This notebook addresses the preprocessing of the following eICU tables:
* lab

## Importing the necessary packages

In [None]:
import os                                  # os handles directory/workspace changes
import numpy as np                         # NumPy to handle numeric and NaN operations
import yaml                                # Save and load YAML files

In [None]:
# Debugging packages
import pixiedust                           # Debugging in Jupyter Notebook cells

In [None]:
# Change to parent directory (presumably "Documents")
os.chdir("../../../..")
# Path to the CSV dataset files
data_path = 'data/eICU/uncompressed/'
# Path to the code files
project_path = 'code/eICU-mortality-prediction/'

In [None]:
# Make sure that every large operation can be handled, by using the disk as an overflow for the memory
!export MODIN_OUT_OF_CORE=true

In [None]:
import modin.pandas as pd                  # Optimized distributed version of Pandas
import data_utils as du                    # Data science and machine learning relevant methods

Set the random seed for reproducibility

In [None]:
du.set_random_seed(42)

## Laboratory data

### Initialize variables

In [None]:
cat_feat = []                              # List of categorical features
cat_embed_feat = []                        # List of categorical features that will be embedded
cat_embed_feat_enum = dict()               # Dictionary of the enumerations of the categorical features that will be embedded

### Read the data

In [None]:
lab_df = pd.read_csv(f'{data_path}original/lab.csv')
lab_df.head()

In [None]:
len(lab_df)

In [None]:
lab_df.patientunitstayid.nunique()

Get an overview of the dataframe through the `describe` method:

In [None]:
lab_df.describe().transpose()

In [None]:
lab_df.info()

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(lab_df)

### Merge similar columns

In [None]:
lab_df.labresult.value_counts()

In [None]:
lab_df.labresulttext.value_counts()

In [None]:
lab_df.labresulttext.value_counts().tail(30)

In [None]:
lab_df.labmeasurenamesystem.value_counts()

In [None]:
lab_df.labmeasurenameinterface.value_counts()

~Merge the result columns:~

I will not merge the result columns, so as to make sure that we only have numeric result values:

In [None]:
# lab_df['lab_result'] = lab_df.apply(lambda df: du.data_processing.merge_values(df['labresult'], 
#                                                                                df['labresulttext'],
#                                                                                str_over_num=False, join_strings=False), 
#                                     axis=1)
# lab_df.head(10)
# Just renaming the lab results feature:
lab_df = lab_df.rename(columns={'labresult': 'lab_result'})

~Drop the now redundant `labresult` and `labresulttext` columns:~

In [None]:
# lab_df = lab_df.drop(columns=['labresult', 'labresulttext'])
# lab_df.head()

Merge the measurement unit columns:

In [None]:
lab_df['lab_units'] = lab_df.apply(lambda df: du.data_processing.merge_values(df['labmeasurenamesystem'], 
                                                                              df['labmeasurenameinterface'],
                                                                              str_over_num=True, join_strings=False), 
                                   axis=1)
lab_df.head(10)

Drop the now redundant `labresult` and `labresulttext` columns:

In [None]:
lab_df = lab_df.drop(columns=['labmeasurenamesystem', 'labmeasurenameinterface'])
lab_df.head()

### Remove unneeded features

In [None]:
lab_df.labtypeid.value_counts()

In [None]:
lab_df.labname.value_counts()

Besides removing the row ID `labid` and the time when data was entered `labresultrevisedoffset`, I'm also removing `labresulttext` as it's redundant with `labresult` and has a string format instead of a numeric one.

In [None]:
lab_df = lab_df.drop(columns=['labid', 'labresultrevisedoffset', 'labresulttext'])
lab_df.head()

In [None]:
du.search_explore.dataframe_missing_values(lab_df)

In [None]:
lab_df.info()

### Discretize categorical features

Convert binary categorical features into simple numberings, one hot encode features with a low number of categories (in this case, 5) and enumerate sparse categorical features that will be embedded.

#### Separate and prepare features for embedding

Identify categorical features that have more than 5 unique categories, which will go through an embedding layer afterwards, and enumerate them.

In the case of microbiology data, we're also going to embed the antibiotic `sensitivitylevel`, not because it has many categories, but because there can be several rows of data per timestamp (which would be impractical on one hot encoded data).

Update list of categorical features and add those that will need embedding (features with more than 5 unique values):

In [None]:
new_cat_feat = ['labtypeid', 'labname', 'lab_units']
[cat_feat.append(col) for col in new_cat_feat]

In [None]:
# Skipping this step here as it's very slow for this large dataframe and we already
# know that all of these features are going to be embedded
# cat_feat_nunique = [lab_df[feature].nunique() for feature in du.utils.iterations_loop(new_cat_feat)]
# cat_feat_nunique

In [None]:
new_cat_embed_feat = []
for i in range(len(new_cat_feat)):
#     if cat_feat_nunique[i] > 5:
    # Add feature to the list of those that will be embedded
    cat_embed_feat.append(new_cat_feat[i])
    new_cat_embed_feat.append(new_cat_feat[i])

In [None]:
lab_df[new_cat_feat].head()

In [None]:
for i in du.utils.iterations_loop(range(len(new_cat_embed_feat))):
    feature = new_cat_embed_feat[i]
    if feature == 'labtypeid':
        # Skip the `labtypeid` feature as it already has a good numeric format
        continue
    # Prepare for embedding, i.e. enumerate categories
    lab_df[feature], cat_embed_feat_enum[feature] = du.embedding.enum_categorical_feature(lab_df, feature, nan_value=0,
                                                                                          forbidden_digit=0)

In [None]:
lab_df[new_cat_feat].head()

In [None]:
cat_embed_feat_enum

In [None]:
lab_df[new_cat_feat].dtypes

#### Save enumeration encoding mapping

Save the dictionary that maps from the original categories/strings to the new numerical encondings.

In [None]:
stream = open(f'{data_path}/cleaned/cat_embed_feat_enum_lab.yaml', 'w')
yaml.dump(cat_embed_feat_enum, stream, default_flow_style=False)

### Create the timestamp feature and sort

Create the timestamp (`ts`) feature:

In [None]:
lab_df = lab_df.rename(columns={'labresultoffset': 'ts'})
lab_df.head()

Remove duplicate rows:

In [None]:
len(lab_df)

In [None]:
lab_df = lab_df.drop_duplicates()
lab_df.head()

In [None]:
len(lab_df)

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
lab_df = lab_df.sort_values('ts')
lab_df.head()

Check for possible multiple rows with the same unit stay ID and timestamp:

In [None]:
lab_df.groupby(['patientunitstayid', 'ts']).count().nlargest(columns='lab_result', n=5).head()

In [None]:
lab_df[(lab_df.patientunitstayid == 3240757) & (lab_df.ts == 162)].head(10)

We can see that there are up to ___ categories per set of `patientunitstayid` and `ts`. As such, we must join them. But first, we need to normalize the results by the respective sets of exam name and units, so as to avoid mixing different absolute values.

### Normalize data

Convert dataframe to Pandas, as the next cells aren't working properly with Modin:

In [None]:
lab_df, pd = du.utils.convert_dataframe(lab_df, to='pandas')

In [None]:
type(lab_df)

In [None]:
lab_df_norm = du.data_processing.normalize_data(lab_df, columns_to_normalize=False,
                                                columns_to_normalize_categ=[(['labname', 'lab_units'], 'lab_result')],
                                                inplace=True)
lab_df_norm.head()

### Join rows that have the same IDs

Even after removing duplicates rows, there are still some that have different information for the same ID and timestamp. We have to concatenate the categorical enumerations.

In [None]:
list(set(lab_df_norm.columns) - set(new_cat_embed_feat) - set(['patientunitstayid', 'ts']))

In [None]:
lab_df_norm = du.embedding.join_categorical_enum(lab_df_norm, new_cat_embed_feat, inplace=True)
lab_df_norm.head()

Reconvert dataframe to Modin:

In [None]:
lab_df_norm, pd = du.utils.convert_dataframe(lab_df_norm, to='modin')

In [None]:
type(lab_df_norm)

In [None]:
lab_df_norm.dtypes

In [None]:
lab_df_norm.groupby(['patientunitstayid', 'ts']).count().nlargest(columns='lab_result', n=5).head()

In [None]:
lab_df[(lab_df.patientunitstayid == 3240757) & (lab_df.ts == 162)].head(10)

Comparing the output from the two previous cells with what we had before the `join_categorical_enum` method, we can see that all rows with duplicate IDs have been successfully joined.

### Clean column names

Standardize all column names to be on lower case, have spaces replaced by underscores and remove comas.

In [None]:
lab_df.columns = du.data_processing.clean_naming(lab_df.columns)
lab_df_norm.columns = du.data_processing.clean_naming(lab_df_norm.columns)
lab_df_norm.head()

### Save the dataframe

Save the dataframe before normalizing:

In [None]:
lab_df.to_csv(f'{data_path}cleaned/unnormalized/lab.csv')

Save the dataframe after normalizing:

In [None]:
lab_df_norm.to_csv(f'{data_path}cleaned/normalized/lab.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
lab_df_norm.describe().transpose()

In [None]:
lab_df.nlargest(columns='lab_result', n=5)