# Exams Data Preprocessing
---

Reading and preprocessing exams data of the eICU dataset from MIT with the data from over 139k patients collected in the US.

This notebook addresses the preprocessing of the following eICU tables:
* lab

## Importing the necessary packages

In [None]:
import os                                  # os handles directory/workspace changes
import numpy as np                         # NumPy to handle numeric and NaN operations
import yaml                                # Save and load YAML files

In [None]:
# Debugging packages
import pixiedust                           # Debugging in Jupyter Notebook cells

In [None]:
# Change to parent directory (presumably "Documents")
os.chdir("../../../..")
# Path to the CSV dataset files
data_path = 'data/eICU/uncompressed/'
# Path to the code files
project_path = 'code/eICU-mortality-prediction/'

In [None]:
# Make sure that every large operation can be handled, by using the disk as an overflow for the memory
!export MODIN_OUT_OF_CORE=true
# Another trick to do with Pandas so as to be able to allocate bigger objects to memory
!sudo bash -c 'echo 1 > /proc/sys/vm/overcommit_memory'

In [None]:
import modin.pandas as pd                  # Optimized distributed version of Pandas
# import pandas as pd
import data_utils as du                    # Data science and machine learning relevant methods

Allow pandas to show more columns:

In [None]:
pd.set_option('display.max_columns', 1000)
pd.set_option('display.max_rows', 1000)

Set the random seed for reproducibility:

In [None]:
du.set_random_seed(42)

Set the maximum number of categories

In [None]:
MAX_CATEGORIES = 250

## Laboratory data

### Initialize variables

In [None]:
# List of categorical features
cat_feat = []
# Dictionary of the one hot encoded columns originary from each categorical feature, that will be embedded
cat_feat_ohe = dict()

### Read the data

In [None]:
lab_df = pd.read_csv(f'{data_path}original/lab.csv')
lab_df.head()

In [None]:
len(lab_df)

In [None]:
lab_df.patientunitstayid.nunique()

Get an overview of the dataframe through the `describe` method:

In [None]:
lab_df.describe().transpose()

In [None]:
lab_df.info()

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(lab_df)

### Merge similar columns

In [None]:
lab_df.labresult.value_counts()

In [None]:
lab_df.labresulttext.value_counts()

In [None]:
lab_df.labresulttext.value_counts().tail(30)

In [None]:
lab_df.labmeasurenamesystem.value_counts()

In [None]:
lab_df.labmeasurenameinterface.value_counts()

~Merge the result columns:~

I will not merge the result columns, so as to make sure that we only have numeric result values:

In [None]:
# lab_df['lab_result'] = lab_df.apply(lambda df: du.data_processing.merge_values(df['labresult'],
#                                                                                df['labresulttext'],
#                                                                                str_over_num=False, join_strings=False),
#                                     axis=1)
# lab_df.head(10)
# Just renaming the lab results feature:
lab_df = lab_df.rename(columns={'labresult': 'lab_result'})

~Drop the now redundant `labresult` and `labresulttext` columns:~

In [None]:
# lab_df = lab_df.drop(columns=['labresult', 'labresulttext'])
# lab_df.head()

Merge the measurement unit columns:

In [None]:
lab_df['lab_units'] = lab_df.apply(lambda df: du.data_processing.merge_values(df['labmeasurenamesystem'],
                                                                              df['labmeasurenameinterface'],
                                                                              str_over_num=True, join_strings=False),
                                   axis=1)
lab_df.head(10)

Drop the now redundant `labresult` and `labresulttext` columns:

In [None]:
lab_df = lab_df.drop(columns=['labmeasurenamesystem', 'labmeasurenameinterface'])
lab_df.head()

### Remove unneeded features

In [None]:
lab_df.labtypeid.value_counts()

In [None]:
lab_df.labname.value_counts()

Besides removing the row ID `labid` and the time when data was entered `labresultrevisedoffset`, I'm also removing `labresulttext` as it's redundant with `labresult` and has a string format instead of a numeric one.

In [None]:
lab_df = lab_df.drop(columns=['labid', 'labresultrevisedoffset', 'labresulttext'])
lab_df.head()

In [None]:
du.search_explore.dataframe_missing_values(lab_df)

In [None]:
lab_df.info()

### Normalize data

Update list of categorical features:

In [None]:
cat_feat = ['labtypeid', 'labname', 'lab_units']

Filter just to the most common categories:

In [None]:
for col in cat_feat:
    most_common_cat = list(lab_df[col].value_counts().nlargest(MAX_CATEGORIES).index)
    lab_df = lab_df[lab_df[col].isin(most_common_cat)]

Convert dataframe to Pandas, as the next cells aren't working properly with Modin:

In [None]:
lab_df, pd = du.utils.convert_dataframe(lab_df, to='pandas')

In [None]:
type(lab_df)

In [None]:
lab_df.dtypes

Fix the dtypes:

In [None]:
lab_df.patientunitstayid = lab_df.patientunitstayid.astype('uint')
lab_df.labresultoffset = lab_df.labresultoffset.astype('int')
lab_df.lab_result = lab_df.lab_result.astype(float)

Normalize the data:

In [None]:
lab_df, mean, std = du.data_processing.normalize_data(lab_df, columns_to_normalize=False,
                                                columns_to_normalize_categ=[(['labname', 'lab_units'], 'lab_result')],
                                                get_stats=True, inplace=True)
lab_df.head()

Save a dictionary with the mean and standard deviation values of each column that was normalized:

In [None]:
norm_stats = dict()
for key, _ in mean.items():
    norm_stats[key] = dict()
    norm_stats[key]['mean'] = mean[key]
    norm_stats[key]['std'] = std[key]
norm_stats

In [None]:
stream = open(f'{data_path}/cleaned/lab_norm_stats.yaml', 'w')
yaml.dump(norm_stats, stream, default_flow_style=False)

### Discretize categorical features

Convert binary categorical features into one hot encode columns, which can later be embedded or used as is.

#### One hot encode features

In [None]:
lab_df[cat_feat].head()

Apply one hot encoding:

In [None]:
lab_df, new_columns = du.data_processing.one_hot_encoding_dataframe(lab_df, columns=cat_feat,
                                                                    join_rows=False,
                                                                    get_new_column_names=True,
                                                                    inplace=True)
lab_df

In [None]:
lab_df.dtypes

Save the association between the original categorical features and the new one hot encoded columns:

In [None]:
for orig_col in cat_feat:
    cat_feat_ohe[orig_col] = [ohe_col for ohe_col in new_columns
                              if ohe_col.startswith(orig_col)]

In [None]:
cat_feat_ohe

#### Save enumeration encoding mapping

Save the dictionary that maps from the original categories/strings to the new numerical encondings.

In [None]:
stream = open(f'{data_path}/cleaned/cat_feat_ohe_lab.yaml', 'w')
yaml.dump(cat_feat_ohe, stream, default_flow_style=False)

### Create the timestamp feature and sort

Create the timestamp (`ts`) feature:

In [None]:
lab_df = lab_df.rename(columns={'labresultoffset': 'ts'})
lab_df.head()

Save the dataframe before dropping duplicates:

In [None]:
lab_df.to_csv(f'{data_path}cleaned/normalized/ohe/lab_before_drop_dupl.csv')

Remove duplicate rows:

In [None]:
len(lab_df)

In [None]:
lab_df = lab_df.drop_duplicates()
lab_df.head()

In [None]:
len(lab_df)

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
lab_df = lab_df.sort_values('ts')
lab_df.head()

Check for possible multiple rows with the same unit stay ID and timestamp:

In [None]:
lab_df.groupby(['patientunitstayid', 'ts']).count().nlargest(columns='lab_result', n=5).head()

In [None]:
lab_df[(lab_df.patientunitstayid == 3240757) & (lab_df.ts == 162)].head(10)

We can see that there are up to ___ categories per set of `patientunitstayid` and `ts`. As such, we must join them. But first, we need to normalize the results by the respective sets of exam name and units, so as to avoid mixing different absolute values.

### Join rows that have the same IDs

Even after removing duplicates rows, there are still some that have different information for the same ID and timestamp. We have to concatenate the categorical enumerations.

In [None]:
list(set(lab_df.columns) - set(cat_feat) - set(['patientunitstayid', 'ts']))

Save the dataframe before joining rows:

In [None]:
lab_df.to_csv(f'{data_path}cleaned/normalized/ohe/lab_before_join.csv')

In [None]:
lab_df = du.embedding.join_repeated_rows(lab_df, inplace=True)
lab_df.head()

Reconvert dataframe to Modin:

In [None]:
lab_df, pd = du.utils.convert_dataframe(lab_df, to='modin')

In [None]:
type(lab_df)

In [None]:
lab_df.dtypes

In [None]:
lab_df.groupby(['patientunitstayid', 'ts']).count().nlargest(columns='lab_result', n=5).head()

In [None]:
lab_df[(lab_df.patientunitstayid == 3240757) & (lab_df.ts == 162)].head(10)

Comparing the output from the two previous cells with what we had before the `join_repeated_rows` method, we can see that all rows with duplicate IDs have been successfully joined.

### Clean column names

Standardize all column names to be on lower case, have spaces replaced by underscores and remove comas.

In [None]:
lab_df.columns = du.data_processing.clean_naming(lab_df.columns)
lab_df.head()

### Save the dataframe

Save the dataframe before normalizing:

In [None]:
# lab_df.to_csv(f'{data_path}cleaned/unnormalized/ohe/lab.csv')

Save the dataframe after normalizing:

In [None]:
lab_df.to_csv(f'{data_path}cleaned/normalized/ohe/lab.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
lab_df.describe().transpose()

In [None]:
lab_df.nlargest(columns='lab_result', n=5)

In [None]:
lab_df = pd.read_csv(f'{data_path}cleaned/normalized/ohe/lab.csv')
lab_df.head()

In [None]:
lab_df = lab_df.drop(columns='Unnamed: 0')

In [None]:
lab_df.info()