# eICU Data Joining
---

Reading and joining all parts of the eICU dataset from MIT with the data from over 139k patients collected in the US.

The main goal of this notebook is to prepare a single CSV document that contains all the relevant data to be used when training a machine learning model that predicts mortality, joining tables, filtering useless columns and performing imputation.

## Importing the necessary packages

In [None]:
import os                                  # os handles directory/workspace changes
import numpy as np                         # NumPy to handle numeric and NaN operations
import yaml                                # Save and load YAML files

In [None]:
# Debugging packages
import pixiedust                           # Debugging in Jupyter Notebook cells

In [None]:
# Change to parent directory (presumably "Documents")
os.chdir("../../..")

# Path to the CSV dataset files
data_path = 'Documents/Datasets/Thesis/eICU/uncompressed/'

# Path to the code files
project_path = 'Documents/GitHub/eICU-mortality-prediction/'

In [None]:
import modin.pandas as pd                  # Optimized distributed version of Pandas
import data_utils as du                    # Data science and machine learning relevant methods

Set the random seed for reproducibility

In [None]:
du.set_random_seed(42)

## Initialize variables

In [None]:
cat_feat = []                              # List of categorical features
cat_embed_feat = []                        # List of categorical features that will be embedded
cat_embed_feat_enum = dict()               # Dictionary of the enumerations of the categorical features that will be embedded

## Allergy data

### Read the data

In [None]:
alrg_df = pd.read_csv(f'{data_path}original/allergy.csv')
alrg_df.head()

In [None]:
len(alrg_df)

In [None]:
alrg_df.patientunitstayid.nunique()

Get an overview of the dataframe through the `describe` method:

In [None]:
alrg_df.describe().transpose()

In [None]:
alrg_df.columns

In [None]:
alrg_df.dtypes

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(alrg_df)

### Remove unneeded features

In [None]:
alrg_df[alrg_df.allergytype == 'Non Drug'].drughiclseqno.value_counts()

In [None]:
alrg_df[alrg_df.allergytype == 'Drug'].drughiclseqno.value_counts()

As we can see, the drug features in this table only have data if the allergy derives from using the drug. As such, we don't need the `allergytype` feature. Also ignoring hospital staff related information and using just the drug codes instead of their names, as they're independent of the drug brand.

In [None]:
alrg_df.allergynotetype.value_counts()

Feature `allergynotetype` also doesn't seem very relevant, discarding it.

In [None]:
alrg_df = alrg_df[['patientunitstayid', 'allergyoffset',
                   'allergyname', 'drughiclseqno']]
alrg_df.head()

### Discretize categorical features

Convert binary categorical features into simple numberings, one hot encode features with a low number of categories (in this case, 5) and enumerate sparse categorical features that will be embedded.

#### Separate and prepare features for embedding

Identify categorical features that have more than 5 unique categories, which will go through an embedding layer afterwards, and enumerate them.

In the case of microbiology data, we're also going to embed the antibiotic `sensitivitylevel`, not because it has many categories, but because there can be several rows of data per timestamp (which would be impractical on one hot encoded data).

Update list of categorical features and add those that will need embedding (features with more than 5 unique values):

In [None]:
new_cat_feat = ['allergyname', 'drughiclseqno']
[cat_feat.append(col) for col in new_cat_feat]

In [None]:
cat_feat_nunique = [alrg_df[feature].nunique() for feature in new_cat_feat]
cat_feat_nunique

In [None]:
new_cat_embed_feat = []
for i in range(len(new_cat_feat)):
    if cat_feat_nunique[i] > 5:
        # Add feature to the list of those that will be embedded
        cat_embed_feat.append(new_cat_feat[i])
        new_cat_embed_feat.append(new_cat_feat[i])

In [None]:
alrg_df[new_cat_feat].head()

In [None]:
for i in range(len(new_cat_embed_feat)):
    feature = new_cat_embed_feat[i]
    # Skip the 'drughiclseqno' from enumeration encoding
    if feature == 'drughiclseqno':
        continue
    # Prepare for embedding, i.e. enumerate categories
    alrg_df[feature], cat_embed_feat_enum[feature] = du.embedding.enum_categorical_feature(alrg_df, feature)

Fill missing values of the drug data with 0, so as to prepare for embedding:

In [None]:
alrg_df.drughiclseqno = alrg_df.drughiclseqno.fillna(0).astype(int)

In [None]:
alrg_df[new_cat_feat].head()

In [None]:
cat_embed_feat_enum

In [None]:
alrg_df[new_cat_feat].dtypes

#### Save enumeration encoding mapping

Save the dictionary that maps from the original categories/strings to the new numerical encondings.

In [None]:
stream = open('cat_embed_feat_enum.yaml', 'w')
yaml.dump(cat_embed_feat_enum, stream, default_flow_style=False)

### Create the timestamp feature and sort

Create the timestamp (`ts`) feature:

In [None]:
alrg_df['ts'] = alrg_df['allergyoffset']
alrg_df = alrg_df.drop('allergyoffset', axis=1)
alrg_df.head()

Remove duplicate rows:

In [None]:
len(alrg_df)

In [None]:
alrg_df = alrg_df.drop_duplicates()
alrg_df.head()

In [None]:
len(alrg_df)

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
alrg_df = alrg_df.set_index('ts')
alrg_df.head()

Check for possible multiple rows with the same unit stay ID and timestamp:

In [None]:
alrg_df.reset_index().head()

In [None]:
alrg_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='allergyname').head()

In [None]:
alrg_df[alrg_df.patientunitstayid == 3197554].head(10)

We can see that there are up to 47 categories per set of `patientunitstayid` and `ts`. As such, we must join them.

### Join rows that have the same IDs

Even after removing duplicates rows, there are still some that have different information for the same ID and timestamp. We have to concatenate the categorical enumerations.

In [None]:
alrg_df = du.embedding.join_categorical_enum(alrg_df, new_cat_embed_feat)
alrg_df.head()

In [None]:
alrg_df.dtypes

In [None]:
alrg_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='allergyname').head()

In [None]:
alrg_df[alrg_df.patientunitstayid == 3197554].head(10)

Comparing the output from the two previous cells with what we had before the `join_categorical_enum` method, we can see that all rows with duplicate IDs have been successfully joined.

### Rename columns

In [None]:
alrg_df = alrg_df.rename(columns={'drughiclseqno':'drugallergyhiclseqno'})
alrg_df.head()

### Clean column names

Standardize all column names to be on lower case, have spaces replaced by underscores and remove comas.

In [None]:
alrg_df.columns = du.data_processing.clean_naming(alrg_df.columns)
alrg_df.head()

### Save the dataframe

Save the dataframe before normalizing:

In [None]:
alrg_df.to_csv(f'{data_path}cleaned/unnormalized/allergy.csv')

Save the dataframe after normalizing:

In [None]:
alrg_df.to_csv(f'{data_path}cleaned/normalized/allergy.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
alrg_df.describe().transpose()

### Join dataframes

Merge dataframes by the unit stay, `patientunitstayid`, and the timestamp, `ts`, with a tolerence for a difference of up to 30 minutes.

In [None]:
alrg_df = pd.read_csv(f'{data_path}cleaned/normalized/allergy.csv')
alrg_df.head()

In [None]:
len(alrg_df)

In [None]:
alrg_df.patientunitstayid.nunique()

In [None]:
eICU_df = pd.merge_asof(eICU_df, alrg_df, on='ts', by='patientunitstayid', direction='nearest', tolerance=30)
eICU_df.head()

## Past history data

### Read the data

In [None]:
pasthist_df = pd.read_csv(f'{data_path}original/pastHistory.csv')
pasthist_df.head()

In [None]:
len(pasthist_df)

In [None]:
pasthist_df.patientunitstayid.nunique()

Get an overview of the dataframe through the `describe` method:

In [None]:
pasthist_df.describe().transpose()

In [None]:
pasthist_df.columns

In [None]:
pasthist_df.dtypes

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(pasthist_df)

### Remove unneeded features

In [None]:
pasthist_df.pasthistorypath.value_counts().head(20)

In [None]:
pasthist_df.pasthistorypath.value_counts().tail(20)

In [None]:
pasthist_df.pasthistoryvalue.value_counts()

In [None]:
pasthist_df.pasthistorynotetype.value_counts()

In [None]:
pasthist_df[pasthist_df.pasthistorypath == 'notes/Progress Notes/Past History/Past History Obtain Options/Performed'].pasthistoryvalue.value_counts()

In this case, considering that it regards past diagnosis of the patients, the timestamp when that was observed probably isn't very reliable nor useful. As such, I'm going to remove the offset variables. Furthermore, `pasthistoryvaluetext` is redundant with `pasthistoryvalue`, while `pasthistorynotetype` and the past history path 'notes/Progress Notes/Past History/Past History Obtain Options/Performed' seem to be irrelevant.

In [None]:
pasthist_df = pasthist_df.drop(['pasthistoryid', 'pasthistoryoffset', 'pasthistoryenteredoffset',
                                'pasthistorynotetype', 'pasthistoryvaluetext'], axis=1)
pasthist_df.head()

In [None]:
categories_to_remove = ['notes/Progress Notes/Past History/Past History Obtain Options/Performed']

In [None]:
~(pasthist_df.pasthistorypath.isin(categories_to_remove)).head()

In [None]:
pasthist_df = pasthist_df[~(pasthist_df.pasthistorypath.isin(categories_to_remove))]
pasthist_df.head()

In [None]:
len(pasthist_df)

In [None]:
pasthist_df.patientunitstayid.nunique()

In [None]:
pasthist_df.pasthistorypath.value_counts().head(20)

In [None]:
pasthist_df.pasthistorypath.value_counts().tail(20)

In [None]:
pasthist_df.pasthistoryvalue.value_counts()

There's still plenty of data left, affecting around 81.87% of the unit stays, even after removing several categories.

### Separate high level notes

In [None]:
pasthist_df.pasthistorypath.map(lambda x: x.split('/')).head().values

In [None]:
pasthist_df.pasthistorypath.map(lambda x: len(x.split('/'))).min()

In [None]:
pasthist_df.pasthistorypath.map(lambda x: len(x.split('/'))).max()

In [None]:
pasthist_df.pasthistorypath.apply(lambda x: du.search_explore.get_element_from_split(x, 0, separator='/'),
                                  meta=('x', str)).value_counts()

In [None]:
pasthist_df.pasthistorypath.apply(lambda x: du.search_explore.get_element_from_split(x, 1, separator='/'),
                                  meta=('x', str)).value_counts()

In [None]:
pasthist_df.pasthistorypath.apply(lambda x: du.search_explore.get_element_from_split(x, 2, separator='/'),
                                  meta=('x', str)).value_counts()

In [None]:
pasthist_df.pasthistorypath.apply(lambda x: du.search_explore.get_element_from_split(x, 3, separator='/'),
                                  meta=('x', str)).value_counts()

In [None]:
pasthist_df.pasthistorypath.apply(lambda x: du.search_explore.get_element_from_split(x, 4, separator='/'),
                                  meta=('x', str)).value_counts()

In [None]:
pasthist_df.pasthistorypath.apply(lambda x: du.search_explore.get_element_from_split(x, 5, separator='/'),
                                  meta=('x', str)).value_counts()

In [None]:
pasthist_df.pasthistorypath.apply(lambda x: du.search_explore.get_element_from_split(x, 6, separator='/'),
                                  meta=('x', str)).value_counts()

There are always at least 5 levels of the notes. As the first 4 ones are essentially always the same ("notes/Progress Notes/Past History/Organ Systems/") and the 5th one tends to not be very specific (only indicates which organ system it affected, when it isn't just a case of no health problems detected), it's best to preserve the 5th and isolate the remaining string as a new feature. This way, the split provides further insight to the model on similar notes.

In [None]:
pasthist_df['pasthistorytype'] = pasthist_df.pasthistorypath.apply(lambda x: du.search_explore.get_element_from_split(x, 4, separator='/'), meta=('x', str))
pasthist_df['pasthistorydetails'] = pasthist_df.pasthistorypath.apply(lambda x: du.search_explore.get_element_from_split(x, 5, separator='/', till_the_end=True), meta=('x', str))
pasthist_df.head()

`pasthistoryvalue` seems to correspond to the last element of `pasthistorydetails`. Let's confirm it:

In [None]:
pasthist_df['pasthistorydetails_last'] = pasthist_df.pasthistorydetails.map(lambda x: x.split('/')[-1])
pasthist_df.head()

Compare columns `pasthistoryvalue` and `pasthistorydetails`'s last element:

In [None]:
pasthist_df[pasthist_df.pasthistoryvalue != pasthist_df.pasthistorydetails_last]

The previous output confirms that the newly created `pasthistorydetails` feature's last elememt (last string in the symbol separated lists) is almost exactly equal to the already existing `pasthistoryvalue` feature, with the differences that `pasthistoryvalue` takes into account the scenarios of no health problems detected and behaves correctly in strings that contain the separator symbol in them. So, we should remove `pasthistorydetails`'s last element:

In [None]:
pasthist_df = pasthist_df.drop('pasthistorydetails_last', axis=1)
pasthist_df.head()

In [None]:
pasthist_df['pasthistorydetails'] = pasthist_df.pasthistorydetails.apply(lambda x: '/'.join(x.split('/')[:-1]), meta=('pasthistorydetails', str))
pasthist_df.head()

Remove irrelevant `Not Obtainable` and `Not Performed` values:

In [None]:
pasthist_df[pasthist_df.pasthistoryvalue == 'Not Obtainable'].pasthistorydetails.value_counts()

In [None]:
pasthist_df[pasthist_df.pasthistoryvalue == 'Not Performed'].pasthistorydetails.value_counts()

In [None]:
pasthist_df = pasthist_df[~((pasthist_df.pasthistoryvalue == 'Not Obtainable') | (pasthist_df.pasthistoryvalue == 'Not Performed'))]
pasthist_df.head()

In [None]:
pasthist_df.pasthistorytype.unique()

Replace blank `pasthistorydetails` values:

In [None]:
pasthist_df[pasthist_df.pasthistoryvalue == 'No Health Problems'].pasthistorydetails.value_counts()

In [None]:
pasthist_df[pasthist_df.pasthistoryvalue == 'No Health Problems'].pasthistorydetails.value_counts().index

In [None]:
pasthist_df[pasthist_df.pasthistorydetails == ''].head()

In [None]:
pasthist_df['pasthistorydetails'] = pasthist_df.apply(lambda df: 'No Health Problems' if df['pasthistorytype'] == 'No Health Problems'
                                                                 else df['pasthistorydetails'],
                                                      axis=1, meta=(None, str))
pasthist_df.head()

In [None]:
pasthist_df[pasthist_df.pasthistorydetails == '']

Remove the now redundant `pasthistorypath` column:

In [None]:
pasthist_df = pasthist_df.drop('pasthistorypath', axis=1)
pasthist_df.head()

### Discretize categorical features

Convert binary categorical features into simple numberings, one hot encode features with a low number of categories (in this case, 5) and enumerate sparse categorical features that will be embedded.

#### Separate and prepare features for embedding

Identify categorical features that have more than 5 unique categories, which will go through an embedding layer afterwards, and enumerate them.

Update list of categorical features and add those that will need embedding (features with more than 5 unique values):

In [None]:
new_cat_feat = ['pasthistoryvalue', 'pasthistorytype', 'pasthistorydetails']
[cat_feat.append(col) for col in new_cat_feat]

In [None]:
cat_feat_nunique = [pasthist_df[feature].nunique() for feature in new_cat_feat]
cat_feat_nunique

In [None]:
new_cat_embed_feat = []
for i in range(len(new_cat_feat)):
    if cat_feat_nunique[i] > 5:
        # Add feature to the list of those that will be embedded
        cat_embed_feat.append(new_cat_feat[i])
        new_cat_embed_feat.append(new_cat_feat[i])

In [None]:
pasthist_df[new_cat_feat].head()

In [None]:
for i in range(len(new_cat_embed_feat)):
    feature = new_cat_embed_feat[i]
    # Prepare for embedding, i.e. enumerate categories
    pasthist_df[feature], cat_embed_feat_enum[feature] = du.embedding.enum_categorical_feature(pasthist_df, feature)

In [None]:
pasthist_df[new_cat_feat].head()

In [None]:
cat_embed_feat_enum

In [None]:
pasthist_df[new_cat_feat].dtypes

#### Save enumeration encoding mapping

Save the dictionary that maps from the original categories/strings to the new numerical encondings.

In [None]:
stream = open('cat_embed_feat_enum.yaml', 'w')
yaml.dump(cat_embed_feat_enum, stream, default_flow_style=False)

### Remove duplicate rows

Remove duplicate rows:

In [None]:
len(pasthist_df)

In [None]:
pasthist_df = pasthist_df.drop_duplicates()
pasthist_df.head()

In [None]:
len(pasthist_df)

Check for possible multiple rows with the same unit stay ID and timestamp:

In [None]:
pasthist_df.groupby(['patientunitstayid']).count().nlargest(columns='pasthistoryvalue').head()

In [None]:
pasthist_df[pasthist_df.patientunitstayid == 1558102].head(10)

We can see that there are up to 20 categories per `patientunitstayid`. As such, we must join them.

### Join rows that have the same IDs

In [None]:
pasthist_df = du.embedding.join_categorical_enum(pasthist_df, new_cat_embed_feat, id_columns=['patientunitstayid'])
pasthist_df.head()

In [None]:
pasthist_df.dtypes

In [None]:
pasthist_df.groupby(['patientunitstayid']).count().nlargest(columns='pasthistoryvalue').head()

In [None]:
pasthist_df[pasthist_df.patientunitstayid == 1558102].head(10)

Comparing the output from the two previous cells with what we had before the `join_categorical_enum` method, we can see that all rows with duplicate IDs have been successfully joined.

### Clean column names

Standardize all column names to be on lower case, have spaces replaced by underscores and remove comas.

In [None]:
pasthist_df.columns = du.data_processing.clean_naming(pasthist_df.columns)
pasthist_df.head()

### Save the dataframe

Save the dataframe before normalizing:

In [None]:
pasthist_df.to_csv(f'{data_path}cleaned/unnormalized/pastHistory.csv')

Save the dataframe after normalizing:

In [None]:
pasthist_df.to_csv(f'{data_path}cleaned/normalized/pastHistory.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
pasthist_df.describe().transpose()

### Join dataframes

Merge dataframes by the unit stay, `patientunitstayid`, and the timestamp, `ts`, with a tolerence for a difference of up to 30 minutes.

In [None]:
pasthist_df = pd.read_csv(f'{data_path}cleaned/normalized/pastHistory.csv')
pasthist_df.head()

In [None]:
len(pasthist_df)

In [None]:
pasthist_df.patientunitstayid.nunique()

In [None]:
eICU_df = pd.merge_asof(eICU_df, pasthist_df, on='ts', by='patientunitstayid', direction='nearest', tolerance=30)
eICU_df.head()

## Diagnosis data

### Read the data

In [None]:
diagn_df = pd.read_csv(f'{data_path}original/diagnosis.csv')
diagn_df.head()

In [None]:
len(diagn_df)

In [None]:
diagn_df.patientunitstayid.nunique()

Get an overview of the dataframe through the `describe` method:

In [None]:
diagn_df.describe().transpose()

In [None]:
diagn_df.columns

In [None]:
diagn_df.dtypes

### Check for missing values

In [None]:
du.search_explore.dataframe_missing_values(diagn_df)

### Remove unneeded features

Besides the usual removal of row identifier, `diagnosisid`, I'm also removing apparently irrelevant (and subjective) `diagnosispriority`, redundant, with missing values and other issues `icd9code`, and `activeupondischarge`, as we don't have complete information as to when diagnosis end.

In [None]:
diagn_df = diagn_df.drop(['diagnosisid', 'diagnosispriority', 'icd9code', 'activeupondischarge'], axis=1)
diagn_df.head()

### Separate high level diagnosis

In [None]:
diagn_df.diagnosisstring.value_counts()

In [None]:
diagn_df.diagnosisstring.map(lambda x: x.split('|')).head()

In [None]:
diagn_df.diagnosisstring.map(lambda x: len(x.split('|'))).min()

There are always at least 2 higher level diagnosis. It could be beneficial to extract those first 2 levels to separate features, so as to avoid the need for the model to learn similarities that are already known.

In [None]:
diagn_df['diagnosis_type_1'] = diagn_df.diagnosisstring.apply(lambda x: du.search_explore.get_element_from_split(x, 0, separator='|'), meta=('x', str))
diagn_df['diagnosis_disorder_2'] = diagn_df.diagnosisstring.apply(lambda x: du.search_explore.get_element_from_split(x, 1, separator='|'), meta=('x', str))
diagn_df['diagnosis_detailed_3'] = diagn_df.diagnosisstring.apply(lambda x: du.search_explore.get_element_from_split(x, 2, separator='|', till_the_end=True), meta=('x', str))
# Remove now redundant `diagnosisstring` feature
diagn_df = diagn_df.drop('diagnosisstring', axis=1)
diagn_df.head()

### Discretize categorical features

Convert binary categorical features into simple numberings, one hot encode features with a low number of categories (in this case, 5) and enumerate sparse categorical features that will be embedded.

#### Separate and prepare features for embedding

Identify categorical features that have more than 5 unique categories, which will go through an embedding layer afterwards, and enumerate them.

Update list of categorical features and add those that will need embedding (features with more than 5 unique values):

In [None]:
new_cat_feat = ['diagnosis_type_1', 'diagnosis_disorder_2', 'diagnosis_detailed_3']
[cat_feat.append(col) for col in new_cat_feat]

In [None]:
cat_feat_nunique = [diagn_df[feature].nunique() for feature in new_cat_feat]
cat_feat_nunique

In [None]:
new_cat_embed_feat = []
for i in range(len(new_cat_feat)):
    if cat_feat_nunique[i] > 5:
        # Add feature to the list of those that will be embedded
        cat_embed_feat.append(new_cat_feat[i])
        new_cat_embed_feat.append(new_cat_feat[i])

In [None]:
diagn_df[new_cat_feat].head()

In [None]:
for i in range(len(new_cat_embed_feat)):
    feature = new_cat_embed_feat[i]
    # Prepare for embedding, i.e. enumerate categories
    diagn_df[feature], cat_embed_feat_enum[feature] = du.embedding.enum_categorical_feature(diagn_df, feature)

In [None]:
diagn_df[new_cat_feat].head()

In [None]:
cat_embed_feat_enum

In [None]:
diagn_df[new_cat_feat].dtypes

#### Save enumeration encoding mapping

Save the dictionary that maps from the original categories/strings to the new numerical encondings.

In [None]:
stream = open('cat_embed_feat_enum.yaml', 'w')
yaml.dump(cat_embed_feat_enum, stream, default_flow_style=False)

### Create the timestamp feature and sort

Create the timestamp (`ts`) feature:

In [None]:
diagn_df['ts'] = diagn_df['diagnosisoffset']
diagn_df = diagn_df.drop('diagnosisoffset', axis=1)
diagn_df.head()

Remove duplicate rows:

In [None]:
len(diagn_df)

In [None]:
diagn_df = diagn_df.drop_duplicates()
diagn_df.head()

In [None]:
len(diagn_df)

Sort by `ts` so as to be easier to merge with other dataframes later:

In [None]:
diagn_df = diagn_df.set_index('ts')
diagn_df.head()

Check for possible multiple rows with the same unit stay ID and timestamp:

In [None]:
diagn_df.reset_index().head()

In [None]:
diagn_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='diagnosis_type_1').head()

In [None]:
diagn_df[diagn_df.patientunitstayid == 3089982].head(10)

We can see that there are up to 69 categories per set of `patientunitstayid` and `ts`. As such, we must join them.

### Join rows that have the same IDs

In [None]:
diagn_df = du.embedding.join_categorical_enum(diagn_df, new_cat_embed_feat)
diagn_df.head()

In [None]:
diagn_df.dtypes

In [None]:
diagn_df.reset_index().groupby(['patientunitstayid', 'ts']).count().nlargest(columns='diagnosis_type_1').head()

In [None]:
diagn_df[diagn_df.patientunitstayid == 3089982].head(10)

Comparing the output from the two previous cells with what we had before the `join_categorical_enum` method, we can see that all rows with duplicate IDs have been successfully joined.

### Clean column names

Standardize all column names to be on lower case, have spaces replaced by underscores and remove comas.

In [None]:
diagn_df.columns = du.data_processing.clean_naming(diagn_df.columns)
diagn_df.head()

### Save the dataframe

Save the dataframe before normalizing:

In [None]:
diagn_df.to_csv(f'{data_path}cleaned/unnormalized/diagnosis.csv')

Save the dataframe after normalizing:

In [None]:
diagn_df.to_csv(f'{data_path}cleaned/normalized/diagnosis.csv')

Confirm that everything is ok through the `describe` method:

In [None]:
diagn_df.describe().transpose()

### Join dataframes

Merge dataframes by the unit stay, `patientunitstayid`, and the timestamp, `ts`, with a tolerence for a difference of up to 30 minutes.

In [None]:
diagn_df = pd.read_csv(f'{data_path}cleaned/normalized/diagnosis.csv')
diagn_df.head()

In [None]:
len(diagn_df)

In [None]:
diagn_df.patientunitstayid.nunique()

In [None]:
eICU_df = pd.merge_asof(eICU_df, diagn_df, on='ts', by='patientunitstayid', direction='nearest', tolerance=30)
eICU_df.head()