# Prepare Data

In this script we will prepare the data for the DeepLabeler model

Author: Ryan Fogle

Date: 4-7-2023

In [None]:
# import necessary libraries
import numpy as np
import pandas as pd
import os
import re
from tqdm import tqdm
tqdm.pandas()
from datetime import datetime
start = datetime.now()

## Load in Notes & ICD9 Codes

We will now load in the notes from the MIMIC-III dataset for the discharge summaries and then join this dataset with the ICD9 codes. 

To get access to the MIMIC-III dataset please follow the guidelines as seen in this link: https://eicu-crd.mit.edu/gettingstarted/access/

In [None]:
# IMPORTANT! change this to the location on your machine!
data_url = '../physionet.org/files/mimiciii/1.4/'

notes = pd.read_csv(data_url + 'NOTEEVENTS.csv')
notes.head()

## Subset to only discharge summaries

As you will see, notes has many different types of categories. We will want to grab only the Discharge summary

In [None]:
notes['CATEGORY'].value_counts()

In [None]:
discharges = notes[notes['CATEGORY'] == 'Discharge summary'].copy()
discharges.shape

## Aggregate multiple discharge summaries

About 21% of the notes population has more than 1 discharge summary for a visit, we will combine these notes to get only 1 discharge summary per visit. 

In [None]:
gb = discharges.groupby(['SUBJECT_ID', 'HADM_ID'])['ROW_ID'].count()
f"Percent of pop with more than 1 discharge summary for one visit: {gb[gb > 1].sum() / discharges.shape[0] * 100:.2f}%"

In [None]:
# find text length and then sort dataframe by char length, we want the most informative notes to show first due to needing to truncate the data later
discharges['TEXT_LEN'] = discharges['TEXT'].progress_apply(lambda x: len(x))
discharges.sort_values('TEXT_LEN', ascending=False, inplace=True)

discharges = discharges[['SUBJECT_ID', 'HADM_ID', 'TEXT']].drop_duplicates().groupby(['SUBJECT_ID', 'HADM_ID'])['TEXT'].progress_apply(lambda x: " ".join(x)).reset_index()
discharges.shape

Let's take a look at one example

In [None]:
import numpy as np
print(discharges.iloc[np.random.randint(0, discharges.shape[0])]['TEXT'])

## Remove unnecessary tokens

The paper did not mention how they tokenized the text, so we will need to create our own tokenization process. You will notice that some information from the notes is omitted, like the name of the patient, the dates, the doctor name, and the hospital's name. 

This information will not be useful for our purposes, logically speaking the name of the patient, doctor, hospital, or dates by themselves should not be indicators of what ICD9 codes will be diagnosed. These will be considered stop words for the tokenization process.

Secondly, we will lowercase the words - this will decrease our vocab size. 

Third, we will remove all punctuation and extra white space from the text. This is done for the same reason, to reduce our vocab size and size of our notes. Punction would be considered stop words for the SVM model and we want to only consider words for our Word2Vec and Doc2Vec models. This approach will have problems, it will treat e.coli as "e" and "coli" when we perhaps would want to treat it as one token. 

Fourth, we will remove all numbers. Individual numbers themselves should not present a signficant advantage for word2vec, as many numbers are measurements from the patient. We will strictly remove timestamps as well (ie 10:01PM)

Creating a better tokenizer could be the next steps of this project, but I opted for a simpiler regex solution. 

In [None]:
# lowercase
discharges['TEXT'] = discharges['TEXT'].progress_apply(lambda x: x.lower())

# remove #1, #3
p = re.compile("(\[\*\*.+?\*\*\])|([0-9]{1,2}:[0-9]{2}([AaPp][Mm]){0,1})|([!\"#$%&\'()*+,-./:;<=>?@\\\[\]^_`{|}~])")
s = re.compile("\s+") # replace excessive white space with one space
n = re.compile('[0-9]+') # replace numbers
w = re.compile('(admission\sdate)|(discharge\sdate)|(date\sof\sbirth)|(pm)|(am)|(mg)') # remove common words
discharges['TEXT'] = discharges['TEXT'].progress_apply(lambda x: p.sub(' ', x)).progress_apply(lambda x: n.sub('', x)).progress_apply(lambda x: w.sub('', x)).progress_apply(lambda x: s.sub(' ', x))
discharges['TEXT'].iloc[100,]

Let's look at the statistics

In [None]:
discharges['toks'] = discharges['TEXT'].progress_apply(lambda x: x.split())
toks_len = discharges['toks'].progress_apply(lambda x: len(x))
toks_len.agg(['mean', 'median', 'std', 'max', 'min'])

In [None]:
# save if needed
discharges.to_parquet('discharges.pq')

Our tokenizer is similar enough with the counts listed in the paper, we will stop the tokenization process here

## Load in ICD9 Diagnosis

In [None]:
diag_raw = pd.read_csv(data_url + 'DIAGNOSES_ICD.csv')
diag = diag_raw[diag_raw['ICD9_CODE'].notna()].copy()
diag.head()

In [None]:
diag = diag.groupby(['SUBJECT_ID', 'HADM_ID'])['ICD9_CODE'].progress_apply(lambda x: list(x)).reset_index()
diag.head()

## Check the Occurances of ICD9 codes

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
y = mlb.fit_transform(diag['ICD9_CODE'].to_list())
mlb.classes_

In [None]:
codes = pd.Series(dict(zip(mlb.classes_, np.sum(y, axis=0))))
ax = codes.hist(bins=np.arange(0,100, 1))
ax.set_xlabel('Occurrence Count')
ax.set_ylabel('Frequency')

As you can see many codes have 20 or less examples, to speed up training we will ignore codes that occur less than 50 times. 

The authors of the DeepLabeler trained a model to handle labeling all codes, but when there is less than 50 positive cases you are going to have a very small training set. We will now remove all codes that occur less than 50 times, when we do a test/train split of 80/20 we will have on average 10 test samples. 

In [None]:
valid_codes = codes[codes > 50].index.to_list()
len(valid_codes)

In [None]:
valid_codes = codes[codes > 50].index.to_list()
diag = diag_raw[diag_raw['ICD9_CODE'].notna() & diag_raw['ICD9_CODE'].isin(valid_codes)]
diag.head()

In [None]:
diag = diag.groupby(['SUBJECT_ID', 'HADM_ID'])['ICD9_CODE'].progress_apply(lambda x: list(x)).reset_index()
diag.head()

In [None]:
final = discharges.merge(diag, left_on=['SUBJECT_ID', 'HADM_ID'], right_on=['SUBJECT_ID', 'HADM_ID'], how='inner')
print(final.shape)
final.head()

In [None]:
final.to_parquet('prepared-data.pq')

In [None]:
final.toks.str.len().hist(bins= np.arange(0, 6000, 50))

In [None]:
end = datetime.now()
total_time = end - start
total_time