In [1]:
from pathlib import Path

import pandas as pd

from src import preprocess

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
DATA_DIR = "./data/"
CLINICAL_NOTES_FILE = DATA_DIR + "ClinNotes.csv"
MEDICAL_CONCEPTS_FILE = DATA_DIR + "MedicalConcepts.csv"

PROCESSDED_DATA_DIR = './processed_data/'
PROCESSED_CLINICAL_NOTES_FILE = PROCESSDED_DATA_DIR + "ClinNotes.csv"

In [4]:
df_clinical = pd.read_csv(CLINICAL_NOTES_FILE)

# Data Understanding

Firstly let's take a look at the csv format.

In [5]:
df_clinical.head()

Unnamed: 0,category,notes
0,Cardiovascular / Pulmonary,"2-D M-MODE: , ,1. Left atrial enlargement wit..."
1,Cardiovascular / Pulmonary,1. The left ventricular cavity size and wall ...
2,Cardiovascular / Pulmonary,"2-D ECHOCARDIOGRAM,Multiple views of the heart..."
3,Cardiovascular / Pulmonary,"DESCRIPTION:,1. Normal cardiac chambers size...."
4,Cardiovascular / Pulmonary,"2-D STUDY,1. Mild aortic stenosis, widely calc..."


In [6]:
print('The dataset contains {} records.'.format(len(df_clinical)))

The dataset contains 818 records.


Let's take a peek at one specific clinical note.

In [7]:
df_clinical['notes'][0]

'2-D M-MODE: , ,1.  Left atrial enlargement with left atrial diameter of 4.7 cm.,2.  Normal size right and left ventricle.,3.  Normal LV systolic function with left ventricular ejection fraction of 51%.,4.  Normal LV diastolic function.,5.  No pericardial effusion.,6.  Normal morphology of aortic valve, mitral valve, tricuspid valve, and pulmonary valve.,7.  PA systolic pressure is 36 mmHg.,DOPPLER: , ,1.  Mild mitral and tricuspid regurgitation.,2.  Trace aortic and pulmonary regurgitation.'

We can observe the data contains more commas than needed. I think this is due to some data normalization steps that are done to collapse properly formatted clinical notes. Let's replace the comma with newline to check out.

In [8]:
preprocess.print_note(df_clinical['notes'][0])

2-D M-MODE: 
 
1.  Left atrial enlargement with left atrial diameter of 4.7 cm.
2.  Normal size right and left ventricle.
3.  Normal LV systolic function with left ventricular ejection fraction of 51%.
4.  Normal LV diastolic function.
5.  No pericardial effusion.
6.  Normal morphology of aortic valve
 mitral valve
 tricuspid valve
 and pulmonary valve.
7.  PA systolic pressure is 36 mmHg.
DOPPLER: 
 
1.  Mild mitral and tricuspid regurgitation.
2.  Trace aortic and pulmonary regurgitation.


Basically we can see the format looks nicer in this way despite we match all the commas.

For data understanding, we observe that this dataset containes a small collection (around 800) of clinical notes. It also comes with a extra column describing which category the note is coming from. For the format of the clinical note, we need to perform some extra data cleaning steps. And there is also one notable thing is that the notes contain various numbers, which will impose a very big challenge to vectorization because generally speaking numbers are noise in text without any useful semantic meaning. Some sources of the number are dates, time duration, length or width, blood pressure, medicine concentration and etc.

# Data Preprocess

The clinical notes are written in different formats, and it is hard to parse them in a unified way. But most of them follow a standard that one note is segregated into several sections, each section has an all-caps title and a paragraph of content. In this project, I would like to only extract the content of the notes and discard the titles. The main considerations are as follow:
1. The content of each section is quite informative and can be considered as it already covers the semantic meaning of the title.
2. We want the vectorization method to be generalized and not all clinical notes are in the this format. Including the titles may give those word frequency based method some advantages. And extrating the content will make our vertorization study more robust and generalized to all scenarios.

Let's use the regular expression and some pre-defined rules to match the title and try with some examples below.

In [9]:
preprocess.match_titles(df_clinical['notes'][0])

['2-D M-MODE: ', 'DOPPLER: ']

In [10]:
preprocess.match_titles(df_clinical['notes'][3])

['DESCRIPTION:', 'DOPPLER:', 'IMPRESSION:']

In [11]:
preprocess.match_titles(df_clinical['notes'][9])

['TIME SEEN: ',
 'TOTAL RECORDING TIME: ',
 'PATIENT HISTORY: ',
 'DESCRIPTION: ',
 'CLINICAL INTERPRETATION:  ']

Then we can implement our content extraction method, the basic logic is to firstly extract the titles of the note, and then we only keep the fraction of the notes that are not the titles. Let's use the same notes as examples below.

In [12]:
preprocess.extract_content(df_clinical['notes'][0])

', ,1.  Left atrial enlargement with left atrial diameter of 4.7 cm.,2.  Normal size right and left ventricle.,3.  Normal LV systolic function with left ventricular ejection fraction of 51%.,4.  Normal LV diastolic function.,5.  No pericardial effusion.,6.  Normal morphology of aortic valve, mitral valve, tricuspid valve, and pulmonary valve.,7.  PA systolic pressure is 36 mmHg.,, ,1.  Mild mitral and tricuspid regurgitation.,2.  Trace aortic and pulmonary regurgitation.'

In [13]:
preprocess.extract_content(df_clinical['notes'][3])

',1.  Normal cardiac chambers size.,2.  Normal left ventricular size.,3.  Normal LV systolic function.  Ejection fraction estimated around 60%.,4.  Aortic valve seen with good motion.,5.  Mitral valve seen with good motion.,6.  Tricuspid valve seen with good motion.,7.  No pericardial effusion or intracardiac masses.,,1.  Trace mitral regurgitation.,2.  Trace tricuspid regurgitation.,,1.  Normal LV systolic function.,2.  Ejection fraction estimated around 60%.,'

Here I did some extra cleaning to also remove the section content if the section title contains 'TIME' or 'DATE', this is mainly because I want to reduce the numbers in the content, which can be challenging for vectorization.

In [14]:
preprocess.extract_content(df_clinical['notes'][9])

', This is a 43-year-old female with a history of events concerning for seizures.  Video EEG monitoring is performed to capture events and/or identify etiology.,VIDEO EEG DIAGNOSES,1.  AWAKE:  Normal.,2.  SLEEP:  No activation.,3.  CLINICAL EVENTS:  None.,, Approximately 27 hours of continuous 21-channel digital video EEG monitoring was performed.  The waking background is unchanged from that previously reported.  Hyperventilation produced no changes in the resting record.  Photic stimulation failed to elicit a well-developed photic driving response.,Approximately five-and-half hours of spontaneous intermittent sleep was obtained.  Sleep spindles were present and symmetric.,The patient had no clinical events during the recording.,,This is normal video EEG monitoring for a patient of this age.  No interictal epileptiform activity was identified.  The patient had no clinical events during the recording.  Clinical correlation is required.'

Besides removing the dates and times, I also want to remove the numbered index from the content. We can see the outcome of the function below.

In [18]:
preprocess.remove_number_index(df_clinical['notes'][10])

'DATE OF EXAMINATION: , Start:  12/29/2008 at 1859 hours.  End:  12/30/2008 at 0728 hours.,TOTAL RECORDING TIME:,  12 hours, 29 minutes.,PATIENT HISTORY:,  This is a 46-year-old female with a history of events concerning for seizures.  The patient has a history of epilepsy and has also had non-epileptic events in the past.  Video EEG monitoring is performed to assess whether it is epileptic seizures or non-epileptic events.,VIDEO EEG DIAGNOSES,   Awake:  Normal.,   Sleep:  Activation of a single left temporal spike seen maximally at T3.,   Clinical events:  None.,DESCRIPTION:  ,Approximately 12 hours of continuous 21-channel digital video EEG monitoring was performed.  During the waking state, there is a 9-Hz dominant posterior rhythm.  The background of the record consists primarily of alpha frequency activity.  At times, during the waking portion of the record, there appears to be excessive faster frequency activity.  No activation procedures were performed.,Approximately four hours 

Some extra data cleaning steps include removing extra commas and removing redundant white spaces. After the data cleaning steps, I found some notes become empty strings. For simplicity, I just fill it with some dummy values here. In a more serious context, we may consider to remove those empty notes. With all the data preprocessing and cleaning functions in place, we can batch process our clinical notes data.

In [19]:
df_clinical_processed = df_clinical.copy()
df_clinical_processed['notes'] = df_clinical_processed['notes'].apply(lambda s: \
    preprocess.fill_empty_note(                                                                  
    preprocess.remove_redundant_whitespace( \
    preprocess.remove_commas( \
    preprocess.remove_number_index( \
    preprocess.extract_content(s) \
)))))

Let's also save them for reusing in later notebooks.

In [20]:
Path(PROCESSDED_DATA_DIR).mkdir(parents=True, exist_ok=True)
df_clinical_processed.to_csv(PROCESSED_CLINICAL_NOTES_FILE, index=False)