In [1]:
import numpy as np
import pandas as pd

from src import preprocess, evaluate

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
DATA_DIR = "./data/"
CLINICAL_NOTES_FILE = DATA_DIR + "ClinNotes.csv"
MEDICAL_CONCEPTS_FILE = DATA_DIR + "MedicalConcepts.csv"

PROCESSDED_DATA_DIR = './processed_data/'
TFIDF_VECTOR_FILE = PROCESSDED_DATA_DIR + 'Tfidf_vector.npy'
TFIDF_EXTENDED_VECTOR_FILE = PROCESSDED_DATA_DIR + 'Tfidf_extended_vector.npy'
BIOWORDVEC_VECTOR_FILE = PROCESSDED_DATA_DIR + 'BioWordVec_vector.npy'
CLINICAL_BERT_VECTOR = PROCESSDED_DATA_DIR + 'Clinical_Bert_vector.npy'

# Use Case: Similarity
Once we have the document vectors, we can measure the similarity between two vectors. This is very helpful in searching and ranking. The use cases can be given a user query or a clinical note, to find the most relevant ones from our database. I use the cosine similarity as the metric to avoid favoring larger norms of the vector. Below I show the use case of given a query clinical note, to find the top-k most relevant ones.

Since we don't have labeled data for the relevancy and I do not have the domain knowledge to give a qualitative analysis, I will skip the evaluation part for this task. I will try to interpret the examples below based on the vectorization techniques.

In [4]:
df_clinical = pd.read_csv(CLINICAL_NOTES_FILE)

In [5]:
vectors_tfidf = np.load(TFIDF_VECTOR_FILE)
vectors_tfidf_extended = np.load(TFIDF_EXTENDED_VECTOR_FILE)
vectors_biowordvec = np.load(BIOWORDVEC_VECTOR_FILE)
vectors_clinical_bert = np.load(CLINICAL_BERT_VECTOR)

## TF-IDF
We can observe the matching basically depends on the overlapping of words. The result of extended version of tf-idf vector shows some minor difference to the plain one.

In [6]:
evaluate.print_topk_most_similar_notes(df_clinical, vectors_tfidf, 0, 3)

QUERY CLINICAL NOTE
Category: Cardiovascular / Pulmonary

2-D M-MODE: 
 
1.  Left atrial enlargement with left atrial diameter of 4.7 cm.
2.  Normal size right and left ventricle.
3.  Normal LV systolic function with left ventricular ejection fraction of 51%.
4.  Normal LV diastolic function.
5.  No pericardial effusion.
6.  Normal morphology of aortic valve
 mitral valve
 tricuspid valve
 and pulmonary valve.
7.  PA systolic pressure is 36 mmHg.
DOPPLER: 
 
1.  Mild mitral and tricuspid regurgitation.
2.  Trace aortic and pulmonary regurgitation.

******************************************************************************************
Category: Cardiovascular / Pulmonary

DESCRIPTION:
1.  Normal cardiac chambers size.
2.  Normal left ventricular size.
3.  Normal LV systolic function.  Ejection fraction estimated around 60%.
4.  Aortic valve seen with good motion.
5.  Mitral valve seen with good motion.
6.  Tricuspid valve seen with good motion.
7.  No pericardial effusion or intraca

In [7]:
evaluate.print_topk_most_similar_notes(df_clinical, vectors_tfidf_extended, 0, 3)

QUERY CLINICAL NOTE
Category: Cardiovascular / Pulmonary

2-D M-MODE: 
 
1.  Left atrial enlargement with left atrial diameter of 4.7 cm.
2.  Normal size right and left ventricle.
3.  Normal LV systolic function with left ventricular ejection fraction of 51%.
4.  Normal LV diastolic function.
5.  No pericardial effusion.
6.  Normal morphology of aortic valve
 mitral valve
 tricuspid valve
 and pulmonary valve.
7.  PA systolic pressure is 36 mmHg.
DOPPLER: 
 
1.  Mild mitral and tricuspid regurgitation.
2.  Trace aortic and pulmonary regurgitation.

******************************************************************************************
Category: Cardiovascular / Pulmonary

DESCRIPTION:
1.  Normal cardiac chambers size.
2.  Normal left ventricular size.
3.  Normal LV systolic function.  Ejection fraction estimated around 60%.
4.  Aortic valve seen with good motion.
5.  Mitral valve seen with good motion.
6.  Tricuspid valve seen with good motion.
7.  No pericardial effusion or intraca

## Word Vector Aggregation
We can observe the result is quite similar to the TF-IDF vectorization. This is expected as the overlapping of words will result in a similar aggregated word vector.

In [8]:
evaluate.print_topk_most_similar_notes(df_clinical, vectors_biowordvec, 0, 3)

QUERY CLINICAL NOTE
Category: Cardiovascular / Pulmonary

2-D M-MODE: 
 
1.  Left atrial enlargement with left atrial diameter of 4.7 cm.
2.  Normal size right and left ventricle.
3.  Normal LV systolic function with left ventricular ejection fraction of 51%.
4.  Normal LV diastolic function.
5.  No pericardial effusion.
6.  Normal morphology of aortic valve
 mitral valve
 tricuspid valve
 and pulmonary valve.
7.  PA systolic pressure is 36 mmHg.
DOPPLER: 
 
1.  Mild mitral and tricuspid regurgitation.
2.  Trace aortic and pulmonary regurgitation.

******************************************************************************************
Category: Cardiovascular / Pulmonary

REASON FOR EXAMINATION: 
 Cardiac arrhythmia.
INTERPRETATION: 
 No significant pericardial effusion was identified.
The aortic root dimensions are within normal limits.  The four cardiac chambers dimensions are within normal limits.  No discrete regional wall motion abnormalities are identified.  The left ventricul

## Language Model based Vectorization
We can observe the results are quite different from above two methods and there is likely a favor for long documents.

In [9]:
evaluate.print_topk_most_similar_notes(df_clinical, vectors_clinical_bert, 0, 3)

QUERY CLINICAL NOTE
Category: Cardiovascular / Pulmonary

2-D M-MODE: 
 
1.  Left atrial enlargement with left atrial diameter of 4.7 cm.
2.  Normal size right and left ventricle.
3.  Normal LV systolic function with left ventricular ejection fraction of 51%.
4.  Normal LV diastolic function.
5.  No pericardial effusion.
6.  Normal morphology of aortic valve
 mitral valve
 tricuspid valve
 and pulmonary valve.
7.  PA systolic pressure is 36 mmHg.
DOPPLER: 
 
1.  Mild mitral and tricuspid regurgitation.
2.  Trace aortic and pulmonary regurgitation.

******************************************************************************************
Category: Cardiovascular / Pulmonary

INDICATION: 
 Aortic stenosis.
PROCEDURE: 
 Transesophageal echocardiogram.
INTERPRETATION:  
Procedure and complications explained to the patient in detail.  Informed consent was obtained.  The patient was anesthetized in the throat with lidocaine spray.  Subsequently
 3 mg of IV Versed was given for sedation.  Th