In [1]:
import numpy as np
import pandas as pd

from gensim.models.keyedvectors import KeyedVectors

from src import preprocess
from pathlib import Path

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
DATA_DIR = "./data/"
CLINICAL_NOTES_FILE = DATA_DIR + "ClinNotes.csv"
MEDICAL_CONCEPTS_FILE = DATA_DIR + "MedicalConcepts.csv"

PROCESSDED_DATA_DIR = './processed_data/'
PROCESSED_CLINICAL_NOTES_FILE = PROCESSDED_DATA_DIR + "ClinNotes.csv"
NORMALIZED_CLINICAL_NOTES_FILE = PROCESSDED_DATA_DIR + "ClinNotes_normalized.csv"
BIOWORDVEC_VECTOR_FILE = PROCESSDED_DATA_DIR + 'BioWordVec_vector.npy'

WORD_VEC_DIR = './word_vector/'
BIOWORDVEC_MODEL_FILE = WORD_VEC_DIR + 'BioWordVec_PubMed_MIMICIII_d200.vec.bin'

# Word Vector Aggregation Vectorization

In this notebook we will use Word Vector Aggregation to vectorize our clinical notes. Word vectors are pre-trained on large corpus to capture the semantic meaning and we can use the word vector pre-trained on medical corpus in this project. The word vector I decide to use is [BioWordVec](https://github.com/ncbi-nlp/BioSentVec).

Although the word sequence does not matter in this method, I'm not going to include the related medical terms as TF-IDF since I believe the word vector itself should already encode the semantic correlation.

In [4]:
df_clinical_normalized = pd.read_csv(NORMALIZED_CLINICAL_NOTES_FILE)

As the word vector file is quite big, I didn't inlcude it on Github. It can be downloaded from [here](https://github.com/ncbi-nlp/BioSentVec#biowordvec-1-biomedical-word-embeddings-with-fasttext).

In [5]:
%%time

model = KeyedVectors.load_word2vec_format(BIOWORDVEC_MODEL_FILE, binary=True, limit=500000)

CPU times: total: 3.78 s
Wall time: 3.8 s


There are a few aggregation methods, in this project we just go with averaging.

In [6]:
vectors, num_word, num_oov = preprocess.get_aggregated_doc_vector(model, df_clinical_normalized['notes'])

In [9]:
Path(PROCESSDED_DATA_DIR).mkdir(parents=True, exist_ok=True)
np.save(BIOWORDVEC_VECTOR_FILE, vectors)

Below we get a sense of how many tokens are OOV in the word vector and it is neglectable.

In [10]:
print('OOV words has a percentage of {:.2f}%'.format((num_oov / num_word) * 100))

OOV words has a percentage of 0.98%


I would like to mention that I'm aware of the word vector aggregation is a bit crude as it does not care about the word sequence at all. We can overcome this shortcoming by using some unsupervised sentence vectorization method. Actually BioWordVec here provides a BioSentVec method which is built upon the Sent2Vec technique. But my local machine has an C++ error when installing Sent2vec package, due to the time constraints I didn't dig in too much and move on with the word vector aggregation.