## TF-IDF notebook
Loads data from locally stored notes file and runs TF-IDF

The results of the model look ok ~75-80% accuracy, but I'm concerned as to whether or not I should consider other flags given w/ the note data. There's also a description VARCHAR(255) and an iserror flag that I believe I should eventually consider if fine tuning the model. I will also consider using a custom stop word list in the future if the final model uses TF-IDF.

Also, I'm unsure of how many noteevents a single patient can have. I'm naively passing the events through the TF-IDF into prediction net layer, but I am unsure of whether or not I should bundle all the notes for a single patient together, in order to give the model as much info as possible when predicting the mortality.

In [1]:
import pandas, os, sys, time
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
#parameters for model
num_words = 6000 #vocabulary for TF-IDF, how many words do we want to consider
num_files = 10000 #how many patients do we want to load, stick to 500 if speed is a concern
test_split = 0.2

## Setup environment variables, and load mortality data

In [3]:
%env PSQL_USER=postgres
%env PSQL_PSWD=postgres
%env JDBC_PATH=../TAMU-MedResearch/postgresql-42.4.0.jar
%env PYSPARK_PYTHON=/home/ugrads/k/kingrc15/anaconda3/bin/python

if "/home/ugrads/n/nickcheng0921/omop-summary" not in sys.path:
    sys.path.append("/home/ugrads/n/nickcheng0921/omop-summary")
import omop_summary

sc, session = omop_summary.create_spark_context()

env: PSQL_USER=postgres
env: PSQL_PSWD=postgres
env: JDBC_PATH=../TAMU-MedResearch/postgresql-42.4.0.jar
env: PYSPARK_PYTHON=/home/ugrads/k/kingrc15/anaconda3/bin/python


Load note data from local csv

Extract mortality labels from mimic3 data by using https://github.com/stmilab/omop-summary

In [4]:
start = time.time()
#csv is extremely large (3 Gb) or 91,691,299 lines
#grab chunks, and predict mortality w/in chunks
notes_path = "../mimic3Notes/physionet.org/files/mimiciii/1.4/NOTEEVENTS.csv"
notes_reader = pandas.read_csv(notes_path, iterator=True)

#reader reads from a stream, this means that calling it "consumes" the filestream
chunk = notes_reader.get_chunk(num_files)
notes = [x.TEXT for index, x in chunk.iterrows()]

end = time.time()
print(f"Execution {1.0*(end-start)/60} minutes")

Execution 0.02320173184076945 minutes


In [5]:
m_labels = []
start = time.time()
for patient in chunk.iterrows():
    patient_id = (patient[1].SUBJECT_ID)
    m_labels.append(omop_summary.utils.get_table(f"SELECT * FROM mimiciii.patients WHERE subject_id = {patient_id}", session).head(1)[0]["expire_flag"])
end = time.time()
print(f"Execution {1.0*(end-start)/60} minutes") #1.22 min to load 1000 patients, 6.5 for 5000, 12.5 for 1000

Execution 13.010804132620494 minutes


In [6]:
#ensure that each note has a mortality label
assert(len(m_labels) == len(notes))
print(f"Viewing {len(m_labels)} records, w/ {100.0*sum(m_labels)/len(m_labels)}% mortality rate")

Viewing 10000 records, w/ 52.43% mortality rate


## Fit TF-IDF Vectorizer, and attach linear predictor layer 

In [7]:
notes_vectorizer = TfidfVectorizer(max_features=num_words)

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import numpy as np
import tensorflow as tf

In [9]:
x_train, x_test, y_train, y_test = train_test_split(notes, m_labels, test_size=test_split, random_state=13)

notes_tfidf = notes_vectorizer.fit(x_train) #vectorizer needs to have a set of notes fit, before it can transform as it needs the weights
embed1 = notes_vectorizer.transform(x_train).todense()
embed2 = notes_vectorizer.transform(x_test).todense()

In [10]:
#connect to a fully connected NN layer
#binary class
#https://stackoverflow.com/questions/64764131/how-to-feed-tf-idf-vectorizer-output-as-input-to-a-neural-network-for-classifica

#multi class
#https://machinelearningmastery.com/multi-class-classification-tutorial-keras-deep-learning-library/
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, input_shape = (embed1.shape[1],), activation = 'relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(1, activation='sigmoid')])
    #tf.keras.layers.Dense(5, activation='softmax')])
#model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 128)               768128    
                                                                 
 dropout (Dropout)           (None, 128)               0         
                                                                 
 dense_1 (Dense)             (None, 1)                 129       
                                                                 
Total params: 768,257
Trainable params: 768,257
Non-trainable params: 0
_________________________________________________________________


## Results

In [11]:
model.fit(np.array(embed1), np.array(y_train))
model.evaluate(np.array(embed2), np.array(y_test))



[0.44267070293426514, 0.7990000247955322]