### Deep Learning Approach
In this notebook I am going to use 2 deep learning approaches:

* Collobert et al.(2011) and Gehrmann et al. (2017) CNN model. This model has shown a good performance on text analysis specifically MIMIC-III discharge summaries.
        
* CNN-BiLSTM: I am using this approach due to multiple reasons:
        
     * CNN can do feature extraction also it improves the result and speed of LSTM model which is a good method for analyzing sequence data
     * BiLSTMs can understand context better than LSTM
       
    
I am also trying different embedding methods:

* Make my own embedding using more than 2 million notes that are available in MIMIC-III database.

 * GloVe pretrained embedding

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import tensorflow as tf
import pickle
import warnings
warnings.filterwarnings('ignore')

In [4]:
import nltk
from nltk import word_tokenize
from nltk.stem import *
from nltk.util import ngrams
import string
from nltk.corpus import stopwords
import re
from time import time

# Keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential, Model
from keras.layers import Dense, Flatten, LSTM, Conv1D, Bidirectional
from keras.layers import MaxPooling1D, Dropout, Activation, GlobalMaxPooling1D, Add, Concatenate, concatenate, Input
from keras.layers.embeddings import Embedding
from keras.callbacks import Callback
from keras.optimizers import Adam, Adadelta

# SkLearn
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from keras.callbacks import TensorBoard
from gensim.models import Word2Vec

Using TensorFlow backend.


In [6]:
df_train = pd.read_pickle('df_train.pkl')
df_valid = pd.read_pickle('df_valid.pkl')

In [7]:
def preprocess_text(df):
    # This function preprocesses the text by filling not a number and replacing new lines ('\n') and carriage returns ('\r')
    df.TEXT = df.TEXT.fillna(' ')
    df.TEXT = df.TEXT.str.replace('\n',' ')
    df.TEXT = df.TEXT.str.replace('\r',' ')
    return df

In [8]:
df_train = preprocess_text(df_train)
df_valid = preprocess_text(df_valid)

In [9]:
y_train = df_train.OUTPUT_LABEL
y_valid = df_valid.OUTPUT_LABEL

#### I am not going to do stemming this time, as embedding of different formats of a word are very close to each other.

In [10]:
def clean_text(text):

    punc_list = string.punctuation+'0123456789'
    t = str.maketrans(dict.fromkeys(punc_list, " "))
    text = text.lower().translate(t).split()
    
    ## Remove stop words
    stops = set(stopwords.words("english"))
    text = [w for w in text if not w in stops and len(w) >= 3]
    text = " ".join(text)
    
    ## Clean the text
    text = re.sub(r"[^A-Za-z0-9^,!.\/'+-=]", " ", text)
    text = re.sub(r"what's", "what is ", text)
    text = re.sub(r"\'s", " ", text)
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"i'm", "i am ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub(r",", " ", text)
    text = re.sub(r"\.", " ", text)
    text = re.sub(r"!", " ! ", text)
    text = re.sub(r"\/", " ", text)
    text = re.sub(r"\^", " ^ ", text)
    text = re.sub(r"\+", " + ", text)
    text = re.sub(r"\-", " - ", text)
    text = re.sub(r"\=", " = ", text)
    text = re.sub(r"'", " ", text)
    text = re.sub(r"(\d+)(k)", r"\g<1>000", text)
    text = re.sub(r":", " : ", text)
    text = re.sub(r" e g ", " eg ", text)
    text = re.sub(r" b g ", " bg ", text)
    text = re.sub(r" u s ", " american ", text)
    text = re.sub(r"\0s", "0", text)
    text = re.sub(r" 9 11 ", "911", text)
    text = re.sub(r"e - mail", "email", text)
    text = re.sub(r"j k", "jk", text)
    text = re.sub(r"\s{2,}", " ", text)
    ## Stemming
    # text = text.split()
    # stemmer = SnowballStemmer('english')
    # stemmed_words = [stemmer.stem(word) for word in text]
    # text = " ".join(stemmed_words)
    return text

In [12]:
df_train['TEXT'] = df_train['TEXT'].map(lambda x: clean_text(x))
df_valid['TEXT'] = df_valid['TEXT'].map(lambda x: clean_text(x))

In [13]:
df_train.TEXT[0]

'admission date discharge date date birth sex service colorectal surgery green surgery history present illness year old man history ulcerative colitis since patient hospitalized almost annually flareups current flare began three weeks ago time admitted hospital past three weeks recently started hydrocortisone sent home several days prior admission patient complained increasing symptoms weekend severe lower abdominal pain intake low grade fevers nausea vomiting bloody bowel movements per day past medical history ulcerative colitis past surgical history none medications hydrocortisone tid two ativan prn iron folic acid prevacid allergies mercaptopurine reaction jaundice social history tobacco occasional alcohol family history mother name disease review systems chest pain shortness breath palpitations dysuria hematuria hematemesis physical examination admission vital signs temperature heart rate blood pressure respirations pulse oxygenation room air alert oriented times three acute distre

#### Now I want to create sequence of the words using Keras Tokenizer. I only consider the first 4000 words of the texts. This is more than enough.

In [15]:
max_leng = 4000
vocabulary_size = 400000
tokenizer = Tokenizer(num_words= vocabulary_size)
tokenizer.fit_on_texts(df_train['TEXT'])
sequences = tokenizer.texts_to_sequences(df_train['TEXT'])
data = pad_sequences(sequences, maxlen=max_leng)

In [16]:
### Create sequence for validation dataset:

In [17]:
sequences_valid = tokenizer.texts_to_sequences(df_valid['TEXT'])
data_valid = pad_sequences(sequences_valid , maxlen=max_leng)

In [18]:
data

array([[   0,    0,    0, ...,  450,  450,  171],
       [   0,    0,    0, ...,  450,  450,  171],
       [   0,    0,    0, ...,  450,  450,  171],
       ...,
       [   0,    0,    0, ...,  170,   60, 1092],
       [   0,    0,    0, ...,    4,  231,  275],
       [   0,    0,    0, ...,    2,   20,  171]], dtype=int32)

In [32]:
Model = []
AUC = []

### Word2Vec Embedding:
#### I am using gensim Word2Vec function to train a corpus of 2 million notes available in MIMIC-III NOTEEVENTS table. To prevent any leakage of data I have removed the test dataset notes from this corpus.

In [21]:
model_vec = Word2Vec.load("Word2Vec_5.bin")

#### Let’s see what the embedding of  "admission" looks like in a 100 dimensions world.  

In [23]:
model_vec['admission']

array([-9.940672  ,  4.7706604 , -0.17481485,  2.2851875 ,  3.879398  ,
       -0.34852552,  4.117314  , -3.8222647 ,  2.6205635 , -0.90670586,
       -2.7382095 ,  3.1581028 ,  5.8051367 ,  3.2299984 , -1.6231452 ,
        4.0764256 , -2.7143908 , -4.548696  ,  2.6629424 , -1.9236807 ,
        2.9011989 , -0.9748698 , -0.21091324, -0.12038308,  0.15923713,
       -8.00361   , -0.7134761 ,  0.8393403 , -1.629703  , -3.187114  ,
       -0.40883386, -5.453877  , -5.1056876 , -1.3243814 ,  1.9521133 ,
       -1.1441629 , -6.0349827 , -1.7179009 ,  0.8746022 ,  3.4964767 ,
        4.7658687 , -1.5917919 ,  1.0501578 , -4.5565786 , -0.37815642,
       -3.9397233 , -3.086613  , -2.6896765 , -3.4436028 , -2.5318625 ,
        1.2601362 , -1.2332696 , -3.9269254 ,  1.434145  ,  0.15418415,
        2.2525094 ,  2.9194953 , -3.1198769 ,  0.9414409 ,  0.23927784,
        0.6692262 , -1.4693029 , -4.6392746 ,  1.5093796 ,  5.982255  ,
       -6.322537  ,  2.7222247 , -2.4184532 ,  0.60186446, -1.19

In [24]:
embedding_matrix = np.zeros((len(tokenizer.word_index)+1, 100))
for word, i in tokenizer.word_index.items():
    embedding_vector = model_vec[word]
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

In [25]:
vocabulary_size = embedding_matrix.shape[0]
max_leng = 4000

### Using Collobert et al. Model (2011) with Word2Vec Embedding
#### This model that is later used by Gehrmann et al. (2017) has shown a good performance on MIMIC-III discharge summaries .

In [26]:
def define_model(max_leng, vocabulary_size):
    inputs = Input(shape=(max_leng,))
    # channel 1
    embedding1 = Embedding(vocabulary_size, 100, input_length=max_leng , weights=[embedding_matrix], trainable=False)(inputs)
    conv1 = Conv1D(filters=100, kernel_size = 1, activation='relu')(embedding1)
    drop1 = Dropout(0.5)(conv1)
    pool1 = GlobalMaxPooling1D()(drop1)
    #flat1 = Flatten()(pool1)
    # channel 2
    embedding2 = Embedding(vocabulary_size, 100, input_length=max_leng , weights=[embedding_matrix], trainable=False)(inputs)
    conv2 = Conv1D(filters=100, kernel_size = 2, activation='relu')(embedding2)
    drop2 = Dropout(0.5)(conv2)
    pool2 = GlobalMaxPooling1D()(drop2)
    #flat2 = Flatten()(pool2)
    # channel 3
    embedding3 = Embedding(vocabulary_size, 100, input_length=max_leng , weights=[embedding_matrix], trainable=False)(inputs)
    conv3 = Conv1D(filters=100, kernel_size = 3, activation='relu')(embedding3)
    drop3 = Dropout(0.5)(conv3)
    pool3 = GlobalMaxPooling1D()(drop3)
    #flat3 = Flatten()(pool3)
    # channel 3
    embedding4 = Embedding(vocabulary_size, 100, input_length=max_leng , weights=[embedding_matrix], trainable=False)(inputs)
    conv4 = Conv1D(filters=100, kernel_size = 4, activation='relu')(embedding3)
    drop4 = Dropout(0.5)(conv4)
    pool4 = GlobalMaxPooling1D()(drop4)
    #flat4 = Flatten()(pool4)
    # channel 3
    embedding5 = Embedding(vocabulary_size, 100, input_length=max_leng , weights=[embedding_matrix], trainable=False)(inputs)
    conv5 = Conv1D(filters=100, kernel_size = 5, activation='relu')(embedding3)
    drop5 = Dropout(0.5)(conv5)
    pool5 = GlobalMaxPooling1D()(drop5)
    #flat5 = Flatten()(pool5)
    # merge
    merged = concatenate([pool1, pool2, pool3, pool4, pool5])
    #dense1 = Dense(hidden_dims, activation='relu')(merged)
    outputs = Dense(1, activation='sigmoid')(merged)
    model = Model(inputs=[inputs], outputs=outputs)
    # compile
    adam = Adam(0.0001)
    # adadelta = Adadelta(0.0001)
    model.compile(loss='binary_crossentropy', optimizer= 'adam', metrics=['accuracy'])
    # summarize
    print(model.summary())
    #plot_model(model, show_shapes=True, to_file='multichannel.png')
    return model

In [27]:
from keras.utils.vis_utils import plot_model
# define model
model = define_model(max_leng, vocabulary_size)

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 4000)         0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 4000, 100)    4090300     input_1[0][0]                    
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, 4000, 100)    4090300     input_1[0][0]                    
__________________________________________________________________________________________________
embedding_3 (Embedding)         (None, 4000, 100)    4090300     input_1[0][0]                    
__________________________________________________________________________________________________
conv1d_1 (

In [44]:
model.fit(data,np.array(y_train),validation_data=(data_valid, y_valid), epochs= 50)

Train on 4784 samples, validate on 10223 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fd80211e2e8>

In [45]:
y_pred = model.predict(data_valid)
fpr, tpr, _ = metrics.roc_curve(np.array(y_valid), y_pred)
auc = metrics.auc(fpr,tpr)
auc

0.6839439830314257

In [46]:
model.save('CNN5_W2V')

In [47]:
Model.append('Collbert CNN Model with Word2Vec Embedding')
AUC.append(auc)

### CNN + Stacked Bidirectional LSTM and Word2Vec Embedding

In [48]:
def create_conv_model():
    model_conv = Sequential()
    model_conv.add(Embedding(vocabulary_size, 100, input_length=max_leng , weights=[embedding_matrix], trainable=False))
    model_conv.add(Dropout(0.2))
    model_conv.add(Conv1D(64, 5, activation='relu'))
    model_conv.add(MaxPooling1D(pool_size=4))
    model_conv.add(Bidirectional(LSTM(100,recurrent_dropout= 0.2, return_sequences= True)))
    model_conv.add(Bidirectional(LSTM(50,recurrent_dropout= 0.2)))
    model_conv.add(Dense(1, activation='sigmoid'))
    model_conv.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model_conv
model_conv = create_conv_model()

In [50]:
model_conv.fit(data, np.array(y_train), validation_data=(data_valid, y_valid), epochs=10)

Train on 4784 samples, validate on 10223 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fd80c1f9a90>

In [51]:
y_pred = model_conv.predict_proba(data_valid)
fpr, tpr, _ = metrics.roc_curve(np.array(y_valid), y_pred)
auc = metrics.auc(fpr,tpr)
auc

0.63704221674103

In [52]:
model.save('CNN_BiLSTM_W2V')

In [53]:
Model.append('CNN-BiLSTM Model with Word2Vec Embedding')
AUC.append(auc)

### GloVe Embedding:
#### Here I am using GloVe model with 6B tokens and 400k vocabs from Wikipedia 2014 + Gigaword5 with 100 dimensions. You can download this pretrained embedding here:
https://nlp.stanford.edu/projects/glove/


In [54]:
embeddings_index = dict()
f = open('glove.6B.100d.txt', encoding="utf8")
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

In [55]:
vocabulary_size = 400000
embedding_matrix = np.zeros((vocabulary_size, 100))
for word, index in tokenizer.word_index.items():
    if index > vocabulary_size - 1:
        break
    else:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[index] = embedding_vector

In [56]:
def define_model(max_leng, vocabulary_size):
    inputs = Input(shape=(max_leng,))
    # channel 1
    embedding1 = Embedding(vocabulary_size, 100, input_length=max_leng , weights=[embedding_matrix], trainable=False)(inputs)
    conv1 = Conv1D(filters=100, kernel_size = 1, activation='relu')(embedding1)
    drop1 = Dropout(0.5)(conv1)
    pool1 = GlobalMaxPooling1D()(drop1)
    #flat1 = Flatten()(pool1)
    # channel 2
    embedding2 = Embedding(vocabulary_size, 100, input_length=max_leng , weights=[embedding_matrix], trainable=False)(inputs)
    conv2 = Conv1D(filters=100, kernel_size = 2, activation='relu')(embedding2)
    drop2 = Dropout(0.5)(conv2)
    pool2 = GlobalMaxPooling1D()(drop2)
    #flat2 = Flatten()(pool2)
    # channel 3
    embedding3 = Embedding(vocabulary_size, 100, input_length=max_leng , weights=[embedding_matrix], trainable=False)(inputs)
    conv3 = Conv1D(filters=100, kernel_size = 3, activation='relu')(embedding3)
    drop3 = Dropout(0.5)(conv3)
    pool3 = GlobalMaxPooling1D()(drop3)
    #flat3 = Flatten()(pool3)
    # channel 3
    embedding4 = Embedding(vocabulary_size, 100, input_length=max_leng , weights=[embedding_matrix], trainable=False)(inputs)
    conv4 = Conv1D(filters=100, kernel_size = 4, activation='relu')(embedding3)
    drop4 = Dropout(0.5)(conv4)
    pool4 = GlobalMaxPooling1D()(drop4)
    #flat4 = Flatten()(pool4)
    # channel 3
    embedding5 = Embedding(vocabulary_size, 100, input_length=max_leng , weights=[embedding_matrix], trainable=False)(inputs)
    conv5 = Conv1D(filters=100, kernel_size = 5, activation='relu')(embedding3)
    drop5 = Dropout(0.5)(conv5)
    pool5 = GlobalMaxPooling1D()(drop5)
    #flat5 = Flatten()(pool5)
    # merge
    merged = concatenate([pool1, pool2, pool3, pool4, pool5])
    # interpretation
    # dense1 = Dense(hidden_dims, activation='relu')(merged)
    outputs = Dense(1, activation='sigmoid')(merged)
    model = Model(inputs=[inputs], outputs=outputs)
    # compile
    adam = Adam(0.00001)
    # adadelta = Adadelta(0.0001)
    model.compile(loss='binary_crossentropy', optimizer= 'adam', metrics=['accuracy'])
    # summarize
    print(model.summary())
    #plot_model(model, show_shapes=True, to_file='multichannel.png')
    return model

In [57]:
model.fit(data,np.array(y_train),validation_data=(data_valid, y_valid), epochs= 50)

Train on 4784 samples, validate on 10223 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7fd8188b2f28>

In [58]:
y_pred = model.predict(data_valid)
fpr, tpr, _ = metrics.roc_curve(np.array(y_valid), y_pred)
auc = metrics.auc(fpr,tpr)
auc

0.6913309266411104

In [59]:
model.save('CNN5_GloVe')

In [60]:
Model.append('Collbert CNN Model with GloVe Embedding')
AUC.append(auc)

### CNN + Stacked Bidirectional LSTM

In [61]:
def create_conv_model():
    model_conv = Sequential()
    model_conv.add(Embedding(vocabulary_size, 100, input_length=max_leng , weights=[embedding_matrix], trainable=False))
    model_conv.add(Dropout(0.2))
    model_conv.add(Conv1D(64, 5, activation='relu'))
    model_conv.add(MaxPooling1D(pool_size=4))
    model_conv.add(Bidirectional(LSTM(100,recurrent_dropout= 0.2, return_sequences= True)))
    model_conv.add(Bidirectional(LSTM(50,recurrent_dropout= 0.2)))
    model_conv.add(Dense(1, activation='sigmoid'))
    model_conv.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model_conv
model_conv = create_conv_model()

In [62]:
model_conv.fit(data, np.array(y_train), validation_data=(data_valid, y_valid), epochs=10)

Train on 4784 samples, validate on 10223 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fd80c2b1c50>

In [63]:
y_pred = model_conv.predict_proba(data_valid)
fpr, tpr, _ = metrics.roc_curve(np.array(y_valid), y_pred)
auc = metrics.auc(fpr,tpr)
auc

0.6159363964085272

In [64]:
model.save('CNN_BiLSTM_GloVe')

In [65]:
Model.append('CNN-BiLSTM Model with GloVe Embedding')
AUC.append(auc)

### Results

In [68]:
Results_DL = pd.DataFrame({'Model': Model, 'AUC': AUC})
Results_BoW = pd.read_pickle('AUC_Models_BoW.pkl')
Results = pd.concat([Results_BoW, Results_DL], axis = 0)
Results.reset_index(drop = True)

Unnamed: 0,Model,AUC
0,LR on BoW,0.6997
1,LR on BoW & BoC,0.699682
2,NB on BoW & BoC,0.548622
3,LR on TF-IDF of BoW & BoC,0.658587
4,LR on BoC,0.571647
5,NN on BoW & BoC,0.708844
6,LR on BoW & BoC & up to 3grams,0.706103
7,NN on BoW & BoC & up to 3grams,0.715649
8,Collbert CNN Model with Word2Vec Embedding,0.683944
9,CNN-BiLSTM Model with Word2Vec Embedding,0.637042


### Conclusions:

 * Naïve approach of BoW yields slightly better results than sophisticated models such as CNN and BiLSTM.
 
 * Adding up to 3-grams improves the result of BoW model.

 * Combination of BoW and BoC with up to 3-grams and one dense layer has the best performance.

 * Adding bag of polarized CUIs to the analysis does not improve the results. It could be caused by the high correlation between text and CUIs. Also, I have realized that cTAKES sometimes cannot catch polarization and negation. It is also reported by my classmates.

 * Collobert et al. (2011) CNN model outperforms CNN-BiLSTM model.

 * There is not much difference between the results of GloVe pretrained embedding and Word2Vec embedding of MIMIC-III notes. 

 * LSTM and BiLSTM are prone to overfitting. Here you see that I am using only 10 epochs. This is because after certain numbers of epochs AUC drops due to overfitting.


### References:

 * Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural language processing (almost) from scratch. Journal of Machine Learning Research. 2011; 12(Aug):2493–2537. 
 
 * https://towardsdatascience.com/introduction-to-clinical-natural-language-processing-predicting-hospital-readmission-with-1736d52bc709
 
 * https://www.researchgate.net/publication/306093564_Dimensional_Sentiment_Analysis_Using_a_Regional_CNN-LSTM_Model

 * https://arxiv.org/abs/1703.08705

 * https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5510858/

 * https://www.nature.com/articles/sdata201635


Thank you!