The task in this notebook is mortality prediction on 48hrs using physiological variables.

The positive class is mortality and the negative survive.

We define preprocessing, and test, training classes in python and pytorch. You have to complete the code and answer questions for the following exercises.

## Exercises

1.   Train word embeddings using retrofitting
2.   Visualize retrofitted embedding
  *   How the word2vec and retrofitted embeddings look like? 
  *   What difference do you see and why?
3.   Train the model first using the word2vec embeddings and then retrofitted embeddings. Then evaluate both the trained models on the test set. 
  * Do you see any difference?
  * Do results improve? 
  * Compare the AUC and the calibration curves (with and without retrofitting).
  * **EXTRA** plot AUC and calibration curves for both models (with and without retrofitting) together
  * **EXTRA** Try different (larger and smaller) values for the parameters ```-n``` of retrofitting
      * Do the results change?
4. **EXTRA** Generate the lexicon for retrofitting using the additional notebook Named Entity Recognition with Scispacy

## Download data

In [None]:
#dowload csv files from gdrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
# This only needs to be done once per notebook.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [None]:
#download file from gdrive with the id for share file
# create share link for tar.gz file  and copy id
#https://drive.google.com/file/d/1PV8SF2ToQ50QaFA435gXBw8UCXF6Zio9/view?usp=sharing
#3k patients
file_id = '1S4jRAEmI4mLNCNhT3Z06bM9nhPSTSkyE'
downloaded = drive.CreateFile({'id': file_id})
#3k patients
downloaded.GetContentFile('test_text_data_2.tar.gz')

In [None]:
#extract data
!tar -xzf test_text_data_2.tar.gz

**Pre-trained word embeddings** Embeddings generated from the clinical notes of the patients in the training set with word2vec.

In [None]:
#similarly for word embeddings
# https://drive.google.com/file/d/1i28MYb91_Gz2zB1c-5nExTLGOd9EhPQv/view?usp=sharing
file_id = '1b_XLXkNHdhLtI1pmgY9EsaTv-Y7s6fcW' # URL id. 
downloaded = drive.CreateFile({'id': file_id})
downloaded.GetContentFile('mimic_vectors_training.100d.txt')

## Upload utils 
**Upload mimic__utils_text.py**
For reading csv, normalize data, and imputation.
The imputation techinique used is setting missing values to the previous value, there are other imputation methods avilable. Extension of the utielities from the the [YerevaNN](https://github.com/YerevaNN/mimic3-benchmarks) framework to also use the clinical notes.

In [None]:
#download mimic_utils_text config
#TODO add config into dataset
from google.colab import files
uploaded = files.upload()

Saving mimic_utils_text.py to mimic_utils_text.py


**Configuration files for mimic__utils_text.py** Files necessary to use mimic__utils_text.py. For more information see the [YerevaNN](https://github.com/YerevaNN/mimic3-benchmarks) framework.

In [None]:
#download discretizer config
#TODO add config into dataset
from google.colab import files
uploaded = files.upload()

Saving discretizer_config.json to discretizer_config.json


In [None]:
#download normalizer config
#TODO add config into dataset
from google.colab import files
uploaded = files.upload()

Saving norm_start_time_zero.normalizer to norm_start_time_zero.normalizer


**Upload retrofit.py**
This file implements the retrofitting method seen in our class. More details and documentation are available [here](https://github.com/mfaruqui/retrofitting).

In [None]:
#download retrofitting
#TODO add config into dataset
from google.colab import files
uploaded = files.upload()

Saving retrofit.py to retrofit.py


**Lexicon (knowledge graph) for to use for your experiment** This is a lexicon based on [UMLS](https://www.nlm.nih.gov/research/umls/index.html). Specifically:
  

1.   Entities have been annotated from the clinical notes' of patients included in the training data;
2.   Annotated entities have been linked to UMLS concepts;
3.   Synonims from those UMLS concepts have been extracted (using the proper relation).



In [None]:
#download umls synonyms config
#TODO add config into dataset
from google.colab import files
uploaded = files.upload()

Saving lexicon_UMLS_synonyms.txt to lexicon_UMLS_synonyms.txt


## Install dependencies

In [None]:
!pip install stop_words

## Run retrofit
Retrofit wordembeddings and generate new embeddings which takes UMLS synonyms into account.
Fill the command properly looking at its [documentation](https://github.com/mfaruqui/retrofitting). The retrofitted embeddings needs to be saved in a file named 'mimic_vectors_retrofitted.txt'

In [None]:
!python2 retrofit.py # COMPLETE HERE

Vectors read from: mimic_vectors_training.100d.txt 

Writing down the vectors in mimic_vectors_retrofitted.txt


## Import libraries

In [None]:
#import python and pytorch libraries
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import Adam
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
import codecs
import os
import sys
import numpy as np
import logging
import tempfile
import shutil
import pickle
import platform
import json
from datetime import datetime
from nltk.corpus import stopwords
from stop_words import get_stop_words
from collections import defaultdict
import string
import random
from __future__ import absolute_import
from __future__ import print_function
from sklearn import metrics
from mimic_utils_text import InHospitalMortalityReader, Discretizer, Normalizer, read_chunk

## Visualize embeddings
Visualize word2vec using [T-distributed stochastic neighbor embedding (tSNE)](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding). 
Vizualize also the retrofitted embeddings and  compare them with word2vec ones. How the word2vec and retrofitted embeddings look like? What difference do you see and why?

In [None]:
# load glove into gensim
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
import matplotlib.cm as cm
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

glove_file = 'mimic_vectors_training.100d.txt'
tmp_file = get_tmpfile("glove_word2vec.txt")
_ = glove2word2vec(glove_file, tmp_file)
model = KeyedVectors.load_word2vec_format(tmp_file)

In [None]:
def tsne_plot_similar_words(title, labels, embedding_clusters, word_clusters, a, filename=None):
    plt.figure(figsize=(16, 9))
    colors = cm.rainbow(np.linspace(0, 1, len(labels)))
    for label, embeddings, words, color in zip(labels, embedding_clusters, word_clusters, colors):
        x = embeddings[:, 0]
        y = embeddings[:, 1]
        plt.scatter(x, y, c=color, alpha=a, label=label)
        for i, word in enumerate(words):
            plt.annotate(word, alpha=0.5, xy=(x[i], y[i]), xytext=(5, 2),
                         textcoords='offset points', ha='right', va='bottom', size=8)
    plt.legend(loc=4)
    plt.title(title)
    plt.grid(True)
    if filename:
        plt.savefig(filename, format='png', dpi=150, bbox_inches='tight')
    plt.show()


#plot clusters in the same fig for word2vec
keys = ['diabetes', 'sepsis', 'pneumonia']

embedding_clusters = []
word_clusters = []
for word in keys:
    embeddings = []
    words = []
    for similar_word, _ in model.most_similar(word, topn=20):
        words.append(similar_word)
        embeddings.append(model[similar_word])
    embedding_clusters.append(embeddings)
    word_clusters.append(words)

embedding_clusters = np.array(embedding_clusters)
n, m, k = embedding_clusters.shape
tsne_model_en_2d = TSNE(perplexity=15, n_components=2, init='pca', n_iter=3500, random_state=32)
embeddings_en_2d = np.array(tsne_model_en_2d.fit_transform(embedding_clusters.reshape(n * m, k))).reshape(n, m, 2) 

tsne_plot_similar_words('Similar words from word2vec', keys, embeddings_en_2d, word_clusters, 0.7,
                        'similar_words_word2vec.png')

# Your code here
# visualize retrofitted embeddings

## Pytorch Dataset

We define a vocabulary, dataset, acollate function and create batch 


In [None]:
# vocabulary class to upload word2vec into pytorch
# default tokens
UNK_TOKEN = "<unk>"
PAD_TOKEN = "<pad>"
SOS_TOKEN = "<s>"
EOS_TOKEN = "</s>"


class Vocabulary:
    """
        Creates a vocabulary from a word2vec file. 
    """
    def __init__(self):
        self.idx_to_word = {0: PAD_TOKEN, 1: UNK_TOKEN, 2: SOS_TOKEN, 3: EOS_TOKEN}
        self.word_to_idx = {PAD_TOKEN: 0, UNK_TOKEN: 1, SOS_TOKEN: 2, EOS_TOKEN: 3}
        self.word_freqs = {}
       
    
    def __getitem__(self, key):
        return self.word_to_idx[key] if key in self.word_to_idx else self.word_to_idx[UNK_TOKEN]
    
    def word(self, idx):
        return self.idx_to_word[idx]
    
    def size(self):
        return len(self.word_to_idx)
    
    
    def from_data(input_file, vocab_size, emb_size):
      
        vocab = Vocabulary()
        vocab_size = vocab_size + len(vocab.idx_to_word)
        weight = np.zeros((vocab_size, emb_size))
        with codecs.open(input_file, 'rb')  as f:
         
          for l in f:
            line = l.decode().split()
            token = line[0]
            if token not in vocab.word_to_idx:
              idx = len(vocab.word_to_idx)
              vocab.word_to_idx[token] = idx
              vocab.idx_to_word[idx] = token
            
              vect = np.array(line[1:]).astype(np.float)
              weight[idx] = vect
          # average embedding for unk word
          avg_embedding = np.mean(weight, axis=0)
          weight[1] = avg_embedding
                            
        return vocab, weight

In [None]:

# pytroch class for reading data into batches
class MIMICTextDataset(Dataset):
    """
       Loads a list of sentences into memory from a text file,
       split by newlines. 
    """
    def __init__(self, reader, discretizer, normalizer, 
            notes_output='sentence', max_w=25, max_s=500, max_d=500,
            target_repl=False, batch_labels=False):
        self.data = []
        self.y  = []
        self.max_w = max_w
        self.max_s = max_s
        self.max_d = max_d
        N = reader.get_number_of_examples()
        #if small_part:
        #    N = 1000
        ret = read_chunk(reader, N)
        data = ret["X"]
        notes_text = ret["text"]
        notes_info = ret["text_info"]
        ts = ret["t"]
        labels = ret["y"]
        names = ret["name"]
        data = [discretizer.transform(X, end=t)[0] for (X, t) in zip(data, ts)]
        if normalizer is not None:
            data = [normalizer.transform(X) for X in data]
        #self.x = np.array(data, dtype=np.float32)
        #self.T = self.data.shape[1]
        #if target_repl:
        #    self.y = self._extend_labels(self.y)
        #notes into list of sentences, docs, etc..
        self.notes = []
        tmp_data = []
        tmp_labels = []
        if notes_output == 'sentence':
            # [N, W] patients, words
            for patient_notes, _x, l  in zip(notes_text, data, labels):
                tmp_notes = []
                for doc in sorted(patient_notes):
                    sentences = patient_notes[doc]
                    for sentence in sentences:
                        #print(sentence)
                        tmp_notes.extend(sentence)
                if len(tmp_notes) > 0 and len(tmp_notes) <= self.max_w:
                    #print(tmp_notes)
                    self.notes.append(' '.join(tmp_notes))
                    #self.notes.append(tmp_notes)
                    tmp_data.append(_x)
                    tmp_labels.append(l)
                #elif len(tmp_notes) > 0:
                #    self.notes.append(' '.join(tmp_notes[:self.max_w]))
                #    tmp_data.append(_x)
        elif notes_output == 'sentence-max':
             # [N, W] patients, words
             # [N, W] patients, words
            for patient_notes, _x, l  in zip(notes_text, data, labels):
                tmp_notes = []
                for doc in sorted(patient_notes):
                    sentences = patient_notes[doc]
                    for sentence in sentences:
                        #print(sentence)
                        tmp_notes.extend(sentence)
                if len(tmp_notes) > 0 and len(tmp_notes) <= self.max_w:
                    #print(tmp_notes)
                    self.notes.append(' '.join(tmp_notes))
                    #self.notes.append(tmp_notes)
                    tmp_data.append(_x)
                    tmp_labels.append(l)
                elif len(tmp_notes) > 0:
                    self.notes.append(' '.join(tmp_notes[:self.max_w]))
                    tmp_data.append(_x)
                    tmp_labels.append(l)

        elif notes_output == 'doc':
            # [N, S, W] patients, sentences, words
            # TODO add max size!
            for patient_notes,  _x, l in zip(notes_text, data, labels):
                tmp_notes = []
                for doc in sorted(patient_notes):
                    sentences = patient_notes[doc]
                    for sentence in sentences:
                        if len(sentence) > 0 and len(sentence) <= max_w:
                            tmp_notes.append(sentence)
                        elif len(sentence) > 0:
                            tmp_notes.append(sentence[:max_w])
                if len(tmp_notes) > 0 and len(tmp_notes) <= max_s:
                    self.notes.append(tmp_notes)
                    tmp_data.append(_x)
                    tmp_labels.append(l)
                elif len(tmp_notes) > 0:
                    self.notes.append(tmp_notes[:max_s])
                    tmp_data.append(_x)
                    tmp_labels.append(l)
        
#
        self.x = np.array(tmp_data, dtype=np.float32)   
        self.T = self.x.shape[1]
        if batch_labels:
            self.y = np.array([[l] for l in tmp_labels], dtype=np.float32)
        else:
            self.y = np.array(tmp_labels, dtype=np.float32)


    def _extend_labels(self, labels):
        # (B,)
        labels = labels.repeat(self.T, axis=1)  # (B, T)
        return labels

    def __len__(self):
        # overide len to get number of instances
        return len(self.x)

    def __getitem__(self, idx):
        # get words and label for a given instance index
        return self.x[idx], self.notes[idx], self.y[idx]


In [None]:
#collate and create batch for documents and sentences 


def create_sentence_batch(sentences, vocab, device, stopwords=False):
    """
    Converts a list of sentences to a padded batch of word ids. Returns
    an input batch, output tags, a sequence mask over
    the input batch, and a tensor containing the sequence length of each
    batch element.
    :param sentences: a list of sentences, each a list of token ids
    :param vocab: a Vocabulary object for this dataset
    :param device: 
    :returns: a batch of padded inputs,  mask, lengths
    """
    if stopwords:
        tok = np.array([_remove_stopwords(sen) for sen in sentences])
    else:
        tok = np.array([sen.split() for sen in sentences])
    #tok = np.array([sen[0] for sen in sentences])
    seq_lengths = [len(sen) for sen in tok]
    max_len = max(seq_lengths)
    pad_id = vocab[PAD_TOKEN]
    unk_id = vocab[UNK_TOKEN]

    pad_id_input = []
    #pad and find ids for words given the word2vec vocab
    #print(tok)
    for idx, sen in enumerate(tok):
      tmp_sent = []
      for t in range(max_len):
        if t < seq_lengths[idx]:
          try:
            token_id = vocab[sen[t]]
          except KeyError:
            token_id = unk_id
        else:
          token_id = pad_id
        tmp_sent.append(token_id)
      pad_id_input.append(tmp_sent) 

    
    # Convert everything to PyTorch tensors.
    batch_input = torch.tensor(pad_id_input)
    seq_mask = (batch_input != vocab[PAD_TOKEN])
    seq_length = torch.tensor(seq_lengths)
    
    # Move all tensors to the given device.
    batch_input = batch_input.to(device)
    seq_mask = seq_mask.to(device)
    seq_length = seq_length.to(device)
    
    return batch_input, seq_mask, seq_length


def doc_collate(batch):
    data = np.array([item[0] for item in batch])
    data = torch.tensor(data)
    notes = [item[1] for item in batch]
    target = np.array([item[2] for item in batch])
    target = torch.tensor(target)
    #target = torch.LongTensor(target)
    return [data, notes, target]


def create_doc_batch(docs, vocab, device):
    """
    """
    sent_seq_lengths = np.array([len(doc) for doc in docs])
    word_seq_lengths = [[len(sent) for sent in doc] for doc in docs]
    b = len(docs)
    sent_max_len = max(sent_seq_lengths)
    word_max_len = max([max(w_seq) for w_seq in word_seq_lengths])
    pad_id = vocab[PAD_TOKEN]
    unk_id = vocab[UNK_TOKEN]

    pad_id_input = np.zeros((b, sent_max_len, word_max_len), dtype=int)
    word_seq_length = np.ones((b, sent_max_len), dtype=np.float32)
    for i, w_lens in enumerate(word_seq_lengths):
        for j, w_len in enumerate(w_lens):
            word_seq_length[i][j] = w_len
    #pad and find ids for words given the word2vec vocab
    #print(tok)
    for idx_doc, doc in enumerate(docs):
        #tmp_doc = []
        for i in range(sent_max_len):
            tmp_sent = []
            if i < sent_seq_lengths[idx_doc]:
                sent = doc[i]
                for j in range(word_max_len):
                    if j < word_seq_lengths[idx_doc][i]:
                        try:
                            token_id = vocab[sent[j]]
                        except KeyError:
                            token_id = unk_id
                    else:
                        token_id = pad_id
                    #tmp_sent.append(token_id)
                    pad_id_input[idx_doc][i][j] = token_id
            else:
                #tmp_sent = [pad_id for _ in range(word_max_len[idx_doc])] 
                for j in range(word_max_len):
                    pad_id_input[idx_doc][i][j] = pad_id
        #pad_id_input.append(tmp_sent) 

    # Convert everything to PyTorch tensors.
    batch_input = torch.tensor(pad_id_input)
    #seq_mask = (batch_input != vocab[PAD_TOKEN])
    sent_seq_length = torch.tensor(sent_seq_lengths)
    word_seq_length = torch.tensor(word_seq_length)
    
    # Move all tensors to the given device.
    batch_input = batch_input.to(device)
    #seq_mask = seq_mask.to(device)
    sent_seq_length = sent_seq_length.to(device)
    word_seq_length = word_seq_length.to(device)
    
    return batch_input, sent_seq_length, word_seq_length

## Evaluation metrics

In [None]:
# eval metrics
def print_metrics_binary(y_true, predictions, logging, verbose=1):
    predictions = np.array(predictions)
    if len(predictions.shape) == 1:
        predictions = np.stack([1 - predictions, predictions]).transpose((1, 0))
    cf = metrics.confusion_matrix(y_true, predictions.argmax(axis=1))
    if verbose:
        logging.info("confusion matrix:")
        logging.info(cf)
    cf = cf.astype(np.float32)

    acc = (cf[0][0] + cf[1][1]) / np.sum(cf)
    prec0 = cf[0][0] / (cf[0][0] + cf[1][0])
    prec1 = cf[1][1] / (cf[1][1] + cf[0][1])
    rec0 = cf[0][0] / (cf[0][0] + cf[0][1])
    rec1 = cf[1][1] / (cf[1][1] + cf[1][0])
    auroc = metrics.roc_auc_score(y_true, predictions[:, 1])

    (precisions, recalls, thresholds) = metrics.precision_recall_curve(y_true, predictions[:, 1])
    auprc = metrics.auc(recalls, precisions)
    minpse = np.max([min(x, y) for (x, y) in zip(precisions, recalls)])

    if verbose:
        logging.info("accuracy = {0:.3f}".format(acc))
        logging.info("precision class 0 = {0:.3f}".format(prec0))
        logging.info("precision class 1 = {0:.3f}".format(prec1))
        logging.info("recall class 0 = {0:.3f}".format(rec0))
        logging.info("recall class 1 = {0:.3f}".format(rec1))
        logging.info("AUC of ROC = {0:.3f}".format(auroc))
        logging.info("AUC of PRC = {0:.3f}".format(auprc))
       

    return {"acc": acc,
            "prec0": prec0,
            "prec1": prec1,
            "rec0": rec0,
            "rec1": rec1,
            "auroc": auroc,
            "auprc": auprc}

## Training

Here we define the model, loss and training loop.
Train the model first using the word2vec embeddings and then retrofitted embeddings. Do you see any difference? (also evaluate the model once trained see below)

In [None]:
#model Bow with a single sequence for a patient

class BoWText(nn.Module):

    def __init__(self, vocab_size, label_size, emb_size, hidden_size, dropout=0.2, model_w2vec=None, bidirectional=False):

        super().__init__()
        self.bidirectional = bidirectional
 #       self.pad_idx = pad_idx

        weights = torch.FloatTensor(model_w2vec)
        self.embedder = nn.Embedding.from_pretrained(weights, freeze=False) 
        #two experiments: freeze True/False (maybe it is good to update because Glove is general and our data is domain specific)

        self.combination_layer = nn.Linear(emb_size, hidden_size)

        self.projection = nn.Linear(hidden_size, label_size)
        self.relu = nn.ReLU()
        self.dropout_layer = nn.Dropout(p=dropout)

    def forward(self, x, seq_mask, seq_len): #x=notes

        # Compute word embeddings
        # [B,M,E] B=patient, M=sentence, E=emb
#        print(x.size())
        x_embed = self.embedder(x)
        #x_embed = self.dropout_layer(x_embed) #each word in the sentence is turned on/off for regulatization
        # [B, M, hid_size]
        h = self.combination_layer(x_embed)
        h = self.relu(h)
        h = self.dropout_layer(h)
        #mean
        #h = h.mean(dim=1)
        # average
        #print(seq_len.size())
        #print(torch.sum(h,dim=1).size())
        h = torch.sum(h, dim=1) / seq_len.unsqueeze(-1) #check mean syntax in pytorch --> [B,hidden_size]

        logits = self.projection(h) #size [B,1] each patient has a logit mortality
        
        return logits

In [None]:
#eval model
def eval_model(model, dataset, device, vocab):
    model.eval()
    sigmoid = nn.Sigmoid()
    with torch.no_grad():
        y_true = []
        predictions = []
        for _, notes, labels  in dataset:
            labels = labels.to(device)
            x_notes, seq_mask, seq_len = create_sentence_batch(notes, 
                    vocab, 
                    device, 
                    stopwords=False)

            logits =  model(x_notes, seq_mask, seq_len)
            probs = sigmoid(logits)
            #_, predicted = torch.max(probs.data, 1)
            #y_hat_class = np.where(probs.data<0.5, 0, 1)
            predictions += [p.item() for p in probs]#y_hat_class.squeeze()
            y_true += [y.item() for y in labels]
    #print(predictions)
    #print(y_true)
    results = print_metrics_binary(y_true, predictions, logging)
    return results, predictions, y_true

In [None]:
def train(args):
  mode = 'train'
  hidden_size = args['dim']
  dropout = args['dropout']
  batch_size = args['batch_size']
  learning_rate = args['lr']
  num_epochs = args['epochs']
  emb_size = args['emb_size']
  aggregation_type = args['aggregation_type']
  bidirectional_encoder = args['bidirectional'] # TODO add into args
  seed = args['seed']
  steps = args['steps']
  data = args['data']
  notes = args['notes']
  word2vec = args['word2vec']
  max_w = args['max_w']
  timestep = args['timestep']
  normalizer_state = args['normalizer_state']
  vocab_size = args['vocab_size']
  if seed:
      torch.manual_seed(seed)
      np.random.seed(seed)
  device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")   
  
  logging.basicConfig(level=logging.INFO, 
          format='%(asctime)s %(message)s', 
          datefmt='%Y-%m-%d %H:%M:%S',
          )
  
  vocab, weight = Vocabulary.from_data(word2vec, vocab_size, emb_size) #glove 107647 400000, 300)
    #print(vocab)
    #rint(vocab["<unk>"])

  train_reader = InHospitalMortalityReader(dataset_dir=os.path.join(data, 'train'),
                                        notes_dir=notes,  
                                        listfile=os.path.join(data, 'train_listfile.csv'),
                                         period_length=48.0)

  val_reader = InHospitalMortalityReader(dataset_dir=os.path.join(data, 'train'),
                                       notes_dir=notes,
                                       listfile=os.path.join(data, 'val_listfile.csv'),
                                       period_length=48.0)

  discretizer = Discretizer(timestep=float(timestep),
                          store_masks=True,
                          impute_strategy='previous',
                          start_time='zero')
  discretizer_header = discretizer.transform(train_reader.read_example(0)["X"])[1].split(',')
  cont_channels = [i for (i, x) in enumerate(discretizer_header) if x.find("->") == -1]

  normalizer = Normalizer(fields=cont_channels)  # choose here which columns to standardize
  normalizer_state = normalizer_state
  if normalizer_state is None:
      normalizer_state = 'norm_start_time_zero.normalizer'
      #normalizer_state = 'ihm_ts{}.input_str:{}.start_time:zero.normalizer'.format(args['timestep'], args['imputation'])
      #normalizer_state = os.path.join(os.path.dirname(os.path.abspath("__file__")), normalizer_state)
  #print(normalizer_state)
  normalizer.load_params(normalizer_state)

  # sentence option proces notes into single sequence
  train_dataset = MIMICTextDataset(train_reader, 
                discretizer, 
                normalizer, 
                batch_labels=True,
                max_w=max_w,
                notes_output='sentence-max')
  train_dl = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
  
  val_dataset = MIMICTextDataset(val_reader, 
            discretizer, 
            normalizer, 
            batch_labels=True,
            max_w=max_w,
            notes_output='sentence-max')
  val_dl = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
  
  # Define the classification model.
  model = BoWText(vocab_size=vocab.size(),
                        label_size=1, #label size = 1 because of the binary nature of the predictions(classification)
                        emb_size=emb_size, 
                        hidden_size=hidden_size,
                        dropout=dropout,
                        model_w2vec=weight)

  model = model.to(device)
  logging.info(args)
  logging.info(model)

  # Define optimizer
  optimizer = Adam(model.parameters(), lr=learning_rate) 
   
  criterion = nn.BCEWithLogitsLoss()

  # path to best model save on disk
  best_model = 'best_model.pt'
  best_val_auc = 0.

  results = []

  step = 0
  num_batches = 0
  #training loop
  # loop over the epochs
  for epoch_num in range(1, num_epochs+1): 
      loss_batch = .0
      num_batches = 0
      # loop over mini-batches
      for _, notes, labels  in train_dl:
          # x = x.to(device) structure data
          labels = labels.to(device)
          # to tensor
          x_notes, seq_mask, seq_len = create_sentence_batch(notes, 
                    vocab, 
                    device, 
                    stopwords=False)
          # Model is in training mode (for dropout).
          model.train()
          optimizer.zero_grad()
       
          # run forward
          logits =  model(x_notes, seq_mask, seq_len)
          
          loss = criterion(logits, labels)
            
          loss_batch += loss.item()
          # Backpropagate and update the model weights.
          loss.backward()
          optimizer.step()
        
          num_batches += 1
        
          # Every 100 steps we evaluate the model and report progress.
          if step % steps == 0:
              logging.info("epoch (%d) step %d: training loss = %.2f"% 
                 (epoch_num, step, loss_batch/num_batches))
            
            
          step += 1
        
        
      metrics_results, _, _ = eval_model(model,
                                    val_dl,
                                    device,
                                    vocab)
      metrics_results['epoch'] = epoch_num
      results.append(metrics_results)
      if metrics_results['auroc'] > best_val_auc:
        best_val_auc = metrics_results['auroc']
        # save best model in disk
        torch.save(model.state_dict(), best_model)
        logging.info('best model AUC of ROC = %.3f'%(best_val_auc))
      logging.info("Finished epoch %d" % (epoch_num))


In [None]:
#execute training 
#define hyperparameters
# Change the parameter here to use the retrofitted embedding

args = {'dim':128,
        'dropout':0.2,
        'batch_size':64,
        'lr':1e-3, # for word emebeddings in clinical task better use 1e-4 to avoid forgetting
        'epochs':20,
        'emb_size':100,
        'aggregation_type':'mean',
        'bidirectional':False,
        'seed':42,
        'steps':50,
        'data':'test_text_data_2/in-hospital-mortality',
        'notes': 'test_text_data_2/train',
        'word2vec': 'mimic_vectors_training.100d.txt',
        'max_w': 10000,
        'timestep':1.0,
        'imputation':'previous',
        'normalizer_state':None,
        'vocab_size': 130212} # number of lines in the embedding file (word2vec parameter above)
train(args)

## Test

Here we use the best validation model and run in test. Run prediction model both with the word2vec and the retrofitted embeddings. Do result improve? Compare the AUC and  the calibration curves (before and after retrofitting).


In [None]:
#test

def test(args):
    # define trainning and validation datasets
    mode = 'test'
    hidden_size = args['dim']
    dropout = args['dropout']
    batch_size = args['batch_size']
    emb_size = args['emb_size']
    best_model = args['best_model']
    data = args['data']
    notes = args['notes']
    word2vec = args['word2vec']
    max_w = args['max_w']
    timestep = args['timestep']
    aggregation_type = args['aggregation_type']
    bidirectional_encoder = args['bidirectional'] # TODO add into args
    vocab_size = args['vocab_size']
    device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")   
    # 1. Get a unique working directory 
    
    logging.basicConfig(level=logging.INFO, 
            format='%(asctime)s %(message)s', 
            datefmt='%Y-%m-%d %H:%M:%S')
    
    vocab, weight = Vocabulary.from_data(word2vec, vocab_size, emb_size) #glove 107647 400000, 300)

    test_reader = InHospitalMortalityReader(dataset_dir=os.path.join(data, 'test'),
                                         listfile=os.path.join(data, 'test_listfile.csv'),
                                         notes_dir=notes, 
                                         period_length=48.0)

    
    discretizer = Discretizer(timestep=float(timestep),
                          store_masks=True,
                          impute_strategy='previous',
                          start_time='zero')

    discretizer_header = discretizer.transform(test_reader.read_example(0)["X"])[1].split(',')
    cont_channels = [i for (i, x) in enumerate(discretizer_header) if x.find("->") == -1]

    normalizer = Normalizer(fields=cont_channels)  # choose here which columns to standardize
    normalizer_state = args['normalizer_state']
    if normalizer_state is None:
        #normalizer_state = 'ihm_ts{}.input_str:{}.start_time:zero.normalizer'.format(args.timestep, args.imputation)
        #normalizer_state = os.path.join(os.path.dirname(__file__), normalizer_state)
        normalizer_state = 'norm_start_time_zero.normalizer'
    normalizer.load_params(normalizer_state)

    # Read data
    vocab, weight = Vocabulary.from_data(word2vec, vocab_size, emb_size) #glove 107647 , 300)
    # sentence option proces notes into single sequences
    test_dataset = MIMICTextDataset(test_reader, 
            discretizer, 
            normalizer, 
            batch_labels=True,
            max_w=max_w,
            notes_output='sentence-max')
 
    test_dl =  DataLoader(test_dataset, batch_size=batch_size, shuffle=False)


    # Define the classification model.
    model = BoWText(vocab_size=vocab.size(),
                        label_size=1, #label size = 1 because of the binary nature of the predictions(classification)
                        emb_size=emb_size, 
                        hidden_size=hidden_size,
                        dropout=dropout,
                        model_w2vec=weight)

    model.load_state_dict(torch.load(best_model))
    logging.info(model)
    model = model.to(device)

    metrics_results, pred_probs, y_true = eval_model(model,
                                test_dl,
                                device,
                                vocab)
    return metrics_results, pred_probs, y_true
            

In [None]:
# Run test on best validation model
# Change the parameter here to use the retrofitted embedding
args = {'best_model':'best_model.pt',
        'dim':128,
        'dropout':0.2,
        'batch_size':16,
        'word2vec':'mimic_vectors_training.100d.txt',
        'emb_size':100,
        'aggregation_type':'mean',
        'bidirectional':False,
        'data':'test_text_data_2/in-hospital-mortality',
        'notes':'test_text_data_2/test',
        'timestep':1.0,
        'max_w':10000,
        'imputation':'previous',
        'normalizer_state':None,
        'vocab_size': 130212} # number of lines in the embedding file (word2vec parameter above)
metrics_results, pred_probs, y_true = test(args)

## Plots

ROC and calibrations curves

In [None]:
# Figures ROC and calibration curve
import matplotlib.pyplot as plt
import sys
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt
import matplotlib.lines as mlines
import matplotlib.transforms as mtransforms
import pickle
from sklearn import metrics

In [None]:
# roc curve
bow_fpr, bow_tpr, bow_thresholds = metrics.roc_curve(y_true, pred_probs)

# plot the roc curve for the model
plt.figure()
plt.ylim(0., 1.0)
plt.xlim(0.,1.0)
plt.plot(bow_fpr, bow_tpr, marker='.', label='BoW', color='darkorange')
plt.plot([0, 1], [0, 1], color='navy', linestyle='--')
# axis labels
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
# show the legend
plt.legend()
plt.show()

In [None]:
#calibration curve
bow_y, bow_x = calibration_curve(y_true, pred_probs, n_bins=10)
plt.figure()
plt.ylim(0., 1.0)
plt.xlim(0.,1.0)
#fig, ax = plt.subplots()
# only these two lines are calibration curves
plt.plot(bow_x,bow_y, marker='^', linestyle="", markersize=7, label='BoW', color='darkorange')

plt.plot([0, 1], [0, 1], color='navy', linestyle='--')

plt.xlabel('Mean predicted value')
plt.ylabel('Fraction of positives')
plt.legend()
plt.show()

#plt.show()
