# Med2Vec

This notebook provides functions to compute Med2Vec patient representations.

#### References

Multi-layer Representation Learning for Medical Concepts, E. Choi (2016)

RETAIN: Interpretable Predictive Model in Healthcare using Reverse Time Attention Mechanism, E. Choi (2016)

Doctor AI: Predicting Clinical Events via Recurrent Neural Networks; E. Choi (2016)

#### Github with the authors codes

https://github.com/mp2893/med2vec


#### Python environmment
- Python 2.7
- Required Packages: Theano, Pickle, Scikit-learn and Scipy

#### Complementary Functions

1. En evaluation function: Evaluate_Med2Vec
        This function ables to compute the loss of a trained model on a new sample (different than the one used to fit the model)

2. A Training Function: Train_Med2Vec
        This function ables to train and evaluate the model on a validation sample to prevent overfitting:
        - Split the sample into a train set (80%) and a validation one (20%)
        - Train the Med2Vec architecture on the training sample
        - Evaluate the model on both training and validation samples by computing the losses
        - Returns the losses on the training and test sets

3. Gridsearch of hyperparameters
        This function aims at optimizing the following hyperparameters composing Med2Vec model:
        - embDimSize: dimension of the intermediate latent space
        - hiddenDimSize: dimension of the latent space 
        - window_size: size of the context window 

        All set of hyperparameters to be tested given as input are used to train the model on a training set (80% of the sample). 

        The optimal set of hyperparameters is the one minimizing the loss function on a validation sample (20% of the sample).

Imports

Package of the authors

https://github.com/mp2893/med2vec

In [None]:
from med2vec import *

In [31]:
import os 
import numpy as np
import pandas as pd
import pickle
from sklearn.model_selection import train_test_split

## Training

In [None]:
def Train_Med2Vec(File) : 
    '''
    Parameters
    ----------
    File : dict
        Dictionnary with the following keys :
        - 'seqFile' (str) : The path to the Pickled file containing visit information of patients
        - 'labelFile' (str) : The path to the Pickled file containing grouped visit information of patients
        - 'outFile' (str) : The path to the output models
        - 'demoFile' (str) : The path to the Pickled file containing demographic information of patients
        - 'numXcodes' (int) : The number of unique input medical codes
        - 'numYcodes' (int) : The number of unique output medical codes (the number of unique grouped codes)
        - 'demoSize' (int) : The size of the demographic information vector
        - 'embDimSize' (int) : The size of the code representation
        - 'hiddenDimSize' (int) : The size of the visit representation
        - 'windowSize' (int) : The size of the visit context window (range: 1,2,3,4,5)
        - 'batchSize' (int) : The size of the demographic information vector
        - 'maxEpochs' (int) : The number of training epochs
        - 'logEps' (float) : The learning rate
        - 'L2_reg' (float) : L2 regularization for the code representation matrix W_c

    Returns
    -------
    loss_train : float
        The value of the loss on the training sample.  
    loss_test : float
        The value of the loss on the test sample.
    '''

    # Split Train / Test
    if File['numYcodes'] > 0 :
        seqFile_train, seqFile_test, labelFile_train, labelFile_test = train_test_split(np.array(pickle.load(open(File['seqFile'], 'rb'))), np.array(pickle.load(open(File['labelFile'], 'rb'))), test_size=0.2, random_state=42)
        pickle.dump(seqFile_train, open(File['outFile']+'/seqFile_train', 'wb'), -1)
        pickle.dump(labelFile_train, open(File['outFile']+'/labelFile_train', 'wb'), -1)
        pickle.dump(seqFile_test, open(File['outFile']+'/seqFile_test', 'wb'), -1)
        pickle.dump(labelFile_test, open(File['outFile']+'/labelFile_test', 'wb'), -1)
    else :
        seqFile_train, seqFile_test = train_test_split(np.array(pickle.load(open(File['seqFile'], 'rb'))),  test_size=0.2, random_state=42)
        pickle.dump(seqFile_train, open(File['outFile']+'/seqFile_train', 'wb'), -1)
        pickle.dump(seqFile_test, open(File['outFile']+'/seqFile_test', 'wb'), -1)

    # Train the model
    if File['numYcodes'] > 0 :
        train_med2vec(seqFile = File['outFile']+'/seqFile_train',
                    labelFile = File['outFile']+'/labelFile_train',
                    demoFile = File['demoFile'],
                    outFile = File['outFile'], 
                    numXcodes = File['numXcodes'],
                    numYcodes = File['numYcodes'],
                    demoSize = File['demoSize'],
                    embDimSize = File['embDimSize'], 
                    hiddenDimSize = File['hiddenDimSize'],
                    windowSize = File['windowSize'],
                    batchSize = File['batchSize'],
                    maxEpochs = File['maxEpochs'],
                    logEps = File['logEps'],
                    L2_reg = File['L2_reg'])
    else :
        train_med2vec(seqFile = File['outFile']+'/seqFile_train',
                    labelFile = File['labelFile'],
                    demoFile = File['demoFile'],
                    outFile = File['outFile'], 
                    numXcodes = File['numXcodes'],
                    numYcodes = File['numYcodes'],
                    demoSize = File['demoSize'],
                    embDimSize = File['embDimSize'], 
                    hiddenDimSize = File['hiddenDimSize'],
                    windowSize = File['windowSize'],
                    batchSize = File['batchSize'],
                    maxEpochs = File['maxEpochs'],
                    logEps = File['logEps'],
                    L2_reg = File['L2_reg'])
    
    # Evaluate
    Filetrain = File.copy()
    Filetest = File.copy()
    Filetrain['seqFile'] = File['outFile']+'/seqFile_train'
    Filetest['seqFile'] = File['outFile']+'/seqFile_test'
    if File['numYcodes'] > 0 :
        Filetrain['labelFile'] = File['outFile']+'/labelFile'
        Filetest['labelFile'] = File['outFile']+'/labelFile_test'
        
    loss_train = Evaluate_Med2Vec(Filetrain, File['outFile']+ '.' +str(File['maxEpochs']-1) + '.npz', Filetrain) 
    loss_test = Evaluate_Med2Vec(Filetest, File['outFile']+ '.' +str(File['maxEpochs']-1) + '.npz', Filetest) 

    return loss_train, loss_test

## Evaluate the model

In [114]:
def Evaluate_Med2Vec(File, model, options) : 
        '''
        Parameters
        ----------
        File : dict
                Dictionnary with the following keys :
                - 'seqFile' (str) : The path to the Pickled file containing visit information of patients
                - 'labelFile' (str) : The path to the Pickled file containing grouped visit information of patients
                - 'outFile' (str) : The path to the output models
                - 'demoFile' (str) : The path to the Pickled file containing demographic information of patients
                - 'numXcodes' (int) : The number of unique input medical codes
                - 'numYcodes' (int) : The number of unique output medical codes (the number of unique grouped codes)
                - 'demoSize' (int) : The size of the demographic information vector
                - 'batchSize' (int) : The size of the demographic information vector
                - 'maxEpochs' (int) : The number of training epochs
                - 'logEps' (float) : The learning rate
                - 'L2_reg' (float) : L2 regularization for the code representation matrix W_c
        Model : str
                Path to the saved trained model.
        options : dict
                Dictionnary of the hyper parameters containing the following keys :
                - 'embDimSize' (int) : The size of the code representation
                - 'hiddenDimSize' (int) : The size of the visit representation
                - 'windowSize' (int) : The size of the visit context window (range: 1,2,3,4,5)

        Returns
        -------
        loss_value : float
                The value of the loss on the sample.        
        '''

        # Model
        model_param = np.load(model)
        tparams = {'W_emb': model_param['W_emb'],
                'b_emb': model_param['b_emb'],
                'W_hidden': model_param['W_hidden'],
                'b_hidden': model_param['b_hidden'],
                'W_output': model_param['W_output'],
                'b_output': model_param['b_output']}

        # Data
        seqs, demos, labels = load_data(File['seqFile'], File['demoFile'], File['labelFile'])
        
        if File['numYcodes'] > 0 :
                x_data, y_data, mask_data, iVector_data, jVector_data = padMatrix(seqs, labels, File)
        else :
                x_data, mask_data, iVector_data, jVector_data = padMatrix(seqs, labels, File)

        new_options = {'numXcodes': File['numXcodes'],
                'numYcodes': File['numYcodes'],
                'demoSize': File['demoSize'],
                'embDimSize': options['embDimSize'],
                'hiddenDimSize': options['hiddenDimSize'],
                'windowSize': options['windowSize'],
                'logEps': File['logEps'],
                'L2_reg': File['L2_reg']}
        
        # Compute loss
        if File['demoSize'] > 0 :
                if File['numYcodes'] > 0:
                        x, d, y, mask, iVector, jVector, total_cost = build_model(tparams, new_options)
                        eval_cost = theano.function(inputs=[x, d, y, mask, iVector, jVector],  outputs=total_cost)
                        loss_value = eval_cost(x_data, demos, y_data, mask_data, iVector_data, jVector_data)
                else :
                        x, d, mask, iVector, jVector, total_cost = build_model(tparams, new_options)
                        eval_cost = theano.function(inputs=[x, d, mask, iVector, jVector],  outputs=total_cost)
                        loss_value = eval_cost(x_data, demos, mask_data, iVector_data, jVector_data)
        else :
                if File['numYcodes'] > 0:
                        x, y, mask, iVector, jVector, total_cost = build_model(tparams, new_options)
                        eval_cost = theano.function(inputs=[x, y, mask, iVector, jVector],  outputs=total_cost)
                        loss_value = eval_cost(x_data, y_data, mask_data, iVector_data, jVector_data)
                else :
                        x, mask, iVector, jVector, total_cost = build_model(tparams, new_options)
                        eval_cost = theano.function(inputs=[x, mask, iVector, jVector],  outputs=total_cost)
                        loss_value = eval_cost(x_data, mask_data, iVector_data, jVector_data)
        
        return loss_value

## Gridsearch of the hyperparameters

In [115]:
def Gridsearch_Med2Vec(File, embDimSize_list, hiddenDimSize_list, window_size_list) : 
    '''
    Parameters
    ----------
    File : dict
        Dictionnary with the following keys :
        - 'seqFile' (str) : The path to the Pickled file containing visit information of patients
        - 'labelFile' (str) : The path to the Pickled file containing grouped visit information of patients
        - 'outFile' (str) : The path to the output models
        - 'demoFile' (str) : The path to the Pickled file containing demographic information of patients
        - 'numXcodes' (int) : The number of unique input medical codes
        - 'numYcodes' (int) : The number of unique output medical codes (the number of unique grouped codes)
        - 'demoSize' (int) : The size of the demographic information vector
        - 'batchSize' (int) : The size of the demographic information vector
        - 'maxEpochs' (int) : The number of training epochs
        - 'logEps' (float) : The learning rate
        - 'L2_reg' (float) : L2 regularization for the code representation matrix W_c
    embDimSize_list : list
        List of the embedding size (for the code representation) to be tested.
    hiddenDimSize_list : list
        List of the hidden size (for the visit representation) to be tested.
    window_size_list : list
        List of the window sizes to be tested.

    Returns
    -------
    df_hyperparmeters : DataFrame
        For each pairs of hyperparameters tested, the loss associated on the validation set is stocked.
    '''

    df_hyperparameters = pd.DataFrame(columns=['embDimSize', 'hiddenDimSize', 'windowSize', 'Loss_test'])

    i=0
    
    # Train / Test Split
    if len(File['labelFile']) > 0 :
        if len(File['demoFile']) > 0 :
            seqFile_train, seqFile_test, labelFile_train, labelFile_test, demoFile_train, demoFile_test = train_test_split(np.array(pickle.load(open(File['seqFile'], 'rb'))), np.array(pickle.load(open(File['labelFile'], 'rb'))), np.array(pickle.load(open(File['demoFile'], 'rb'))), test_size=0.2, random_state=42)
            pickle.dump(seqFile_train, open(File['outFile']+'/seqFile_train_gridsearch', 'wb'), -1)
            pickle.dump(seqFile_test, open(File['outFile']+'/seqFile_test_gridsearch', 'wb'), -1)
            pickle.dump(labelFile_train, open(File['outFile']+'/labelFile_train_gridsearch', 'wb'), -1)
            pickle.dump(labelFile_test, open(File['outFile']+'/labelFile_test_gridsearch', 'wb'), -1)
            pickle.dump(demoFile_train, open(File['outFile']+'/demoFile_train_gridsearch', 'wb'), -1)
            pickle.dump(demoFile_test, open(File['outFile']+'/demoFile_test_gridsearch', 'wb'), -1)
        else :
            seqFile_train, seqFile_test, labelFile_train, labelFile_test = train_test_split(np.array(pickle.load(open(File['seqFile'], 'rb'))), np.array(pickle.load(open(File['labelFile'], 'rb'))), test_size=0.2, random_state=42)
            pickle.dump(seqFile_train, open(File['outFile']+'/seqFile_train_gridsearch', 'wb'), -1)
            pickle.dump(seqFile_test, open(File['outFile']+'/seqFile_test_gridsearch', 'wb'), -1)
            pickle.dump(labelFile_train, open(File['outFile']+'/labelFile_train_gridsearch', 'wb'), -1)
            pickle.dump(labelFile_test, open(File['outFile']+'/labelFile_test_gridsearch', 'wb'), -1)
    else :
        seqFile_train, seqFile_test = train_test_split(np.array(pickle.load(open(File['seqFile'], 'rb'))), test_size=0.2, random_state=42)
        pickle.dump(seqFile_train, open(File['outFile']+'/seqFile_train_gridsearch', 'wb'), -1)
        pickle.dump(seqFile_test, open(File['outFile']+'/seqFile_test_gridsearch', 'wb'), -1)
            

    for embDimSize in embDimSize_list :
        for hidden_dim in hiddenDimSize_list :
            for window_size in window_size_list :

                options_train = {'embDimSize': embDimSize,
                        'hiddenDimSize': hidden_dim,
                        'windowSize': window_size
                        }
            
                # Training
                if len(File['labelFile']) > 0 :
                    if len(File['demoFile']) > 0 :
                        train_med2vec(seqFile = File['outFile']+'/seqFile_train_gridsearch',
                                labelFile = File['outFile']+'/labelFile_train_gridsearch',
                                demoFile = File['outFile']+'/demoFile_train_gridsearch',
                                outFile = File['outFile'], 
                                numXcodes = File['numXcodes'],
                                numYcodes = File['numYcodes'],
                                demoSize = File['demoSize'],
                                embDimSize = options_train['embDimSize'], 
                                hiddenDimSize = options_train['hiddenDimSize'],
                                windowSize = options_train['windowSize'],
                                batchSize = File['batchSize'],
                                maxEpochs = File['maxEpochs'],
                                logEps = File['logEps'],
                                L2_reg = File['L2_reg'])
                    else :
                        train_med2vec(seqFile = File['outFile']+'/seqFile_train_gridsearch',
                                labelFile = File['outFile']+'/labelFile_train_gridsearch',
                                demoFile = '',
                                outFile = File['outFile'], 
                                numXcodes = File['numXcodes'],
                                numYcodes = File['numYcodes'],
                                demoSize = File['demoSize'],
                                embDimSize = options_train['embDimSize'], 
                                hiddenDimSize = options_train['hiddenDimSize'],
                                windowSize = options_train['windowSize'],
                                batchSize = File['batchSize'],
                                maxEpochs = File['maxEpochs'],
                                logEps = File['logEps'],
                                L2_reg = File['L2_reg'])
                else :
                    train_med2vec(seqFile = File['outFile']+'/seqFile_train_gridsearch',
                                labelFile = '',
                                demoFile = '',
                                outFile = File['outFile'], 
                                numXcodes = File['numXcodes'],
                                numYcodes = File['numYcodes'],
                                demoSize = File['demoSize'],
                                embDimSize = options_train['embDimSize'], 
                                hiddenDimSize = options_train['hiddenDimSize'],
                                windowSize = options_train['windowSize'],
                                batchSize = File['batchSize'],
                                maxEpochs = File['maxEpochs'],
                                logEps = File['logEps'],
                                L2_reg = File['L2_reg'])
                            
            

                # Charge the learned model
                Filetest = File.copy()
                if len(File['labelFile']) > 0 :
                    if len(File['demoFile']) > 0 :
                        Filetest['seqFile'] = File['outFile']+'/seqFile_test_gridsearch'
                        Filetest['labelFile'] = File['outFile']+'/labelFile_test_gridsearch'
                        Filetest['demoFile'] = File['outFile']+'/demoFile_test_gridsearch'
                    else :
                        Filetest['seqFile'] = File['outFile']+'/seqFile_test_gridsearch'
                        Filetest['labelFile'] = File['outFile']+'/labelFile_test_gridsearch'
                else :
                    Filetest['seqFile'] = File['outFile']+'/seqFile_test_gridsearch'
                            
                loss_value_test = Evaluate_Med2Vec(Filetest, File['outFile']+ '.' +str(File['maxEpochs']-1) + '.npz', options_train)
                df_hyperparameters.loc[i, 'embDimSize'], df_hyperparameters.loc[i, 'hiddenDimSize'], df_hyperparameters.loc[i, 'windowSize'], df_hyperparameters.loc[i, 'Loss_test'] = embDimSize, hidden_dim, window_size, round(loss_value_test.item(), 4)
                    
                i+=1

    return df_hyperparameters