# Deep Patient - Functions

This notebook provides functions to compute Deep Patient Representation Learning

### Reference

Miotto, R., Li, L., Kidd, B. A. et Dudley, J. T. (2016). Deep patient: an unsupervised representation to predict the future of patients from the electronic health records. Scientific reports, 6(1):1–10



### Github with the authors codes

https://github.com/riccardomiotto/deep_patient


### Python environnement

- Python 2.7
- Required Packages : Theano, Scikit-learn, Pandas and Scipy


### Complementary Functions

We provide 2 complementary functionalities :

1. A Gridsearch function: gridsearch_sda

    This function aims at optimizing the following hyperparameters composing Deep Patient architecture:
    - nhidden: dimension of the latent space
    - nlayer: number of Autoencoder layers
    - corrup_lvl: data corruption level

    All set of hyperparameters to be tested given as input are used to train the model on a training set (80% of the sample). 

    The optimal set of hyperparameters is the one minimizing the loss function on a validation sample (20% of the sample).



2. An Evaluation function : evaluate_sda

    This function compute the cost of the model on both training and validation set.





Imports

In [None]:
from scipy import sparse
import pandas as pd
from sklearn.model_selection import train_test_split
import theano
import theano.tensor as T

## Function for the Evaluation of the training

In [None]:
def evaluate_sda(model, train, test, verbose=True):
    '''
    Parameters
    ----------
    model : SDA
        The model to evaluate.
    train : matrix
        Training samples (matrix of size samples x features) use for the training.
    test : matrix
        The validation samples (size samples x features) on which to test the model.

    Return
    ------
    cost_per_layer_train : list
        List of costs obtained for each layer (ie. DA) on the training sample.
    cost_per_layer_test : list
        List of costs obtained for each layer (ie. DA) on the validation sample.
    '''

    cost_per_layer_train = []
    cost_per_layer_test = []

    sda_train = SDA(train.shape[1],
                nhidden=model.sda[0].nh,
                nlayer=model.nlayer,
                param={
        'epochs': model.sda[0].epochs,
        'learn_rate' : model.sda[0].learn_rate,
        'batch_size': model.sda[0].batch_size,
        'corrupt_lvl': model.sda[0].corrupt_lvl
    })

    sda_test = SDA(test.shape[1],
                nhidden=model.sda[0].nh,
                nlayer=model.nlayer,
                param={
        'epochs': model.sda[0].epochs,
        'learn_rate' : model.sda[0].learn_rate,
        'batch_size': model.sda[0].batch_size,
        'corrupt_lvl': model.sda[0].corrupt_lvl
    })

    data_train = train
    data_test = test

    for i in range(model.nlayer) :
            
            sda_train.sda[i].w.set_value(model.sda[i].w.get_value())
            sda_train.sda[i].b.set_value(model.sda[i].b.get_value())
            sda_train.sda[i].bp.set_value(model.sda[i].bp.get_value())

            sda_test.sda[i].w.set_value(model.sda[i].w.get_value())
            sda_test.sda[i].b.set_value(model.sda[i].b.get_value())
            sda_test.sda[i].bp.set_value(model.sda[i].bp.get_value())

            try:
                data_train = data_train.toarray()
            except Exception:
                pass

            try:
                data_test = data_test.toarray()
            except Exception:
                pass

            data_train = sda_train.sda[i].normalizer.fit_transform(data_train)
            dt_train = theano.shared(value=data_train.astype(theano.config.floatX), borrow=True)
            x_train = T.matrix(name='dt_train')
            tilde_x_train = sda_train.sda[i]._corrupted_input(x_train)
            y_train = sda_train.sda[i]._hidden_representation(tilde_x_train)
            z_train = sda_train.sda[i]._reconstructed_input(y_train)

            data_test = sda_test.sda[i].normalizer.fit_transform(data_test)
            dt_test = theano.shared(value=data_test.astype(theano.config.floatX), borrow=True)
            x_test = T.matrix(name='dt_test')
            tilde_x_test = sda_test.sda[i]._corrupted_input(x_test)
            y_test = sda_test.sda[i]._hidden_representation(tilde_x_test)
            z_test = sda_test.sda[i]._reconstructed_input(y_test)

            gcst_train = - T.sum(x_train * T.log(z_train) + (1 - x_train) * T.log(1 - z_train), axis=1)
            gcst_test = - T.sum(x_test * T.log(z_test) + (1 - x_test) * T.log(1 - z_test), axis=1)
            loss_train = T.mean(gcst_train)
            loss_test = T.mean(gcst_test)

            compute_loss_train = theano.function(inputs=[x_train], outputs=loss_train)
            compute_loss_test = theano.function(inputs=[x_test], outputs=loss_test)
            loss_train_value = compute_loss_train(data_train)
            loss_test_value = compute_loss_test(data_test)

            if verbose==True:
                print('Layer : ' +str(i))
                print("Loss_train:", round(loss_train_value,4))
                print("Loss_test:", round(loss_test_value,4))

            cost_per_layer_train.append(loss_train_value)
            cost_per_layer_test.append(loss_test_value)

            if i < model.nlayer-1:
                data_train = sda_train.sda[i].normalizer.transform(data_train)
                dt_train = theano.shared(data_train, borrow=True)
                data_train = sda_train.sda[i]._hidden_representation(dt_train).eval()

                data_test = sda_test.sda[i].normalizer.transform(data_test)
                dt_test = theano.shared(data_test, borrow=True)
                data_test = sda_test.sda[i]._hidden_representation(dt_test).eval()
                
    return cost_per_layer_train, cost_per_layer_test

## Gridsearch of optimal hyperparameters

Tested Parameters:
- nhidden: dimension of the latent space
- nlayer: number of Autoencoder layers
- corrup_lvl: data corruption level

In [None]:
def gridsearch_sda(data, epochs, learning_rate, batch_size, embedding_dim_list, layers_list, corrupt_lvl_list) :
    '''
    Parameters
    ----------
    data : matrix 
        Matrix : samples x features.
    epochs : int
        Number of epochs.
    learning_rate : float
        Learning rate of the Gradient Descent.
    batch_size : int
        Number of samples per batch.
    embedding_dim_list : list
        List of dimensions of the resulting embedding (ie. the hidden layer of the SDA) to test.
    layers_list : list
        List of the number of layers (ie. DA) to test.
    corrupt_lvl_list : list
        List of the values of data corruption (ie. noise) to test.
    
    Returns
    -------
    df_gridsearch : DataFrame
        For each pairs of parameters (nhidden, nlayer, corrupt_lvl) the associated loss on both training and test samples.
    '''

    seq_matrix_train, seq_matrix_test = train_test_split(data, test_size=0.2, random_state=42)
    i = 0
    df_hyperparameters = pd.DataFrame(columns=['nlayer', 'nhidden', 'corrupt_lvl', 'Loss_train', 'Loss_test'])

    for nlayer in layers_list :
        for nhidden in embedding_dim_list :
            for corrupt_lvl in corrupt_lvl_list :

                sda = SDA(seq_matrix_train.shape[1],
                                nhidden=nhidden,
                                nlayer=nlayer,
                                param={
                        'epochs': epochs,
                        'learn_rate' : learning_rate,
                        'batch_size': batch_size,
                        'corrupt_lvl': corrupt_lvl
                    })
                
                sda.train(seq_matrix_train)
                
                cost_per_layer_train, cost_per_layer_test = evaluate_sda(sda, seq_matrix_train, seq_matrix_test, verbose=False)

                df_hyperparameters.loc[i, 'nlayer'], df_hyperparameters.loc[i, 'nhidden'],  df_hyperparameters.loc[i, 'corrupt_lvl'], df_hyperparameters.loc[i, 'Loss_train'], df_hyperparameters.loc[i, 'Loss_test'] = nlayer, nhidden, corrupt_lvl, round(cost_per_layer_train[-1],4), round(cost_per_layer_test[-1],4)                  
                i+=1

    return df_hyperparameters