# Deep Patient

Required environment:
- Python 2.7
- Packages : Theano, Scikit-learn, Pandas and Scipy

## Import codes

Package available on authors github

https://github.com/riccardomiotto/deep_patient

In [None]:
from da import DA
from sda import SDA

Complementary Functionalities provided on our Github

In [None]:
run DeepPatient_Functions.ipynb

## Data

### Import the data 

In [None]:
data = pd.read_csv(...)

### Prepare the data
1. The medical codes must be factorized, ie. transformed in integer
2. Padding on the patient sequence: each lenght of sequences must be the same 
3. Generate a matrix of format sparse.csc_matrix() of shape number of samples x max visits lenght

In [None]:
# Generate the final matrix
seq_matrix = sparse.csc_matrix(...)

## Gridsearch

Optimized Parameters
- nhidden: dimension of the latent space
- nlayer: number of Autoencoder layers
- corrup_lvl: data corruption level

In [None]:
epochs = 100
learning_rate = 1e-3
batch_size = 250
# Tested Parameters
embedding_dim_list = [10, 20, 50, 100]
layers_list = [1, 3, 5]
corrupt_lvl_list = [0.01, 0.05, 0.1]

In [None]:
df_hyperparameters = gridsearch_sda(seq_matrix, epochs, learning_rate, batch_size, embedding_dim_list, layers_list, corrupt_lvl_list)

We keep the set of hyperparameters minimizing the loss on validation test to avoid over-fitting

In [None]:
print('The optimal Parameters are ')
print('Dimension of the latent space : ' + str(df_hyperparameters.iloc[int(np.where(df_hyperparameters.Loss_test==np.min(df_hyperparameters['Loss_test']))[0].item())].nhidden))
print('Number of layers : ' + str(df_hyperparameters.iloc[int(np.where(df_hyperparameters.Loss_test==np.min(df_hyperparameters['Loss_test']))[0].item())].nlayer))
print('Corruption level : ' + str(df_hyperparameters.iloc[int(np.where(df_hyperparameters.Loss_test==np.min(df_hyperparameters['Loss_test']))[0].item())].corrupt_lvl))

## Training Step

### Split the sample into a train (80%) and test sample (20%)

In [None]:
np.random.seed(407)
seq_matrix_train, seq_matrix_test= train_test_split(seq_matrix, test_size=0.2, random_state=42)

### Parameters

- Optimal hyperparameters

In [None]:
nhidden = df_hyperparameters.iloc[int(np.where(df_hyperparameters.Loss_test==np.min(df_hyperparameters['Loss_test']))[0].item())].nhidden
nlayer = df_hyperparameters.iloc[int(np.where(df_hyperparameters.Loss_test==np.min(df_hyperparameters['Loss_test']))[0].item())].nlayer
corrupt_lvl = df_hyperparameters.iloc[int(np.where(df_hyperparameters.Loss_test==np.min(df_hyperparameters['Loss_test']))[0].item())].corrupt_lvl

- Training Parameters

In [None]:
epochs = 200
learning_rate = 1e-3
batch_size = 250

### Model Construction

In [None]:
model = SDA(seq_matrix_train.shape[1],
             nhidden=nhidden,
             nlayer=nlayer,
             param={
    'epochs': epochs,
    'learn_rate' : learning_rate,
    'batch_size': batch_size,
    'corrupt_lvl': corrupt_lvl
})

### Training

In [None]:
model.train(seq_matrix_train)

### Saving of the model

In [None]:
param = {}
for layer in range(nlayer) :
    param['layer_' + str(layer)] = {'w':model.sda[layer].w.get_value(), 'b':model.sda[layer].b.get_value(), 'bp':model.sda[layer].bp.get_value()}

with open('model.pkl', 'wb') as f:
    pickle.dump(param, f)

## Evaluation Step

In [None]:
cost_per_layer_train, cost_per_layer_test = evaluate_sda(model, seq_matrix_train, seq_matrix_test)

- Cost per layer on the training set

In [None]:
cost_per_layer_train

- Cost per layer on the validation set

In [None]:
cost_per_layer_test

## Resulting latent space

### Apply the model on the whole sample

In [None]:
deep_repr = model.apply(seq_matrix)
print(deep_repr.shape)

Make sure the shape of the representation is number of patients x dimension of the latent space

### Saving the embedding space

This step is required to charge the patient representation when computing the clustering.

In [None]:
np.save('deep_repr', np.array(deep_repr))