EXTRA (APÉNDICE)

En este notebook, se entrenan modelos de deep learning (LSTM) utilizando series temporales de las variables que componen OASIS, para predecir si una estancia en UCI va a ser > 3 días o no. Luego se evalua el rendimiento de los modelos con ACC, AUC-ROC, AUC-PR, confusion matrix...También se realiza Cross Validation y student-t-test para determinar si las diferencias entre los resultados son estadísticamente significativas.

Obs. Ejecute primero '06CreateTimeSeries.ipynb' para obtener 'result_OneBigDataset.csv', 'result_OneBigDataset_y_true.csv', 'result_OneBigDataset_test.csv' y 'result_OneBigDataset_y_true_test.csv'.

In [1]:
import pandas as pd
import numpy as np
np.random.seed(42)
import csv
import os
import sys

In [2]:
import matplotlib.pyplot as plt
import matplotlib
import pylab
import seaborn as sns
matplotlib.style.use('ggplot')

In [3]:
import shutil
import random
random.seed(42)
from sklearn.preprocessing import StandardScaler
from sklearn.utils import class_weight
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import KFold, cross_val_score
from sklearn import metrics
from sklearn.model_selection import cross_val_predict
from scipy import stats


In [4]:
result_OneBigDataset = pd.read_csv('/data/codi/OASIS/timesSeriesv2_cohorte/data/lengthOfStay/result_OneBigDataset.csv')
y_true = pd.read_csv('/data/codi/OASIS/timesSeriesv2_cohorte/data/lengthOfStay/result_OneBigDataset_y_true.csv')

In [5]:
result_OneBigDataset.head(50)

Unnamed: 0,Hours,HRATE,MAP,RESP_RATE,TEMP_C,gcs_e,gcs_m,gcs_total_carevue,gcs_v,AGE,PRELOS
0,0,97.0,74.0,23.0,36.6,4.0,6.0,15.0,5.0,43.531507,2.8
1,1,96.0,74.0,22.0,36.6,3.0,6.0,15.0,1.0,43.531507,2.8
2,2,90.0,74.0,20.0,36.6,3.0,6.0,15.0,1.0,43.531507,2.8
3,3,90.0,76.0,16.0,36.6,3.0,6.0,15.0,1.0,43.531507,2.8
4,4,81.0,76.0,16.0,36.6,3.0,6.0,15.0,1.0,43.531507,2.8
5,5,80.0,76.0,15.0,36.944444,3.0,6.0,15.0,1.0,43.531507,2.8
6,6,83.0,81.0,18.0,36.944444,3.0,6.0,15.0,1.0,43.531507,2.8
7,7,88.0,81.0,18.0,36.944444,3.0,6.0,15.0,1.0,43.531507,2.8
8,8,86.0,81.0,16.0,36.944444,3.0,6.0,15.0,1.0,43.531507,2.8
9,9,89.0,78.0,18.0,37.055556,3.0,6.0,15.0,5.0,43.531507,2.8


In [6]:
result_OneBigDataset.isnull().sum()

Hours                0
HRATE                0
MAP                  0
RESP_RATE            0
TEMP_C               0
gcs_e                0
gcs_m                0
gcs_total_carevue    0
gcs_v                0
AGE                  0
PRELOS               0
dtype: int64

In [7]:
result_OneBigDataset = np.array(result_OneBigDataset.drop(['Hours'],axis=1))
scaler = StandardScaler() #normalizar
result_OneBigDataset_tranformed = scaler.fit_transform(result_OneBigDataset)

In [8]:
#Sin normalizar:
#train_input = np.array(result_OneBigDataset.drop(['Hours'],axis=1)).reshape(y_true.shape[0], 24, 8)
#Normalizado:
train_input = result_OneBigDataset_tranformed.reshape(y_true.shape[0], 24, 10)

In [9]:
train_input

array([[[ 6.04373538e-01, -3.00562352e-02,  7.29956932e-01, ...,
          6.48531688e-01, -1.19838408e+00, -2.92974697e-01],
        [ 5.52791841e-01, -3.00562352e-02,  5.52048634e-01, ...,
         -1.66958158e+00, -1.19838408e+00, -2.92974697e-01],
        [ 2.43301658e-01, -3.00562352e-02,  1.96232039e-01, ...,
         -1.66958158e+00, -1.19838408e+00, -2.92974697e-01],
        ...,
        [ 1.40138263e-01,  1.99531822e-02, -6.93309448e-01, ...,
          6.48531688e-01, -1.19838408e+00, -2.92974697e-01],
        [-1.46068282e-02,  2.70973847e-02, -3.37492853e-01, ...,
          6.48531688e-01, -1.19838408e+00, -2.92974697e-01],
        [-1.69351920e-01,  1.28089797e-02, -1.59584556e-01, ...,
          6.48531688e-01, -1.19838408e+00, -2.92974697e-01]],

       [[-9.94659075e-01,  4.37669361e-02,  1.08577353e+00, ...,
          6.48531688e-01,  9.50236535e-01,  7.83182393e-02],
        [-1.35573096e+00, -7.29214502e-02, -3.37492853e-01, ...,
          6.48531688e-01,  9.50236535e

In [10]:
train_input.shape

(29846, 24, 10)

In [11]:
y_true = np.array(y_true)
y_true

array([[0.],
       [0.],
       [0.],
       ...,
       [0.],
       [1.],
       [0.]])

In [12]:
y_true.shape

(29846, 1)

In [13]:
result_OneBigDataset_test = pd.read_csv('/data/codi/OASIS/timesSeriesv2_cohorte/data/lengthOfStay/result_OneBigDataset_test.csv')
y_true_test = pd.read_csv('/data/codi/OASIS/timesSeriesv2_cohorte/data//lengthOfStay/result_OneBigDataset_y_true_test.csv')

In [14]:
result_OneBigDataset_test.head(50)

Unnamed: 0,Hours,HRATE,MAP,RESP_RATE,TEMP_C,gcs_e,gcs_m,gcs_total_carevue,gcs_v,AGE,PRELOS
0,0,54.0,89.0,14.0,36.6,4.0,6.0,15.0,5.0,49.70411,84.9
1,1,54.0,77.0,15.0,36.6,4.0,6.0,15.0,5.0,49.70411,84.9
2,2,55.0,80.333298,10.0,36.111099,4.0,6.0,15.0,5.0,49.70411,84.9
3,3,57.0,69.333298,8.0,36.111099,4.0,6.0,15.0,5.0,49.70411,84.9
4,4,63.0,71.333298,14.0,35.777802,4.0,6.0,15.0,5.0,49.70411,84.9
5,5,63.0,70.666702,13.0,35.777802,4.0,6.0,15.0,5.0,49.70411,84.9
6,6,61.0,75.666702,9.0,35.777802,4.0,6.0,15.0,5.0,49.70411,84.9
7,7,58.0,66.0,8.0,35.777802,4.0,6.0,15.0,5.0,49.70411,84.9
8,8,56.0,82.0,8.0,35.944401,4.0,6.0,15.0,5.0,49.70411,84.9
9,9,61.0,70.666702,22.0,35.944401,4.0,6.0,15.0,5.0,49.70411,84.9


In [15]:
result_OneBigDataset_test = np.array(result_OneBigDataset_test.drop(['Hours'],axis=1))
scaler = StandardScaler()
result_OneBigDataset_test_tranformed = scaler.fit_transform(result_OneBigDataset_test)
test_input = result_OneBigDataset_test_tranformed.reshape(y_true_test.shape[0], 24, 10)

In [16]:
test_input

array([[[-0.00393468,  0.49202727, -0.89175725, ...,  0.6607166 ,
         -0.81395485,  0.73924543],
        [-0.00393468, -0.0494832 , -0.71248693, ...,  0.6607166 ,
         -0.81395485,  0.73924543],
        [-0.00389908,  0.10093476, -1.6088385 , ...,  0.6607166 ,
         -0.81395485,  0.73924543],
        ...,
        [-0.00379226, -0.16982047, -0.71248693, ...,  0.6607166 ,
         -0.81395485,  0.73924543],
        [-0.00372105, -0.57595333, -0.89175725, ...,  0.6607166 ,
         -0.81395485,  0.73924543],
        [-0.00372105, -0.57595333, -0.89175725, ...,  0.6607166 ,
         -0.81395485,  0.73924543]],

       [[-0.00297335, -0.50074193,  0.72167558, ...,  0.6607166 ,
         -1.57832581, -0.33165456],
        [-0.00315137, -0.69628594,  0.36313495, ...,  0.6607166 ,
         -1.57832581, -0.33165456],
        [-0.00304456, -0.56090815,  0.18386464, ...,  0.6607166 ,
         -1.57832581, -0.33165456],
        ...,
        [-0.00304456, -0.45561606, -0.53321662, ...,  

In [17]:
test_input.shape

(5282, 24, 10)

In [18]:
y_true_test = np.array(y_true_test)
y_true_test

array([[0.],
       [1.],
       [0.],
       ...,
       [0.],
       [0.],
       [0.]])

In [19]:
y_true_test.shape

(5282, 1)

In [20]:
y_test = y_true_test.ravel()
y_test.shape

(5282,)

In [21]:
#para Cross Validation, usar el dataset completo # Arthur
y = (np.concatenate([y_true, y_true_test], axis=0))
X = (np.concatenate([result_OneBigDataset_tranformed, result_OneBigDataset_test_tranformed], axis=0)).reshape(y.shape[0], 24, 10)

In [22]:
X.shape

(35128, 24, 10)

In [23]:
y.shape

(35128, 1)

In [24]:
predictions_output_dir = '/data/codi/OASIS/timesSeriesv2_cohorte/data/lengthOfStay/CV/'

## Li

LSTM without class weight

In [None]:
from keras.layers import Bidirectional
from keras.models import Sequential
from keras.layers.core import Activation, Dropout, Dense
from keras.models import Model
from keras.layers import Flatten, LSTM

model = Sequential()
model.add(Bidirectional(LSTM(10, activation='sigmoid'), input_shape=(24, 10)))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy',metrics=['accuracy'])

history = model.fit(train_input, y_true, epochs=8, verbose=1)


In [None]:
print(model.summary())

In [None]:
test_output = model.predict(test_input, verbose=1) #return probabilities


In [None]:
test_output

In [None]:
y_pred = (test_output.ravel()>0.5) + 0.0 # predict and get class (0 if pred < 0.5 else 1)
print('labels(0 short stay, 1 long stay) predicted: ', y_pred)
print('true labels: ', y_test)

In [None]:

test_ac=np.round(metrics.accuracy_score(y_test, y_pred)*100,4)
print("Accuracy test:",test_ac)
auroc = metrics.roc_auc_score(y_test, test_output)
print("AUC-ROC: ", auroc)
(precisions, recalls, thresholds) = metrics.precision_recall_curve(y_test, test_output)
auprc = metrics.auc(recalls, precisions)
print("AUC-PR: ", auprc)

In [None]:
cm = confusion_matrix(y_test, y_pred)
print(cm)

In [None]:
plt.figure(figsize=(6,6))
ax = sns.heatmap(cm, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score: {0}'.format((test_ac))
plt.title(all_sample_title, size = 15)
plt.show()

In [None]:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
plt.figure(figsize=(6,6))
sns.heatmap(cm, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score: {0}'.format(test_ac)
plt.title(all_sample_title, size = 15)
plt.show()

LSTM with class weight

In [None]:
classWeight = class_weight.compute_class_weight('balanced',np.unique(np.ravel(y_true)),np.ravel(y_true))
classWeight

In [None]:
classWeight = {i : classWeight[i] for i in range(2)}  #convert to dictionary in order to fit to keras model
classWeight

In [None]:
from keras.layers import Bidirectional
from keras.models import Sequential
from keras.layers.core import Activation, Dropout, Dense
from keras.models import Model
from keras.layers import Flatten, LSTM

model = Sequential()
model.add(Bidirectional(LSTM(10, activation='sigmoid'), input_shape=(24, 10)))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy',metrics=['accuracy'])

history = model.fit(train_input, y_true, epochs=8, verbose=1,class_weight=classWeight)


In [None]:
print(model.summary())

Prepare test Dataset, same process

In [None]:
test_output = model.predict(test_input, verbose=1)


In [None]:
print(test_output)

In [None]:
y_pred = (test_output.ravel()>0.5) + 0.0 # predict and get class (0 if pred < 0.5 else 1)

In [None]:
print('labels(0 short stay, 1 long stay) predicted: ', y_pred)
print('true labels: ', y_test)

In [None]:
test_ac=np.round(metrics.accuracy_score(y_test, y_pred)*100,4)
print("Accuracy test:",test_ac)

In [None]:
#Manera 2 de calcular Accuracy:
#https://keras.io/api/metrics/accuracy_metrics/
#import tensorflow as tf
#m = tf.keras.metrics.BinaryAccuracy(threshold=0.5)
#m.update_state(y_true_test,test_output)
#score = m.result().numpy()
#score

In [None]:
auroc = metrics.roc_auc_score(y_test, test_output)
print("AUC-ROC: ", auroc)
(precisions, recalls, thresholds) = metrics.precision_recall_curve(y_test, test_output)
auprc = metrics.auc(recalls, precisions)
print("AUC-PR: ", auprc)

In [None]:
cm = confusion_matrix(y_test, y_pred)
print(cm)

In [None]:
plt.figure(figsize=(6,6))
ax = sns.heatmap(cm, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score: {0}'.format((test_ac))
plt.title(all_sample_title, size = 15)
plt.show()

In [None]:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
plt.figure(figsize=(6,6))
sns.heatmap(cm, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score: {0}'.format(test_ac)
plt.title(all_sample_title, size = 15)
plt.show()

## Arthur

In [35]:
from tslearn.clustering import TimeSeriesKMeans
from keras.layers import Bidirectional
from keras.models import Sequential
from keras.layers.core import Activation, Dropout, Dense
from keras.models import Model
from keras.layers import Flatten, LSTM
from keras.callbacks import EarlyStopping,ModelCheckpoint
from sklearn.utils import class_weight

def kfoldclustered(classifier, X, y, weighted=False):
    cv = KFold(n_splits=5, shuffle=True, random_state=42)
    f = 0;
    aucprs = []
    aucrocs = []
    accuracies = []
    for train, test in cv.split(X, y): #train and test are indexes
        f += 1
        print('KFold ',f,' ---')
        train_input = X[train]
        test_input = X[test]
        y_true = y[train]
        y_test = y[test]
        
        print("---- Clustering Train ---- ")
        # Apply TSCK
        n_clusters = 3
        kmeans = TimeSeriesKMeans(n_clusters=n_clusters, metric="dtw", n_init=1, max_iter=10, random_state=42)
        # Fit to the training data
        kmeans.fit(train_input)
        # Generate out clusters
        train_cluster = kmeans.predict(train_input)
        # Add predicted cluster and y regression label to our training DataFrame
        train_df = list(zip(train_cluster, y_true, train_input))
        ls = sorted(train_df, key=lambda t: t[0])
        # Unzip sorted data
        cluster, y_, data = zip(*ls)
        data = np.array(data)
        y_ = np.array(y_)
        # Getting indexes of clusters division
        c = 0
        ind = []
        for i in range(len(ls)):
            if ls[i][0] > c:
                c = ls[i][0]
                ind.append(i)
        # Removing clusters with less than 10 samples
        d1 = ind[0]
        d2 = ind[1] - ind[0]
        d3 = y_true.shape[0] - ind[1]
        cluster_centers_ = kmeans.cluster_centers_
        if d1 < 10:
            dist1 = np.linalg.norm(kmeans.cluster_centers_[1])
            dist2 = np.linalg.norm(kmeans.cluster_centers_[2])
            if dist1 < dist2:
                c = 1
            else:
                c = 2
            train_cluster[:ind[0]] = c
            ind = ind[1:]
            cluster_centers_ = np.delete(kmeans.cluster_centers_, 0, 0)
        elif d2 < 10:
            dist1 = np.linalg.norm(kmeans.cluster_centers_[0])
            dist2 = np.linalg.norm(kmeans.cluster_centers_[2])
            if dist1 < dist2:
                c = 0
            else:
                c = 2
            train_cluster[ind[0]:ind[1]] = c
            ind = ind[:1]
            cluster_centers_ = np.delete(kmeans.cluster_centers_, 1, 0)
        elif d3 < 10:
            dist1 = np.linalg.norm(kmeans.cluster_centers_[0])
            dist2 = np.linalg.norm(kmeans.cluster_centers_[1])
            if dist1 < dist2:
                c = 0
            else:
                c = 1
            train_cluster[ind[1]:] = c
            ind = ind[:1]
            cluster_centers_ = np.delete(kmeans.cluster_centers_, 2, 0)
            
        print("-------- LSTM fitting for each cluster --------")
        i=0
        if weighted:
            classWeight = class_weight.compute_class_weight(class_weight='balanced',
                                                classes=np.unique(np.ravel(y_true)),
                                                y=np.ravel(y_true))
            classWeight = {i : classWeight[i] for i in range(2)}
        for index in range(len(ind)+1):
            if index == 0:
                cluster_X = data[:ind[index],:,:]
                cluster_Y = y_[:ind[index]]
            elif index == len(ind):
                cluster_X = data[ind[index-1]:,:,:]
                cluster_Y = y_[ind[index-1]:]
            else:
                cluster_X = data[ind[index-1]:ind[index],:,:]
                cluster_Y = y_[ind[index-1]:ind[index]]
            print(cluster_X.shape)
            
            # LSTM
            model = Sequential()
            model.add(Bidirectional(LSTM(10, activation='sigmoid'), input_shape=(24, 10)))
            model.add(Dropout(0.2))
            model.add(Dense(1, activation='sigmoid'))
            model.compile(optimizer='adam', loss='binary_crossentropy',metrics=['accuracy'])
            # LSTM
            if weighted == False:
                earlyStop=EarlyStopping(monitor="val_loss",verbose=2,mode='min',patience=3, restore_best_weights=True)
                saved_model_path = os.path.join(predictions_output_dir, 'cv{}_lstm_tsck{}.h5'.format(f,i))
                history = model.fit(cluster_X, cluster_Y, epochs=10, verbose=1, validation_split=0.15, callbacks=[earlyStop])
                model.save(saved_model_path)
                i += 1
            # LSTM_W
            else:
                print("PASSOU")
                earlyStop=EarlyStopping(monitor="val_loss",verbose=2,mode='min',patience=3, restore_best_weights=True)
                saved_model_path = os.path.join(predictions_output_dir, 'cv{}_lstm_w_tsck{}.h5'.format(f,i))
                history = model.fit(cluster_X, cluster_Y, epochs=10, verbose=1, validation_split=0.15, class_weight=classWeight, callbacks=[earlyStop])
                model.save(saved_model_path)
                i += 1
                
        testkfold(f, cluster_centers_, train_cluster, y_, data, ind, y_test, test_input, aucprs, aucrocs, accuracies, weighted)
    return aucprs, aucrocs, accuracies

In [36]:
from keras.models import load_model
def testkfold(f, cluster_centers_, train_cluster, y_, data, ind, y_test, test_input, aucprs, aucrocs, accuracies, weighted):
    # Create initial test data to store assigned clusters
    test_df = list(zip(train_cluster, y_test, test_input))
    # Test sample es asignado al cluster correspondiente mediante Distancia euclidiana y se aplica el modelo correspondiente
    print("Assigning each test sample to the closest cluster centroid...")
    new_cluster = [0 for i in range(y_test.shape[0])]
    for row in range(len(test_df)):
        min_distance = float('inf')
        closest_cluster = None
        for k in range(cluster_centers_.shape[0]):
            # Check if the assigned cluster has more than 100 samples
            # if train_clusters_df[k].shape[0] > 100: # Probar sin limite
            distance = np.linalg.norm(cluster_centers_[k]-test_df[row][2])
            if distance < min_distance:
                min_distance = distance
                closest_cluster = k
        # Assign cluster to test sample
        new_cluster[row] = closest_cluster
    # Sort test data
    test_df = list(zip(new_cluster, y_test, test_input))
    ls_test = sorted(test_df, key=lambda t: t[0])
    # Unzip sorted data
    cluster_t, y_t, data_t = zip(*ls_test)
    data_t = np.array(data_t)
    y_t = np.array(y_t)
    # Getting indexes
    c = 0
    ind_t = []
    for i in range(len(ls_test)):
        if ls_test[i][0] > c:
            c = ls_test[i][0]
            ind_t.append(i)
    print("-------- Train metrics ---------")
    i = 0
    # For each cluster, predict probabilities of class labels
    for index in range(len(ind)+1):
        if index == 0:
            cluster_X = data[:ind[index],:,:]
            cluster_Y = y_[:ind[index]]
        elif index == len(ind):
            cluster_X = data[ind[index-1]:,:,:]
            cluster_Y = y_[ind[index-1]:]
        else:
            cluster_X = data[ind[index-1]:ind[index],:,:]
            cluster_Y = y_[ind[index-1]:ind[index]]
        if weighted:
            saved_model_path = os.path.join(predictions_output_dir, 'cv{}_lstm_w_tsck{}.h5'.format(f,i))
        else:
            saved_model_path = os.path.join(predictions_output_dir, 'cv{}_lstm_tsck{}.h5'.format(f,i))
        
        model = load_model(saved_model_path)
        classes_prob = model.predict(cluster_X, verbose=1)
        if i == 0:
            train_X_probs = np.array(classes_prob)
            y_train = np.array(cluster_Y)
        else: 
            train_X_probs = np.concatenate((train_X_probs, classes_prob))
            y_train = np.concatenate((y_train, cluster_Y))
        i += 1

    # Test metrics
    print("-------- Test metrics ---------")
    i = 0
    # For each cluster, predict probabilities of class labels
    for index in range(len(ind_t)+1):
        if index == 0:
            cluster_X = data_t[:ind_t[index],:,:]
            cluster_Y = y_t[:ind_t[index]]
        elif index == len(ind):
            cluster_X = data_t[ind_t[index-1]:,:,:]
            cluster_Y = y_t[ind_t[index-1]:]
        else:
            cluster_X = data_t[ind_t[index-1]:ind_t[index],:,:]
            cluster_Y = y_t[ind_t[index-1]:ind_t[index]]
            
        if weighted:
            saved_model_path = os.path.join(predictions_output_dir, 'cv{}_lstm_w_tsck{}.h5'.format(f,i))
        else:
            saved_model_path = os.path.join(predictions_output_dir, 'cv{}_lstm_tsck{}.h5'.format(f,i))
        model = load_model(saved_model_path)
        classes_prob = model.predict(cluster_X, verbose=1)
        if i == 0:
            test_X_probs = np.array(classes_prob)
            test_y = np.array(cluster_Y)
        else: 
            test_X_probs = np.concatenate((test_X_probs, classes_prob))
            test_y = np.concatenate((test_y, cluster_Y))
        print('----- cluster ',i,' -----')
        print(cluster_X.shape)
        # Test metrics
        y_pred1 = (classes_prob.ravel()>0.5) + 0.0 # predict and get class (0 if pred < 0.5 else 1)
        test_ac1=np.round(metrics.accuracy_score(cluster_Y, y_pred1)*100,4)
        print("Accuracy test cluster ",i,":", test_ac1)
        i += 1
    print("------ Metrics ------")
    # Train metrics
    y_pred_train = (train_X_probs.ravel()>0.5) + 0.0 # predict and get class (0 if pred < 0.5 else 1)
    print('labels(0 short stay, 1 long stay) predicted: ', y_pred_train)
    print('true labels: ', y_train)
    # Test metrics
    y_pred = (test_X_probs.ravel()>0.5) + 0.0 # predict and get class (0 if pred < 0.5 else 1)
    print('labels(0 short stay, 1 long stay) predicted: ', y_pred)
    print('true labels: ', test_y)
    #
    train_ac=np.round(metrics.accuracy_score(y_train, y_pred_train)*100,4)
    print("Accuracy train:", train_ac)
    test_ac=np.round(metrics.accuracy_score(test_y, y_pred)*100,4)
    print("Accuracy test:", test_ac)
    auroc = metrics.roc_auc_score(test_y, test_X_probs)
    print("AUC-ROC: ", auroc)
    (precisions, recalls, thresholds) = metrics.precision_recall_curve(test_y, test_X_probs)
    auprc = metrics.auc(recalls, precisions)
    print("AUC-PR: ", auprc)
    aucprs.append(auprc)
    aucrocs.append(auroc)
    accuracies.append(test_ac)

In [37]:
# Execute LSTM
aucprs, aucrocs, accuracies = kfoldclustered("LSTM", X, y, weighted = False)

KFold  1  ---
---- Clustering Train ---- 
-------- LSTM fitting for each cluster --------
(12979, 24, 10)
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
(9129, 24, 10)
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
(5994, 24, 10)
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Assigning each test sample to the closest cluster centroid...
-------- Train metrics ---------
-------- Test metrics ---------
----- cluster  0  -----
(3406, 24, 10)
Accuracy test cluster  0 : 70.0822
----- cluster  1  -----
(2082, 24, 10)
Accuracy test cluster  1 : 76.7531
----- cluster  2  -----
(1538, 24, 10)
Accuracy test cluster  2 : 71.0013
------ Metrics ------
labels(0 short stay, 1 long stay) predicted:  [0. 0. 0. ... 0. 1. 1.]
true labels:  [[0.]
 [0.]
 [1.]
 ...
 [0.]
 [0.]
 [0.]]
labels(0 short stay, 1 long stay) 

Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Assigning each test sample to the closest cluster centroid...
-------- Train metrics ---------
-------- Test metrics ---------
----- cluster  0  -----
(4784, 24, 10)
Accuracy test cluster  0 : 73.1814
----- cluster  1  -----
(540, 24, 10)
Accuracy test cluster  1 : 70.1852
----- cluster  2  -----
(1702, 24, 10)
Accuracy test cluster  2 : 69.8002
------ Metrics ------
labels(0 short stay, 1 long stay) predicted:  [0. 0. 0. ... 1. 0. 1.]
true labels:  [[0.]
 [0.]
 [0.]
 ...
 [0.]
 [1.]
 [0.]]
labels(0 short stay, 1 long stay) predicted:  [0. 0. 0. ... 1. 1. 1.]
true labels:  [[0.]
 [0.]
 [0.]
 ...
 [1.]
 [0.]
 [0.]]
Accuracy train: 72.3329
Accuracy test: 72.1321
AUC-ROC:  0.7435762780123726
AUC-PR:  0.6589476165051861
KFold  3  ---
---- Clustering Train ---- 
-------- LSTM fitting for each cluster --------
(5320, 24, 10)
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epo

Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
(16121, 24, 10)
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Assigning each test sample to the closest cluster centroid...
-------- Train metrics ---------
-------- Test metrics ---------
----- cluster  0  -----
(3111, 24, 10)
Accuracy test cluster  0 : 76.0206
----- cluster  1  -----
(3914, 24, 10)
Accuracy test cluster  1 : 71.5125
------ Metrics ------
labels(0 short stay, 1 long stay) predicted:  [0. 0. 0. ... 1. 0. 0.]
true labels:  [[0.]
 [0.]
 [0.]
 ...
 [1.]
 [0.]
 [0.]]
labels(0 short stay, 1 long stay) predicted:  [0. 1. 0. ... 0. 0. 1.]
true labels:  [[0.]
 [1.]
 [1.]
 ...
 [1.]
 [1.]
 [0.]]
Accuracy train: 72.3517
Accuracy test: 73.5089
AUC-ROC:  0.7545235331534101
AUC-PR:  0.6666725268965457
KFold  5  ---
---- Clustering Train ---- 
-------- LSTM fitting for each cluster --------
(10438, 24, 10)
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epo

----- cluster  0  -----
(2523, 24, 10)
Accuracy test cluster  0 : 75.6639
----- cluster  1  -----
(1052, 24, 10)
Accuracy test cluster  1 : 70.4373
----- cluster  2  -----
(3450, 24, 10)
Accuracy test cluster  2 : 70.6377
------ Metrics ------
labels(0 short stay, 1 long stay) predicted:  [0. 0. 0. ... 1. 0. 0.]
true labels:  [[0.]
 [0.]
 [0.]
 ...
 [1.]
 [0.]
 [0.]]
labels(0 short stay, 1 long stay) predicted:  [0. 0. 0. ... 0. 0. 1.]
true labels:  [[0.]
 [0.]
 [0.]
 ...
 [0.]
 [1.]
 [0.]]
Accuracy train: 72.5901
Accuracy test: 72.4128
AUC-ROC:  0.7337221920781335
AUC-PR:  0.6354399796243597


In [38]:
print ('aucpr scores:', aucprs)
print("%0.4f mean aucpr with a standard deviation of %0.4f" % (np.mean(aucprs), np.std(aucprs)))

print ('aucroc scores:', aucrocs)
print("%0.4f mean aucroc with a standard deviation of %0.4f" % (np.mean(aucrocs), np.std(aucrocs)))

print ('accuracy scores:', accuracies)
print("%0.4f mean accuracy with a standard deviation of %0.4f" % (np.mean(accuracies), np.std(accuracies)))

aucpr scores: [0.6613680322003546, 0.6589476165051861, 0.6474585594591604, 0.6666725268965457, 0.6354399796243597]
0.6540 mean aucpr with a standard deviation of 0.0112
aucroc scores: [0.7432762046323007, 0.7435762780123726, 0.7349363226432507, 0.7545235331534101, 0.7337221920781335]
0.7420 mean aucroc with a standard deviation of 0.0075
accuracy scores: [72.2602, 72.1321, 71.6339, 73.5089, 72.4128]
72.3896 mean accuracy with a standard deviation of 0.6177


In [82]:
# Execute LSTM_w
aucprs_w, aucrocs_w, accuracies_w = kfoldclustered("LSTM", X, y, True)

KFold  1  ---
---- Clustering Train ---- 
-------- LSTM fitting for each cluster --------
(12979, 24, 10)
PASSOU
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 5: early stopping
(9129, 24, 10)
PASSOU
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
(5994, 24, 10)
PASSOU
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Assigning each test sample to the closest cluster centroid...
-------- Train metrics ---------
-------- Test metrics ---------
------ Metrics ------
labels(0 short stay, 1 long stay) predicted:  [0. 0. 0. ... 0. 1. 1.]
true labels:  [[0.]
 [0.]
 [1.]
 ...
 [0.]
 [0.]
 [0.]]
labels(0 short stay, 1 long stay) predicted:  [0. 0. 0. ... 1. 1. 1.]
true labels:  [[0.]
 [1.]
 [0.]
 ...
 [0.]
 [0.]
 [1.]]
Accuracy train: 72.1372
Accuracy test: 71.8901
AUC-ROC:  0.738998318143643
AUC-PR:  0.6476107357022802
KFold  2  ---
---- Clustering Train --

-------- Train metrics ---------
-------- Test metrics ---------
------ Metrics ------
labels(0 short stay, 1 long stay) predicted:  [0. 0. 0. ... 1. 0. 0.]
true labels:  [[0.]
 [0.]
 [0.]
 ...
 [1.]
 [0.]
 [0.]]
labels(0 short stay, 1 long stay) predicted:  [0. 1. 0. ... 0. 0. 1.]
true labels:  [[0.]
 [1.]
 [1.]
 ...
 [1.]
 [1.]
 [0.]]
Accuracy train: 71.7895
Accuracy test: 73.1103
AUC-ROC:  0.7510376296694833
AUC-PR:  0.6555467656100392
KFold  5  ---
---- Clustering Train ---- 
-------- LSTM fitting for each cluster --------
(10438, 24, 10)
PASSOU
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 8: early stopping
(3637, 24, 10)
PASSOU
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
(14028, 24, 10)
PASSOU
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 7: early stopping
Assigning each test sample to the closest cluster centroid...
-------- Trai

In [83]:
print ('aucpr scores:', aucprs_w)
print("%0.4f mean aucpr with a standard deviation of %0.4f" % (np.mean(aucprs_w), np.std(aucprs_w)))

print ('aucroc scores:', aucrocs_w)
print("%0.4f mean aucroc with a standard deviation of %0.4f" % (np.mean(aucrocs_w), np.std(aucrocs_w)))

print ('accuracy scores:', accuracies_w)
print("%0.4f mean accuracy with a standard deviation of %0.4f" % (np.mean(accuracies_w), np.std(accuracies_w)))

aucpr scores: [0.6476107357022802, 0.6515220395344019, 0.6433301815159652, 0.6555467656100392, 0.631819612345009]
0.6460 mean aucpr with a standard deviation of 0.0082
aucroc scores: [0.738998318143643, 0.7397990248010862, 0.7343867905322874, 0.7510376296694833, 0.7305359618978717]
0.7390 mean aucroc with a standard deviation of 0.0069
accuracy scores: [71.8901, 71.5912, 71.1785, 73.1103, 71.5302]
71.8601 mean accuracy with a standard deviation of 0.6648


## K-Means

In [26]:
from sklearn.cluster import KMeans
from keras.layers import Bidirectional
from keras.models import Sequential
from keras.layers.core import Activation, Dropout, Dense
from keras.models import Model
from keras.layers import Flatten, LSTM
from keras.callbacks import EarlyStopping,ModelCheckpoint
from sklearn.utils import class_weight

def kfoldclusteredkmeans(classifier, X, y, weighted=False):
    cv = KFold(n_splits=5, shuffle=True, random_state=42)
    f = 0;
    aucprs = []
    aucrocs = []
    accuracies = []
    for train, test in cv.split(X, y): #train and test are indexes
        f += 1
        print('KFold ',f,' ---')
        train_input = X[train]
        test_input = X[test]
        y_true = y[train]
        y_test = y[test]
        
        print("---- Clustering Train ---- ")
        # Apply KMeans
        n_clusters = 3
        kmeans = KMeans(
            init="random",
            n_clusters=n_clusters,
            n_init=10,
            max_iter=300,
            random_state=42
        )
        # Fit to the training data
        kmeans.fit(train_input.reshape(y_true.shape[0], 24*10))
         # Generate out clusters
        train_cluster = kmeans.predict(train_input.reshape(y_true.shape[0], 24*10))
        # Add predicted cluster and y regression label to our training DataFrame
        train_df = list(zip(train_cluster, y_true, train_input))
        ls = sorted(train_df, key=lambda t: t[0])
        # Unzip sorted data
        cluster, y_, data = zip(*ls)
        data = np.array(data)
        y_ = np.array(y_)
        # Getting indexes of clusters division
        c = 0
        ind = []
        for i in range(len(ls)):
            if ls[i][0] > c:
                c = ls[i][0]
                ind.append(i)
        # Removing clusters with less than 10 samples
        d1 = ind[0]
        d2 = ind[1] - ind[0]
        d3 = y_true.shape[0] - ind[1]
        cluster_centers_ = kmeans.cluster_centers_
        if d1 < 10:
            dist1 = np.linalg.norm(kmeans.cluster_centers_[1])
            dist2 = np.linalg.norm(kmeans.cluster_centers_[2])
            if dist1 < dist2:
                c = 1
            else:
                c = 2
            train_cluster[:ind[0]] = c
            ind = ind[1:]
            cluster_centers_ = np.delete(kmeans.cluster_centers_, 0, 0)
        elif d2 < 10:
            dist1 = np.linalg.norm(kmeans.cluster_centers_[0])
            dist2 = np.linalg.norm(kmeans.cluster_centers_[2])
            if dist1 < dist2:
                c = 0
            else:
                c = 2
            train_cluster[ind[0]:ind[1]] = c
            ind = ind[:1]
            cluster_centers_ = np.delete(kmeans.cluster_centers_, 1, 0)
        elif d3 < 10:
            dist1 = np.linalg.norm(kmeans.cluster_centers_[0])
            dist2 = np.linalg.norm(kmeans.cluster_centers_[1])
            if dist1 < dist2:
                c = 0
            else:
                c = 1
            train_cluster[ind[1]:] = c
            ind = ind[:1]
            cluster_centers_ = np.delete(kmeans.cluster_centers_, 2, 0)
            
        print("-------- LSTM fitting for each cluster --------")
        i=0
        if weighted:
            classWeight = class_weight.compute_class_weight(class_weight='balanced',
                                                classes=np.unique(np.ravel(y_true)),
                                                y=np.ravel(y_true))
            classWeight = {i : classWeight[i] for i in range(2)}
        for index in range(len(ind)+1):
            if index == 0:
                cluster_X = data[:ind[index],:,:]
                cluster_Y = y_[:ind[index]]
            elif index == len(ind):
                cluster_X = data[ind[index-1]:,:,:]
                cluster_Y = y_[ind[index-1]:]
            else:
                cluster_X = data[ind[index-1]:ind[index],:,:]
                cluster_Y = y_[ind[index-1]:ind[index]]
            print(cluster_X.shape)
            
            # LSTM
            model = Sequential()
            model.add(Bidirectional(LSTM(10, activation='sigmoid'), input_shape=(24, 10)))
            model.add(Dropout(0.2))
            model.add(Dense(1, activation='sigmoid'))
            model.compile(optimizer='adam', loss='binary_crossentropy',metrics=['accuracy'])
        
            
            # LSTM
            if weighted == False:
                earlyStop=EarlyStopping(monitor="val_loss",verbose=2,mode='min',patience=3, restore_best_weights=True)
                saved_model_path = os.path.join(predictions_output_dir, 'cv{}_lstm_kmeans{}.h5'.format(f,i))
                history = model.fit(cluster_X, cluster_Y, epochs=10, verbose=1, validation_split=0.15, callbacks=[earlyStop])
                model.save(saved_model_path)
                i += 1
            # LSTM_W
            else:
                print("PASSOU")
                earlyStop=EarlyStopping(monitor="val_loss",verbose=2,mode='min',patience=3, restore_best_weights=True)
                saved_model_path = os.path.join(predictions_output_dir, 'cv{}_lstm_w_kmeans{}.h5'.format(f,i))
                history = model.fit(cluster_X, cluster_Y, epochs=10, verbose=1, validation_split=0.15, class_weight=classWeight, callbacks=[earlyStop])
                model.save(saved_model_path)
                i += 1
                
        testkfoldkmeans(f, cluster_centers_, train_cluster, y_, data, ind, y_test, test_input, aucprs, aucrocs, accuracies, weighted)
    return aucprs, aucrocs, accuracies

In [27]:
from keras.models import load_model
def testkfoldkmeans(f, cluster_centers_, train_cluster, y_, data, ind, y_test, test_input, aucprs, aucrocs, accuracies, weighted):
    # Create initial test data to store assigned clusters
    test_df = list(zip(train_cluster, y_test, test_input))
    # Test sample es asignado al cluster correspondiente mediante Distancia euclidiana y se aplica el modelo correspondiente
    print("Assigning each test sample to the closest cluster centroid...")
    new_cluster = [0 for i in range(y_test.shape[0])]
    for row in range(len(test_df)):
        min_distance = float('inf')
        closest_cluster = None
        for k in range(cluster_centers_.shape[0]):
            # Check if the assigned cluster has more than 100 samples
            # if train_clusters_df[k].shape[0] > 100: # Probar sin limite
            distance = np.linalg.norm(cluster_centers_[k]-test_df[row][2].reshape(240))
            if distance < min_distance:
                min_distance = distance
                closest_cluster = k
        # Assign cluster to test sample
        new_cluster[row] = closest_cluster
    # Sort test data
    test_df = list(zip(new_cluster, y_test, test_input))
    ls_test = sorted(test_df, key=lambda t: t[0])
    # Unzip sorted data
    cluster_t, y_t, data_t = zip(*ls_test)
    data_t = np.array(data_t)
    y_t = np.array(y_t)
    # Getting indexes
    c = 0
    ind_t = []
    for i in range(len(ls_test)):
        if ls_test[i][0] > c:
            c = ls_test[i][0]
            ind_t.append(i)
    print("-------- Train metrics ---------")
    i = 0
    # For each cluster, predict probabilities of class labels
    for index in range(len(ind)+1):
        if index == 0:
            cluster_X = data[:ind[index],:,:]
            cluster_Y = y_[:ind[index]]
        elif index == len(ind):
            cluster_X = data[ind[index-1]:,:,:]
            cluster_Y = y_[ind[index-1]:]
        else:
            cluster_X = data[ind[index-1]:ind[index],:,:]
            cluster_Y = y_[ind[index-1]:ind[index]]
        if weighted:
            saved_model_path = os.path.join(predictions_output_dir, 'cv{}_lstm_w_kmeans{}.h5'.format(f,i))
        else:
            saved_model_path = os.path.join(predictions_output_dir, 'cv{}_lstm_kmeans{}.h5'.format(f,i))
        
        model = load_model(saved_model_path)
        classes_prob = model.predict(cluster_X, verbose=1)
        if i == 0:
            train_X_probs = np.array(classes_prob)
            y_train = np.array(cluster_Y)
        else: 
            train_X_probs = np.concatenate((train_X_probs, classes_prob))
            y_train = np.concatenate((y_train, cluster_Y))
        i += 1

    # Test metrics
    print("-------- Test metrics ---------")
    i = 0
    # For each cluster, predict probabilities of class labels
    for index in range(len(ind_t)+1):
        if index == 0:
            cluster_X = data_t[:ind_t[index],:,:]
            cluster_Y = y_t[:ind_t[index]]
        elif index == len(ind):
            cluster_X = data_t[ind_t[index-1]:,:,:]
            cluster_Y = y_t[ind_t[index-1]:]
        else:
            cluster_X = data_t[ind_t[index-1]:ind_t[index],:,:]
            cluster_Y = y_t[ind_t[index-1]:ind_t[index]]
            
        if weighted:
            saved_model_path = os.path.join(predictions_output_dir, 'cv{}_lstm_w_kmeans{}.h5'.format(f,i))
        else:
            saved_model_path = os.path.join(predictions_output_dir, 'cv{}_lstm_kmeans{}.h5'.format(f,i))
        model = load_model(saved_model_path)
        classes_prob = model.predict(cluster_X, verbose=1)
        if i == 0:
            test_X_probs = np.array(classes_prob)
            test_y = np.array(cluster_Y)
        else: 
            test_X_probs = np.concatenate((test_X_probs, classes_prob))
            test_y = np.concatenate((test_y, cluster_Y))
        print('----- cluster ',i,' -----')
        print(cluster_X.shape)
        # Test metrics
        y_pred1 = (classes_prob.ravel()>0.5) + 0.0 # predict and get class (0 if pred < 0.5 else 1)
        test_ac1=np.round(metrics.accuracy_score(cluster_Y, y_pred1)*100,4)
        print("Accuracy test cluster ",i,":", test_ac1)
        
        i += 1
    print("------ Metrics ------")
    # Train metrics
    y_pred_train = (train_X_probs.ravel()>0.5) + 0.0 # predict and get class (0 if pred < 0.5 else 1)
    print('labels(0 short stay, 1 long stay) predicted: ', y_pred_train)
    print('true labels: ', y_train)
    # Test metrics
    y_pred = (test_X_probs.ravel()>0.5) + 0.0 # predict and get class (0 if pred < 0.5 else 1)
    print('labels(0 short stay, 1 long stay) predicted: ', y_pred)
    print('true labels: ', test_y)
    #
    train_ac=np.round(metrics.accuracy_score(y_train, y_pred_train)*100,4)
    print("Accuracy train:", train_ac)
    test_ac=np.round(metrics.accuracy_score(test_y, y_pred)*100,4)
    print("Accuracy test:", test_ac)
    auroc = metrics.roc_auc_score(test_y, test_X_probs)
    print("AUC-ROC: ", auroc)
    (precisions, recalls, thresholds) = metrics.precision_recall_curve(test_y, test_X_probs)
    auprc = metrics.auc(recalls, precisions)
    print("AUC-PR: ", auprc)
    aucprs.append(auprc)
    aucrocs.append(auroc)
    accuracies.append(test_ac)

In [28]:
# Execute LSTM
aucprs, aucrocs, accuracies = kfoldclusteredkmeans("LSTM", X, y, weighted = False)

KFold  1  ---
---- Clustering Train ---- 
-------- LSTM fitting for each cluster --------
(6779, 24, 10)


2022-06-02 15:29:40.467440: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-06-02 15:29:40.468202: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-06-02 15:29:40.470156: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (sony-vaio): /proc/driver/nvidia/version does not exist


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
(20716, 24, 10)
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 8: early stopping
(607, 24, 10)
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Assigning each test sample to the closest cluster centroid...
-------- Train metrics ---------
-------- Test metrics ---------
----- cluster  0  -----
(1751, 24, 10)
Accuracy test cluster  0 : 69.7316
----- cluster  1  -----
(5121, 24, 10)
Accuracy test cluster  1 : 73.1693
----- cluster  2  -----
(154, 24, 10)
Accuracy test cluster  2 : 67.5325
------ Metrics ------
labels(0 short stay, 1 long stay) predicted:  [1. 1. 1. ... 1. 1. 1.]
true labels:  [[0.]
 [1.]
 [1.]
 ...
 [0.]
 [0.]
 [1.]]
labels(0 short stay, 1 long stay) predicted:  [1. 1. 1. ... 1. 1. 0.]
true labels:  [[1.]
 [1.]
 [1.]
 ...
 [1.]
 [0.]
 [0.]]
Accuracy train

Epoch 9: early stopping
(576, 24, 10)
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Assigning each test sample to the closest cluster centroid...
-------- Train metrics ---------
-------- Test metrics ---------
----- cluster  0  -----
(1724, 24, 10)
Accuracy test cluster  0 : 71.4617
----- cluster  1  -----
(5146, 24, 10)
Accuracy test cluster  1 : 72.9693
----- cluster  2  -----
(156, 24, 10)
Accuracy test cluster  2 : 64.7436
------ Metrics ------
labels(0 short stay, 1 long stay) predicted:  [1. 1. 1. ... 1. 1. 0.]
true labels:  [[1.]
 [1.]
 [1.]
 ...
 [0.]
 [1.]
 [0.]]
labels(0 short stay, 1 long stay) predicted:  [1. 0. 1. ... 1. 1. 1.]
true labels:  [[0.]
 [0.]
 [0.]
 ...
 [1.]
 [0.]
 [0.]]
Accuracy train: 72.4397
Accuracy test: 72.4167
AUC-ROC:  0.7451711551371734
AUC-PR:  0.6607602253939872
KFold  3  ---
---- Clustering Train ---- 
-------- LSTM fitting for each cluster --------
(12816, 24, 10)
Epoch 1/10
Epoch 2/

-------- LSTM fitting for each cluster --------
(20667, 24, 10)
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 10: early stopping
(6805, 24, 10)
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
(631, 24, 10)
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Assigning each test sample to the closest cluster centroid...
-------- Train metrics ---------
-------- Test metrics ---------
----- cluster  0  -----
(5176, 24, 10)
Accuracy test cluster  0 : 73.9567
----- cluster  1  -----
(1689, 24, 10)
Accuracy test cluster  1 : 72.2321
----- cluster  2  -----
(160, 24, 10)
Accuracy test cluster  2 : 52.5
------ Metrics ------
labels(0 short stay, 1 long stay) predicted:  [0. 0. 0. ... 1. 1. 1.]
true labels:  [[0.]
 [0.]
 [1.]
 ...
 [0.]
 [1.]
 [0.]]
labels(0 short stay, 1 long stay) predicted:  [0. 0. 0. 

Epoch 9/10
Epoch 10/10
(20588, 24, 10)
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 9: early stopping
Assigning each test sample to the closest cluster centroid...
-------- Train metrics ---------
-------- Test metrics ---------
----- cluster  0  -----
(1637, 24, 10)
Accuracy test cluster  0 : 69.5174
----- cluster  1  -----
(143, 24, 10)
Accuracy test cluster  1 : 60.8392
----- cluster  2  -----
(5245, 24, 10)
Accuracy test cluster  2 : 73.346
------ Metrics ------
labels(0 short stay, 1 long stay) predicted:  [1. 1. 1. ... 0. 0. 0.]
true labels:  [[0.]
 [1.]
 [1.]
 ...
 [0.]
 [0.]
 [0.]]
labels(0 short stay, 1 long stay) predicted:  [1. 1. 0. ... 1. 0. 0.]
true labels:  [[1.]
 [1.]
 [0.]
 ...
 [0.]
 [0.]
 [1.]]
Accuracy train: 72.5118
Accuracy test: 72.1993
AUC-ROC:  0.7340974429200149
AUC-PR:  0.633228760856434


In [34]:
print ('aucpr scores:', aucprs)
print("%0.4f mean aucpr with a standard deviation of %0.4f" % (np.mean(aucprs), np.std(aucprs)))

print ('aucroc scores:', aucrocs)
print("%0.4f mean aucroc with a standard deviation of %0.4f" % (np.mean(aucrocs), np.std(aucrocs)))

print ('accuracy scores:', accuracies)
print("%0.4f mean accuracy with a standard deviation of %0.4f" % (np.mean(accuracies), np.std(accuracies)))

aucpr scores: [0.6565342290034353, 0.6607602253939872, 0.6550276592490609, 0.6685799085433248, 0.633228760856434]
0.6548 mean aucpr with a standard deviation of 0.0118
aucroc scores: [0.7408432479803762, 0.7451711551371734, 0.7373378165630684, 0.7511341881820595, 0.7340974429200149]
0.7417 mean aucroc with a standard deviation of 0.0060
accuracy scores: [72.189, 72.4167, 71.9613, 73.0534, 72.1993]
72.3639 mean accuracy with a standard deviation of 0.3736


## Li

In [29]:
#manera 2
#https://amirhessam88.github.io/roc-vs-pr/

def customCrossValidationMetrics(classifier, X, y, weighted = False  ):
    scaler = StandardScaler()
    cv = KFold(n_splits=5, shuffle=True, random_state=42)
    #ejemplo classifier=RandomForestClassifier(n_estimators=10, random_state = 42)
    aucprs = []
    aucrocs = []
    accuracies = []
    
    i=1
    for train, test in cv.split(X, y): #train and test are indexes
        print('KFold ',i)
        #reshape because scaler need <= 2d
        X_train_tranformed = scaler.fit_transform(X[train].reshape(train.shape[0]*24, 10))
        X_test_tranformed = scaler.transform(X[test].reshape(test.shape[0]*24, 10))
        
        #re-reshape
        
        X_train_tranformed = X_train_tranformed.reshape(train.shape[0], 24, 10)
        X_test_tranformed = X_test_tranformed.reshape(test.shape[0], 24, 10)
        
        if(weighted):
            classWeight = class_weight.compute_class_weight('balanced',np.unique(np.ravel(y[train])),np.ravel(y[train]))
            classWeight = {i : classWeight[i] for i in range(2)}  #convert to dictionary in order to fit to keras model
            print("training: ")
            history = classifier.fit(X_train_tranformed, np.ravel(y[train]), epochs=8, verbose=1,class_weight=classWeight)

        else:
            print("training: ")
            history = classifier.fit(X_train_tranformed, np.ravel(y[train]),epochs=8, verbose=1)

        print("testing: ")
        probas_ = np.ravel(classifier.predict(X_test_tranformed, verbose=1))
        # Compute PR curve and area the curve
        precision, recall, thresholds = metrics.precision_recall_curve(y[test], probas_)
        pr_auc =  np.round(metrics.auc(recall, precision), 6)
        aucprs.append(pr_auc)
        auroc =  np.round(metrics.roc_auc_score(y[test],probas_),6)
        aucrocs.append(auroc)

        y_pred_binary = classifier.predict(X_test_tranformed)
        y_pred_binary = (y_pred_binary.ravel()>0.5) + 0.0 # predict and get class (0 if pred < 0.5 else 1)
        acc = np.round(metrics.accuracy_score(y[test],y_pred_binary)*100,4)
        accuracies.append(acc)
        
        i = i+1

    print ('aucpr scores:', aucprs)
    print("%0.4f mean aucpr with a standard deviation of %0.4f" % (np.mean(aucprs), np.std(aucprs)))

    print ('aucroc scores:', aucrocs)
    print("%0.4f mean aucroc with a standard deviation of %0.4f" % (np.mean(aucrocs), np.std(aucrocs)))

    print ('accuracy scores:', accuracies)
    print("%0.4f mean accuracy with a standard deviation of %0.4f" % (np.mean(accuracies), np.std(accuracies)))
    
    return aucprs, aucrocs, accuracies


In [30]:
from keras.layers import Bidirectional
from keras.models import Sequential
from keras.layers.core import Activation, Dropout, Dense
from keras.models import Model
from keras.layers import Flatten, LSTM

classifier = Sequential()
classifier.add(Bidirectional(LSTM(10, activation='sigmoid'), input_shape=(24, 10)))
classifier.add(Dropout(0.2))
classifier.add(Dense(1, activation='sigmoid'))
classifier.compile(optimizer='adam', loss='binary_crossentropy',metrics=['accuracy'])


In [32]:
aucprs_LSTM, aucrocs_LSTM, accuracies_LSTM = customCrossValidationMetrics(classifier, X, y, weighted = False)

KFold  1
training: 


2022-06-02 15:52:00.810529: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 26977920 exceeds 10% of free system memory.


Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8
testing: 
KFold  2
training: 
Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8
testing: 
KFold  3
training: 
Epoch 1/8
  4/879 [..............................] - ETA: 17s - loss: 0.5913 - accuracy: 0.7344

2022-06-02 15:56:58.271995: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 26977920 exceeds 10% of free system memory.


Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8
testing: 
KFold  4
training: 
Epoch 1/8


2022-06-02 15:59:25.278477: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 26978880 exceeds 10% of free system memory.


Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8
testing: 
KFold  5
training: 
Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8
testing: 
aucpr scores: [0.664502, 0.673576, 0.672413, 0.686458, 0.661199]
0.6716 mean aucpr with a standard deviation of 0.0088
aucroc scores: [0.746278, 0.751817, 0.743593, 0.764057, 0.747589]
0.7507 mean aucroc with a standard deviation of 0.0072
accuracy scores: [72.431, 72.189, 72.2032, 73.6655, 72.6406]
72.6259 mean accuracy with a standard deviation of 0.5456


In [None]:
from keras.layers import Bidirectional
from keras.models import Sequential
from keras.layers.core import Activation, Dropout, Dense
from keras.models import Model
from keras.layers import Flatten, LSTM

classifier_weighted = Sequential()
classifier_weighted.add(Bidirectional(LSTM(10, activation='sigmoid'), input_shape=(24, 10)))
classifier_weighted.add(Dropout(0.2))
classifier_weighted.add(Dense(1, activation='sigmoid'))
classifier_weighted.compile(optimizer='adam', loss='binary_crossentropy',metrics=['accuracy'])


In [None]:
aucprs_LSTM_weighted, aucrocs_LSTM_weighted, accuracies_LSTM_weighted = customCrossValidationMetrics(classifier_weighted, X, y, weighted = True)

Compare Statistical Significance Tests for Comparing Machine Learning Algorithms

https://stackoverflow.com/questions/54498235/compare-whether-the-difference-between-performance-accuracy-of-2-ml-models-is-st

In [None]:
print('LSTM vs LSTM_weighted: ', stats.ttest_rel(aucrocs_LSTM,aucrocs_LSTM_weighted))


In [None]:
print('LSTM vs LSTM_weighted: ', stats.ttest_rel(aucprs_LSTM,aucprs_LSTM_weighted))


las diferencias de aucroc y aucpr entre LSTM y LSTM_weighted no son estadísticamente signifiativas. (p_value > 0.05, no podemos rechazar la hipotesi nula de que los modelos tienen comportamiento similar)  
Obs. para LOS>3 días, el cohorte de benchmark está más balanceado. (38% positivo)