# Alzheimer Detection through handwriten data with The DARWIN dataset

The DARWIN dataset contains handwriting data collected according to the acquisition protocol described in [1], which is composed of 25 handwriting tasks. The protocol  was specifically designed for the early detection of Alzheimer’s disease (AD). The dataset includes data from 174 participants (89 AD patients and 85 healthy people).
The file “DARWIN.csv” contains the acquired data. The file consists of one row for each participant plus an additional header row. The first row is the header row, the next 89 rows collect patients data, whereas the remaining 84 rows collect information from healthy people.
The file consists of 452 columns. The first column shows participants' identifiers, whereas the last column shows the class to which each participant belongs.  This value can be equal to  'P' (Patient) or 'H' (Healthy).
The remaining columns report the features extracted from a specific task. The tasks performed are 25, and for each task 18 features have been extracted. The column will be identified by the name of the features followed by a numeric identifier representing the task the feature is extracted. E.g., the column with the header "total_time8" collects the values for the "total time" feature extracted from task #8.
Benchmark performances achieved on the DARWIN dataset  have been published in [2].
For any further questions do not hesitate to contact Dr.  Fontanella (fontanella AT unicas DOT it).

References
[1] N. D.  Cilia,  C.  De  Stefano,  F.  Fontanella,  A.  S.  Di  Freca,  An experimental protocol to support cognitive impairment diagnosis by using handwriting analysis, Procedia Computer Science 141 (2018) 466–471.
https://doi.org/10.1016/j.procs.2018.10.141

[2] N. D. Cilia, G. De Gregorio, C. De Stefano, F. Fontanella, A.  Marcelli, A. Parziale, Diagnosing Alzheimer’s disease from online handwriting: A novel dataset and performance benchmarking, Engineering Applications of Artificial Intelligence, Vol. 111 (20229) 104822.  
https://doi.org/10.1016/j.engappai.2022.104822


In [22]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import joblib
import pickle
import xgboost as xgb
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from keras.models import Sequential
from keras.layers import LSTM, Dense
from sklearn.svm import SVC
from sklearn.metrics import roc_curve, auc, roc_auc_score, confusion_matrix
from sklearn.metrics import accuracy_score

In [6]:
dataset = pd.read_csv('Data/data.csv'); print(dataset.shape); dataset.head()

(174, 452)


Unnamed: 0,ID,air_time1,disp_index1,gmrt_in_air1,gmrt_on_paper1,max_x_extension1,max_y_extension1,mean_acc_in_air1,mean_acc_on_paper1,mean_gmrt1,...,mean_jerk_in_air25,mean_jerk_on_paper25,mean_speed_in_air25,mean_speed_on_paper25,num_of_pendown25,paper_time25,pressure_mean25,pressure_var25,total_time25,class
0,id_1,5160,1.3e-05,120.804174,86.853334,957,6601,0.3618,0.217459,103.828754,...,0.141434,0.024471,5.596487,3.184589,71,40120,1749.278166,296102.7676,144605,P
1,id_2,51980,1.6e-05,115.318238,83.448681,1694,6998,0.272513,0.14488,99.383459,...,0.049663,0.018368,1.665973,0.950249,129,126700,1504.768272,278744.285,298640,P
2,id_3,2600,1e-05,229.933997,172.761858,2333,5802,0.38702,0.181342,201.347928,...,0.178194,0.017174,4.000781,2.392521,74,45480,1431.443492,144411.7055,79025,P
3,id_4,2130,1e-05,369.403342,183.193104,1756,8159,0.556879,0.164502,276.298223,...,0.113905,0.01986,4.206746,1.613522,123,67945,1465.843329,230184.7154,181220,P
4,id_5,2310,7e-06,257.997131,111.275889,987,4732,0.266077,0.145104,184.63651,...,0.121782,0.020872,3.319036,1.680629,92,37285,1841.702561,158290.0255,72575,P


In [25]:
dataset.drop('ID', axis=1, inplace=True)

In [8]:
dataset.describe()

Unnamed: 0,air_time1,disp_index1,gmrt_in_air1,gmrt_on_paper1,max_x_extension1,max_y_extension1,mean_acc_in_air1,mean_acc_on_paper1,mean_gmrt1,mean_jerk_in_air1,...,mean_gmrt25,mean_jerk_in_air25,mean_jerk_on_paper25,mean_speed_in_air25,mean_speed_on_paper25,num_of_pendown25,paper_time25,pressure_mean25,pressure_var25,total_time25
count,174.0,174.0,174.0,174.0,174.0,174.0,174.0,174.0,174.0,174.0,...,174.0,174.0,174.0,174.0,174.0,174.0,174.0,174.0,174.0,174.0
mean,5664.166667,1e-05,297.666685,200.504413,1977.965517,7323.896552,0.416374,0.179823,249.085549,0.067556,...,221.360646,0.148286,0.019934,4.472643,2.871613,85.83908,43109.712644,1629.585962,163061.76736,164203.3
std,12653.772746,3e-06,183.943181,111.629546,1648.306365,2188.290512,0.381837,0.064693,132.698462,0.074776,...,63.762013,0.062207,0.002388,1.501411,0.852809,27.485518,19092.024337,324.142316,56845.610814,496939.7
min,65.0,2e-06,28.734515,29.935835,754.0,561.0,0.067748,0.096631,41.199445,0.011861,...,69.928033,0.030169,0.014987,1.323565,0.950249,32.0,15930.0,474.049462,26984.92666,29980.0
25%,1697.5,8e-06,174.153023,136.524742,1362.5,6124.0,0.218209,0.146647,161.136182,0.029523,...,178.798382,0.107732,0.018301,3.485934,2.401199,66.0,32803.75,1499.112088,120099.0468,59175.0
50%,2890.0,9e-06,255.791452,176.494494,1681.0,6975.5,0.275184,0.163659,224.445268,0.039233,...,217.431621,0.140483,0.019488,4.510578,2.830672,81.0,37312.5,1729.38501,158236.7718,76115.0
75%,4931.25,1.1e-05,358.917885,234.05256,2082.75,8298.5,0.442706,0.188879,294.392298,0.071057,...,264.310776,0.199168,0.021134,5.212794,3.335828,101.5,46533.75,1865.626974,200921.078475,127542.5
max,109965.0,2.8e-05,1168.328276,865.210522,18602.0,15783.0,2.772566,0.62735,836.784702,0.543199,...,437.373267,0.375078,0.029227,10.416715,5.602909,209.0,139575.0,1999.775983,352981.85,5704200.0


In [9]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 174 entries, 0 to 173
Columns: 452 entries, ID to class
dtypes: float64(300), int64(150), object(2)
memory usage: 614.6+ KB


In [24]:
X_train.describe()

Unnamed: 0,air_time1,disp_index1,gmrt_in_air1,gmrt_on_paper1,max_x_extension1,max_y_extension1,mean_acc_in_air1,mean_acc_on_paper1,mean_gmrt1,mean_jerk_in_air1,...,mean_gmrt25,mean_jerk_in_air25,mean_jerk_on_paper25,mean_speed_in_air25,mean_speed_on_paper25,num_of_pendown25,paper_time25,pressure_mean25,pressure_var25,total_time25
count,139.0,139.0,139.0,139.0,139.0,139.0,139.0,139.0,139.0,139.0,...,139.0,139.0,139.0,139.0,139.0,139.0,139.0,139.0,139.0,139.0
mean,5967.884892,1e-05,306.594153,203.40885,1973.402878,7290.683453,0.415702,0.180503,255.001502,0.06716,...,220.250211,0.146122,0.019889,4.426823,2.861405,87.503597,43485.647482,1611.8596,162163.427748,177082.1
std,13975.740559,3e-06,193.04777,117.101777,1755.99964,2161.001942,0.398152,0.069014,138.464919,0.077972,...,66.09283,0.063944,0.002425,1.551919,0.87218,29.253112,19741.442347,339.574671,58254.957665,553723.2
min,65.0,2e-06,28.734515,29.935835,786.0,1137.0,0.067748,0.096631,41.199445,0.011861,...,69.928033,0.030169,0.014987,1.323565,0.950249,32.0,22425.0,474.049462,26984.92666,35530.0
25%,1532.5,8e-06,175.322217,139.0536,1392.0,6041.5,0.218249,0.144992,163.189898,0.029781,...,176.067991,0.101006,0.018273,3.344525,2.374763,65.5,32052.5,1490.670869,118910.3785,56572.5
50%,2725.0,9e-06,261.454359,176.873032,1680.0,6963.0,0.279051,0.16248,229.016144,0.040834,...,216.102339,0.13758,0.019374,4.316327,2.808008,84.0,37560.0,1722.652206,157649.098,75900.0
75%,4762.5,1.1e-05,363.192992,236.21004,2048.5,8222.0,0.415687,0.188403,305.293882,0.068184,...,264.310776,0.192259,0.020989,5.113032,3.335828,105.0,46950.0,1857.865016,198587.4758,126870.0
max,109965.0,2.8e-05,1168.328276,865.210522,18602.0,15783.0,2.772566,0.62735,836.784702,0.543199,...,437.373267,0.375078,0.029227,10.416715,5.602909,209.0,139575.0,1999.775983,352981.85,5704200.0


In [26]:
# Dataset split before scale
colNames = dataset.columns.tolist()
predictors = colNames[:-1]
target = colNames[-1]
X = dataset[predictors]
y = dataset[target]
y = LabelEncoder().fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

# Train and test data standardization
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

(139, 450) (35, 450) (139,) (35,)


In [27]:
# Grid Search Optimization, training and evaluation of first model using logistic regression (Benchmark model)
tuned_params_v1 = {'C': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000], 
                   'penalty': ['l2', 'none']}
model_v1 = GridSearchCV(LogisticRegression(), 
                         tuned_params_v1, 
                         scoring = 'roc_auc', 
                         n_jobs = -1)
model_v1.fit(X_train_scaled, y_train)
# Best model selection
print('Best estimator: ', model_v1.best_estimator_)
y_pred_v1 = model_v1.predict(X_test_scaled)
# Prob based prediction
y_pred_proba_v1 = model_v1.predict_proba(X_test_scaled)
# Filtering predictions by positive class, needed for ROC curve
y_pred_proba_v1 = model_v1.predict_proba(X_test_scaled)[:,1]
# MConfusion Matrix
print('Confusion Matrix: \n', confusion_matrix(y_test, y_pred_v1))
# Area Under The Curve
roc_auc_v1 = roc_auc_score(y_test, y_pred_v1)
print('Area Under the curve, roc_auc_score: ', roc_auc_v1)
# ROC curve
fpr_v1, tpr_v1, thresholds = roc_curve(y_test, y_pred_proba_v1)
# AUC
auc_v1 = auc(fpr_v1, tpr_v1)
print('AUC: ', auc_v1)
# Test data accuracy
accuracy_v1 = accuracy_score(y_test, y_pred_v1)
print('Accuracy: ', accuracy_v1)
with open('MLModels/model_v1.pkl', 'wb') as pickle_file:
      joblib.dump(model_v1, 'MLModels/model_v1.pkl') 
        # Pandas dataframe for models metrics and future comparison
df_models = pd.DataFrame()
dict_model_v1 = {'Name': 'model_v1', 
                  'Algorithm': 'Logistic Regression', 
                  'ROC_AUC Score': roc_auc_v1,
                  'AUC Score': auc_v1,
                  'Accuracy': accuracy_v1}
# Add dict to dataframe
df_models = df_models.append(dict_model_v1, ignore_index = True)
display(df_models)





Best estimator:  LogisticRegression(C=0.01)
Confusion Matrix: 
 [[17  3]
 [ 5 10]]
Area Under the curve, roc_auc_score:  0.7583333333333333
AUC:  0.9233333333333333
Accuracy:  0.7714285714285715


  df_models = df_models.append(dict_model_v1, ignore_index = True)


Unnamed: 0,Name,Algorithm,ROC_AUC Score,AUC Score,Accuracy
0,model_v1,Logistic Regression,0.758333,0.923333,0.771429


In [28]:
# Grid Search Optimization, training and evaluation of second model using Random Forest Classifier
tuned_params_v2 = {'n_estimators': [50, 100, 200, 300, 400, 500],
                   'criterion': ['gini', 'entropy', 'log_loss'],
                   'min_samples_split': [2, 5, 10], 
                   'min_samples_leaf': [1, 2, 4]}
model_v2 = GridSearchCV(RandomForestClassifier(), 
                               tuned_params_v2, 
                               scoring = 'roc_auc', 
                               n_jobs  = -1)
model_v2.fit(X_train_scaled, y_train)
# Best model
print('Best Estimator: ', model_v2.best_estimator_)
# Test data prediction
y_pred_v2 = model_v2.predict(X_test_scaled)
# Filtering predictions by positive class, needed for ROC curve
y_pred_proba_v2 = model_v2.predict_proba(X_test_scaled)[:,1]
# MConfusion Matrix
print('Confusion Matrix: ', confusion_matrix(y_test, y_pred_v2))
# ROC AUC curve
roc_auc_v2 = roc_auc_score(y_test, y_pred_v2)
print("ROC_AUC: ", roc_auc_v2)
# ROC curve
fpr_v2, tpr_v2, thresholds = roc_curve(y_test, y_pred_proba_v2)
# AUC
auc_v2 = auc(fpr_v2, tpr_v2)
print('AUC: ', auc_v2)
# Accuracy
accuracy_v2 = accuracy_score(y_test, y_pred_v2)
print("Accuracy: ", accuracy_v2)
# Salve model in disk
with open('MLModels/model_v2.pkl', 'wb') as pickle_file:
      joblib.dump(model_v2, 'MLModels/model_v2.pkl') 
dict_model_v2 = {'Name': 'model_v2', 
                  'Algorithm': 'Random Forest', 
                  'ROC_AUC Score': roc_auc_v2,
                  'AUC Score': auc_v2,
                  'Accuracy': accuracy_v2}
# Add dict to dataframe
df_models = df_models.append(dict_model_v2, ignore_index = True)
display(df_models)

Best Estimator:  RandomForestClassifier(criterion='log_loss', min_samples_split=10,
                       n_estimators=200)
Confusion Matrix:  [[17  3]
 [ 4 11]]
ROC_AUC:  0.7916666666666667
AUC:  0.9299999999999999
Accuracy:  0.8


  df_models = df_models.append(dict_model_v2, ignore_index = True)


Unnamed: 0,Name,Algorithm,ROC_AUC Score,AUC Score,Accuracy
0,model_v1,Logistic Regression,0.758333,0.923333,0.771429
1,model_v2,Random Forest,0.791667,0.93,0.8


In [None]:
# Grid Search Optimization, training and evaluation of thrid model using KNN
# K range
neighbors = list(range(1, 20, 2))
# Scores list
cv_scores = []
# Cross validation to determine better k value
for k in neighbors:
    knn = KNeighborsClassifier(n_neighbors = k)
    scores = cross_val_score(knn, X_train_scaled, y_train, cv = 5, scoring = 'accuracy')
    cv_scores.append(scores.mean())   
# error adjusting
error = [1 - x for x in cv_scores]
optimal_k = neighbors[error.index(min(error))]
print('k best value is %d' % optimal_k)
# model_v3
model_v3 = KNeighborsClassifier(n_neighbors = optimal_k)
# Training
model_v3.fit(X_train_scaled, y_train)
# Predict
y_pred_v3 = model_v3.predict(X_test_scaled)
# Confusion Matrix
confusion_matrix(y_test, y_pred_v3)
# Prob prediction of positive class
y_pred_proba_v3 = model_v3.predict_proba(X_test_scaled)[:,1]
# ROC_AUC in test data
roc_auc_v3 = roc_auc_score(y_test, y_pred_v3)
print('ROC_AUC: ', roc_auc_v3)
# ROC curve
fpr_v3, tpr_v3, thresholds = roc_curve(y_test, y_pred_proba_v3)
# AUC
auc_v3 = auc(fpr_v3, tpr_v3)
print("AUC", auc_v3)
# CAccuracy
accuracy_v3 = accuracy_score(y_test, y_pred_v3)
print('Accuracy: ', accuracy_v3)
# Save model
with open('MLModels/model_v3.pkl', 'wb') as pickle_file:
      joblib.dump(model_v3, 'MLModels/model_v3.pkl') 
# Dicionário com as métricas do modelo_v3
dict_modelo_v3 = {'Name': 'model_v3', 
                  'Algorithm': 'KNN', 
                  'ROC_AUC Score': roc_auc_v3,
                  'AUC Score': auc_v3,
                  'Accuracy': accuracy_v3}
df_models = df_models.append(dict_modelo_v3, ignore_index = True)
display(df_models)

In [29]:
# Grid Search Optimization, training and evaluation of fourth SVM model
def svc_param_selection(X, y, nfolds):
    Cs = [0.001, 0.01, 0.1, 1, 10]
    gammas = [0.001, 0.01, 0.1, 1]
    kernels = ['linear', 'rbf', 'poly', 'sigmoid']
    param_grid = {'C': Cs, 'gamma' : gammas, 'kernel' : kernels}
    grid_search = GridSearchCV(SVC(), param_grid, cv = nfolds)
    grid_search.fit(X_train_scaled, y_train)
    grid_search.best_params_
    return grid_search.best_params_
print('Best params: ', svc_param_selection(X_train_scaled, y_train, 10))

Best params:  {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}


In [30]:
# Setting best model
model_v4 = SVC(C = 1, gamma = 0.001, kernel = 'rbf', probability = True)
# Training
model_v4.fit(X_train_scaled, y_train)
# Predicting
y_pred_v4 = model_v4.predict(X_test_scaled)
confusion_matrix(y_test, y_pred_v4)
# Prob format
y_pred_proba_v4 = model_v4.predict_proba(X_test_scaled)[:, 1]
# ROC AUC score
roc_auc_v4 = roc_auc_score(y_test, y_pred_v4)
print('ROC_AUC: ', roc_auc_v4)
# ROC curve
fpr_v4, tpr_v4, thresholds = roc_curve(y_test, y_pred_proba_v4)
# UC score
auc_v4 = auc(fpr_v4, tpr_v4)
print('AUC: ', auc_v4)
# Accuracy
accuracy_v4 = accuracy_score(y_test, y_pred_v4)
print(accuracy_v4)
# Saving model
with open('MLModels/model_v4.pkl', 'wb') as pickle_file:
      joblib.dump(model_v4, 'MLModels/model_v4.pkl') 
# Model_v4 dict
dict_model_v4 = {'Name': 'model_v4', 
                  'Algorithm': 'SVM', 
                  'ROC_AUC Score': roc_auc_v4,
                  'AUC Score': auc_v4,
                  'Accuracy': accuracy_v4}
# add to dataframe
df_models = df_models.append(dict_model_v4, ignore_index = True)
display(df_models)

ROC_AUC:  0.8166666666666667
AUC:  0.9500000000000001
0.8285714285714286


  df_models = df_models.append(dict_model_v4, ignore_index = True)


Unnamed: 0,Name,Algorithm,ROC_AUC Score,AUC Score,Accuracy
0,model_v1,Logistic Regression,0.758333,0.923333,0.771429
1,model_v2,Random Forest,0.791667,0.93,0.8
2,model_v4,SVM,0.816667,0.95,0.828571


In [31]:
# Modeling, training and evaluation of LSTM model
X_train = X_train_scaled.reshape(X_train_scaled.shape[0], X_train_scaled.shape[1], 1)
X_test = X_test_scaled.reshape(X_test_scaled.shape[0], X_test_scaled.shape[1], 1)

model_v5 = Sequential()
model_v5.add(LSTM(units=64, input_shape=(X_train.shape[1], X_train.shape[2])))
model_v5.add(Dense(units=32, activation='relu'))
model_v5.add(Dense(units=1, activation='sigmoid'))

model_v5.compile(loss='binary_crossentropy', optimizer='adam', metrics = ['accuracy'])

model_v5.fit(X_train, y_train, batch_size=32, epochs=100)

loss, accuracy = model_v5.evaluate(X_test, y_test)

2023-07-07 19:42:32.514858: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-07 19:42:32.604589: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-07 19:42:32.604798: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysf

Epoch 1/100


2023-07-07 19:42:33.816486: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2023-07-07 19:42:33.818142: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2023-07-07 19:42:33.818922: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus

Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 7

Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


2023-07-07 19:42:47.828557: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2023-07-07 19:42:47.829608: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2023-07-07 19:42:47.830335: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus

In [32]:
y_pred = model_v5.predict(X_test_scaled)
# Calculate ROC AUC score
roc_auc_v5 = roc_auc_score(y_test, y_pred)
print("ROC AUC Score:", roc_auc_v5)
# Calculate AUC score
auc_v5 = roc_auc_score(y_test, y_pred, max_fpr=1.0)
print("AUC Score:", auc_v5)
loss, accuracy_v5 = model_v5.evaluate(X_test, y_test)
print(f'Test Loss: {loss:.4f}')
print(f'Test Accuracy: {accuracy_v5:.4f}')
# Save the model
with open('MLModels/model_v5.pkl', 'wb') as pickle_file:
      joblib.dump(model_v5, 'MLModels/model_v5.pkl') 
# model_v6 metrics
dict_model_v5 = {'Name': 'model_v5', 
                  'Algorithm': 'LSTM', 
                  'ROC_AUC Score': roc_auc_v5,
                  'AUC Score': auc_v5,
                  'Accuracy': accuracy_v5}
# Add to dataframe
df_models = df_models.append(dict_model_v5, ignore_index = True)
display(df_models)



2023-07-07 19:42:54.702088: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2023-07-07 19:42:54.703137: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2023-07-07 19:42:54.703823: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus

ROC AUC Score: 0.5833333333333334
AUC Score: 0.5833333333333334
Test Loss: 0.7119
Test Accuracy: 0.5143


  df_models = df_models.append(dict_model_v5, ignore_index = True)


Unnamed: 0,Name,Algorithm,ROC_AUC Score,AUC Score,Accuracy
0,model_v1,Logistic Regression,0.758333,0.923333,0.771429
1,model_v2,Random Forest,0.791667,0.93,0.8
2,model_v4,SVM,0.816667,0.95,0.828571
3,model_v5,LSTM,0.583333,0.583333,0.514286


In [33]:
# Define the parameter grid for hyperparameter tuning
param_grid = {
    'max_depth': [3, 5, 7],
    'learning_rate': [0.1, 0.01, 0.001, 0.0001],
    'n_estimators': [100, 200, 300],
    'gamma': [0.0001,0.001, 0.1, 0.5, 1, 10, 100],
}
# Create the XGBoost classifier
model = xgb.XGBClassifier(objective='binary:logistic', random_state=42)

# Perform grid search cross-validation
grid_search = GridSearchCV(model, param_grid, scoring='accuracy', cv=5)
grid_search.fit(X_train_scaled, y_train)

# Get the best hyperparameters and model
best_params = grid_search.best_params_
model_v6 = grid_search.best_estimator_

y_pred = model_v6.predict(X_test_scaled)

# Evaluate the model
accuracy_v6 = accuracy_score(y_test, y_pred)
roc_auc_v6 = roc_auc_score(y_test, y_pred)
auc_v6 = roc_auc_score(y_test, y_pred, max_fpr=1.0)

# Save model
with open('MLModels/model_v6.pkl', 'wb') as pickle_file:
      joblib.dump(model_v6, 'MLModels/model_v6.pkl') 

# Metrics
dict_model_v6 = {'Name': 'model_v6', 
                  'Algorithm': 'XGBoost', 
                  'ROC_AUC Score': roc_auc_v6,
                  'AUC Score': auc_v6,
                  'Accuracy': accuracy_v6}

print(f'Accuracy: {accuracy_v6:.4f}')
print(f'ROC AUC: {roc_auc_v6:.4f}')

print("Best Hyperparameters:", best_params)

# Add dict to dataframe
df_models = df_models.append(dict_model_v6, ignore_index = True)
display(df_models)

Accuracy: 0.9143
ROC AUC: 0.9250
Best Hyperparameters: {'gamma': 0.1, 'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 100}


  df_models = df_models.append(dict_model_v6, ignore_index = True)


Unnamed: 0,Name,Algorithm,ROC_AUC Score,AUC Score,Accuracy
0,model_v1,Logistic Regression,0.758333,0.923333,0.771429
1,model_v2,Random Forest,0.791667,0.93,0.8
2,model_v4,SVM,0.816667,0.95,0.828571
3,model_v5,LSTM,0.583333,0.583333,0.514286
4,model_v6,XGBoost,0.925,0.925,0.914286


In [34]:
### Best model selection
# Since AUC Score is o global metric, it will be used to select best model for this solution.
df_best_model = df_models[df_models['AUC Score'] == df_models['AUC Score'].max()]
print(df_best_model)

       Name Algorithm  ROC_AUC Score  AUC Score  Accuracy
2  model_v4       SVM       0.816667       0.95  0.828571


In [None]:
## Prediction with best model
# Name
model = df_best_model.Name.to_string(index = False)
# Loading from disk
best_model = joblib.load('MLModels/' + model + '.pkl')
# Raw data of new pacient
new_pacient = [##Insert new raw data in this field##]
print(len(new_pacient))
# Convert to np.array
arr_pacient = np.array(new_pacient).reshape(1,-1)
#Apply scaler
arr_pacient = scaler.transform(arr_pacient)
# Convert to np.array
arr_pacient = np.array(arr_pacient)
# Class prediction
pred_new_pacient = best_model.predict(arr_pacient.reshape(1, -1))
print(pred_new_pacient)
# Verify and print result
if pred_new_pacient == 1:
    print('This pacient should present Alzheimer')
else:
    print('This pacient should not present Alzheimer')