# Modeling

Iremos aplicar algoritmos de aprendizado de máquina para modelar a previsão de pessoas com/sem Parkinson.

Nosso intuito é verificar se isso é possível utilizando métricas que envolvem velocidade além de entender quais seríam elas.

# 1. Reading dataset

In [1]:
import os
import numpy as np
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score

from sklearn.model_selection import train_test_split

from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import ADASYN

In [2]:
#Path to data
data_path = os.path.dirname(os.getcwd()) + '/Data/'

#Read data
df_parkinson_hw = pd.read_csv(data_path + '/handwrite/parkinson_hw_velocity.csv')

#Read data
df_parkinson_hand = pd.read_csv(data_path + '/tappy-keystroke/parkinson_tappy_hand.csv',index_col='id')
df_parkinson_direction = pd.read_csv(data_path + '/tappy-keystroke/parkinson_tappy_direction.csv',index_col='id')

# 2. Modeling

## 2.1 Dataset handwrite

In [3]:
'''Perform training in several classifiers'''
def training(X,Y):
    
    #Classifiers
    dt = DecisionTreeClassifier()
    svm = SVC(gamma='scale',probability=True)
    lr = LogisticRegression(solver='liblinear')
    knn_3 = KNeighborsClassifier(n_neighbors=3)
    knn_5 = KNeighborsClassifier(n_neighbors=5)
    knn_7 = KNeighborsClassifier(n_neighbors=7)
    knn_9 = KNeighborsClassifier(n_neighbors=9)
    xgb = XGBClassifier()
    rf = RandomForestClassifier(n_estimators=100)
    
    
    clfs = {'dt':dt,'svm':svm,'lr':lr,'knn_3':knn_3,'knn_5':knn_5,'knn_7':knn_7,'knn_9':knn_9,'xgb':xgb,'rf':rf}

    #Training
    for k,v in clfs.items():
        clfs[k] = v.fit(X,Y)
    
    return clfs

'''Compute some metrics about classification'''
def metrics(clfs,X,Y):
    #Confusion matrix for each classifier
    result = []
    for k in clfs.keys():
        tn, fp, fn, tp  = confusion_matrix(Y,clfs[k].predict(X)).ravel()
        result.append({'classifier':k,'acc':round((tn+tp)/(tn+fp+fn+tp),2),'sens':round(tp/(tp+fn),2),'spec':round(tn/(tn+fp),2),'auc':round(roc_auc_score(Y,clfs[k].predict(X)),2)})
        
    return result

'''Perform an avaliation on classifiers'''
def evaluate_classifiers(X,Y,test_size=0.3,oversampling='ADASYN'):
    
    #Dataframe with classification result
    df_result = pd.DataFrame()
    
    for i in range(50):
        #Split data
        X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=test_size)
        
        #Apply oversampling in training
        if(oversampling == 'SMOTE'):
            smote = SMOTE(ratio='minority')
            X_train,Y_train = smote.fit_sample(X_train,Y_train)
            
        elif(oversampling == 'ADASYN'):
            adasyn = ADASYN(ratio='minority')
            X_train,Y_train = adasyn.fit_sample(X_train,Y_train)
        
        #Training
        clfs = training(X_train,Y_train)

        #Concat other results
        df_result = pd.concat([df_result,pd.DataFrame(metrics(clfs,X_test.values,Y_test))])

    #Summarizes the metrics
    df_result = df_result.pivot_table(index='classifier',values=['acc','auc','sens','spec'])
    
    #Sort by auc
    df_result.sort_values(by='auc',ascending=False,inplace=True)
    
    return df_result

In [4]:
#Test 0 
X = df_parkinson_hw.loc[df_parkinson_hw['test_id'] == 0,['mean','std']]
Y = df_parkinson_hw.loc[df_parkinson_hw['test_id'] == 0,'parkinson']

evaluate_classifiers(X,Y)

Unnamed: 0_level_0,acc,auc,sens,spec
classifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
knn_9,0.6086,0.6708,0.5678,0.775
lr,0.5736,0.6432,0.5326,0.754
knn_7,0.5926,0.6426,0.5574,0.7286
svm,0.4092,0.6156,0.2886,0.9428
knn_5,0.5854,0.6104,0.566,0.6528
knn_3,0.5874,0.5916,0.5874,0.5962
xgb,0.6034,0.5528,0.6324,0.4722
rf,0.5952,0.5378,0.6294,0.4452
dt,0.603,0.5334,0.6428,0.4244


In [28]:
#Test 1
X = df_parkinson_hw.loc[df_parkinson_hw['test_id'] == 1,['mean','std']]
Y = df_parkinson_hw.loc[df_parkinson_hw['test_id'] == 1,'parkinson']

evaluate_classifiers(X,Y)

Unnamed: 0_level_0,acc,auc,sens,spec
classifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
lr,0.6682,0.7642,0.5898,0.9398
svm,0.5674,0.7214,0.448,0.9916
knn_9,0.6592,0.7164,0.6156,0.8186
rf,0.724,0.6996,0.7404,0.6588
knn_7,0.6542,0.6978,0.6182,0.7792
xgb,0.7264,0.6852,0.7546,0.6172
knn_5,0.6476,0.6806,0.6204,0.7414
knn_3,0.6708,0.6758,0.6604,0.692
dt,0.7182,0.6426,0.768,0.5182


Regressão logística (RL) apresentou melhores resultados nos dois testes sendo que o teste 1 aparenta ser melhor na distinção de pessoas com/sem Parkinson.

Vamos analisar os coeficientes importantes segundo a regressão logística com Lasso.

In [31]:
#Divisão do dataset
X = df_parkinson_hw.loc[df_parkinson_hw['test_id'] == 1,['mean','std']]
Y = df_parkinson_hw.loc[df_parkinson_hw['test_id'] == 1,'parkinson']

#Modelagem
lr = LogisticRegression(solver='liblinear',penalty='l1').fit(X,Y)
dict(zip(X.columns.values,lr.coef_[0]))

{'mean': 2.9630101567267646, 'std': 0.0}

Pela RL com Lasso vemos que apenas a média da velocidade é interessante.

## 2.2 Dataset hand and direction

In [22]:
#Hand
X = df_parkinson_hand.iloc[:,1:]
Y = df_parkinson_hand['parkinson']

evaluate_classifiers(X,Y)

Unnamed: 0_level_0,acc,auc,sens,spec
classifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
lr,0.5824,0.5514,0.6154,0.488
xgb,0.6138,0.5048,0.7266,0.2826
rf,0.619,0.4922,0.7496,0.2346
dt,0.5798,0.4904,0.6696,0.3092
knn_9,0.473,0.4836,0.46,0.5066
svm,0.7208,0.4822,0.9604,0.0034
knn_5,0.4986,0.4776,0.5138,0.442
knn_7,0.4822,0.4744,0.4846,0.4644
knn_3,0.511,0.4728,0.5464,0.3982


In [23]:
#Direction
X = df_parkinson_direction.iloc[:,1:]
Y = df_parkinson_direction['parkinson']

evaluate_classifiers(X,Y)

Unnamed: 0_level_0,acc,auc,sens,spec
classifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
xgb,0.7056,0.6058,0.7978,0.4156
rf,0.7124,0.5824,0.831,0.3326
dt,0.6648,0.5736,0.75,0.3968
lr,0.6044,0.5346,0.6734,0.3942
svm,0.7606,0.4982,0.9964,0.0
knn_9,0.4664,0.4878,0.4496,0.5262
knn_5,0.4958,0.4876,0.5082,0.4678
knn_7,0.4702,0.4824,0.4636,0.502
knn_3,0.524,0.4804,0.5662,0.3944


In [34]:
#Direction without flight
X = df_parkinson_direction.iloc[:,19:]
Y = df_parkinson_direction['parkinson']

evaluate_classifiers(X,Y)

Unnamed: 0_level_0,acc,auc,sens,spec
classifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
xgb,0.6992,0.6008,0.7898,0.412
rf,0.6948,0.5642,0.8164,0.3116
dt,0.6438,0.5586,0.7226,0.3934
svm,0.7596,0.4982,0.9964,0.0
lr,0.5582,0.494,0.6214,0.3668
knn_9,0.4864,0.489,0.4854,0.4938
knn_7,0.4914,0.4852,0.5018,0.4678
knn_5,0.5024,0.472,0.532,0.4116
knn_3,0.4986,0.4378,0.5574,0.317


Dentre as 3 modelagem realizadas vemos que aquela que utiza a abordagem com mais divisões nas direções (LL,LR,LS,RL,RR,RS,SL,SR,SS) apresenta os melhores resultados sendo que não é necessário os atributos flight's.

Novamente vamos analisar os coeficientes via XGB apenas para ter uma ideia dos melhores coeficientes.

In [48]:
#Divisão do dataset
X = df_parkinson_direction.iloc[:,19:]
Y = df_parkinson_direction['parkinson']

xgb = XGBClassifier().fit(X,Y)
dict(zip(X.columns,xgb.feature_importances_))

{'hold_time_LL_mean': 0.019666942,
 'hold_time_LL_std': 0.017505148,
 'hold_time_LR_mean': 0.04694745,
 'hold_time_LR_std': 0.00048592474,
 'hold_time_LS_mean': 0.03862971,
 'hold_time_LS_std': 0.053769127,
 'hold_time_RL_mean': 0.014556614,
 'hold_time_RL_std': 0.007845612,
 'hold_time_RR_mean': 0.030264033,
 'hold_time_RR_std': 0.031115191,
 'hold_time_RS_mean': 0.025287073,
 'hold_time_RS_std': 0.04054601,
 'hold_time_SL_mean': 0.035497032,
 'hold_time_SL_std': 0.01977266,
 'hold_time_SR_mean': 0.037707787,
 'hold_time_SR_std': 0.01936177,
 'hold_time_SS_mean': 0.02235417,
 'hold_time_SS_std': 0.033705965,
 'latency_LL_mean': 0.036825106,
 'latency_LL_std': 0.016128946,
 'latency_LR_mean': 0.017327586,
 'latency_LR_std': 0.019470986,
 'latency_LS_mean': 0.043868095,
 'latency_LS_std': 0.033731684,
 'latency_RL_mean': 0.0067273593,
 'latency_RL_std': 0.01589806,
 'latency_RR_mean': 0.019966086,
 'latency_RR_std': 0.018970633,
 'latency_RS_mean': 0.044319775,
 'latency_RS_std': 0.0290

# 3. Conclusão

Sabemos que existe uma grande dificuldade atualmente em identificar quando uma pessoa possui ou não a Doença de Parkinson. Justamente por causa disso estudamos dois conjuntos de dados que envolvem em geral métricas de velocidade nessas pessoas para entender se seria possível realizar uma boa previsão da doença.

Nossa conclusão é que sim, é possível chegar a resultados razoavelmente satisfatórios com tais métricas. Desse modo, vemos que existe a possibilidade de no futuro utilizar-se de outros meios não convencionais e não intrusivos para realizar o diagnóstico da doença a partir do uso de dados que envolvem velocidade e machine learning.

Possívelmente as métricas que envolvem rapidez são bons fatores porque pessoas com Parkinson tendem a possuir tremores, rigidez e lentidão nos movimentos. Logo, tais atributos são relevantes por causa dos sintomas da própia doença. Não foi possível identificar nenhum valor que fosse mais relevante dos demais.

# 4. Observações

Em nossos estudos não realizamos nenhuma otimização de parâmetros. Certamente tal otimização levaria a resultados muito melhores dos apresentados aqui, mostrando novamente que os dados com aprendizado de máquina são o futuro para o diagnóstico de doenças, principalmente a de Parkinson.

Devido ao desbalanceamento das classes foi necessário utilizar algoritmos que geram-se dados sintético, então, nossas conclusões não são 100% confiáveis apesar de darem uma boa direção. Certamente o maior uso de dados possibilitaria mais certezas.

Seria possível aplicar o PCA principalmente nos dataset's hand e direction mas não realizamos isso pelo fato de desejarmos ver as importâncias dos coeficientes.