## Parkinson Data

Multivariate, integer and real numeric attributes, regression problem, 26 attributes, no missing data. Oxford Parkinson's Disease Telemonitoring Dataset. We try to estimate total_UPDRS attribute.

In [4]:
library(glmnet)
library(tidyverse)
library(caret)
library(pROC)
library(randomForest)
library(gbm)
library(rpart)
require(rpart.plot)
library(e1071)
library(ranger)
library(caret)
library(plyr)
library(superml)

In [3]:
data <- read.csv("C:/Users/n__gu/Desktop/parkinson.csv", header=TRUE)

In [3]:
head(data)

age,sex,test_time,motor_UPDRS,Jitter...,Jitter.Abs.,Jitter.RAP,Jitter.PPQ5,Jitter.DDP,Shimmer,...,Shimmer.APQ3,Shimmer.APQ5,Shimmer.APQ11,Shimmer.DDA,NHR,HNR,RPDE,DFA,PPE,total_UPDRS
72,0,5.6431,28.199,0.00662,3.38e-05,0.00401,0.00317,0.01204,0.02565,...,0.01438,0.01309,0.01662,0.04314,0.01429,21.64,0.41888,0.54842,0.16006,34.398
72,0,12.666,28.447,0.003,1.68e-05,0.00132,0.0015,0.00395,0.02024,...,0.00994,0.01072,0.01689,0.02982,0.011112,27.183,0.43493,0.56477,0.1081,34.894
72,0,19.681,28.695,0.00481,2.46e-05,0.00205,0.00208,0.00616,0.01675,...,0.00734,0.00844,0.01458,0.02202,0.02022,23.047,0.46222,0.54405,0.21014,35.389
72,0,25.647,28.905,0.00528,2.66e-05,0.00191,0.00264,0.00573,0.02309,...,0.01106,0.01265,0.01963,0.03317,0.027837,24.445,0.4873,0.57794,0.33277,35.81
72,0,33.642,29.187,0.00335,2.01e-05,0.00093,0.0013,0.00278,0.01703,...,0.00679,0.00929,0.01819,0.02036,0.011625,26.126,0.47188,0.56122,0.19361,36.375
72,0,40.652,29.435,0.00353,2.29e-05,0.00119,0.00159,0.00357,0.02227,...,0.01006,0.01337,0.02263,0.03019,0.009438,22.946,0.53949,0.57243,0.195,36.87


In [4]:
dim(data)

In [5]:
set.seed(101) # Set Seed so that same sample can be reproduced in future also
# Now Selecting 75% of data as sample from total 'n' rows of the data  

sample <- sample.int(n = nrow(data), size = floor(.75*nrow(data)), replace = F)
train <- data[sample, ]
test  <- data[-sample, ]

In [6]:
train <- as.data.frame(train)
x <- as.matrix(train[,1:20])
y <- as.matrix(train[,21])

## LASSO

In [7]:
set.seed(100) 
cv <- cv.glmnet(x, y, nfolds = 10,alpha = 1)

# Display the best lambda value
cv$lambda.min

In [8]:
model <- glmnet(x, y , alpha = 1, lambda = cv$lambda.min)
# Display regression coefficients
coef(model)

21 x 1 sparse Matrix of class "dgCMatrix"
                         s0
(Intercept)    2.012336e+00
age            6.964683e-02
sex           -1.373079e+00
test_time      2.050973e-03
motor_UPDRS    1.224749e+00
Jitter...     -1.112536e+02
Jitter.Abs.    1.256790e+04
Jitter.RAP     1.533854e+02
Jitter.PPQ5   -3.056244e+01
Jitter.DDP     3.675900e+00
Shimmer       -1.066154e+01
Shimmer.dB.   -1.210871e+00
Shimmer.APQ3   .           
Shimmer.APQ5   7.580680e+01
Shimmer.APQ11 -4.328033e+01
Shimmer.DDA   -3.408243e+00
NHR           -2.087811e+00
HNR           -9.777607e-02
RPDE           3.087567e+00
DFA           -2.316508e+00
PPE           -4.597002e+00

In [9]:
x.test <- as.matrix(test[,1:20])
predictions <- model %>% predict(x.test) %>% as.vector()

RMSE(test[,21], predictions)
R2(test[,21], predictions)

In [10]:
predictions <- model %>% predict(as.matrix(train[,1:20])) %>% as.vector()
RMSE(train[,21], predictions)
R2(train[,21], predictions)

R2 values for train and test data are similar, performs well and no issue of overfitting.

## Decision Tree

In [11]:
control <- trainControl(method = "cv", 10)
tree <-train(form = total_UPDRS ~ .,
      data = train,
      method = "rpart",  # Decision Tree
      trControl = control,
      tuneGrid = 
        expand.grid(.cp = seq(.01,0.5))) 

In [12]:
tree2 <- rpart(total_UPDRS~., data=train,control = rpart.control(minbucket =3, cp= 0.01))
predict2 <-predict(tree2, newdata=test)
RMSE(test$total_UPDRS, predict2)
R2(predict2,test$total_UPDRS)

In [13]:
tree2 <- rpart(total_UPDRS~., data=train,control = rpart.control(minbucket =5, cp= 0.01) )
predict2 <-predict(tree2, newdata=test)
RMSE(test$total_UPDRS, predict2)
R2(predict2,test$total_UPDRS)

In [14]:
tree2 <- rpart(total_UPDRS~., data=train,control = rpart.control(minbucket =10, cp= 0.01) )
predict2 <-predict(tree2, newdata=test)
RMSE(test$total_UPDRS, predict2)
R2(predict2,test$total_UPDRS)

In [15]:
tree2 <- rpart(total_UPDRS~., data=train,control = rpart.control(minbucket =20, cp= 0.01) )
predict2 <-predict(tree2, newdata=test)
RMSE(test$total_UPDRS, predict2)
R2(predict2,test$total_UPDRS)

In [16]:
predict2 <-predict(tree2, newdata=train)
RMSE(train$total_UPDRS, predict2)
R2(predict2,train$total_UPDRS)

After tuning for complexity parameter with cross validation, cp value is found to be 0.01 and minbucket is tuned manually. Different minbucket values gave the same result.

R2 values for test and train data are very close, decision tree performed well and no issue of over- or underfitting.

## Random Forest

In [17]:
control <- trainControl(method = "cv", 3)
rf <- train(form = total_UPDRS ~ .,
      data = train,
      method = "ranger",  # Decision Tree
      trControl = control,
      tuneGrid = 
        expand.grid(mtry = c(2,5,7,10),splitrule = "variance", min.node.size = 5)) 

In [19]:
predictions <- predict(rf, newdata = train)
RMSE(train[,21], predictions)
R2(train[,21], predictions)

In [22]:
predictions <- predict(rf, newdata = test)
RMSE(test[,21], predictions)
R2(test[,21], predictions)

RMSE of test data is higher as expected, however the r2 values at 0.99 for both show overfitting.

## Stochastic Gradient Boosting

In [23]:
control <- trainControl(method = "cv", 10)
sgb <- train(form = total_UPDRS ~ .,
      data = train,
      method = "gbm", 
      trControl = control,
      tuneGrid = 
        expand.grid(interaction.depth=c(1, 3, 5), n.trees = c(100, 200, 500, 1000), shrinkage=c(0.01, 0.001), n.minobsinnode = 10)) 

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1      113.8426             nan     0.0010    0.1448
     2      113.6994             nan     0.0010    0.1433
     3      113.5549             nan     0.0010    0.1442
     4      113.4108             nan     0.0010    0.1443
     5      113.2685             nan     0.0010    0.1439
     6      113.1236             nan     0.0010    0.1433
     7      112.9789             nan     0.0010    0.1433
     8      112.8364             nan     0.0010    0.1427
     9      112.6956             nan     0.0010    0.1425
    10      112.5497             nan     0.0010    0.1427
    20      111.1391             nan     0.0010    0.1394
    40      108.4033             nan     0.0010    0.1345
    60      105.7674             nan     0.0010    0.1292
    80      103.2268             nan     0.0010    0.1244
   100      100.7868             nan     0.0010    0.1199
   120       98.4293             nan     0.0010    0.1162
   140       9

In [24]:
predictions <- predict(sgb, newdata = test)
RMSE(test[,21], predictions)
R2(test[,21], predictions)

In [25]:
predictions <- predict(sgb, newdata = train)
RMSE(train[,21], predictions)
R2(train[,21], predictions)

RMSE and R2 values are very similar, however there seems to be overfitting issue for this method as well.

Overall, random forest and SGB has overfitting issue, LASSO and decision tree performed well without overfitting. Since LASSO r2 value is slightly higher than decision tree, the best method is LASSO for this dataset.

## Overall Comments on Four Datasets and Four Methods

Best performing methods for datasets were:

Covertype (classification): SGB

Bioassay (classification): Random Forest

Sports (classification): LASSO

Parkinson (regression): LASSO

Random forest had the issue of overfitting more frequently than the others. Decision tree did not perform as well as the other methods. LASSO was reliable in all datasets even though lacking in some cases. For all three different classification problems, we have different methods that perform well. Hence, we can conclude that there is no one clearly better performing method, but we should try to find the method that works well for the given dataset.