## Covertype

Multiclass, 54 attributes, classification problem, categorical and integer attributes, no missing values.

Predicting forest cover type from cartographic variables only (no remotely sensed data). The actual forest cover type for a given observation (30 x 30 meter cell) was determined from US Forest Service (USFS) Region 2 Resource Information System (RIS) data. Independent variables were derived from data originally obtained from US Geological Survey (USGS) and USFS data. Data is in raw form (not scaled) and contains binary (0 or 1) columns of data for qualitative independent variables (wilderness areas and soil types).

Name / Data Type / Measurement / Description

Elevation / quantitative /meters / Elevation in meters

Aspect / quantitative / azimuth / Aspect in degrees azimuth

Slope / quantitative / degrees / Slope in degrees

Horizontal_Distance_To_Hydrology / quantitative / meters / Horz Dist to nearest surface water features

Vertical_Distance_To_Hydrology / quantitative / meters / Vert Dist to nearest surface water features

Horizontal_Distance_To_Roadways / quantitative / meters / Horz Dist to nearest roadway

Hillshade_9am / quantitative / 0 to 255 index / Hillshade index at 9am, summer solstice

Hillshade_Noon / quantitative / 0 to 255 index / Hillshade index at noon, summer soltice

Hillshade_3pm / quantitative / 0 to 255 index / Hillshade index at 3pm, summer solstice

Horizontal_Distance_To_Fire_Points / quantitative / meters / Horz Dist to nearest wildfire ignition points

Wilderness_Area (4 binary columns) / qualitative / 0 (absence) or 1 (presence) / Wilderness area designation

Soil_Type (40 binary columns) / qualitative / 0 (absence) or 1 (presence) / Soil Type designation

Cover_Type (7 types) / integer / 1 to 7 / Forest Cover Type designation


In [2]:
library(glmnet)
library(tidyverse)
library(caret)
library(pROC)
library(randomForest)
library(gbm)
library(rpart)
require(rpart.plot)
library(e1071)
library(ranger)
library(dplyr)

In [3]:
covertype <- read.table("C:/Users/n__gu/Desktop/covertype.txt", header=TRUE)
dim(covertype)

The dataset is too large, so I will use a sample for the applications.

In [4]:
data <- covertype %>% sample_frac(0.01)
dim(data)
str(data)

'data.frame':	5810 obs. of  55 variables:
 $ Elevation                         : int  3143 3051 3278 2982 3012 2995 2963 2840 3123 3121 ...
 $ Aspect                            : int  350 52 320 254 117 36 30 18 42 43 ...
 $ Slope                             : int  13 11 21 9 14 9 15 8 13 9 ...
 $ Horizontal_Distance_To_Hydrology  : int  30 268 335 816 30 190 644 255 90 170 ...
 $ Vertical_Distance_To_Hydrology    : int  5 13 79 82 2 18 144 78 -1 8 ...
 $ Horizontal_Distance_To_Roadways   : int  1170 1774 524 5994 845 1382 1350 3494 1648 1275 ...
 $ Hillshade_9am                     : int  197 226 161 200 244 220 214 215 222 222 ...
 $ Hillshade_Noon                    : int  218 217 215 246 226 221 206 224 210 221 ...
 $ Hillshade_3pm                     : int  159 124 191 185 108 136 124 146 121 134 ...
 $ Horizontal_Distance_To_Fire_Points: int  2266 1194 400 5749 350 1657 2648 6003 190 2072 ...
 $ Wilderness_Area1                  : int  1 0 0 1 0 0 0 1 1 1 ...
 $ Wilderness_Area2 

Train and test sets

In [5]:
set.seed(101) # Set Seed so that same sample can be reproduced in future also
# Now Selecting 75% of data as sample from total 'n' rows of the data  

sample <- sample.int(n = nrow(data), size = floor(.75*nrow(data)), replace = F)
train <- data[sample, ]
test  <- data[-sample, ]

## LASSO

We will find the best lambda parameted with cross validation.

In [6]:
train <- as.data.frame(train)
x <- as.matrix(train[,1:54])
y <- as.matrix(as.factor(train[,55]))

In [7]:
set.seed(123) 
cv <- cv.glmnet(x, y, alpha = 1,family="multinomial")

# Display the best lambda value
cv$lambda.min



In [8]:
model <- glmnet(x, y , alpha = 1, lambda = cv$lambda.min)
# Display regression coefficients
coef(model)

55 x 1 sparse Matrix of class "dgCMatrix"
                                              s0
(Intercept)                         3.330107e+00
Elevation                          -9.131917e-04
Aspect                              3.567257e-05
Slope                               3.531656e-03
Horizontal_Distance_To_Hydrology   -3.030395e-04
Vertical_Distance_To_Hydrology      1.162732e-03
Horizontal_Distance_To_Roadways     6.874047e-06
Hillshade_9am                       6.633523e-03
Hillshade_Noon                     -6.599422e-04
Hillshade_3pm                       2.967310e-03
Horizontal_Distance_To_Fire_Points  2.532896e-05
Wilderness_Area1                   -7.826716e-01
Wilderness_Area2                   -2.857832e-01
Wilderness_Area3                    .           
Wilderness_Area4                    6.173121e-01
Soil_Type1                          5.465523e-01
Soil_Type2                          6.121650e-01
Soil_Type3                         -3.508610e-01
Soil_Type4                 

In [9]:
x.test <- as.matrix(test[,1:54])
predictions <- model %>% predict(x.test) %>% as.vector()

table(test[,55])
auc(roc(response = test$Cover_Type, predictor = predictions))



  1   2   3   4   5   6   7 
568 682  88   8  22  40  45 

"'response' has more than two levels. Consider setting 'levels' explicitly or using 'multiclass.roc' instead"Setting levels: control = 1, case = 2
Setting direction: controls < cases


In [11]:
x.train <- as.matrix(train[,1:54])
predictions <- model %>% predict(x.train) %>% as.vector()
auc(roc(response = train$Cover_Type, predictor = predictions))

"'response' has more than two levels. Consider setting 'levels' explicitly or using 'multiclass.roc' instead"Setting levels: control = 1, case = 2
Setting direction: controls < cases


AUC is 0.67 for test data, and 0.67 for train data. There is no issue of over- or underfitting. However based on the results, LASSO did not perform well.

## Decision Tree

We will optimize minimal number of observations per tree leaf and complexity parameter.

In [12]:
caret.control <- trainControl(method = "repeatedcv",
                              number = 10,
                              repeats = 3)
cpGrid = expand.grid( .cp = seq(0.01,0.5) )
rpart.cv <- train(Cover_Type ~ ., 
                  data = train,
                  method = "rpart",
                  trControl = caret.control,
                  tuneGrid = cpGrid)

In [13]:
tree1 <- rpart(Cover_Type~., data=train,control = rpart.control(minbucket =5, cp= 0.01) , method = 'class')
predict1 <-predict(tree1, newdata=test, type = 'class')
table <- table(as.matrix(predict1), test$Cover_Type)
table
auc <- auc(predict1, test$Cover_Type)
auc

   
      1   2   3   4   5   6   7
  1 417 183   0   0   0   0  45
  2 151 486  18   0  21  11   0
  3   0  13  70   8   1  29   0

"'response' has more than two levels. Consider setting 'levels' explicitly or using 'multiclass.roc' instead"Setting levels: control = 1, case = 2
Setting direction: controls < cases


In [14]:
tree2 <- rpart(Cover_Type~., data=train,control = rpart.control(minbucket =10, cp=0.01),method = 'class')
predict2 <-predict(tree2, newdata=test, type = 'class')
table <- table(as.matrix(predict2), test$Cover_Type)
table
auc <- auc(predict2, test$Cover_Type)
auc

   
      1   2   3   4   5   6   7
  1 417 183   0   0   0   0  45
  2 151 486  18   0  21  11   0
  3   0  13  70   8   1  29   0

"'response' has more than two levels. Consider setting 'levels' explicitly or using 'multiclass.roc' instead"Setting levels: control = 1, case = 2
Setting direction: controls < cases


In [15]:
tree3 <- rpart(Cover_Type~., data=train,control = rpart.control(minbucket =20, cp=0.01),method = 'class')
predict3 <-predict(tree3, newdata=test, type = 'class')
table <- table(as.matrix(predict3), test$Cover_Type)
table
auc <- auc(predict3, test$Cover_Type)
auc

   
      1   2   3   4   5   6   7
  1 417 183   0   0   0   0  45
  2 151 486  18   0  21  11   0
  3   0  13  70   8   1  29   0

"'response' has more than two levels. Consider setting 'levels' explicitly or using 'multiclass.roc' instead"Setting levels: control = 1, case = 2
Setting direction: controls < cases


In [16]:
tree4 <- rpart(Cover_Type~., data=train,control = rpart.control(minbucket =30, cp=0.02),method = 'class')
predict4 <-predict(tree4, newdata=test, type = 'class')
table <- table(as.matrix(predict4), test$Cover_Type)
table
auc <- auc(predict4, test$Cover_Type)
auc

   
      1   2   3   4   5   6   7
  1 417 183   0   0   0   0  45
  2 151 486  24   0  21  12   0
  3   0  13  64   8   1  28   0

"'response' has more than two levels. Consider setting 'levels' explicitly or using 'multiclass.roc' instead"Setting levels: control = 1, case = 2
Setting direction: controls < cases


In [17]:
predict4 <-predict(tree4, newdata=train, type = 'class')
table <- table(as.matrix(predict4), train$Cover_Type)
table
auc <- auc(predict4, train$Cover_Type)
auc

   
       1    2    3    4    5    6    7
  1 1180  553    0    0    0    0  147
  2  441 1531   74    0   77   38    1
  3    0   30  176   17    2   90    0

"'response' has more than two levels. Consider setting 'levels' explicitly or using 'multiclass.roc' instead"Setting levels: control = 1, case = 2
Setting direction: controls < cases


cp = 0.01 gives the best result, with AUC = 0.699, different minbucket values did not affect the result. For train data, AUC is 0.694, very close to result for test data, slightly less.

This method fails to recognize classes 4-7, but overall performance is better than LASSO.

## Random Forest

Minimum node size is set to 5, we will find the best mtry parameter.

In [6]:
test$Cover_Type <- as.factor(test$Cover_Type)
train$Cover_Type <- as.factor(train$Cover_Type)

In [18]:
train <- train %>%
  mutate(Cover_Type = as.character(Cover_Type),
         Cover_Type = if_else(Cover_Type == '1', 'a', Cover_Type),
         Cover_Type = if_else(Cover_Type == '2', 'b', Cover_Type),
         Cover_Type = if_else(Cover_Type == '3', 'c', Cover_Type),
         Cover_Type = if_else(Cover_Type == '4', 'd', Cover_Type),
         Cover_Type = if_else(Cover_Type == '5', 'e', Cover_Type),
         Cover_Type = if_else(Cover_Type == '6', 'f', Cover_Type),
         Cover_Type = if_else(Cover_Type == '7', 'g', Cover_Type),
         Cover_Type = as.factor(Cover_Type))
test <- test %>%
  mutate(Cover_Type = as.character(Cover_Type),
         Cover_Type = if_else(Cover_Type == '1', 'a', Cover_Type),
         Cover_Type = if_else(Cover_Type == '2', 'b', Cover_Type),
         Cover_Type = if_else(Cover_Type == '3', 'c', Cover_Type),
         Cover_Type = if_else(Cover_Type == '4', 'd', Cover_Type),
         Cover_Type = if_else(Cover_Type == '5', 'e', Cover_Type),
         Cover_Type = if_else(Cover_Type == '6', 'f', Cover_Type),
         Cover_Type = if_else(Cover_Type == '7', 'g', Cover_Type),
         Cover_Type = as.factor(Cover_Type))

In [19]:
folds <- createMultiFolds(train$Cover_Type, k = 5, times = 3)
control <- trainControl(method = "cv", number = 5, verboseIter = TRUE,
                        classProbs = TRUE, savePredictions = TRUE, index = folds, allowParallel = TRUE)
tune_grid <- expand.grid(mtry = c(2,5,7,10), splitrule= "gini",  min.node.size = 5)
rf <- train(Cover_Type~., train, method = "ranger", tuneGrid = tune_grid, trControl = control)

+ Fold1.Rep1: mtry= 2, splitrule=gini, min.node.size=5 
- Fold1.Rep1: mtry= 2, splitrule=gini, min.node.size=5 
+ Fold1.Rep1: mtry= 5, splitrule=gini, min.node.size=5 
- Fold1.Rep1: mtry= 5, splitrule=gini, min.node.size=5 
+ Fold1.Rep1: mtry= 7, splitrule=gini, min.node.size=5 
- Fold1.Rep1: mtry= 7, splitrule=gini, min.node.size=5 
+ Fold1.Rep1: mtry=10, splitrule=gini, min.node.size=5 
- Fold1.Rep1: mtry=10, splitrule=gini, min.node.size=5 
+ Fold2.Rep1: mtry= 2, splitrule=gini, min.node.size=5 
- Fold2.Rep1: mtry= 2, splitrule=gini, min.node.size=5 
+ Fold2.Rep1: mtry= 5, splitrule=gini, min.node.size=5 
- Fold2.Rep1: mtry= 5, splitrule=gini, min.node.size=5 
+ Fold2.Rep1: mtry= 7, splitrule=gini, min.node.size=5 
- Fold2.Rep1: mtry= 7, splitrule=gini, min.node.size=5 
+ Fold2.Rep1: mtry=10, splitrule=gini, min.node.size=5 
- Fold2.Rep1: mtry=10, splitrule=gini, min.node.size=5 
+ Fold3.Rep1: mtry= 2, splitrule=gini, min.node.size=5 
- Fold3.Rep1: mtry= 2, splitrule=gini, min.node.

In [20]:
predictrf <-predict(rf, newdata=test)

In [21]:
confusionMatrix(test$Cover_Type,as.factor(predictrf))

Confusion Matrix and Statistics

          Reference
Prediction   a   b   c   d   e   f   g
         a 439 123   0   0   0   0   6
         b 101 571   4   0   1   5   0
         c   0   6  74   0   0   8   0
         d   0   0   7   1   0   0   0
         e   0  20   0   0   2   0   0
         f   0  12  15   0   0  13   0
         g  17   0   0   0   0   0  28

Overall Statistics
                                         
               Accuracy : 0.7763         
                 95% CI : (0.754, 0.7975)
    No Information Rate : 0.5038         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.6323         
                                         
 Mcnemar's Test P-Value : NA             

Statistics by Class:

                     Class: a Class: b Class: c  Class: d Class: e Class: f
Sensitivity            0.7882   0.7801  0.74000 1.0000000 0.666667 0.500000
Specificity            0.8560   0.8460  0.98965 0.9951791 0.98

In [22]:
predictrf <-predict(rf, newdata=train)
confusionMatrix(train$Cover_Type,as.factor(predictrf))

Confusion Matrix and Statistics

          Reference
Prediction    a    b    c    d    e    f    g
         a 1600   21    0    0    0    0    0
         b    2 2109    2    0    0    1    0
         c    0    5  245    0    0    0    0
         d    0    0    0   17    0    0    0
         e    0   11    2    0   66    0    0
         f    0    4    4    0    0  120    0
         g    4    0    0    0    0    0  144

Overall Statistics
                                          
               Accuracy : 0.9871          
                 95% CI : (0.9833, 0.9903)
    No Information Rate : 0.4935          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9792          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: a Class: b Class: c Class: d Class: e Class: f
Sensitivity            0.9963   0.9809  0.96838 1.000000  1.00000  0.99174

Accuracy is 0.77 for test data and 0.987 for train data, best performance so far based on test data predictions. However, it overfits the training data.

## Stochastic Gradient Boosting

We will optimize parameters depth of the tree, learning rate (also known
as shrinkage) and number of trees through cross validation.

In [23]:
control <- trainControl(method = "cv", 3)
sgb <- train(form = Cover_Type ~ .,
      data = train,
      method = "gbm", 
      trControl = control,
      tuneGrid = 
        expand.grid(interaction.depth=c(1, 3, 5), n.trees = c(100, 200, 500, 1000), shrinkage=c(0.01, 0.001), n.minobsinnode = 10)) 

"variable 50: Soil_Type36 has no variation."

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.9459             nan     0.0010    0.0055
     2        1.9430             nan     0.0010    0.0055
     3        1.9401             nan     0.0010    0.0055
     4        1.9372             nan     0.0010    0.0055
     5        1.9344             nan     0.0010    0.0055
     6        1.9316             nan     0.0010    0.0055
     7        1.9287             nan     0.0010    0.0054
     8        1.9259             nan     0.0010    0.0054
     9        1.9230             nan     0.0010    0.0054
    10        1.9202             nan     0.0010    0.0054
    20        1.8928             nan     0.0010    0.0051
    40        1.8411             nan     0.0010    0.0047
    60        1.7931             nan     0.0010    0.0044
    80        1.7484             nan     0.0010    0.0041
   100        1.7069             nan     0.0010    0.0038
   120        1.6682             nan     0.0010    0.0036
   140        

"variable 50: Soil_Type36 has no variation."

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.9459             nan     0.0010    0.0063
     2        1.9426             nan     0.0010    0.0062
     3        1.9393             nan     0.0010    0.0061
     4        1.9361             nan     0.0010    0.0061
     5        1.9329             nan     0.0010    0.0061
     6        1.9297             nan     0.0010    0.0061
     7        1.9265             nan     0.0010    0.0061
     8        1.9232             nan     0.0010    0.0060
     9        1.9201             nan     0.0010    0.0060
    10        1.9169             nan     0.0010    0.0060
    20        1.8858             nan     0.0010    0.0057
    40        1.8275             nan     0.0010    0.0052
    60        1.7738             nan     0.0010    0.0048
    80        1.7239             nan     0.0010    0.0045
   100        1.6773             nan     0.0010    0.0042
   120        1.6344             nan     0.0010    0.0039
   140        

"variable 50: Soil_Type36 has no variation."

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.9459             nan     0.0010    0.0065
     2        1.9424             nan     0.0010    0.0065
     3        1.9390             nan     0.0010    0.0064
     4        1.9356             nan     0.0010    0.0062
     5        1.9322             nan     0.0010    0.0064
     6        1.9288             nan     0.0010    0.0063
     7        1.9255             nan     0.0010    0.0063
     8        1.9221             nan     0.0010    0.0063
     9        1.9187             nan     0.0010    0.0062
    10        1.9154             nan     0.0010    0.0062
    20        1.8830             nan     0.0010    0.0059
    40        1.8220             nan     0.0010    0.0055
    60        1.7656             nan     0.0010    0.0051
    80        1.7138             nan     0.0010    0.0046
   100        1.6654             nan     0.0010    0.0043
   120        1.6201             nan     0.0010    0.0040
   140        

"variable 50: Soil_Type36 has no variation."

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.9459             nan     0.0100    0.0552
     2        1.9169             nan     0.0100    0.0526
     3        1.8888             nan     0.0100    0.0509
     4        1.8616             nan     0.0100    0.0478
     5        1.8361             nan     0.0100    0.0467
     6        1.8117             nan     0.0100    0.0448
     7        1.7879             nan     0.0100    0.0429
     8        1.7651             nan     0.0100    0.0418
     9        1.7431             nan     0.0100    0.0400
    10        1.7217             nan     0.0100    0.0387
    20        1.5450             nan     0.0100    0.0284
    40        1.3080             nan     0.0100    0.0172
    60        1.1568             nan     0.0100    0.0101
    80        1.0545             nan     0.0100    0.0066
   100        0.9831             nan     0.0100    0.0053
   120        0.9292             nan     0.0100    0.0039
   140        

"variable 50: Soil_Type36 has no variation."

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.9459             nan     0.0100    0.0611
     2        1.9130             nan     0.0100    0.0595
     3        1.8814             nan     0.0100    0.0563
     4        1.8519             nan     0.0100    0.0539
     5        1.8230             nan     0.0100    0.0519
     6        1.7954             nan     0.0100    0.0496
     7        1.7690             nan     0.0100    0.0474
     8        1.7429             nan     0.0100    0.0456
     9        1.7186             nan     0.0100    0.0435
    10        1.6949             nan     0.0100    0.0426
    20        1.4974             nan     0.0100    0.0311
    40        1.2368             nan     0.0100    0.0178
    60        1.0724             nan     0.0100    0.0117
    80        0.9606             nan     0.0100    0.0080
   100        0.8806             nan     0.0100    0.0057
   120        0.8215             nan     0.0100    0.0040
   140        

"variable 50: Soil_Type36 has no variation."

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.9459             nan     0.0100    0.0640
     2        1.9114             nan     0.0100    0.0608
     3        1.8788             nan     0.0100    0.0581
     4        1.8481             nan     0.0100    0.0565
     5        1.8175             nan     0.0100    0.0539
     6        1.7885             nan     0.0100    0.0515
     7        1.7609             nan     0.0100    0.0485
     8        1.7344             nan     0.0100    0.0481
     9        1.7085             nan     0.0100    0.0450
    10        1.6843             nan     0.0100    0.0445
    20        1.4776             nan     0.0100    0.0312
    40        1.2060             nan     0.0100    0.0182
    60        1.0350             nan     0.0100    0.0117
    80        0.9198             nan     0.0100    0.0078
   100        0.8362             nan     0.0100    0.0056
   120        0.7739             nan     0.0100    0.0037
   140        

"variable 51: Soil_Type37 has no variation."

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.9459             nan     0.0010    0.0054
     2        1.9431             nan     0.0010    0.0054
     3        1.9403             nan     0.0010    0.0054
     4        1.9375             nan     0.0010    0.0054
     5        1.9346             nan     0.0010    0.0053
     6        1.9318             nan     0.0010    0.0054
     7        1.9290             nan     0.0010    0.0053
     8        1.9262             nan     0.0010    0.0053
     9        1.9234             nan     0.0010    0.0053
    10        1.9206             nan     0.0010    0.0053
    20        1.8936             nan     0.0010    0.0050
    40        1.8426             nan     0.0010    0.0047
    60        1.7958             nan     0.0010    0.0043
    80        1.7520             nan     0.0010    0.0040
   100        1.7112             nan     0.0010    0.0037
   120        1.6728             nan     0.0010    0.0035
   140        

"variable 51: Soil_Type37 has no variation."

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.9459             nan     0.0010    0.0062
     2        1.9426             nan     0.0010    0.0063
     3        1.9393             nan     0.0010    0.0061
     4        1.9361             nan     0.0010    0.0061
     5        1.9329             nan     0.0010    0.0061
     6        1.9297             nan     0.0010    0.0061
     7        1.9265             nan     0.0010    0.0061
     8        1.9233             nan     0.0010    0.0061
     9        1.9201             nan     0.0010    0.0060
    10        1.9169             nan     0.0010    0.0060
    20        1.8860             nan     0.0010    0.0058
    40        1.8282             nan     0.0010    0.0053
    60        1.7749             nan     0.0010    0.0048
    80        1.7254             nan     0.0010    0.0044
   100        1.6795             nan     0.0010    0.0041
   120        1.6365             nan     0.0010    0.0038
   140        

"variable 51: Soil_Type37 has no variation."

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.9459             nan     0.0010    0.0065
     2        1.9424             nan     0.0010    0.0064
     3        1.9391             nan     0.0010    0.0063
     4        1.9357             nan     0.0010    0.0064
     5        1.9323             nan     0.0010    0.0063
     6        1.9289             nan     0.0010    0.0062
     7        1.9255             nan     0.0010    0.0063
     8        1.9222             nan     0.0010    0.0062
     9        1.9189             nan     0.0010    0.0061
    10        1.9155             nan     0.0010    0.0062
    20        1.8833             nan     0.0010    0.0059
    40        1.8228             nan     0.0010    0.0054
    60        1.7672             nan     0.0010    0.0050
    80        1.7156             nan     0.0010    0.0046
   100        1.6676             nan     0.0010    0.0043
   120        1.6229             nan     0.0010    0.0040
   140        

"variable 51: Soil_Type37 has no variation."

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.9459             nan     0.0100    0.0545
     2        1.9173             nan     0.0100    0.0521
     3        1.8903             nan     0.0100    0.0489
     4        1.8648             nan     0.0100    0.0476
     5        1.8398             nan     0.0100    0.0457
     6        1.8150             nan     0.0100    0.0442
     7        1.7920             nan     0.0100    0.0425
     8        1.7691             nan     0.0100    0.0409
     9        1.7474             nan     0.0100    0.0397
    10        1.7269             nan     0.0100    0.0378
    20        1.5518             nan     0.0100    0.0281
    40        1.3149             nan     0.0100    0.0167
    60        1.1634             nan     0.0100    0.0112
    80        1.0602             nan     0.0100    0.0075
   100        0.9877             nan     0.0100    0.0052
   120        0.9347             nan     0.0100    0.0040
   140        

"variable 51: Soil_Type37 has no variation."

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.9459             nan     0.0100    0.0610
     2        1.9137             nan     0.0100    0.0587
     3        1.8827             nan     0.0100    0.0553
     4        1.8531             nan     0.0100    0.0528
     5        1.8245             nan     0.0100    0.0522
     6        1.7967             nan     0.0100    0.0494
     7        1.7704             nan     0.0100    0.0471
     8        1.7454             nan     0.0100    0.0450
     9        1.7210             nan     0.0100    0.0438
    10        1.6978             nan     0.0100    0.0427
    20        1.5010             nan     0.0100    0.0298
    40        1.2420             nan     0.0100    0.0182
    60        1.0780             nan     0.0100    0.0112
    80        0.9667             nan     0.0100    0.0074
   100        0.8871             nan     0.0100    0.0054
   120        0.8290             nan     0.0100    0.0037
   140        

"variable 51: Soil_Type37 has no variation."

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.9459             nan     0.0100    0.0642
     2        1.9116             nan     0.0100    0.0597
     3        1.8790             nan     0.0100    0.0571
     4        1.8490             nan     0.0100    0.0559
     5        1.8188             nan     0.0100    0.0533
     6        1.7904             nan     0.0100    0.0518
     7        1.7629             nan     0.0100    0.0486
     8        1.7368             nan     0.0100    0.0465
     9        1.7120             nan     0.0100    0.0458
    10        1.6872             nan     0.0100    0.0438
    20        1.4817             nan     0.0100    0.0315
    40        1.2107             nan     0.0100    0.0186
    60        1.0406             nan     0.0100    0.0118
    80        0.9255             nan     0.0100    0.0075
   100        0.8430             nan     0.0100    0.0054
   120        0.7816             nan     0.0100    0.0040
   140        

"variable 50: Soil_Type36 has no variation."

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.9459             nan     0.0010    0.0056
     2        1.9430             nan     0.0010    0.0054
     3        1.9401             nan     0.0010    0.0056
     4        1.9372             nan     0.0010    0.0055
     5        1.9344             nan     0.0010    0.0055
     6        1.9315             nan     0.0010    0.0055
     7        1.9287             nan     0.0010    0.0054
     8        1.9258             nan     0.0010    0.0054
     9        1.9229             nan     0.0010    0.0054
    10        1.9202             nan     0.0010    0.0054
    20        1.8928             nan     0.0010    0.0052
    40        1.8412             nan     0.0010    0.0048
    60        1.7933             nan     0.0010    0.0044
    80        1.7490             nan     0.0010    0.0041
   100        1.7076             nan     0.0010    0.0038
   120        1.6687             nan     0.0010    0.0035
   140        

"variable 50: Soil_Type36 has no variation."

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.9459             nan     0.0010    0.0063
     2        1.9426             nan     0.0010    0.0064
     3        1.9392             nan     0.0010    0.0062
     4        1.9360             nan     0.0010    0.0062
     5        1.9327             nan     0.0010    0.0062
     6        1.9294             nan     0.0010    0.0060
     7        1.9263             nan     0.0010    0.0060
     8        1.9230             nan     0.0010    0.0062
     9        1.9198             nan     0.0010    0.0060
    10        1.9167             nan     0.0010    0.0061
    20        1.8854             nan     0.0010    0.0059
    40        1.8267             nan     0.0010    0.0054
    60        1.7726             nan     0.0010    0.0049
    80        1.7224             nan     0.0010    0.0045
   100        1.6758             nan     0.0010    0.0043
   120        1.6319             nan     0.0010    0.0039
   140        

"variable 50: Soil_Type36 has no variation."

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.9459             nan     0.0010    0.0066
     2        1.9424             nan     0.0010    0.0064
     3        1.9389             nan     0.0010    0.0066
     4        1.9355             nan     0.0010    0.0065
     5        1.9320             nan     0.0010    0.0065
     6        1.9285             nan     0.0010    0.0063
     7        1.9251             nan     0.0010    0.0064
     8        1.9217             nan     0.0010    0.0063
     9        1.9183             nan     0.0010    0.0064
    10        1.9150             nan     0.0010    0.0062
    20        1.8825             nan     0.0010    0.0061
    40        1.8212             nan     0.0010    0.0055
    60        1.7647             nan     0.0010    0.0051
    80        1.7123             nan     0.0010    0.0047
   100        1.6636             nan     0.0010    0.0043
   120        1.6180             nan     0.0010    0.0041
   140        

"variable 50: Soil_Type36 has no variation."

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.9459             nan     0.0100    0.0553
     2        1.9172             nan     0.0100    0.0530
     3        1.8899             nan     0.0100    0.0500
     4        1.8630             nan     0.0100    0.0483
     5        1.8373             nan     0.0100    0.0465
     6        1.8125             nan     0.0100    0.0446
     7        1.7889             nan     0.0100    0.0433
     8        1.7662             nan     0.0100    0.0415
     9        1.7438             nan     0.0100    0.0400
    10        1.7225             nan     0.0100    0.0385
    20        1.5454             nan     0.0100    0.0279
    40        1.3068             nan     0.0100    0.0171
    60        1.1542             nan     0.0100    0.0103
    80        1.0510             nan     0.0100    0.0080
   100        0.9781             nan     0.0100    0.0055
   120        0.9250             nan     0.0100    0.0038
   140        

"variable 50: Soil_Type36 has no variation."

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.9459             nan     0.0100    0.0617
     2        1.9123             nan     0.0100    0.0593
     3        1.8812             nan     0.0100    0.0567
     4        1.8514             nan     0.0100    0.0538
     5        1.8224             nan     0.0100    0.0519
     6        1.7946             nan     0.0100    0.0501
     7        1.7675             nan     0.0100    0.0476
     8        1.7420             nan     0.0100    0.0470
     9        1.7172             nan     0.0100    0.0430
    10        1.6938             nan     0.0100    0.0423
    20        1.4944             nan     0.0100    0.0314
    40        1.2324             nan     0.0100    0.0176
    60        1.0665             nan     0.0100    0.0116
    80        0.9537             nan     0.0100    0.0080
   100        0.8728             nan     0.0100    0.0052
   120        0.8141             nan     0.0100    0.0040
   140        

"variable 50: Soil_Type36 has no variation."

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.9459             nan     0.0100    0.0646
     2        1.9117             nan     0.0100    0.0609
     3        1.8784             nan     0.0100    0.0580
     4        1.8473             nan     0.0100    0.0562
     5        1.8173             nan     0.0100    0.0533
     6        1.7888             nan     0.0100    0.0523
     7        1.7609             nan     0.0100    0.0495
     8        1.7340             nan     0.0100    0.0470
     9        1.7087             nan     0.0100    0.0462
    10        1.6841             nan     0.0100    0.0442
    20        1.4767             nan     0.0100    0.0314
    40        1.2021             nan     0.0100    0.0186
    60        1.0290             nan     0.0100    0.0115
    80        0.9119             nan     0.0100    0.0083
   100        0.8279             nan     0.0100    0.0055
   120        0.7651             nan     0.0100    0.0039
   140        

"variable 50: Soil_Type36 has no variation."

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.9459             nan     0.0100    0.0643
     2        1.9116             nan     0.0100    0.0610
     3        1.8794             nan     0.0100    0.0587
     4        1.8484             nan     0.0100    0.0559
     5        1.8189             nan     0.0100    0.0534
     6        1.7904             nan     0.0100    0.0512
     7        1.7631             nan     0.0100    0.0487
     8        1.7370             nan     0.0100    0.0469
     9        1.7115             nan     0.0100    0.0457
    10        1.6870             nan     0.0100    0.0434
    20        1.4824             nan     0.0100    0.0319
    40        1.2140             nan     0.0100    0.0181
    60        1.0454             nan     0.0100    0.0118
    80        0.9316             nan     0.0100    0.0080
   100        0.8503             nan     0.0100    0.0057
   120        0.7899             nan     0.0100    0.0038
   140        

In [37]:
predictions <- predict(sgb, newdata = test)
confusionMatrix(test$Cover_Type, predictions)

Confusion Matrix and Statistics

          Reference
Prediction   a   b   c   d   e   f   g
         a 418 139   0   0   1   1   9
         b 109 563   3   0   1   6   0
         c   0   8  71   0   0   9   0
         d   0   0   6   2   0   0   0
         e   0  19   0   0   3   0   0
         f   0   7  15   0   0  18   0
         g  17   0   0   0   0   0  28

Overall Statistics
                                          
               Accuracy : 0.7591          
                 95% CI : (0.7363, 0.7809)
    No Information Rate : 0.5065          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6054          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: a Class: b Class: c Class: d Class: e Class: f
Sensitivity            0.7684   0.7649  0.74737 1.000000 0.600000  0.52941
Specificity            0.8350   0.8340  0.98748 0.99586

In [36]:
predictions <- predict(sgb, newdata = train)
confusionMatrix(train$Cover_Type, predictions)

Confusion Matrix and Statistics

          Reference
Prediction    a    b    c    d    e    f    g
         a 1316  302    1    0    0    0    2
         b  268 1833    9    0    0    4    0
         c    0   12  236    0    0    2    0
         d    0    0    0   17    0    0    0
         e    2   31    2    0   44    0    0
         f    0   11    4    0    0  113    0
         g   18    2    0    0    0    0  128

Overall Statistics
                                          
               Accuracy : 0.8462          
                 95% CI : (0.8352, 0.8568)
    No Information Rate : 0.5029          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.7494          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: a Class: b Class: c Class: d Class: e Class: f
Sensitivity            0.8204   0.8366  0.93651 1.000000  1.00000  0.94958

Accuracy is 0.76 for test data, and 0.85 for train data. Fits better to train data, but we cannot comment there is an issue of overfitting for SGB.

Comparing the results, with accuracy 0.76 Random Forest and SGB performed the best. However, random forest has overfitting with train data. For this data, I would comment SGB performed the best based on test and train data performance. 76% accuracy shows there is room for improvement in performance.