> Noise, or error, has varying degrees of impact on models’ predictive performance and occurs in three general forms in most data sets:
- Since many predictors are measured, they contain some level of systematic noise associated with the measurement system. Any extraneous noise in the predictors is likely to be propagated directly through the model prediction equation and results in poor performance.
- A second way noise can be introduced into the data is by the inclusion of non-informative predictors (e.g., predictors that have no relationship with the response). Some models have the ability to filter out irrelevant information, and hence their predictive performance is relatively unaffected.
- A third way noise enters the modeling process is through the response variable. As with predictors, some outcomes can be measured with a degree of systematic, unwanted noise. This type of error gives rise to an upper bound on model performance for which no pre-processing, model complexity, or tuning can overcome. For example, if a measured categorical outcome is mislabeled in the training data 10 % of the time, it is unlikely

In [1]:
library(AppliedPredictiveModeling)
data(solubility)

library(caret)
set.seed(100)
trn_id <- createFolds(solTrainY,returnTrain = T)
ctrl <- trainControl(method="cv",index=trn_id)

Loading required package: lattice
Loading required package: ggplot2


In [2]:
library(doMC)
registerDoMC()

Loading required package: foreach
Loading required package: iterators
Loading required package: parallel


In [3]:
set.seed(100)
mtryVals <- floor(seq(10,ncol(solTrainXtrans),length=10))
mtryGrid <- data.frame(.mtry=mtryVals)

fit_rf <- train(x=solTrainXtrans,y=solTrainY,
               method="rf",tuneGrid=mtryGrid,ntree=500,importance=T,
               trControl=ctrl)
fit_rf

Random Forest 

951 samples
228 predictors

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 856, 857, 855, 856, 856, 855, ... 
Resampling results across tuning parameters:

  mtry  RMSE       Rsquared   MAE      
   10   0.7074042  0.8879189  0.5261409
   34   0.6558718  0.8986420  0.4762569
   58   0.6514905  0.8987308  0.4717958
   82   0.6511441  0.8984975  0.4708676
  106   0.6476819  0.8993302  0.4681766
  131   0.6492597  0.8984346  0.4679246
  155   0.6466394  0.8992383  0.4673148
  179   0.6516802  0.8975136  0.4700424
  203   0.6502154  0.8979519  0.4677562
  228   0.6529326  0.8969530  0.4694228

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 155.

In [10]:
ImpOrder <- order(fit_rf$finalModel$importance[,1],
                 decreasing=T)
top20 <- rownames(fit_rf$finalModel$importance[ImpOrder,])[1:20]
solTrainX_imp <- subset(solTrainX,select=top20)
solTestX_imp <- subset(solTestX,select=top20)

In [11]:
permute_solTrainX_imp <- apply(solTrainX_imp,2,function(x) sample(x))
solSimX <- rbind(solTrainX_imp,permute_solTrainX_imp)
groupVals <- c("Training","Random")
groupY <- factor(rep(groupVals,each=nrow(solTrainX)))

In [12]:
head(solSimX)

Unnamed: 0,MolWeight,NumCarbon,SurfaceArea2,SurfaceArea1,NumNonHBonds,NumNonHAtoms,HydrophilicFactor,NumBonds,NumHydrogen,NumAtoms,NumMultBonds,NumOxygen,NumHalogen,FP075,NumAromaticBonds,NumRotBonds,FP072,FP015,FP092,NumChlorine
661,208.28,14,25.78,25.78,18,16,-0.856,30,12,28,16,0,0,0,16,0,0,1,0,0
662,365.54,21,80.43,52.19,29,26,-0.37,52,23,49,13,1,0,1,12,4,1,1,0,0
663,206.31,13,37.3,37.3,15,15,-0.33,33,18,33,7,2,0,0,6,4,1,1,0,0
665,136.26,10,0.0,0.0,10,10,-0.96,26,16,26,2,0,0,0,0,1,0,1,0,0
668,229.75,9,53.94,53.94,15,15,-0.069,31,16,31,6,0,1,1,6,5,0,1,1,1
669,270.25,10,45.61,20.31,14,15,-0.651,31,17,32,2,1,2,1,0,5,1,1,1,2


In [13]:
rfSolClass <- train(x=solSimX,y=groupY,
                   method="rf",tuneLength=5,ntree=500,
                   control=trainControl(method="LGOCV"))

solTestGroupProbs <- predict(rfSolClass,solTestX_imp,type="prob")

In [14]:
rfSolClass

Random Forest 

1902 samples
  20 predictor
   2 classes: 'Random', 'Training' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 1902, 1902, 1902, 1902, 1902, 1902, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   2    0.9968267  0.9936487
   6    0.9966000  0.9931950
  11    0.9954053  0.9908038
  15    0.9952882  0.9905706
  20    0.9945396  0.9890721

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.