> Noise, or error, has varying degrees of impact on models’ predictive performance and occurs in three general forms in most data sets:
- Since many predictors are measured, they contain some level of systematic noise associated with the measurement system. Any extraneous noise in the predictors is likely to be propagated directly through the model prediction equation and results in poor performance.
- A second way noise can be introduced into the data is by the inclusion of non-informative predictors (e.g., predictors that have no relationship with the response). Some models have the ability to filter out irrelevant information, and hence their predictive performance is relatively unaffected.
- A third way noise enters the modeling process is through the response variable. As with predictors, some outcomes can be measured with a degree of systematic, unwanted noise. This type of error gives rise to an upper bound on model performance for which no pre-processing, model complexity, or tuning can overcome. For example, if a measured categorical outcome is mislabeled in the training data 10 % of the time, it is unlikely

In [1]:
library(AppliedPredictiveModeling)
data(solubility)

library(caret)
set.seed(100)
trn_id <- createFolds(solTrainY,returnTrain = T)
ctrl <- trainControl(method="cv",index=trn_id)

Loading required package: lattice
Loading required package: ggplot2


In [2]:
library(doMC)
registerDoMC()

Loading required package: foreach
Loading required package: iterators
Loading required package: parallel


In [None]:
set.seed(100)
mtryVals <- floor(seq(10,ncol(solTrainXtrans),length=10))
mtryGrid <- data.frame(.mtry=mtryVals)

fit_rf <- train(x=solTrainXtrans,y=solTrainY,
               method="rf",tuneGrid=mtryGrid,ntree=500,importance=T,
               trControl=ctrl)
fit_rf

In [None]:
ImpOrder <- order(fit_rf$finalModel$importance[,1],
                 decreasing=T)
top20 <- rownames(ImpOrder)[1:20]
solTrainX_imp <- subset(solTrainX,select=top20)
solTestX_imp <- subset(solTestX,select=top20)

In [None]:
permute_solTrainX_imp <- apply(solTrainX_imp,2,function(x) sample(x))
solSimX <- rbind(solTrainX_imp,permute_solTrainX_imp)
groupVals <- c("Training","Random")
groupY <- factor(rep(groupVals,each=nrow(solTrainX)))

In [None]:
rfSolClass <- train(x=solSimX,y=groupY,
                   method="rf",tuneLength=5,ntree=500,
                   control=trainControl(method="LGOCV))
solTestGroupProbs <- predict(rfSolClass,solTestX_imp,type="prob")