In [4]:
#Import Necessary libraries
library('tidyverse')
library('dplyr')
library('forcats')
library('quantregForest')
library('kableExtra')
source('functions.R')

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.1     [32m✔[39m [34mpurrr  [39m 0.3.5
[32m✔[39m [34mtibble [39m 3.1.8     [32m✔[39m [34mdplyr  [39m 1.1.0
[32m✔[39m [34mtidyr  [39m 1.2.1     [32m✔[39m [34mstringr[39m 1.5.0
[32m✔[39m [34mreadr  [39m 2.1.3     [32m✔[39m [34mforcats[39m 0.5.2
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
Loading required package: randomForest

randomForest 4.7-1.1

Type rfNews() to see new features/changes/bug fixes.


Attaching package: ‘randomForest’


The following object is masked from ‘package:dplyr’:

    combine


The following object is masked from ‘package:ggplot2’:

    margin


Loading required package: RColorBrewer


Attaching package: ‘kableExt

## 05d-RandomForest
WARNING: This document has a runtime exceeding 5 minutes (sometimes an hour depending on computing power)

Below we write a custom function to perform k-fold cv on a random forest model then report the interval scores, avg length, and calibration.

In [3]:
#' @description
#' Perform Kfold CV on a Forest
#'
#' @param Kfold integer, the number of folds to perform kfold cv over
#' @param seed integer, a number to set randomness for reproduction
#' @param datafr, a dataframe to train and test the model on, all predictor variables will be used and price will be predicted
#'
#' @return outTree, a report of the performance of the tree on each fold

kFoldForest = function(Kfold, seed, datafr)
{ set.seed(seed)
  n = nrow(datafr)
  iperm<<-sample(n) # set as global for debugging check
  nhold = round(n/Kfold)
  reg = list()
  pred = list() 
  scoreVar = list()
  rocVar = list()
  pred_y = sample(n-nhold)
  for(k in 1:Kfold){ 
        ilow = (k-1)*nhold+1
        ihigh = k*nhold
        if(k==Kfold) { ihigh = n }
        ifold = iperm[ilow:ihigh]
      
        holdo = datafr[ifold,]
        train = datafr[-ifold,]
        train$price = log(train$price)
      
        qRF = quantregForest(train[,-1],train[,1],ntree = 100)
        predRF = predict(qRF, what=c(.1,.25,.5,.75,.9), newdata=holdo[,-1])
        preds50 = cbind(exp(predRF[,3]),exp(predRF[,2]),exp(predRF[,4]))
        preds80 = cbind(exp(predRF[,3]),exp(predRF[,1]),exp(predRF[,5]))
      
        IS50qRF = intervalScore(preds50,holdo$price,0.5)
        IS80qRF = intervalScore(preds80,holdo$price,0.8)
          
        outqRF = rbind(IS50qRF$summary,IS80qRF$summary) 
        colnames(outqRF) = c("level","avgleng","IS","cover") 
        print(outqRF)
        
  }
}

### Running Kfold CV
Below I load the data and run kfold cv on the full set of predictors, then a reduced set. Some variables are excluded from the full set to improve training (varibales were selected for low performance such as paint_color).

to improve computation speed this was run locally in Rstudio and NOT in this notebook.

In [5]:
#load the training data
training_set = readRDS('04a-wrangledTrain.rds') %>% select(-c(state,size,paint_color)) #removing state since Forest can't hande >32 categories

In [None]:
kFoldForest(Kfold = 3,seed = 123,datafr = training_set)

In [None]:
sub = feature_selection(training_set)
kFoldForest(Kfold = 3, seed = 123, datafr = sub)

The results heavily favoured the full predictor set model with an average interval score of 10484 on the 50% interval and 16176 on the 80% interval compared to 10995 and 22409 respectively. For a detailed report of the fold values see the table in the report document.

## Fitting the full model
Below I fit the model on the whole training set

In [5]:
qRF = quantregForest(training_set[,-1],training_set[,1],ntree = 100)

In [6]:
saveRDS(qRF, "05d-RF.rds")

In [19]:
#predicting on holdout
sub$price = log(sub$price)
qRF = quantregForest(sub[,-1],sub[,1],ntree = 100)
predRF = predict(qRF, what=c(.1,.25,.5,.75,.9), newdata=holdo[,-1])
preds50 = cbind(exp(predRF[,3]),exp(predRF[,2]),exp(predRF[,4]))
preds80 = cbind(exp(predRF[,3]),exp(predRF[,1]),exp(predRF[,5]))     
IS50qRF = intervalScore(preds50,holdo$price,0.5)
IS80qRF = intervalScore(preds80,holdo$price,0.8)
outqRF = rbind(IS50qRF$summary,IS80qRF$summary) 
colnames(outqRF) = c("level","avgleng","IS","cover") 
print(outqRF)

     level  avgleng       IS     cover
[1,]   0.5 11346.96 18694.62 0.6035763
[2,]   0.8 23133.01 29568.49 0.8753025
