vignette and project name

KelvynBladen · Jul 14, 2023 · e2dbac6 · e2dbac6
1 parent a9da4af
commit e2dbac6
Show file tree

Hide file tree

Showing 10 changed files with 112 additions and 99 deletions.
diff --git a/README.Rmd b/README.Rmd
@@ -68,62 +68,65 @@ You can view R package's source code on GitHub: <https://github.com/KelvynBladen
 
 ```{r, warning=FALSE, message=FALSE}
 library(rfvip)
+library(MASS)
+library(EZtune)
 ```
 
-To introduce the functionality of `rfvip`, we will look at a modeling problem 
-for the Boston housing data (found in the MASS package). We will attempt to 
-build an optimal Random Forest model for accuracy and interpretability. To 
-begin we will run some preliminary diagnostics on our data.
+To introduce the functionality of `rfvip`, we look at modeling the Boston 
+housing data (found in the MASS package). We want to 
+build a Random Forest model with a view towards both accuracy and 
+interpretability. We begin by running some preliminary diagnostics on our data.
 
-```{r, warning=FALSE, message=FALSE}
+```{r, warning=FALSE, message=FALSE, fig.width=3, fig.height=3, fig.align='center'}
 set.seed(1234)
 
-pcs <- partial_cor(medv ~ ., data = MASS::Boston, model = lm)
+pcs <- partial_cor(medv ~ ., data = Boston, model = lm)
 pcs$plot_y_part_cors
 
-rv <- robust_vifs(medv ~ ., data = MASS::Boston, model = lm)
+rv <- robust_vifs(medv ~ ., data = Boston, model = lm)
 rv$plot_nonlin_vifs
 ```
 
 These functions assess concerns with collinearity. Notice that the VIFs from 
-`robust_vifs`are all less than 10. The partial correlations with the response 
+`robust_vifs` are all less than 10. The partial correlations with the response 
 from `partial_cor` are a type of pseudo-importance assessing the importance 
 each variable does not share with the others. Now we tune our model by 
 assessing four different mtry values in the `mtry_compare` function.
 
-```{r, warning=FALSE, message=FALSE}
+```{r, warning=FALSE, message=FALSE, fig.width=4, fig.height=4, fig.align='center'}
 set.seed(1)
 m <- mtry_compare(medv ~ .,
-  data = MASS::Boston, sqrt = TRUE,
+  data = Boston, sqrt = TRUE,
   mvec = c(1, 4, 9, 13), num_var = 7
 )
 m$gg_model_errors
 m$model_errors
 ```
 
 According to the accuracy plot and table above, our best choice is when mtry is 
-4. However, the accuracy for the best model is notably very similar to two of 
-the other models. We now look at the variable importance metrics across the 
-different models.
+4. However, the accuracy for the best model is notably only slightly better than 
+the models with mtry set to 9 and 13. We now look at the variable importance 
+metrics across the different models.
 
-```{r, warning=FALSE, message=FALSE}
+```{r, warning=FALSE, message=FALSE, fig.width=6, fig.height=5, fig.align='center'}
 m$gg_var_imp_error
 ```
 
 The top two variables are consistently identified as more important than the 
 other variables and their order remains unchanged across mtry. However, the 
-variables 'nox' and 'dis' switch order as mtry increases. Common sense suggests 
-that pollution (nox) is correlated with distance to employment centers (dis). 
-Our common sense leads us to assume that most home buyers consider distance to 
-work more than pollution when selecting a house. Therefore, 'dis' is likely a 
-more causal driver of price than 'nox'. Consequently, the model where mtry is 9 
-appears to be superior to the model where mtry is 4, despite mtry of 4 yielding 
-slightly more accurate results.
+variables 'nox' and 'dis' switch order as mtry increases. Pollution (nox) has a 
+strong negative correlation with distance to employment centers (dis). This 
+makes sense if the employment centers are responsible for much of the pollution.
+If many home buyers consider distance to 
+work more important than pollution when selecting a house, 'dis' is more likely
+to be a causal driver of price than 'nox'. By this reasoning, the model where 
+mtry is 9 appears to be superior to the model where mtry is 4, despite mtry of 4 
+yielding slightly more accurate results.
 
 We now take our selected model and build individual importance plots for it 
 using `ggvip`.
 
-```{r, warning=FALSE, message=FALSE}
+```{r, warning=FALSE, message=FALSE, fig.width=4, fig.height=4, fig.align='center'}
 g <- ggvip(m$rf9)$both_vips
 ```
 

diff --git a/README.md b/README.md
@@ -62,33 +62,35 @@ You can view R package’s source code on GitHub:
 
 ``` r
 library(rfvip)
+library(MASS)
+library(EZtune)
 ```
 
-To introduce the functionality of `rfvip`, we will look at a modeling
-problem for the Boston housing data (found in the MASS package). We will
-attempt to build an optimal Random Forest model for accuracy and
-interpretability. To begin we will run some preliminary diagnostics on
+To introduce the functionality of `rfvip`, we look at modeling the
+Boston housing data (found in the MASS package). We want to build a
+Random Forest model with a view towards both accuracy and
+interpretability. We begin by running some preliminary diagnostics on
 our data.
 
 ``` r
 set.seed(1234)
 
-pcs <- partial_cor(medv ~ ., data = MASS::Boston, model = lm)
+pcs <- partial_cor(medv ~ ., data = Boston, model = lm)
 pcs$plot_y_part_cors
 ```
 
-<img src="man/figures/README-unnamed-chunk-3-1.png" width="100%" />
+<img src="man/figures/README-unnamed-chunk-3-1.png" width="100%" style="display: block; margin: auto;" />
 
 ``` r
 
-rv <- robust_vifs(medv ~ ., data = MASS::Boston, model = lm)
+rv <- robust_vifs(medv ~ ., data = Boston, model = lm)
 rv$plot_nonlin_vifs
 ```
 
-<img src="man/figures/README-unnamed-chunk-3-2.png" width="100%" />
+<img src="man/figures/README-unnamed-chunk-3-2.png" width="100%" style="display: block; margin: auto;" />
 
 These functions assess concerns with collinearity. Notice that the VIFs
-from `robust_vifs`are all less than 10. The partial correlations with
+from `robust_vifs` are all less than 10. The partial correlations with
 the response from `partial_cor` are a type of pseudo-importance
 assessing the importance each variable does not share with the others.
 Now we tune our model by assessing four different mtry values in the
@@ -97,13 +99,13 @@ Now we tune our model by assessing four different mtry values in the
 ``` r
 set.seed(1)
 m <- mtry_compare(medv ~ .,
-  data = MASS::Boston, sqrt = TRUE,
+  data = Boston, sqrt = TRUE,
   mvec = c(1, 4, 9, 13), num_var = 7
 )
 m$gg_model_errors
 ```
 
-<img src="man/figures/README-unnamed-chunk-4-1.png" width="100%" />
+<img src="man/figures/README-unnamed-chunk-4-1.png" width="100%" style="display: block; margin: auto;" />
 
 ``` r
 m$model_errors
@@ -115,26 +117,27 @@ m$model_errors
 ```
 
 According to the accuracy plot and table above, our best choice is when
-mtry is 4. However, the accuracy for the best model is notably very
-similar to two of the other models. We now look at the variable
-importance metrics across the different models.
+mtry is 4. However, the accuracy for the best model is notably only
+slightly better than the models with mtry set to 9 and 13. We now look
+at the variable importance metrics across the different models.
 
 ``` r
 m$gg_var_imp_error
 ```
 
-<img src="man/figures/README-unnamed-chunk-5-1.png" width="100%" />
+<img src="man/figures/README-unnamed-chunk-5-1.png" width="100%" style="display: block; margin: auto;" />
 
 The top two variables are consistently identified as more important than
 the other variables and their order remains unchanged across mtry.
 However, the variables ‘nox’ and ‘dis’ switch order as mtry increases.
-Common sense suggests that pollution (nox) is correlated with distance
-to employment centers (dis). Our common sense leads us to assume that
-most home buyers consider distance to work more than pollution when
-selecting a house. Therefore, ‘dis’ is likely a more causal driver of
-price than ‘nox’. Consequently, the model where mtry is 9 appears to be
-superior to the model where mtry is 4, despite mtry of 4 yielding
-slightly more accurate results.
+Pollution (nox) has a strong negative correlation with distance to
+employment centers (dis). This makes sense if the employment centers are
+responsible for much of the pollution. If many home buyers consider
+distance to work more important than pollution when selecting a house,
+‘dis’ is more likely to be a causal driver of price than ‘nox’. By this
+reasoning, the model where mtry is 9 appears to be superior to the model
+where mtry is 4, despite mtry of 4 yielding slightly more accurate
+results.
 
 We now take our selected model and build individual importance plots for
 it using `ggvip`.
@@ -143,7 +146,7 @@ it using `ggvip`.
 g <- ggvip(m$rf9)$both_vips
 ```
 
-<img src="man/figures/README-unnamed-chunk-6-1.png" width="100%" />
+<img src="man/figures/README-unnamed-chunk-6-1.png" width="100%" style="display: block; margin: auto;" />
 
 The plot above resembles a standard variable importance plot, but
 possesses superior tick labels and editing capabilities for the analyst.

diff --git a/man/figures/README-unnamed-chunk-3-1.png b/man/figures/README-unnamed-chunk-3-1.png
diff --git a/man/figures/README-unnamed-chunk-3-2.png b/man/figures/README-unnamed-chunk-3-2.png
diff --git a/man/figures/README-unnamed-chunk-4-1.png b/man/figures/README-unnamed-chunk-4-1.png
diff --git a/man/figures/README-unnamed-chunk-5-1.png b/man/figures/README-unnamed-chunk-5-1.png
diff --git a/man/figures/README-unnamed-chunk-6-1.png b/man/figures/README-unnamed-chunk-6-1.png
diff --git a/misc/pdp_compare.R b/misc/pdp_compare.R
@@ -41,18 +41,22 @@ pdp_compare <- function(x = Lo.rf, vars,
     im <- im[(length(im)-2):length(im)]
   }
 
-  # if(inherits(x,"gbm")){
-  #   n.trees = x$n.trees
-  #   im = gbm::relative.influence(x)
-  # }
+  if(inherits(x,"gbm")){
+    im = as.data.frame(gbm::relative.influence(x))
+    colnames(im) = "importance"
+    im$var = rownames(im)
+  }
 
   vvec <- colnames(model_frame)
 
   pd_num <- NULL
   pd_fac <- NULL
   for (i in vvec) {
-    tmp <- pdp::partial(x, pred.var = i, which.class = which.class,
-                   prob = prob)#, ...)
+    ifelse(inherits(x,"gbm"),
+      tmp <- pdp::partial(x, pred.var = i, which.class = which.class,
+                          prob = prob, n.trees = x$n.trees),
+      tmp <- pdp::partial(x, pred.var = i, which.class = which.class,
+                          prob = prob))#, ...)
     names(tmp) <- c("x", "y")
 
     if(inherits(tmp$x, "numeric")){
@@ -70,7 +74,7 @@ pdp_compare <- function(x = Lo.rf, vars,
                 y[floor(length(y)*trim)+1],
               sd = sd(y),
               mad = stats::mad(y, center = mean(y))) %>%
-    arrange(desc(trim_range))
+    arrange(desc(trim_range), desc(sd))
 
   if(exists("im")){
     imp <- dplyr::left_join(imp, im, by = c("var" = "var"))
@@ -197,6 +201,7 @@ mtcars.rf <- randomForest(formula = mpg ~ ., data = mtcars)
 # p <- partial(mtcars.rf, pred.var = c("drat"))
 # plotPartial(p)
 car_pd <- pdp_compare(x = mtcars.rf)
+car_pd$imp
 car_pd$full
 car_pd$drat
 car_pd$cyl

diff --git a/rfvip.Rproj → randomForestVIP.Rproj b/rfvip.Rproj → randomForestVIP.Rproj