lint readme unicode fixes

KelvynBladen · Jul 11, 2023 · a9da4af · a9da4af
1 parent 9a4059d
commit a9da4af
Show file tree

Hide file tree

Showing 19 changed files with 198 additions and 135 deletions.
diff --git a/.Rbuildignore b/.Rbuildignore
@@ -6,3 +6,4 @@
 ^\.github$
 ^README\.Rmd$
 ^revdep$
+^CRAN-SUBMISSION$
diff --git a/CRAN-SUBMISSION b/CRAN-SUBMISSION
@@ -0,0 +1,3 @@
+Version: 0.1.0
+Date: 2023-07-10 23:07:17 UTC
+SHA: 9a4059db4b49ce0c61ee1b612a17a4227bd75f2d
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,23 +1,23 @@
 Type: Package
 Package: rfvip
-Title: Tune Random Forests Based On Variable Importance & Plot Results
+Title: Tune Random Forests Based on Variable Importance & Plot Results
 Version: 0.1.0
 Author: Kelvyn Bladen
 Maintainer: Kelvyn Bladen <kelvyn.bladen@usu.edu>
-Description: This package contains functions for assessing variable
-    relations and associations prior to modeling with a Random Forest
-    algorithm (although these are relevant for any predictive model).
+Description: Functions for assessing variable relations and associations 
+    prior to modeling with a Random Forest algorithm (although these are 
+    relevant for any predictive model).
     Metrics such as partial correlations and variance inflation factors
     are tabulated as well as plotted for the user. A function is available
-    for tuning the hyper-parameter mtry based on model performance and
-    variable importance metrics. This grid-search technique provides
+    for tuning the main Random Forest hyper-parameter based on model performance 
+    and variable importance metrics. This grid-search technique provides
     tables and plots showing the effect of mtry on each of the assessment
     metrics. It also returns each of the evaluated models to the user. The
-    package also provides superior ggplot2 variable importance plots for
-    singular models. All of the plots are developed with ggplot2 techniques 
+    package also provides superior variable importance plots for
+    individual models. All of the plots are developed 
     so that the user has the ability to edit and improve further upon the 
     plots. Derivations and methodology are described in Bladen (2022) 
-    <https://digitalcommons.usu.edu/etd/8587>.
+    <https://digitalcommons.usu.edu/etd/8587/>.
 License: GPL-3 + file LICENSE
 URL: https://github.com/KelvynBladen/rfvip
 Depends: 

diff --git a/R/data.R b/R/data.R
@@ -94,18 +94,18 @@
 #'   Kyle T. Hess, Jacob Gibson, and Joshua J. Lawler. 2007. Random Forests
 #'   for Classification in Ecology. Ecology 88(11): 2783-2792.
 #'
-#'   https://CRAN.R-project.org/package=EZtune
+#'   https://CRAN.R-project.org/package=EZtune/
 "lichen"
 
 
 #' Housing Values in Suburbs of Boston
 #'
-#' @name Boston
-#' @keywords Boston
+#' @name boston
+#' @keywords boston
 #' @docType data
 #' @description
 #' The Boston data frame has 506 rows and 14 columns.
-#' @usage Boston
+#' @usage boston
 #' @format
 #' This data frame contains the following columns:
 #'
@@ -122,12 +122,12 @@
 #'   \item{rad}{index of accessibility to radial highways.}
 #'   \item{tax}{full-value property-tax rate per $10,000.}
 #'   \item{ptratio}{pupil-teacher ratio by town.}
-#'   \item{black}{\eqn{1000(Bk−0.63)^2} where \eqn{Bk} is the proportion of
-#'   blacks by town.}
+#'   \item{black}{\eqn{1000(Bk-0.63)^2} where \eqn{Bk} is the
+#'   proportion of blacks by town.}
 #'   \item{lstat}{lower status of the population (percent).}
 #'   \item{medv}{median value of owner-occupied homes in $1000s.}
 #'   }
 #'
 #' @source
 #' https://www.stats.ox.ac.uk/pub/MASS4/
-"Boston"
+"boston"
diff --git a/R/ggvip.R b/R/ggvip.R
@@ -15,8 +15,8 @@
 #'   (1=mean decrease in accuracy or % increase in MSE, 2 = mean decrease in
 #'   node impurity or mean decrease in gini). Default is "both".
 #' @param num_var Optional argument for reducing the number of variables to the
-#'   top 'num_var'. Must be an integer between 1 and the total number of predictor
-#'   variables in the model.
+#'   top 'num_var'. Must be an integer between 1 and the total number of
+#'   predictor variables in the model.
 #' @return A ggplot dotchart showing the importance of the variables that were
 #'   plotted.
 #' @examples
@@ -52,7 +52,7 @@ ggvip <- function(x, scale = FALSE, sqrt = TRUE, type = "both", num_var) {
     d <- imp_frame %>%
       arrange(desc(get(colnames(imp_frame)[1]))) %>%
       filter(get(colnames(imp_frame)[1]) >=
-               get(colnames(imp_frame)[1])[num_var])
+        get(colnames(imp_frame)[1])[num_var])
 
     imp_frame <- imp_frame %>%
       filter(var %in% d$var)
@@ -67,10 +67,10 @@ ggvip <- function(x, scale = FALSE, sqrt = TRUE, type = "both", num_var) {
     ind <- findInterval(m, v)
 
     newr <- m / (10^(ind - 5))
-    rrr <- ceiling(newr / 10)*10
+    rrr <- ceiling(newr / 10) * 10
 
     if (newr / rrr < 3 / 4) {
-      rrr <- ceiling(newr / 4)*4
+      rrr <- ceiling(newr / 4) * 4
     }
 
     newm <- rrr * (10^(ind - 5))
@@ -116,10 +116,10 @@ ggvip <- function(x, scale = FALSE, sqrt = TRUE, type = "both", num_var) {
     ind <- findInterval(m, v)
 
     newr <- m / (10^(ind - 5))
-    rrr <- ceiling(newr/10)*10
+    rrr <- ceiling(newr / 10) * 10
 
     if (newr / rrr < 3 / 4) {
-      rrr <- ceiling(newr/4)*4
+      rrr <- ceiling(newr / 4) * 4
     }
 
     newm <- rrr * (10^(ind - 5))
@@ -133,8 +133,10 @@ ggvip <- function(x, scale = FALSE, sqrt = TRUE, type = "both", num_var) {
 
     imp_frame1 <- imp_frame
 
-    imp_frame1 <- imp_frame1[rev(do.call(base::order,
-                                         as.list(imp_frame1[1]))), ]
+    imp_frame1 <- imp_frame1[rev(do.call(
+      base::order,
+      as.list(imp_frame1[1])
+    )), ]
 
     g1 <- imp_frame %>%
       ggplot(aes_string(
@@ -158,10 +160,10 @@ ggvip <- function(x, scale = FALSE, sqrt = TRUE, type = "both", num_var) {
     ind <- findInterval(m, v)
 
     newr <- m / (10^(ind - 5))
-    rrr <- ceiling(newr/10)*10
+    rrr <- ceiling(newr / 10) * 10
 
     if (newr / rrr < 3 / 4) {
-      rrr <- ceiling(newr/4)*4
+      rrr <- ceiling(newr / 4) * 4
     }
 
     newm <- rrr * (10^(ind - 5))
@@ -175,8 +177,10 @@ ggvip <- function(x, scale = FALSE, sqrt = TRUE, type = "both", num_var) {
 
     imp_frame2 <- imp_frame
 
-    imp_frame2 <- imp_frame2[rev(do.call(base::order,
-                                         as.list(imp_frame2[2]))), ]
+    imp_frame2 <- imp_frame2[rev(do.call(
+      base::order,
+      as.list(imp_frame2[2])
+    )), ]
 
     g2 <- imp_frame %>%
       ggplot(aes_string(

diff --git a/R/mtry_compare.R b/R/mtry_compare.R
@@ -171,9 +171,9 @@ mtry_compare <- function(formula, data = NULL, scale = FALSE, sqrt = TRUE,
   ind <- findInterval(m, v)
 
   newr <- m / (10^(ind - 5))
-  rrr <- ceiling(newr/10)*10
+  rrr <- ceiling(newr / 10) * 10
 
-  rrr <- ifelse(newr / rrr < 3 / 4, ceiling(newr/4)*4, rrr)
+  rrr <- ifelse(newr / rrr < 3 / 4, ceiling(newr / 4) * 4, rrr)
 
   newm <- rrr * (10^(ind - 5))
 
@@ -204,9 +204,9 @@ mtry_compare <- function(formula, data = NULL, scale = FALSE, sqrt = TRUE,
   ind <- findInterval(m, v)
 
   newr <- m / (10^(ind - 5))
-  rrr <- ceiling(newr/10)*10
+  rrr <- ceiling(newr / 10) * 10
 
-  rrr <- ifelse(newr / rrr < 3 / 4, ceiling(newr/4)*4, rrr)
+  rrr <- ifelse(newr / rrr < 3 / 4, ceiling(newr / 4) * 4, rrr)
 
   newm <- rrr * (10^(ind - 5))
 
@@ -237,9 +237,9 @@ mtry_compare <- function(formula, data = NULL, scale = FALSE, sqrt = TRUE,
   ind <- findInterval(m, v)
 
   newr <- m / (10^(ind - 5))
-  rrr <- ceiling(newr/10)*10
+  rrr <- ceiling(newr / 10) * 10
 
-  rrr <- ifelse(newr / rrr < 3 / 4, ceiling(newr/4)*4, rrr)
+  rrr <- ifelse(newr / rrr < 3 / 4, ceiling(newr / 4) * 4, rrr)
 
   newm <- rrr * (10^(ind - 5))
 
@@ -281,9 +281,3 @@ mtry_compare <- function(formula, data = NULL, scale = FALSE, sqrt = TRUE,
 
   l
 }
-
-# library(rfvip)
-# m <- mtry_compare(formula = medv ~ ., data = Boston, num_var = 7,
-#                   mvec = c(-1.2, 3, 4, 5, 7, 9, 11.2, 13.3), sqrt = T)
-# m <- mtry_compare(formula = medv ~ ., data = Boston, sqrt = F)
-# m <- mtry_compare(formula = factor(Species) ~ ., data = iris, sqrt = TRUE)
diff --git a/R/partial_cor.R b/R/partial_cor.R
@@ -225,9 +225,3 @@ partial_cor <- function(formula, data = NULL, model = lm, num_var, ...) {
 
   l
 }
-
-# library(minerva); library(infotheo); library(entropy); library(rmi)
-# iris1 <- iris %>% filter(Species != "setosa")
-# p <- partial_cor(formula = Petal.Length ~ ., data = iris1, model = lm)
-# p1 <- partial_cor(formula = iris$Petal.Length ~ iris$Sepal.Width +
-#   iris$Sepal.Length + iris$Petal.Width)
diff --git a/R/robust_vifs.R b/R/robust_vifs.R
@@ -70,8 +70,8 @@ robust_vifs <- function(formula, data, model = randomForest,
 
     # Consider Fixes that use a test or OOB or CV error rather than
     # training Error.
-    r2 <- 1 - (sum((as.numeric(mf[, k]) - predict(r, mf[, -c(1, k)])) ^ 2) /
-               sum((as.numeric(mf[, k]) - mean(as.numeric(mf[, k])))))
+    r2 <- 1 - (sum((as.numeric(mf[, k]) - predict(r, mf[, -c(1, k)]))^2) /
+      sum((as.numeric(mf[, k]) - mean(as.numeric(mf[, k])))))
     vdf[k - 1, 4] <- 1 / (1 - r2)
     vdf[k - 1, 5] <- r2
   }
@@ -166,8 +166,3 @@ robust_vifs <- function(formula, data, model = randomForest,
 
   l
 }
-
-# library(car); library(MASS); library(rpart)
-# robust_vifs(formula = Petal.Length ~ ., data = iris[1:4],
-#             model = randomForest)
-# robust_vifs(medv ~ ., data = Boston)
diff --git a/README.Rmd b/README.Rmd
@@ -19,15 +19,33 @@ knitr::opts_chunk$set(
 [![R-CMD-check](https://github.com/KelvynBladen/rfvip/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/KelvynBladen/rfvip/actions/workflows/R-CMD-check.yaml)
 <!-- badges: end -->
 
-The goal of rfvip is to tune and select a good Random Forest model based on the accuracy and variable importance metrics associated with each model. To accomplish this, functions are available to tabulate and plot results designed to help the user select an optimal model. 
-
-This package contains functions for assessing variable relations and associations prior to modeling with a Random Forest algorithm (although these are relevant for any predictive model). Metrics such as partial correlations and variance inflation factors are tabulated as well as plotted for the user using the functions partial_cor() and robust_vifs(). 
-
-The function mtry_compare() is available for tuning the hyper-parameter mtry based on model performance and variable importance metrics. This grid-search technique provides tables and plots showing the effect of mtry on each of the assessment metrics. It also returns each of the evaluated models to the user.
-
-The package also provides superior ggplot2 variable importance plots for singular models using the function ggvip(). This function is a highly aesthetic and editable improvement upon the function randomForest::varImpPlot() and other basic importance graphics.
-
-All of the plots generated by these functions are developed with ggplot2 techniques so that the user has the ability to edit and improve further upon the plots.
+The goal of `rfvip` is to tune and select a good Random Forest model based on 
+the accuracy and variable importance metrics associated with each model. To 
+accomplish this, functions are available to tabulate and plot results designed 
+to help the user select an optimal model. 
+
+This package contains functions for assessing variable relations and 
+associations prior to modeling with a Random Forest algorithm (although these 
+are relevant for any predictive model). Metrics such as partial correlations 
+and variance inflation factors are tabulated as well as plotted for the user 
+using the functions `partial_cor` and `robust_vifs`. 
+
+The function `mtry_compare` is available for tuning the hyper-parameter mtry 
+based on model performance and variable importance metrics. This grid-search 
+technique provides tables and plots showing the effect of mtry on each of the 
+assessment metrics. It also returns each of the evaluated models to the user.
+
+The package also provides superior ggplot2 variable importance plots for 
+individual models using the function `ggvip`. This function is a highly 
+aesthetic and editable improvement upon the function `randomForest::varImpPlot`
+and other basic importance graphics.
+
+All of the plots generated by these functions are developed with ggplot2 
+techniques so that the user has the ability to edit and improve further upon 
+the plots.
+
+For methodology see "Contributions to Random Forest Variable Importance with 
+Applications in R" <https://digitalcommons.usu.edu/etd/8587/>.
 
 ## Installation
 
@@ -48,13 +66,14 @@ You can view R package's source code on GitHub: <https://github.com/KelvynBladen
 
 ## Example
 
-This is a basic example which shows you how to solve a common problem:
-
 ```{r, warning=FALSE, message=FALSE}
 library(rfvip)
 ```
 
-We will attempt to build an optimal model for the Boston housing data. This can be found in the MASS package. To begin we will run some preliminary diagnostics on our data.
+To introduce the functionality of `rfvip`, we will look at a modeling problem 
+for the Boston housing data (found in the MASS package). We will attempt to 
+build an optimal Random Forest model for accuracy and interpretability. To 
+begin we will run some preliminary diagnostics on our data.
 
 ```{r, warning=FALSE, message=FALSE}
 set.seed(1234)
@@ -66,28 +85,51 @@ rv <- robust_vifs(medv ~ ., data = MASS::Boston, model = lm)
 rv$plot_nonlin_vifs
 ```
 
-These do not look too bad with regard to collinearity. The VIFs are all less than 10. The partial correlations with the response are a type of pseudo-importance assessing the importance each variable does not share with the others. Now we tune our model across four mtry values.
+These functions assess concerns with collinearity. Notice that the VIFs from 
+`robust_vifs`are all less than 10. The partial correlations with the response 
+from `partial_cor` are a type of pseudo-importance assessing the importance 
+each variable does not share with the others. Now we tune our model by 
+assessing four different mtry values in the `mtry_compare` function.
 
 ```{r, warning=FALSE, message=FALSE}
 set.seed(1)
-m <- mtry_compare(medv ~ ., data = MASS::Boston, sqrt = TRUE, 
-                  mvec = c(1,4,9,13), num_var = 7)
+m <- mtry_compare(medv ~ .,
+  data = MASS::Boston, sqrt = TRUE,
+  mvec = c(1, 4, 9, 13), num_var = 7
+)
 m$gg_model_errors
 m$model_errors
 ```
 
-According to the accuracy plot and table, our best choice is when mtry is 4. However, the accuracy for the best model is very similar to two of the other models. We now look at the variable importance metrics across the different models.
+According to the accuracy plot and table above, our best choice is when mtry is 
+4. However, the accuracy for the best model is notably very similar to two of 
+the other models. We now look at the variable importance metrics across the 
+different models.
 
 ```{r, warning=FALSE, message=FALSE}
 m$gg_var_imp_error
 ```
 
-The top two variables are consistent. However, the variables 'nox' and 'dis' switch order as mtry increases. Common sense suggests that pollution(nox) is correlated with distance to employment centres(dis). It can be assumed that most home buyers consider location to work more than pollution when selecting a house. Therefore, 'dis' is likely a more casual driving of price than 'nox'. Consequently, the model where mtry is 9 appears to be superior to the model where mtry is 4 (even if it is slightly more accurate). 
+The top two variables are consistently identified as more important than the 
+other variables and their order remains unchanged across mtry. However, the 
+variables 'nox' and 'dis' switch order as mtry increases. Common sense suggests 
+that pollution (nox) is correlated with distance to employment centers (dis). 
+Our common sense leads us to assume that most home buyers consider distance to 
+work more than pollution when selecting a house. Therefore, 'dis' is likely a 
+more causal driver of price than 'nox'. Consequently, the model where mtry is 9 
+appears to be superior to the model where mtry is 4, despite mtry of 4 yielding 
+slightly more accurate results.
 
-We now take our selected model and build individual importance plots for it.
+We now take our selected model and build individual importance plots for it 
+using `ggvip`.
 
 ```{r, warning=FALSE, message=FALSE}
 g <- ggvip(m$rf9)$both_vips
 ```
 
-Looks great. We have used variable importance and accuracy metrics to pick a solid model for prediction and with reasonably useful importance values.
+The plot above resembles a standard variable importance plot, but possesses 
+superior tick labels and editing capabilities for the analyst. 
+
+We have used the `rfvip` package to tune a strong model for prediction and with 
+reasonably useful importance values. This was accomplished by assessing 
+variable importance and accuracy metrics across the hyper-parameter mtry.