Skip to content

Commit

Permalink
vignette and project name
Browse files Browse the repository at this point in the history
  • Loading branch information
KelvynBladen committed Jul 14, 2023
1 parent a9da4af commit e2dbac6
Show file tree
Hide file tree
Showing 10 changed files with 112 additions and 99 deletions.
47 changes: 25 additions & 22 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -68,62 +68,65 @@ You can view R package's source code on GitHub: <https://github.com/KelvynBladen

```{r, warning=FALSE, message=FALSE}
library(rfvip)
library(MASS)
library(EZtune)
```

To introduce the functionality of `rfvip`, we will look at a modeling problem
for the Boston housing data (found in the MASS package). We will attempt to
build an optimal Random Forest model for accuracy and interpretability. To
begin we will run some preliminary diagnostics on our data.
To introduce the functionality of `rfvip`, we look at modeling the Boston
housing data (found in the MASS package). We want to
build a Random Forest model with a view towards both accuracy and
interpretability. We begin by running some preliminary diagnostics on our data.

```{r, warning=FALSE, message=FALSE}
```{r, warning=FALSE, message=FALSE, fig.width=3, fig.height=3, fig.align='center'}
set.seed(1234)
pcs <- partial_cor(medv ~ ., data = MASS::Boston, model = lm)
pcs <- partial_cor(medv ~ ., data = Boston, model = lm)
pcs$plot_y_part_cors
rv <- robust_vifs(medv ~ ., data = MASS::Boston, model = lm)
rv <- robust_vifs(medv ~ ., data = Boston, model = lm)
rv$plot_nonlin_vifs
```

These functions assess concerns with collinearity. Notice that the VIFs from
`robust_vifs`are all less than 10. The partial correlations with the response
`robust_vifs` are all less than 10. The partial correlations with the response
from `partial_cor` are a type of pseudo-importance assessing the importance
each variable does not share with the others. Now we tune our model by
assessing four different mtry values in the `mtry_compare` function.

```{r, warning=FALSE, message=FALSE}
```{r, warning=FALSE, message=FALSE, fig.width=4, fig.height=4, fig.align='center'}
set.seed(1)
m <- mtry_compare(medv ~ .,
data = MASS::Boston, sqrt = TRUE,
data = Boston, sqrt = TRUE,
mvec = c(1, 4, 9, 13), num_var = 7
)
m$gg_model_errors
m$model_errors
```

According to the accuracy plot and table above, our best choice is when mtry is
4. However, the accuracy for the best model is notably very similar to two of
the other models. We now look at the variable importance metrics across the
different models.
4. However, the accuracy for the best model is notably only slightly better than
the models with mtry set to 9 and 13. We now look at the variable importance
metrics across the different models.

```{r, warning=FALSE, message=FALSE}
```{r, warning=FALSE, message=FALSE, fig.width=6, fig.height=5, fig.align='center'}
m$gg_var_imp_error
```

The top two variables are consistently identified as more important than the
other variables and their order remains unchanged across mtry. However, the
variables 'nox' and 'dis' switch order as mtry increases. Common sense suggests
that pollution (nox) is correlated with distance to employment centers (dis).
Our common sense leads us to assume that most home buyers consider distance to
work more than pollution when selecting a house. Therefore, 'dis' is likely a
more causal driver of price than 'nox'. Consequently, the model where mtry is 9
appears to be superior to the model where mtry is 4, despite mtry of 4 yielding
slightly more accurate results.
variables 'nox' and 'dis' switch order as mtry increases. Pollution (nox) has a
strong negative correlation with distance to employment centers (dis). This
makes sense if the employment centers are responsible for much of the pollution.
If many home buyers consider distance to
work more important than pollution when selecting a house, 'dis' is more likely
to be a causal driver of price than 'nox'. By this reasoning, the model where
mtry is 9 appears to be superior to the model where mtry is 4, despite mtry of 4
yielding slightly more accurate results.

We now take our selected model and build individual importance plots for it
using `ggvip`.

```{r, warning=FALSE, message=FALSE}
```{r, warning=FALSE, message=FALSE, fig.width=4, fig.height=4, fig.align='center'}
g <- ggvip(m$rf9)$both_vips
```

Expand Down
49 changes: 26 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,33 +62,35 @@ You can view R package’s source code on GitHub:

``` r
library(rfvip)
library(MASS)
library(EZtune)
```

To introduce the functionality of `rfvip`, we will look at a modeling
problem for the Boston housing data (found in the MASS package). We will
attempt to build an optimal Random Forest model for accuracy and
interpretability. To begin we will run some preliminary diagnostics on
To introduce the functionality of `rfvip`, we look at modeling the
Boston housing data (found in the MASS package). We want to build a
Random Forest model with a view towards both accuracy and
interpretability. We begin by running some preliminary diagnostics on
our data.

``` r
set.seed(1234)

pcs <- partial_cor(medv ~ ., data = MASS::Boston, model = lm)
pcs <- partial_cor(medv ~ ., data = Boston, model = lm)
pcs$plot_y_part_cors
```

<img src="man/figures/README-unnamed-chunk-3-1.png" width="100%" />
<img src="man/figures/README-unnamed-chunk-3-1.png" width="100%" style="display: block; margin: auto;" />

``` r

rv <- robust_vifs(medv ~ ., data = MASS::Boston, model = lm)
rv <- robust_vifs(medv ~ ., data = Boston, model = lm)
rv$plot_nonlin_vifs
```

<img src="man/figures/README-unnamed-chunk-3-2.png" width="100%" />
<img src="man/figures/README-unnamed-chunk-3-2.png" width="100%" style="display: block; margin: auto;" />

These functions assess concerns with collinearity. Notice that the VIFs
from `robust_vifs`are all less than 10. The partial correlations with
from `robust_vifs` are all less than 10. The partial correlations with
the response from `partial_cor` are a type of pseudo-importance
assessing the importance each variable does not share with the others.
Now we tune our model by assessing four different mtry values in the
Expand All @@ -97,13 +99,13 @@ Now we tune our model by assessing four different mtry values in the
``` r
set.seed(1)
m <- mtry_compare(medv ~ .,
data = MASS::Boston, sqrt = TRUE,
data = Boston, sqrt = TRUE,
mvec = c(1, 4, 9, 13), num_var = 7
)
m$gg_model_errors
```

<img src="man/figures/README-unnamed-chunk-4-1.png" width="100%" />
<img src="man/figures/README-unnamed-chunk-4-1.png" width="100%" style="display: block; margin: auto;" />

``` r
m$model_errors
Expand All @@ -115,26 +117,27 @@ m$model_errors
```

According to the accuracy plot and table above, our best choice is when
mtry is 4. However, the accuracy for the best model is notably very
similar to two of the other models. We now look at the variable
importance metrics across the different models.
mtry is 4. However, the accuracy for the best model is notably only
slightly better than the models with mtry set to 9 and 13. We now look
at the variable importance metrics across the different models.

``` r
m$gg_var_imp_error
```

<img src="man/figures/README-unnamed-chunk-5-1.png" width="100%" />
<img src="man/figures/README-unnamed-chunk-5-1.png" width="100%" style="display: block; margin: auto;" />

The top two variables are consistently identified as more important than
the other variables and their order remains unchanged across mtry.
However, the variables ‘nox’ and ‘dis’ switch order as mtry increases.
Common sense suggests that pollution (nox) is correlated with distance
to employment centers (dis). Our common sense leads us to assume that
most home buyers consider distance to work more than pollution when
selecting a house. Therefore, ‘dis’ is likely a more causal driver of
price than ‘nox’. Consequently, the model where mtry is 9 appears to be
superior to the model where mtry is 4, despite mtry of 4 yielding
slightly more accurate results.
Pollution (nox) has a strong negative correlation with distance to
employment centers (dis). This makes sense if the employment centers are
responsible for much of the pollution. If many home buyers consider
distance to work more important than pollution when selecting a house,
‘dis’ is more likely to be a causal driver of price than ‘nox’. By this
reasoning, the model where mtry is 9 appears to be superior to the model
where mtry is 4, despite mtry of 4 yielding slightly more accurate
results.

We now take our selected model and build individual importance plots for
it using `ggvip`.
Expand All @@ -143,7 +146,7 @@ it using `ggvip`.
g <- ggvip(m$rf9)$both_vips
```

<img src="man/figures/README-unnamed-chunk-6-1.png" width="100%" />
<img src="man/figures/README-unnamed-chunk-6-1.png" width="100%" style="display: block; margin: auto;" />

The plot above resembles a standard variable importance plot, but
possesses superior tick labels and editing capabilities for the analyst.
Expand Down
Binary file modified man/figures/README-unnamed-chunk-3-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified man/figures/README-unnamed-chunk-3-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified man/figures/README-unnamed-chunk-4-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified man/figures/README-unnamed-chunk-5-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified man/figures/README-unnamed-chunk-6-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
19 changes: 12 additions & 7 deletions misc/pdp_compare.R
Original file line number Diff line number Diff line change
Expand Up @@ -41,18 +41,22 @@ pdp_compare <- function(x = Lo.rf, vars,
im <- im[(length(im)-2):length(im)]
}

# if(inherits(x,"gbm")){
# n.trees = x$n.trees
# im = gbm::relative.influence(x)
# }
if(inherits(x,"gbm")){
im = as.data.frame(gbm::relative.influence(x))
colnames(im) = "importance"
im$var = rownames(im)
}

vvec <- colnames(model_frame)

pd_num <- NULL
pd_fac <- NULL
for (i in vvec) {
tmp <- pdp::partial(x, pred.var = i, which.class = which.class,
prob = prob)#, ...)
ifelse(inherits(x,"gbm"),
tmp <- pdp::partial(x, pred.var = i, which.class = which.class,
prob = prob, n.trees = x$n.trees),
tmp <- pdp::partial(x, pred.var = i, which.class = which.class,
prob = prob))#, ...)
names(tmp) <- c("x", "y")

if(inherits(tmp$x, "numeric")){
Expand All @@ -70,7 +74,7 @@ pdp_compare <- function(x = Lo.rf, vars,
y[floor(length(y)*trim)+1],
sd = sd(y),
mad = stats::mad(y, center = mean(y))) %>%
arrange(desc(trim_range))
arrange(desc(trim_range), desc(sd))

if(exists("im")){
imp <- dplyr::left_join(imp, im, by = c("var" = "var"))
Expand Down Expand Up @@ -197,6 +201,7 @@ mtcars.rf <- randomForest(formula = mpg ~ ., data = mtcars)
# p <- partial(mtcars.rf, pred.var = c("drat"))
# plotPartial(p)
car_pd <- pdp_compare(x = mtcars.rf)
car_pd$imp
car_pd$full
car_pd$drat
car_pd$cyl
Expand Down
File renamed without changes.
Loading

0 comments on commit e2dbac6

Please sign in to comment.