Merge branch 'main' of https://github.com/SchlossLab/mikropml

SchlossLab · Nov 30, 2022 · e8ee491 · e8ee491
2 parents b4cf70e + bfb08f6
commit e8ee491
Show file tree

Hide file tree

Showing 8 changed files with 54 additions and 17 deletions.
diff --git a/docs/dev/articles/parallel.html b/docs/dev/articles/parallel.html
diff --git a/docs/dev/articles/parallel_files/figure-html/customize_perf_plot-1.png b/docs/dev/articles/parallel_files/figure-html/customize_perf_plot-1.png
diff --git a/docs/dev/articles/parallel_files/figure-html/feat_imp_plot-1.png b/docs/dev/articles/parallel_files/figure-html/feat_imp_plot-1.png
diff --git a/docs/dev/articles/parallel_files/figure-html/plot_perf-1.png b/docs/dev/articles/parallel_files/figure-html/plot_perf-1.png
diff --git a/docs/dev/pkgdown.yml b/docs/dev/pkgdown.yml
@@ -7,7 +7,7 @@ articles:
   parallel: parallel.html
   preprocess: preprocess.html
   tuning: tuning.html
-last_built: 2022-11-04T17:10Z
+last_built: 2022-11-17T17:37Z
 urls:
   reference: http://www.schlosslab.org/mikropml/reference
   article: http://www.schlosslab.org/mikropml/articles

diff --git a/docs/dev/reference/get_perf_metric_fn.html b/docs/dev/reference/get_perf_metric_fn.html
diff --git a/docs/dev/search.json b/docs/dev/search.json
diff --git a/vignettes/parallel.Rmd b/vignettes/parallel.Rmd
@@ -19,6 +19,14 @@ knitr::opts_chunk$set(
 )
 ```
 
+In this tutorial, we show how you can speed up pre-processing, model training,
+and feature importance steps for individual runs, as well as how to train
+multiple models in parallel within R.
+However, we highly recommend using a workflow manager such as Snakemake rather
+than parallelizing within a single R session.
+Jump to the section [Parallelizing with Snakemake](#parallelizing-with-snakemake)
+below if you're interested in skipping right to our best recommendation.
+
 ```{r setup}
 library(mikropml)
 library(dplyr)
@@ -156,26 +164,37 @@ perf_boxplot +
   coord_flip()
 ```
 
-#### feature importance
+#### Feature importance
+
+The `perf_metric_diff` from the feature importance data frame contains the 
+differences between the performance on the actual test data and the performance 
+on the permuted test data (i.e. **test** minus **permuted**).
+If a feature is important for model performance, we expect `perf_metric_diff` to
+be positive.
+In other words, the features that resulted in the largest **decrease** in
+performance when permuted are the most important features.
+
+You can select the top n most important features for your models and plot them
+like so:
 
 ```{r feat_imp_plot}
+top_n <- 5
 top_feats <- feat_df %>%
   group_by(method, names) %>%
   summarize(median_diff = median(perf_metric_diff)) %>%
-  slice_min(order_by = median_diff, n = 5)
+  filter(median_diff > 0) %>%
+  slice_max(order_by = median_diff, n = top_n)
 
 feat_df %>%
   right_join(top_feats, by = c("method", "names")) %>%
   mutate(features = factor(names, levels = rev(unique(top_feats$names)))) %>%
   ggplot(aes(x = perf_metric_diff, y = features, color = method)) +
   geom_boxplot() +
-  facet_wrap(~method) +
   theme_bw()
 ```
 
-The features that resulted in the largest **decrease** in performance when
-permuted are the most importance features.
-
+See the docs for `get_feature_importance()` for more details on how these values
+are computed.
 
 ## Live progress updates