Merge pull request #323 from SchlossLab/improve-docs

Improve description of `run_ml()` and its args
SchlossLab · Jan 20, 2023 · f73f8a8 · f73f8a8
2 parents 4df6fe7 + 23c2aa4
commit f73f8a8
Show file tree

Hide file tree

Showing 80 changed files with 276 additions and 262 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,6 +1,6 @@
 Package: mikropml
 Title: User-Friendly R Package for Supervised Machine Learning Pipelines
-Version: 1.5.0
+Version: 1.5.0.9000
 Date: 2023-01-15
 Authors@R: 
     c(person(given = "Begüm",

diff --git a/NEWS.md b/NEWS.md
@@ -1,3 +1,7 @@
+# mikropml development version
+
+- Minor documentation improvements (#323, @kelly-sovacool).
+
 # mikropml 1.5.0
 
 - New example showing how to plot feature importances in the `parallel` vignette (#310, @kelly-sovacool).

diff --git a/R/run_ml.R b/R/run_ml.R
@@ -1,13 +1,15 @@
 #' Run the machine learning pipeline
 #'
-#' This function runs machine learning (ML), evaluates the best model,
+#' This function splits the data set into a train & test set,
+#' trains machine learning (ML) models using k-fold cross-validation,
+#' evaluates the best model on the held-out test set,
 #' and optionally calculates feature importance using the framework
 #' outlined in Topçuoğlu _et al._ 2020 (\doi{10.1128/mBio.00434-20}).
-#' Required inputs are a dataframe with an outcome variable and other columns
-#' as features, as well as the ML method.
+#' Required inputs are a data frame (must contain an outcome variable and all
+#' other columns as features) and the ML method.
 #' See `vignette('introduction')` for more details.
 #'
-#' @param dataset Dataframe with an outcome variable and other columns as features.
+#' @param dataset Data frame with an outcome variable and other columns as features.
 #' @param method ML method.
 #'   Options: `c("glmnet", "rf", "rpart2", "svmRadial", "xgbTree")`.
 #'   - glmnet: linear, logistic, or multiclass regression
@@ -73,13 +75,28 @@
 #'
 #' - `trained_model`: Output of [caret::train()], including the best model.
 #' - `test_data`: Part of the data that was used for testing.
-#' - `performance`: Dataframe of performance metrics. The first column is the cross-validation performance metric, and the last two columns are the ML method used and the seed (if one was set), respectively. All other columns are performance metrics calculated on the test data. This contains only one row, so you can easily combine performance dataframes from multiple calls to `run_ml()` (see `vignette("parallel")`).
-#' - `feature_importance`: If feature importances were calculated, a dataframe where each row is a feature or correlated group. The columns are the performance metric of the permuted data, the difference between the true performance metric and the performance metric of the permuted data (true - permuted), the feature name, the ML method, the performance metric name, and the seed (if provided). For AUC and RMSE, the higher perf_metric_diff is, the more important that feature is for predicting the outcome. For log loss, the lower perf_metric_diff is, the more important that feature is for predicting the outcome.
-#'
+#' - `performance`: Data frame of performance metrics. The first column is the
+#'    cross-validation performance metric, and the last two columns are the ML
+#'    method used and the seed (if one was set), respectively.
+#'    All other columns are performance metrics calculated on the test data.
+#'    This contains only one row, so you can easily combine performance
+#'    data frames from multiple calls to `run_ml()`
+#'    (see `vignette("parallel")`).
+#' - `feature_importance`: If feature importances were calculated, a data frame
+#'    where each row is a feature or correlated group. The columns are the
+#'    performance metric of the permuted data, the difference between the true
+#'    performance metric and the performance metric of the permuted data
+#'    (true - permuted), the feature name, the ML method,
+#'    the performance metric name, and the seed (if provided).
+#'    For AUC and RMSE, the higher perf_metric_diff is, the more important that
+#'    feature is for predicting the outcome. For log loss, the lower
+#'    perf_metric_diff is, the more important that feature is for
+#'    predicting the outcome.
 #'
 #' @section More details:
 #'
-#' For more details, please see [the vignettes](http://www.schlosslab.org/mikropml/articles/).
+#' For more details, please see
+#' [the vignettes](http://www.schlosslab.org/mikropml/articles/).
 #'
 #' @export
 #' @author Begüm Topçuoğlu, \email{topcuoglu.begum@@gmail.com}

diff --git a/docs/dev/CODE_OF_CONDUCT.html b/docs/dev/CODE_OF_CONDUCT.html
diff --git a/docs/dev/CONTRIBUTING.html b/docs/dev/CONTRIBUTING.html
diff --git a/docs/dev/LICENSE-text.html b/docs/dev/LICENSE-text.html
diff --git a/docs/dev/LICENSE.html b/docs/dev/LICENSE.html
diff --git a/docs/dev/SUPPORT.html b/docs/dev/SUPPORT.html
diff --git a/docs/dev/articles/index.html b/docs/dev/articles/index.html
diff --git a/docs/dev/articles/introduction.html b/docs/dev/articles/introduction.html
diff --git a/docs/dev/articles/paper.html b/docs/dev/articles/paper.html
diff --git a/docs/dev/articles/parallel.html b/docs/dev/articles/parallel.html
diff --git a/docs/dev/articles/preprocess.html b/docs/dev/articles/preprocess.html
diff --git a/docs/dev/articles/tuning.html b/docs/dev/articles/tuning.html
diff --git a/docs/dev/authors.html b/docs/dev/authors.html
diff --git a/docs/dev/index.html b/docs/dev/index.html