From b0a3e2b7b2c72f0591f75bb81833594d6ba812e9 Mon Sep 17 00:00:00 2001 From: Kelly Sovacool Date: Fri, 20 Jan 2023 15:32:16 -0500 Subject: [PATCH 1/4] Improve description of `run_ml()` and its args --- R/run_ml.R | 33 ++++++++++---- docs/reference/calc_baseline_precision.html | 6 +-- docs/reference/get_hyperparams_list.html | 6 +-- docs/reference/index.html | 4 +- docs/reference/plot_curves.html | 4 +- docs/reference/preprocess_data.html | 6 +-- docs/reference/randomize_feature_order.html | 14 +++--- docs/reference/run_ml.html | 50 +++++++++++++++------ docs/reference/sensspec.html | 4 +- man/calc_baseline_precision.Rd | 2 +- man/get_hyperparams_list.Rd | 2 +- man/preprocess_data.Rd | 2 +- man/randomize_feature_order.Rd | 2 +- man/run_ml.Rd | 32 ++++++++++--- 14 files changed, 112 insertions(+), 55 deletions(-) diff --git a/R/run_ml.R b/R/run_ml.R index 2f9679a8..ffb23013 100644 --- a/R/run_ml.R +++ b/R/run_ml.R @@ -1,13 +1,15 @@ #' Run the machine learning pipeline #' -#' This function runs machine learning (ML), evaluates the best model, +#' This function splits the data set into a train & test set, +#' trains machine learning (ML) models using k-fold cross-validation, +#' evaluates the best model on the held-out test set, #' and optionally calculates feature importance using the framework #' outlined in Topçuoğlu _et al._ 2020 (\doi{10.1128/mBio.00434-20}). -#' Required inputs are a dataframe with an outcome variable and other columns -#' as features, as well as the ML method. +#' Required inputs are a data frame (must contain an outcome variable and all +#' other columns as features) and the ML method. #' See `vignette('introduction')` for more details. #' -#' @param dataset Dataframe with an outcome variable and other columns as features. +#' @param dataset Data frame with an outcome variable and other columns as features. #' @param method ML method. #' Options: `c("glmnet", "rf", "rpart2", "svmRadial", "xgbTree")`. #' - glmnet: linear, logistic, or multiclass regression @@ -73,13 +75,28 @@ #' #' - `trained_model`: Output of [caret::train()], including the best model. #' - `test_data`: Part of the data that was used for testing. -#' - `performance`: Dataframe of performance metrics. The first column is the cross-validation performance metric, and the last two columns are the ML method used and the seed (if one was set), respectively. All other columns are performance metrics calculated on the test data. This contains only one row, so you can easily combine performance dataframes from multiple calls to `run_ml()` (see `vignette("parallel")`). -#' - `feature_importance`: If feature importances were calculated, a dataframe where each row is a feature or correlated group. The columns are the performance metric of the permuted data, the difference between the true performance metric and the performance metric of the permuted data (true - permuted), the feature name, the ML method, the performance metric name, and the seed (if provided). For AUC and RMSE, the higher perf_metric_diff is, the more important that feature is for predicting the outcome. For log loss, the lower perf_metric_diff is, the more important that feature is for predicting the outcome. -#' +#' - `performance`: Data frame of performance metrics. The first column is the +#' cross-validation performance metric, and the last two columns are the ML +#' method used and the seed (if one was set), respectively. +#' All other columns are performance metrics calculated on the test data. +#' This contains only one row, so you can easily combine performance +#' data frames from multiple calls to `run_ml()` +#' (see `vignette("parallel")`). +#' - `feature_importance`: If feature importances were calculated, a data frame +#' where each row is a feature or correlated group. The columns are the +#' performance metric of the permuted data, the difference between the true +#' performance metric and the performance metric of the permuted data +#' (true - permuted), the feature name, the ML method, +#' the performance metric name, and the seed (if provided). +#' For AUC and RMSE, the higher perf_metric_diff is, the more important that +#' feature is for predicting the outcome. For log loss, the lower +#' perf_metric_diff is, the more important that feature is for +#' predicting the outcome. #' #' @section More details: #' -#' For more details, please see [the vignettes](http://www.schlosslab.org/mikropml/articles/). +#' For more details, please see +#' [the vignettes](http://www.schlosslab.org/mikropml/articles/). #' #' @export #' @author Begüm Topçuoğlu, \email{topcuoglu.begum@@gmail.com} diff --git a/docs/reference/calc_baseline_precision.html b/docs/reference/calc_baseline_precision.html index 9becd3e3..6ddf3d01 100644 --- a/docs/reference/calc_baseline_precision.html +++ b/docs/reference/calc_baseline_precision.html @@ -1,5 +1,5 @@ -Calculate the fraction of positives, i.e. baseline precision for a PRC curve — calc_baseline_precision • mikropmlCalculate the fraction of positives, i.e. baseline precision for a PRC curve — calc_baseline_precision • mikropml @@ -70,7 +70,7 @@

Usage

Arguments

dataset
-

Dataframe with an outcome variable and other columns as features.

+

Data frame with an outcome variable and other columns as features.

outcome_colname
@@ -131,7 +131,7 @@

Examples -

Site built with pkgdown 2.0.7.

+

Site built with pkgdown 2.0.6.

diff --git a/docs/reference/get_hyperparams_list.html b/docs/reference/get_hyperparams_list.html index dda9ff35..a05c338a 100644 --- a/docs/reference/get_hyperparams_list.html +++ b/docs/reference/get_hyperparams_list.html @@ -1,5 +1,5 @@ -Set hyperparameters based on ML method and dataset characteristics — get_hyperparams_list • mikropmlSet hyperparameters based on ML method and dataset characteristics — get_hyperparams_list • mikropml @@ -70,7 +70,7 @@

Usage

Arguments

dataset
-

Dataframe with an outcome variable and other columns as features.

+

Data frame with an outcome variable and other columns as features.

method
@@ -123,7 +123,7 @@

Examples -

Site built with pkgdown 2.0.7.

+

Site built with pkgdown 2.0.6.

diff --git a/docs/reference/index.html b/docs/reference/index.html index bf86f63a..6ef76199 100644 --- a/docs/reference/index.html +++ b/docs/reference/index.html @@ -1,5 +1,5 @@ -Function reference • mikropmlFunction reference • mikropml @@ -369,7 +369,7 @@

Pipeline customization diff --git a/docs/reference/plot_curves.html b/docs/reference/plot_curves.html index f21adeb6..a834218b 100644 --- a/docs/reference/plot_curves.html +++ b/docs/reference/plot_curves.html @@ -1,5 +1,5 @@ -Plot ROC and PRC curves — plot_mean_roc • mikropmlPlot ROC and PRC curves — plot_mean_roc • mikropml @@ -141,7 +141,7 @@

Examples -

Site built with pkgdown 2.0.7.

+

Site built with pkgdown 2.0.6.

diff --git a/docs/reference/preprocess_data.html b/docs/reference/preprocess_data.html index 88a6d406..3599d27b 100644 --- a/docs/reference/preprocess_data.html +++ b/docs/reference/preprocess_data.html @@ -1,5 +1,5 @@ -Preprocess data prior to running machine learning — preprocess_data • mikropmlPreprocess data prior to running machine learning — preprocess_data • mikropml @@ -79,7 +79,7 @@

Usage

Arguments

dataset
-

Dataframe with an outcome variable and other columns as features.

+

Data frame with an outcome variable and other columns as features.

outcome_colname
@@ -202,7 +202,7 @@

Examples -

Site built with pkgdown 2.0.7.

+

Site built with pkgdown 2.0.6.

diff --git a/docs/reference/randomize_feature_order.html b/docs/reference/randomize_feature_order.html index 49e30182..ebfd77e5 100644 --- a/docs/reference/randomize_feature_order.html +++ b/docs/reference/randomize_feature_order.html @@ -1,5 +1,5 @@ -Randomize feature order to eliminate any position-dependent effects — randomize_feature_order • mikropmlRandomize feature order to eliminate any position-dependent effects — randomize_feature_order • mikropml @@ -70,7 +70,7 @@

Usage

Arguments

dataset
-

Dataframe with an outcome variable and other columns as features.

+

Data frame with an outcome variable and other columns as features.

outcome_colname
@@ -97,10 +97,10 @@

Examples a = 4:6, b = 7:9, c = 10:12, d = 13:15 ) randomize_feature_order(dat, "outcome") -#> outcome c b a d -#> 1 1 10 7 4 13 -#> 2 2 11 8 5 14 -#> 3 3 12 9 6 15 +#> outcome d c b a +#> 1 1 13 10 7 4 +#> 2 2 14 11 8 5 +#> 3 3 15 12 9 6