-
Notifications
You must be signed in to change notification settings - Fork 17
/
tuning.Rmd
275 lines (210 loc) · 9.61 KB
/
tuning.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
---
title: "Hyperparameter tuning"
author: "Begüm D. Topçuoğlu"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Hyperparameter tuning}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
One particularly important aspect of machine learning (ML) is hyperparameter tuning.
A hyperparameter is a parameter that is set before the ML training begins.
These parameters are tunable and they effect how well the model trains.
We must do a grid search for many hyperparameter possibilities and exhaust our search to pick the ideal value for the model and dataset.
In this package, we do this during the cross-validation step.
Let's start with an example ML run.
The input data to `run_ml()` is a dataframe where each row is a sample or observation.
One column (assumed to be the first) is the outcome of interest,
and all of the other columns are the features.
We package `otu_mini_bin` as a small example dataset with `mikropml`.
```{r}
# install.packages("devtools")
# devtools::install_github("SchlossLab/mikropml")
library(mikropml)
head(otu_mini_bin)
```
Before we train and evaluate a ML model, we can preprocess the data.
You can learn more about this in the preprocessing vignette: `vignette("preprocess")`.
```{r}
preproc <- preprocess_data(
dataset = otu_mini_bin,
outcome_colname = "dx"
)
dat <- preproc$dat_transformed
```
We'll use `dat` for the following examples.
## The simplest way to `run_ml()`
As mentioned above, the minimal input is your dataset (`dataset`) and the machine learning model you want to use (`method`).
When we `run_ml()`, by default we do a 100 times repeated, 5-fold cross-validation,
where we evaluate the hyperparameters in these 500 total iterations.
Say we want to run L2 regularized logistic regression. We do this with:
```{r, warning = FALSE}
results <- run_ml(dat,
"glmnet",
outcome_colname = "dx",
cv_times = 100,
seed = 2019
)
```
You'll probably get a warning when you run this because the dataset is very small. If you want to learn more about that, check out the introductory vignette about training and evaluating a ML model: `vignette("introduction")`.
By default, `run_ml()` selects hyperparameters depending on the dataset and method used.
```{r}
results$trained_model
```
As you can see, the `alpha` hyperparameter is set to 0, which specifies L2 regularization.
`glmnet` gives us the option to run both L1 and L2 regularization.
If we change `alpha` to 1, we would run L1-regularized logistic regression.
You can also tune `alpha` by specifying a variety of values between 0 and 1.
When you use a value that is between 0 and 1, you are running elastic net.
The default hyperparameter `lambda` which adjusts the L2 regularization penalty is a range of values between 10^-4 to 10.
When we look at the 100 repeated cross-validation performance metrics such as
`AUC`, `Accuracy`, `prAUC` for each tested `lambda` value,
we see that some are not appropriate for this dataset and some do better than others.
```{r}
results$trained_model$results
```
## Customizing hyperparameters
In this example, we want to change the `lambda` values to provide a better range to test in the cross-validation step.
We don't want to use the defaults but provide our own named list with new values.
For example:
```{r}
new_hp <- list(
alpha = 1,
lambda = c(0.00001, 0.0001, 0.001, 0.01, 0.015, 0.02, 0.025, 0.03, 0.04, 0.05, 0.06, 0.1)
)
new_hp
```
Now let's run L2 logistic regression with the new `lambda` values:
```{r, warning = FALSE}
results <- run_ml(dat,
"glmnet",
outcome_colname = "dx",
cv_times = 100,
hyperparameters = new_hp,
seed = 2019
)
results$trained_model
```
This time, we cover a larger and different range of `lambda` settings in cross-validation.
How do we know which `lambda` value is the best one?
To answer that, we need to run the ML pipeline on multiple data splits
and look at the mean cross-validation performance of each `lambda` across those modeling experiments.
We describe how to run the pipeline with multiple data splits in `vignette("parallel")`.
Here we train the model with the new `lambda` range we defined above.
We run it 3 times each with a different seed, which will result in different
splits of the data into training and testing sets.
We can then use `plot_hp_performance` to see which `lambda` gives us the largest mean AUC value across modeling experiments.
```{r, warning = FALSE}
results <- lapply(seq(100, 102), function(seed) {
run_ml(dat, "glmnet", seed = seed, hyperparameters = new_hp)
})
models <- lapply(results, function(x) x$trained_model)
hp_metrics <- combine_hp_performance(models)
plot_hp_performance(hp_metrics$dat, lambda, AUC)
```
As you can see, we get a mean maxima at `0.03` which is the best `lambda` value
for this dataset when we run 3 data splits.
The fact that we are seeing this maxima in the middle of our range and not at the edges,
shows that we are providing a large enough range to exhaust our `lambda` search
as we build the model.
We recommend the user to use this plot to make sure the best hyperparameter
is not on the edges of the provided list.
For a better understanding of the global maxima,
it would be better to run more data splits by using more seeds.
We picked 3 seeds to keep the runtime down for this vignette,
but for real-world data we recommend using many more seeds.
## Hyperparameter options
You can see which default hyperparameters would be used for your dataset with `get_hyperparams_list()`.
Here are a few examples with built-in datasets we provide:
```{r}
get_hyperparams_list(otu_mini_bin, "glmnet")
get_hyperparams_list(otu_mini_bin, "rf")
get_hyperparams_list(otu_small, "rf")
```
Here are the hyperparameters that are tuned for each of the modeling methods.
The output for all of them is very similar, so we won't go into those details.
### Regression
As mentioned above, `glmnet` uses the `alpha` parameter and `lambda` hyperparameter.
`alpha` of `0` is for L2 regularization (ridge).
`alpha` of `1` is for L1 regularization (lasso).
`alpha` in between is elastic net. You can also tune `alpha` like you would any other hyperparameter.
Please refer to original `glmnet` documentation for more information:
https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html
The default hyperparameters chosen by `run_ml()` are fixed for `glmnet`.
```{r, echo = FALSE}
mikropml:::set_hparams_glmnet()
```
### Random forest
When we run `rf` or `parRF`, we are using the the `randomForest` package implementation.
We are tuning the `mtry` hyperparameter.
This is the number of features that are randomly collected to be sampled at each tree node.
This number needs to be less than the number of features in the dataset.
Please refer to the original documentation for more information:
https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
By default, we take the square root of number of features in the dataset
and we provide a range that is `[sqrt_features / 2, sqrt_features, sqrt_features * 2]`.
For example if the number of features is 1000:
```{r, echo = FALSE}
mikropml:::set_hparams_rf(1000)
```
Similar to the `glmnet` method, we can provide our own `mtry` range.
### Decision tree
When we run `rpart2`, we are running the `rpart` package implementation of decision tree.
We are tuning the `maxdepth` hyperparameter.
This is the maximum depth of any node of the final tree.
Please refer to the original documentation for more information on maxdepth:
https://cran.r-project.org/web/packages/rpart/rpart.pdf
By default, we provide a range that is less than the number of features in the dataset.
For example if we have 1000 features:
```{r, echo = FALSE}
mikropml:::set_hparams_rpart2(1000)
```
or 10 features:
```{r, echo = FALSE}
mikropml:::set_hparams_rpart2(10)
```
### SVM with radial basis kernel
When we run the `svmRadial` method, we are tuning the `C` and `sigma` hyperparameters.
`sigma` defines how far the influence of a single training example reaches and `C` behaves as a regularization parameter.
Please refer to this great `sklearn` resource for more information on these hyperparameters:
https://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html
By default, we provide 2 separate range of values for the two hyperparameters.
```{r, echo = FALSE}
mikropml:::set_hparams_svmRadial()
```
### XGBoost
When we run the `xgbTree` method, we are tuning the
`nrounds`, `gamma`, `eta` `max_depth`, `colsample_bytree`, `min_child_weight` and `subsample` hyperparameters.
You can read more about these hyperparameters here:
https://xgboost.readthedocs.io/en/latest/parameter.html
By default, we set the `nrounds`, `gamma`, `colsample_bytree` and `min_child_weight`
to fixed values and we provide a range of values for `eta`, `max_depth` and `subsample`.
All of these can be changed and optimized by the user by supplying a custom
named list of hyperparameters to `run_ml()`.
```{r, echo = FALSE}
mikropml:::set_hparams_xgbTree(1000)
```
## Other ML methods
While the above ML methods are those that we have tested and set default
hyperparameters for, in theory you may be able use other methods supported by
caret with `run_ml()`. Take a look at the [available models in
caret](https://topepo.github.io/caret/available-models.html) (or see
[here](https://topepo.github.io/caret/train-models-by-tag.html) for a list by
tag). You will need to give `run_ml()` your own custom hyperparameters just like
in the examples above:
```{r rfRules, eval = FALSE}
run_ml(otu_mini_bin,
"regLogistic",
hyperparameters = list(
cost = 10^seq(-4, 1, 1),
epsilon = c(0.01),
loss = c("L2_primal")
)
)
```