-
Notifications
You must be signed in to change notification settings - Fork 9
/
parsnip.sthlp
450 lines (423 loc) · 22.3 KB
/
parsnip.sthlp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
{smcl}
{* *! version 2.1.0 09oct2020}{...}
{findalias asfradohelp}{...}
{vieweralsosee "" "--"}{...}
{vieweralsosee "[R] help" "help help"}{...}
{vieweralsosee "[R] mlrtime" "help mlrtime"}{...}
{vieweralsosee "[R] mlrtimesetup" "help mlrtimesetup"}{...}
{vieweralsosee "[R] grf" "help grf"}{...}
{vieweralsosee "[R] gsynth" "help gsynth"}{...}
{viewerjumpto "Syntax" "examplehelpfile##syntax"}{...}
{viewerjumpto "Description" "examplehelpfile##description"}{...}
{viewerjumpto "Author" "examplehelpfile##author"}{...}
{viewerjumpto "References" "examplehelpfile##references"}{...}
{viewerjumpto "Examples" "examplehelpfile##examples"}{...}
{title:Parsnip}
{phang}
{bf:parsnip} {hline 2} Run a machine learning model in R's parsnip and return
results
{marker syntax}{...}
{title:Syntax}
{p 8 17 2}
{cmd:parsnip} {help depvar} {help varlist} [{help if}] [{help in}],
model(string) [{it:options}]
{synoptset 20 tabbed}{...}
{synopthdr}
{synoptline}
{syntab:Main}
{synopt:{help depvar}} The dependent variable being predicted.
Depending on the {opt model()} and {opt mode()} there may be restrictions on
the variable's type.
If {opt depvar} is a factor variable, it is generally recommended that you
convert it to a string variable.
All value labels will be dropped by {cmd: parsnip}.{p_end}
{synopt:{help varlist}} The set of predictors.
Factor-variable and time series operators are allowed, although their names are
not preserved, so postestimation may become more difficult to interpret.
Additionally, a number of the estimators do not have built-in ways to drop
variables with no variation.
So if, for example, you have a factor variable where one of the levels is absent
in the training data, you may want to make dummies by hand, excluding
that one.{p_end}
{synopt:{opt model}} Required.
The type of model to fit.
See details below for how to check for documentation these models.
Currently supports: {it: boost_tree}, {it: gen_additive_mod}, {it: linear_reg},
{it: logistic_reg}, {it: mars}, {it: mlp}, {it: multinom_reg},
{it: nearest_neighbor}, {it: rand_forest}, {it: svm_linear}, {it: svm_poly},
{it: svm_rbf}.{p_end}
{synoptline}
{syntab:Main Options}
{synopt:{opt clearR}} Start a new R environment before running your
command.{p_end}
{synopt:{opt seed(integer)}} Set the R seed to this.{p_end}
{synopt:{opt replace}} Drop variables that have the same names as those about
to be created by {opt pred()}.
Note that in the case of models that produce more than one column of
predictions, this may lead to variables being dropped even if they just have
{opt pred()} as a stem rather than being the exact name given in
{opt pred()}.{p_end}
{synopt:{opt results(string)}} What to do with the results (i.e. coefficients,
not predictions, noting that not all models will have coefficients) other than
printing them to screen.
Setting this to {opt results(replace)} will replace the data in memory with the
result of running the R {cmd: tidy()} function on the results.
Setting {opt results()} to a string that ends in .dta or .csv will instead
write the {cmd: tidy()} output to a Stata or CSV file, respectively.
If you forget the extension (and didn't write "results") it will assume you
wanted a .dta and output one.
{it: Importantly, note that tidy() doesn't work with any model that does not
produce coefficients.
For these models there are no results to return, only predictions.}{p_end}
{syntab:Stata Variables}
{synopt:{opt hold(varname)}} Variable name indicating a hold/train split.
A value of 0 will be used in estimation, predictions will be generated for
values of 1, and missing values will be excluded from both.{p_end}
{synopt:{opt pred(string)}} New variable name to store model predictions in.
For methods that produce more than one column of predictions, multiple columns
will be made that all start with {opt pred} as a stem.
Must be a valid name in R and Stata.
{cmd: predicted} by default.
If this variable already exists, be sure {opt replace} is specified so it can
be overwritten.
Specify {opt pred(none)} to skip prediction.{p_end}
{synopt:{opt id(varname)}} A unique row identifier that will be used to merge
predictions from R back into Stata.
If not specified, a new variable called {it:parsnip_merge_id} will be
created.{p_end}
{syntab:Additional Customization}
{synopt:{opt predopts(string)}} A set of options, in R syntax, separated by
commas, to use in the {cmd: predict} command.
See {cmd: rcall: help(predict.modeltype, package = 'packagename')} if you know
the object type your estimation method produces and the name of the package it
comes from, which then takes the place of "modeltype" in that help command.
Do not specify the {cmd:newdata} option.
R true/false options must be specified in all-caps as {cmd:TRUE} or {cmd:FALSE},
for example {cmd:ll.weight.penalty=TRUE, num.threads=4}{p_end}
{synopt:{opt otheropts(string)}} A set of options supported by the
{opt engine()}, in R syntax, separated by commas, to pass to your
{cmd: set_engine()} command and thus on to the estimation command.
See details below for accessing help files for your command; these help files
will specify other options you can set.
You can refer to other variables in your data as part of the overall data set
{cmd: dataset}, i.e. {cmd: dataset@@varname}, or similarly {cmd: holding} and
{cmd: training} for the holding and training subsets.
Use "@@" instead of "$" where $ appears in the R code to avoid confusing Stata.
This will be ignored if {opt engine()} is not specified.{p_end}
{synopt:{opt predformula(string)}} A custom prediction formula, in R syntax, to
pass to your {cmd: fit()} command.{p_end}
{synopt:{opt precedes(string)}} A list of R commands, separated by semicolons
({cmd: ;}), of whatever you'd like R to do after importing your data but before
running your {cmd: parsnip} command.
You can refer to other variables in your data as part of the data set
{cmd: dataset}, i.e. {cmd: dataset@@varname}, or similarly {cmd: holding} and
{cmd: training} for the holding and training subsets.
Use "@@" instead of "$" where $ appears in the R code to avoid confusing
Stata.{p_end}
{synoptline}
{syntab:Shared parsnip Estimation Options (repeated below)}
{synopt:{opt mode(string)}} the prediction outcome mode.
Different options available depending on which {opt model()} you have,
specified below.{p_end}
{synopt:{opt engine(string)}} The computational engine to use for fitting.
There is a default option selected for each {opt model()} but you will usually
want to specify this.
Options specified below.{p_end}
{synoptline}
{syntab:model(boost_tree) Additional Options}
{synopt:{opt mode(string)}} Required.
Prediction outcome mode.
Options "regression" or "classification".{p_end}
{synopt:{opt engine(string)}} Can be "xgboost" (default) or "C5.0".
{opt engine(C5.0)} only works with {opt mode(classification)}.
"spark" is not supported in {cmd: mlrtime}.{p_end}
{synopt:{opt mtry(real)}} A number for the number (or proportion) of predictors
that will be randomly sampled at each split when creating the tree models
(specific engines only).{p_end}
{synopt:{opt trees(integer)}} An integer for the number of trees contained in
the ensemble.{p_end}
{synopt:{opt min_n(integer)}} An integer for the minimum number of data points
in a node that is required for the node to be split further.{p_end}
{synopt:{opt tree_depth(integer)}} An integer for the maximum depth of the tree
(i.e. number of splits) (specific engines only).{p_end}
{synopt:{opt learn_rate(real)}} A number for the rate at which the boosting
algorithm adapts from iteration-to-iteration (specific engines only).{p_end}
{synopt:{opt loss_reduction(real)}} A number for the reduction in the loss
function required to split further (specific engines only).{p_end}
{synopt:{opt sample_size(real)}} A number for the number (or proportion) of
data that is exposed to the fitting routine.
For xgboost, the sampling is done at each iteration while C5.0 samples once
during training.{p_end}
{synopt:{opt stop_iter(integer)}} The number of iterations without improvement
before stopping (specific engines only).{p_end}
{synoptline}
{syntab:model(gen_additive_mod) Additional Options}
{synopt:{opt mode(string)}} Required.
Prediction outcome mode.
Options "regression" or "classification".
Because {opt mode(classification)} does not work in the standard way with
predictions {it:or} tables of coefficients, you probably won't get much out of
it unless you can do your postestimation in R code using {cmd: rcall}; you
should specify {opt pred(none)} if specifying
{opt mode(classification) model(gen_additive_mod)}.{p_end}
{synopt:{opt engine(string)}} For now can only be "mgcv", the default.{p_end}
{synopt:{opt select_features}} If specified, the model has the ability to
eliminate a predictor (via penalization).
Increasing {opt adjust_deg_free} will increase the likelihood of removing
predictors.{p_end}
{synopt:{opt adjust_deg_free(real)}} If {opt select_features} is specified,
then this acts as a multiplier for smoothness.
Increase this beyond 1 to produce smoother models.{p_end}
{synoptline}
{syntab:model(linear_reg), model(logistic_reg), and model(multinom_reg) Additional Options}
{synopt:{opt mode(string)}} Prediction outcome mode.
Can only be the defaults, which are "regression" for {opt model(linear_reg)}
or "classification" for {opt model(logistic_reg)} or
{opt model(multinom_reg)}.{p_end}
{synopt:{opt engine(string)}} All three models have access to "glmnet" and
"keras" (see Description for notes on "keras").
{opt model(linear_reg)} can also be "stan" or its default "lm".
{opt model(logistic_reg)} can also be "LiblineaR" or its default "glm".
{opt model(multinom_reg)} can also be its default "nnet" (does not allow
{opt results(replace)}).
"spark" is not supported in {cmd: mlrtime}.{p_end}
{synopt:{opt penalty(real)}} Required if {opt engine(glmnet)} is specified.
A non-negative number representing the total amount of regularization (specific
engines only).
For keras models, this corresponds to purely L2 regularization (aka weight
decay) while the other models can be either or a combination of L1 and L2
(depending on the value of {opt mixture}).{p_end}
{synopt:{opt mixture(real)}} A number between zero and one (inclusive) that is
the proportion of L1 regularization (i.e. lasso) in the model.
When mixture = 1, it is a pure lasso model while mixture = 0 indicates that
ridge regression is being used (specific engines only).
For LiblineaR models, mixture must be exactly 0 or 1 only.{p_end}
{synoptline}
{syntab:model(mars) Additional Options}
{synopt:{opt mode(string)}} Required.
Prediction outcome mode.
Options "regression" or "classification".{p_end}
{synopt:{opt engine(string)}} Can only be its default "earth".{p_end}
{synopt:{opt num_terms(integer)}} The number of features that will be retained
in the final model, including the intercept.{p_end}
{synopt:{opt prod_degree(integer)}} The number of features that will be retained
in the final model, including the intercept.{p_end}
{synopt:{opt prune_method(string)}} The pruning method.
Can be the default "backward", or "none", "exhaustive", "forward", "seqrep", or
"cv".{p_end}
{synoptline}
{syntab:model(mlp) Additional Options}
{synopt:{opt mode(string)}} Required.
Prediction outcome mode.
Options "regression" or "classification".{p_end}
{synopt:{opt engine(string)}} Can be "keras" or default "nnet" (see Description
for notes on "keras").{p_end}
{synopt:{opt hidden_units(integer)}} An integer for the number of units in the
hidden model.{p_end}
{synopt:{opt penalty(real)}} A non-negative numeric value for the amount of
weight decay.{p_end}
{synopt:{opt dropout(real)}} A number between 0 (inclusive) and 1 denoting the
proportion of model parameters randomly set to zero during model
training.{p_end}
{synopt:{opt epochs(integer)}} An integer for the number of training
iterations.{p_end}
{synopt:{opt activation(string)}} The type of relationship between the original
predictors and the hidden unit layer.
The activation function between the hidden and output layers is automatically
set to either "linear" or "softmax" depending on the type of outcome.
Possible values are: "linear", "softmax", "relu", and "elu".{p_end}
{synoptline}
{syntab:model(nearest_neighbor) Additional Options}
{synopt:{opt mode(string)}} Required.
Prediction outcome mode.
Options "regression" or "classification".{p_end}
{synopt:{opt engine(string)}} Can only be the default "kknn".{p_end}
{synopt:{opt neighbors(integer)}} The number of neighbors to consider (often
called k).
For kknn, a value of 5 is used if neighbors is not specified.{p_end}
{synopt:{opt weight_func(string)}} The type of kernel function used to weight
distances between samples.
Valid choices are: "rectangular", "triangular", "epanechnikov", "biweight",
"triweight", "cos", "inv", "gaussian", "rank", or "optimal".{p_end}
{synopt:{opt dist_power(real)}} The parameter used in calculating Minkowski
distance.{p_end}
{synoptline}
{syntab:model(rand_forest) Additional Options}
{synopt:{opt mode(string)}} Required.
Prediction outcome mode.
Options "regression" or "classification".{p_end}
{synopt:{opt engine(string)}} Can be the default "ranger", or "randomForest".
"spark" is not supported in {cmd: mlrtime}.{p_end}
{synopt:{opt mtry(integer)}} The number of predictors that will be randomly
sampled at each split when creating the tree models.{p_end}
{synopt:{opt trees(integer)}} The number of trees contained in the
ensemble.{p_end}
{synopt:{opt min_n(integer)}} The minimum number of data points in a node that
are required for the node to be split further.{p_end}
{synoptline}
{syntab:model(svm_linear) Additional Options}
{synopt:{opt mode(string)}} Required.
Prediction outcome mode.
Options "regression" or "classification".{p_end}
{synopt:{opt engine(string)}} Can be the default "LiblineaR", or
"kernlab".{p_end}
{synopt:{opt cost(real)}} A positive number for the cost of predicting a sample
within or on the wrong side of the margin{p_end}
{synopt:{opt margin(real)}} A positive number for the epsilon in the SVM
insensitive loss function (regression only){p_end}
{synoptline}
{syntab:model(svm_poly) Additional Options}
{synopt:{opt mode(string)}} Required.
Prediction outcome mode.
Options "regression" or "classification".{p_end}
{synopt:{opt engine(string)}} Can only be the default "kernlab".{p_end}
{synopt:{opt cost(real)}} A positive number for the cost of predicting a sample
within or on the wrong side of the margin{p_end}
{synopt:{opt degree(real)}} A positive number for polynomial degree.{p_end}
{synopt:{opt scale_factor(real)}} A positive number for the polynomial scaling
factor.{p_end}
{synopt:{opt margin(real)}} A positive number for the epsilon in the SVM
insensitive loss function (regression only){p_end}
{synoptline}
{syntab:model(svm_rbf) Additional Options}
{synopt:{opt mode(string)}} Required.
Prediction outcome mode.
Options "regression" or "classification".{p_end}
{synopt:{opt engine(string)}} Can only be the default "kernlab".{p_end}
{synopt:{opt cost(real)}} A positive number for the cost of predicting a sample
within or on the wrong side of the margin{p_end}
{synopt:{opt rbf_sigma(real)}} A positive number for radial basis
function.{p_end}
{synopt:{opt margin(real)}} A positive number for the epsilon in the SVM
insensitive loss function (regression only){p_end}
{synoptline}
{p2colreset}{...}
{p 4 6 2}
{cmd:parsnip}
{marker description}{...}
{title:Description}
{pstd}
{cmd:parsnip} is an interface between Stata and the various machine learning
functions in the R package {cmd: parsnip}, by Kuhn and Vaughn (2021).
Note the "spark" engine accessible in the R package is not included since
managing a Spark connection with in-memory Stata data doesn't make all that
much sense anyway.
{pstd}
parsnip is a common API for modeling a wide range of analysis functions,
focusing on those popular in machine learning (including many that are not
machine-learning-specific, like logistic regression).
The full list of supported models and engines is quite long, and is likely to
continue being updated (hopefully without breaking this Stata function).
The list can be seen at
{browse "https://www.tidymodels.org/find/parsnip/":the tidymodels website}.
{pstd}
{cmd:parsnip} will send data to an R session, fit the model of interest, and
then return model results or predicted values, as desired and appropriate for
the model.
Results will be printed to screen but {it:not} in Stata-table format, and
results will not be available by matrix after estimation; this is a consequence
of the wide range of available models with different forms of output.
Predicted values are of course available in the {opt pred()} variable after
estimation.
Model results itself, i.e. things like coefficients, can be accessed using the
option {opt results(replace)}, which will replace the data in memory with the
results.
There is not a suite of postestimation commands for these results, although you
can use {cmd:rcall} to run postestimation in R, referring to the estimated
model object as {cmd: M}.
{pstd}
The R {cmd: parsnip} package itself is a common API that refers out to a long
list of other estimation packages.
Some of these do not run in R and require additional installation.
For example, {opt engine(keras)} will require that you have a TensorFlow
installation already on your system that is accessible from R.
Before using {opt engine(keras)}, you must have a valid installation of
TensorFlow.
See {browse "https://tensorflow.rstudio.com/installation/": this page} for
instructions on installing TensorFlow in R.
If you do not have TensorFlow installed where R can find it, you can try
{cmd: rcall: tensorflow::install_tensorflow()}, although this process may be
finicky.
Yes, this means that if {opt engine(keras)} is specified, the Stata package
{cmd: mlrtime} contains the function {cmd:parsnip} which calls the Stata
{cmd: rcall} package, which calls the R package {cmd: parsnip}, which calls
the R package {cmd: keras}, which calls the R package {cmd: tensorflow}, which
calls the platform TensorFlow, which itself is written in (and calls on
libraries in) Python, C++, and CUDA.
Welcome to computing in the 2020s.
{pstd}
Unfortunately, troubleshooting your model if it does not work is rather
difficult when using a wrapper like this.
You can get some help on the GitHub page at
{browse "https://github.com/NickCH-K/mlrtime"} but I can't help everyone, you
may be a bit on your own.
You may be able to use {cmd: rcall} to see what error messages come up if things
are not being returned to Stata, or try putting {cmd: set trace on} before
running {cmd: parsnip} so you can see the R error message being read back, if
it doesn't print to the console.
{pstd}
Syntax takes the form of {help depvar} {help varlist}, {opt model()}.
{opt model()} is the type of model you'll be fitting.
You'll likely also want to specify {opt engine()}, which determines which
estimation engine / algorithm will be used to fit the model, {opt pred()},
which is the name of the variable where predictions will be stored in Stata,
{opt replace} if that variable already exists, and {opt hold()} to specify a
hold-out set to predict for, using only the training data to fit the model.
{pstd}
See {cmd: rcall help('linear_reg', package = 'parsnip')} (or other models
besides {cmd: linear_reg}, as approprate) for details on the different kinds
of engines available for that model.
The help page includes links to more information on the different engines.
These links unfortunately don't work when calling R's {cmd: help} using
{cmd: rcall}, but you can get more information by searching
{browse "https://www.tidymodels.org/find/parsnip/":the tidymodels website}
for the model, or doing
{cmd: rcall help('detail_model_engine', package = 'parsnip')}, replacing
"model" and "engine" with the model and engine of interest, respectively.
{pstd}
Note that if you know R you can add your own code as you like using
{cmd: rcall}.
The {opt precedes} option allows the insertion of R code after data is imported
but before estimating the model, and you can use {cmd: rcall} as normal
afterwards, referring to the fitted model object as {cmd: M}.
The relevant objects will remain in the R session and can be accessed and
manipulated further with R code using {help rcall} afterwards (Haghish 2019).
Run {help mlrtimesetup} before use for the first time.
{pstd}
Additional note: while {help depvar} and {help varlist} accept time-series,
interaction, and factor operators, the names of these variables are not
maintained in an interpretable way.
This may be a problem if you plan to do some sort of postestimation that
reports the variable names of predictors.
In this case you should create factor and time-series variables by hand and
store them directly in the data.
{pstd}
The use of any mlrtime function will often result in flashing blue screens.
If you are photosensitive you may want to look away.
{marker author}{...}
{title:Author}
Nick Huntington-Klein
nhuntington-klein@seattleu.edu
{marker references}{...}
{title:References}
{phang} Kuhn, M. and Vaughan, D. (2021). parsnip: A Common API to Modeling
and Analysis Functions. R package version 0.1.7.
{browse "https://CRAN.R-project.org/package=parsnip": Link}.
{phang} Haghish, E. F. (2019). Seamless interactive language interfacing between
R and Stata. The Stata Journal, 19(1), 61–82.
{browse "https://doi.org/10.1177/1536867X19830891":Link}.
{marker examples}{...}
{title:Examples}
Note that this syntax example produces some strange results, but this is an
artifact of the not-well-thought-out analysis.
{phang}{cmd:. sysuse auto.dta, clear}{p_end}
{phang}{cmd:. g model = word(make, 1)}{p_end}
{phang}{cmd:. encode model, g(modeln)}{p_end}
{phang}{cmd:. g hold = runiform() > .5}{p_end}
{phang}{cmd:. * I wonder how LASSO predictions compare to random forest!}{p_end}
{phang}{cmd:. parsnip price mpg headroom trunk i.modeln, model(linear_reg) hold(hold) engine(glmnet) penalty(.5) replace clearR pred(lasso_prediction)}{p_end}
{phang}{cmd:. parsnip price mpg headroom trunk i.modeln, model(rand_forest) hold(hold) engine(ranger) mode(regression) replace clearR pred(RF_prediction)}{p_end}
{phang}{cmd:. * We could, like in help grf, swap the training and holdout sets to get predictions for everyone}{p_end}
{phang}{cmd:. * but let's just look at which prediction correlates to the truth better (among our holdouts)}{p_end}
{phang}{cmd:. pwcorr price lasso_prediction RF_prediction}{p_end}
{phang}{cmd:. * Looks like the random forest does better here!}{p_end}