-
Notifications
You must be signed in to change notification settings - Fork 16
/
introduction.html
920 lines (890 loc) · 76.7 KB
/
introduction.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
<!DOCTYPE html>
<!-- Generated by pkgdown: do not edit by hand --><html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta name="description" content="mikropml">
<title>Introduction to mikropml • mikropml</title>
<!-- favicons --><link rel="icon" type="image/png" sizes="16x16" href="../favicon-16x16.png">
<link rel="icon" type="image/png" sizes="32x32" href="../favicon-32x32.png">
<link rel="apple-touch-icon" type="image/png" sizes="180x180" href="../apple-touch-icon.png">
<link rel="apple-touch-icon" type="image/png" sizes="120x120" href="../apple-touch-icon-120x120.png">
<link rel="apple-touch-icon" type="image/png" sizes="76x76" href="../apple-touch-icon-76x76.png">
<link rel="apple-touch-icon" type="image/png" sizes="60x60" href="../apple-touch-icon-60x60.png">
<script src="../deps/jquery-3.6.0/jquery-3.6.0.min.js"></script><meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<link href="../deps/bootstrap-5.2.2/bootstrap.min.css" rel="stylesheet">
<script src="../deps/bootstrap-5.2.2/bootstrap.bundle.min.js"></script><!-- Font Awesome icons --><link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.12.1/css/all.min.css" integrity="sha256-mmgLkCYLUQbXn0B1SRqzHar6dCnv9oZFPEC1g1cwlkk=" crossorigin="anonymous">
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.12.1/css/v4-shims.min.css" integrity="sha256-wZjR52fzng1pJHwx4aV2AO3yyTOXrcDW7jBpJtTwVxw=" crossorigin="anonymous">
<!-- bootstrap-toc --><script src="https://cdn.jsdelivr.net/gh/afeld/bootstrap-toc@v1.0.1/dist/bootstrap-toc.min.js" integrity="sha256-4veVQbu7//Lk5TSmc7YV48MxtMy98e26cf5MrgZYnwo=" crossorigin="anonymous"></script><!-- headroom.js --><script src="https://cdnjs.cloudflare.com/ajax/libs/headroom/0.11.0/headroom.min.js" integrity="sha256-AsUX4SJE1+yuDu5+mAVzJbuYNPHj/WroHuZ8Ir/CkE0=" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/headroom/0.11.0/jQuery.headroom.min.js" integrity="sha256-ZX/yNShbjqsohH1k95liqY9Gd8uOiE1S4vZc+9KQ1K4=" crossorigin="anonymous"></script><!-- clipboard.js --><script src="https://cdnjs.cloudflare.com/ajax/libs/clipboard.js/2.0.6/clipboard.min.js" integrity="sha256-inc5kl9MA1hkeYUt+EC3BhlIgyp/2jDIyBLS6k3UxPI=" crossorigin="anonymous"></script><!-- search --><script src="https://cdnjs.cloudflare.com/ajax/libs/fuse.js/6.4.6/fuse.js" integrity="sha512-zv6Ywkjyktsohkbp9bb45V6tEMoWhzFzXis+LrMehmJZZSys19Yxf1dopHx7WzIKxr5tK2dVcYmaCk2uqdjF4A==" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/autocomplete.js/0.38.0/autocomplete.jquery.min.js" integrity="sha512-GU9ayf+66Xx2TmpxqJpliWbT5PiGYxpaG8rfnBEk1LL8l1KGkRShhngwdXK1UgqhAzWpZHSiYPc09/NwDQIGyg==" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/mark.js/8.11.1/mark.min.js" integrity="sha512-5CYOlHXGh6QpOFA/TeTylKLWfB3ftPsde7AnmhuitiTX4K5SqCLBeKro6sPS8ilsz1Q4NRx3v8Ko2IBiszzdww==" crossorigin="anonymous"></script><!-- pkgdown --><script src="../pkgdown.js"></script><meta property="og:title" content="Introduction to mikropml">
<meta property="og:description" content="mikropml">
<meta property="og:image" content="http://www.schlosslab.org/mikropml/logo.png">
<meta name="robots" content="noindex">
<!-- mathjax --><script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js" integrity="sha256-nvJJv9wWKEm88qvoQl9ekL2J+k/RWIsaSScxxlsrv8k=" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/config/TeX-AMS-MML_HTMLorMML.js" integrity="sha256-84DKXVJXs0/F8OTMzX4UR909+jtl4G7SPypPavF+GfA=" crossorigin="anonymous"></script><!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.3/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]-->
</head>
<body>
<a href="#main" class="visually-hidden-focusable">Skip to contents</a>
<nav class="navbar fixed-top navbar-light navbar-expand-lg bg-light"><div class="container">
<a class="navbar-brand me-2" href="../index.html">mikropml</a>
<small class="nav-text text-danger me-auto" data-bs-toggle="tooltip" data-bs-placement="bottom" title="In-development version">1.5.0.9000</small>
<button class="navbar-toggler" type="button" data-bs-toggle="collapse" data-bs-target="#navbar" aria-controls="navbar" aria-expanded="false" aria-label="Toggle navigation">
<span class="navbar-toggler-icon"></span>
</button>
<div id="navbar" class="collapse navbar-collapse ms-3">
<ul class="navbar-nav me-auto">
<li class="nav-item">
<a class="nav-link" href="../reference/index.html">Reference</a>
</li>
<li class="active nav-item dropdown">
<a href="#" class="nav-link dropdown-toggle" data-bs-toggle="dropdown" role="button" aria-expanded="false" aria-haspopup="true" id="dropdown-articles">Articles</a>
<div class="dropdown-menu" aria-labelledby="dropdown-articles">
<h6 class="dropdown-header" data-toc-skip>Paper</h6>
<a class="dropdown-item" href="../articles/paper.html">mikropml: User-Friendly R Package for Supervised Machine Learning Pipelines</a>
<div class="dropdown-divider"></div>
<h6 class="dropdown-header" data-toc-skip>Vignettes</h6>
<a class="dropdown-item" href="../articles/introduction.html">Introduction to mikropml</a>
<a class="dropdown-item" href="../articles/preprocess.html">Preprocessing data</a>
<a class="dropdown-item" href="../articles/tuning.html">Hyperparameter tuning</a>
<a class="dropdown-item" href="../articles/parallel.html">Parallel processing</a>
</div>
</li>
<li class="nav-item">
<a class="nav-link" href="../news/index.html">Changelog</a>
</li>
</ul>
<form class="form-inline my-2 my-lg-0" role="search">
<input type="search" class="form-control me-sm-2" aria-label="Toggle navigation" name="search-input" data-search-index="../search.json" id="search-input" placeholder="Search for" autocomplete="off">
</form>
<ul class="navbar-nav">
<li class="nav-item">
<a class="external-link nav-link" href="https://github.com/SchlossLab/mikropml/" aria-label="github">
<span class="fab fa fab fa-github fa-lg"></span>
</a>
</li>
</ul>
</div>
</div>
</nav><div class="container template-article">
<div class="row">
<main id="main" class="col-md-9"><div class="page-header">
<img src="../logo.png" class="logo" alt=""><h1>Introduction to mikropml</h1>
<h4 data-toc-skip class="author">Zena Lapp</h4>
<small class="dont-index">Source: <a href="https://github.com/SchlossLab/mikropml/blob/HEAD/vignettes/introduction.Rmd" class="external-link"><code>vignettes/introduction.Rmd</code></a></small>
<div class="d-none name"><code>introduction.Rmd</code></div>
</div>
<p>The goal of <code>mikropml</code> is to make supervised machine
learning (ML) easy for you to run while implementing good practices for
machine learning pipelines. All you need to run the ML pipeline is one
function: <code><a href="../reference/run_ml.html">run_ml()</a></code>. We’ve selected sensible default
arguments related to good practices <span class="citation">(Topçuoğlu et
al. 2020; Tang et al. 2020)</span>, but we allow you to change those
arguments to tailor <code><a href="../reference/run_ml.html">run_ml()</a></code> to the needs of your data.</p>
<p>This document takes you through all of the <code><a href="../reference/run_ml.html">run_ml()</a></code>
inputs, both required and optional, as well as the outputs.</p>
<p>In summary, you provide:</p>
<ul>
<li>A dataset with an outcome column and feature columns (rows are
samples; unfortunately we do not support multi-label
classification)</li>
<li>Model choice (i.e. method)</li>
</ul>
<p>And the function outputs:</p>
<ul>
<li>The trained model</li>
<li>Model performance metrics</li>
<li>(Optional) feature importance metrics</li>
</ul>
<div class="section level2">
<h2 id="its-running-so-slow">It’s running so slow!<a class="anchor" aria-label="anchor" href="#its-running-so-slow"></a>
</h2>
<p>Since I assume a lot of you won’t read this entire vignette, I’m
going to say this at the beginning. If the <code><a href="../reference/run_ml.html">run_ml()</a></code>
function is running super slow, you should consider parallelizing. See
<code><a href="../articles/parallel.html">vignette("parallel")</a></code> for examples.</p>
</div>
<div class="section level2">
<h2 id="understanding-the-inputs">Understanding the inputs<a class="anchor" aria-label="anchor" href="#understanding-the-inputs"></a>
</h2>
<div class="section level3">
<h3 id="the-input-data">The input data<a class="anchor" aria-label="anchor" href="#the-input-data"></a>
</h3>
<p>The input data to <code><a href="../reference/run_ml.html">run_ml()</a></code> is a dataframe where each row
is a sample or observation. One column (assumed to be the first) is the
outcome of interest, and all of the other columns are the features. We
package <code>otu_mini_bin</code> as a small example dataset with
<code>mikropml</code>.</p>
<div class="sourceCode" id="cb1"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="co"># install.packages("devtools")</span></span>
<span><span class="co"># devtools::install_github("SchlossLab/mikropml")</span></span>
<span><span class="kw"><a href="https://rdrr.io/r/base/library.html" class="external-link">library</a></span><span class="op">(</span><span class="va"><a href="https://www.schlosslab.org/mikropml/" class="external-link">mikropml</a></span><span class="op">)</span></span>
<span><span class="fu"><a href="https://rdrr.io/r/utils/head.html" class="external-link">head</a></span><span class="op">(</span><span class="va">otu_mini_bin</span><span class="op">)</span></span>
<span><span class="co">#> dx Otu00001 Otu00002 Otu00003 Otu00004 Otu00005 Otu00006 Otu00007</span></span>
<span><span class="co">#> 1 normal 350 268 213 1 208 230 70</span></span>
<span><span class="co">#> 2 normal 568 1320 13 293 671 103 48</span></span>
<span><span class="co">#> 3 normal 151 756 802 556 145 271 57</span></span>
<span><span class="co">#> 4 normal 299 30 1018 0 25 99 75</span></span>
<span><span class="co">#> 5 normal 1409 174 0 3 2 1136 296</span></span>
<span><span class="co">#> 6 normal 167 712 213 4 332 534 139</span></span>
<span><span class="co">#> Otu00008 Otu00009 Otu00010</span></span>
<span><span class="co">#> 1 230 235 64</span></span>
<span><span class="co">#> 2 204 119 115</span></span>
<span><span class="co">#> 3 176 37 710</span></span>
<span><span class="co">#> 4 78 255 197</span></span>
<span><span class="co">#> 5 1 537 533</span></span>
<span><span class="co">#> 6 251 155 122</span></span></code></pre></div>
<p>Here, <code>dx</code> is the outcome column (normal or cancer), and
there are 10 features (<code>Otu00001</code> through
<code>Otu00010</code>). Because there are only 2 outcomes, we will be
performing binary classification in the majority of the examples below.
At the bottom, we will also briefly provide examples of multi-class and
continuous outcomes. As you’ll see, you run them in the same way as for
binary classification!</p>
<p>The feature columns are the amount of each <a href="https://en.wikipedia.org/wiki/Operational_taxonomic_unit" class="external-link">Operational
Taxonomic Unit (OTU)</a> in microbiome samples from patients with cancer
and without cancer. The goal is to predict <code>dx</code>, which stands
for diagnosis. This diagnosis can be cancer or not based on an
individual’s microbiome. No need to understand exactly what that means,
but if you’re interested you can read more about it from the original
paper <span class="citation">(Topçuoğlu et al. 2020)</span>.</p>
<p>For real machine learning applications you’ll need to use more
features, but for the purposes of this vignette we’ll stick with this
example dataset so everything runs faster.</p>
</div>
<div class="section level3">
<h3 id="the-methods-we-support">The methods we support<a class="anchor" aria-label="anchor" href="#the-methods-we-support"></a>
</h3>
<p>All of the methods we use are supported by a great ML wrapper package
<a href="https://topepo.github.io/caret/" class="external-link"><code>caret</code></a>, which
we use to train our machine learning models.</p>
<p>The methods we have tested (and their backend packages) are:</p>
<ul>
<li>Logistic/multiclass/linear regression (<code>"glmnet"</code>)</li>
<li>Random forest (<code>"rf"</code>)</li>
<li>Decision tree (<code>"rpart2"</code>)</li>
<li>Support vector machine with a radial basis kernel
(<code>"svmRadial"</code>)</li>
<li>xgboost (<code>"xgbTree"</code>)</li>
</ul>
<p>For documentation on these methods, as well as many others, you can
look at the <a href="https://topepo.github.io/caret/available-models.html" class="external-link">available
models</a> (or see <a href="https://topepo.github.io/caret/train-models-by-tag.html" class="external-link">here</a>
for a list by tag). While we have not vetted the other models used by
<code>caret</code>, our function is general enough that others might
work. While we can’t promise that we can help with other models, feel
free to [start a new discussion on GitHub]<a href="https://github.com/SchlossLab/mikropml/discussions" class="external-link uri">https://github.com/SchlossLab/mikropml/discussions</a>) if
you have questions about other models and we <em>might</em> be able to
help.</p>
<p>We will first focus on <code>glmnet</code>, which is our default
implementation of L2-regularized logistic regression. Then we will cover
a few other examples towards the end.</p>
</div>
</div>
<div class="section level2">
<h2 id="before-running-ml">Before running ML<a class="anchor" aria-label="anchor" href="#before-running-ml"></a>
</h2>
<p>Before you execute <code><a href="../reference/run_ml.html">run_ml()</a></code>, you should consider
preprocessing your data, either on your own or with the
<code><a href="../reference/preprocess_data.html">preprocess_data()</a></code> function. You can learn more about this
in the preprocessing vignette: <code><a href="../articles/preprocess.html">vignette("preprocess")</a></code>.</p>
</div>
<div class="section level2">
<h2 id="the-simplest-way-to-run_ml">The simplest way to <code>run_ml()</code><a class="anchor" aria-label="anchor" href="#the-simplest-way-to-run_ml"></a>
</h2>
<p>As mentioned above, the minimal input is your dataset
(<code>dataset</code>) and the machine learning model you want to use
(<code>method</code>).</p>
<p>You may also want to provide:</p>
<ul>
<li>The outcome column name. By default <code><a href="../reference/run_ml.html">run_ml()</a></code> will pick
the first column, but it’s best practice to specify the column name
explicitly.</li>
<li>A seed so that the results will be reproducible, and so that you get
the same results as those you see here (i.e have the same train/test
split).</li>
</ul>
<p>Say we want to use logistic regression, then the method we will use
is <code>glmnet</code>. To do so, run the ML pipeline with:</p>
<div class="sourceCode" id="cb2"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">results</span> <span class="op"><-</span> <span class="fu"><a href="../reference/run_ml.html">run_ml</a></span><span class="op">(</span><span class="va">otu_mini_bin</span>,</span>
<span> <span class="st">"glmnet"</span>,</span>
<span> outcome_colname <span class="op">=</span> <span class="st">"dx"</span>,</span>
<span> seed <span class="op">=</span> <span class="fl">2019</span></span>
<span><span class="op">)</span></span></code></pre></div>
<p>You’ll notice a few things:</p>
<ol style="list-style-type: decimal">
<li>It takes a little while to run. This is because of some of the
parameters we use.</li>
<li>There is a message stating that ‘dx’ is being used as the outcome
column. This is what we want, but it’s a nice sanity check!</li>
<li>There was a warning. Don’t worry about this warning right now - it
just means that some of the hyperparameters aren’t a good fit - but if
you’re interested in learning more, see
<code><a href="../articles/tuning.html">vignette("tuning")</a></code>.</li>
</ol>
<p>Now, let’s dig into the output a bit. The results is a list of 4
things:</p>
<div class="sourceCode" id="cb3"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/base/names.html" class="external-link">names</a></span><span class="op">(</span><span class="va">results</span><span class="op">)</span></span>
<span><span class="co">#> [1] "trained_model" "test_data" "performance" </span></span>
<span><span class="co">#> [4] "feature_importance"</span></span></code></pre></div>
<p><code>trained_model</code> is the trained model from
<code>caret</code>. There is a bunch of info in this that we won’t get
into, because you can learn more from the <code><a href="https://rdrr.io/pkg/caret/man/train.html" class="external-link">caret::train()</a></code>
documentation.</p>
<div class="sourceCode" id="cb4"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/base/names.html" class="external-link">names</a></span><span class="op">(</span><span class="va">results</span><span class="op">$</span><span class="va">trained_model</span><span class="op">)</span></span>
<span><span class="co">#> [1] "method" "modelInfo" "modelType" "results" "pred" </span></span>
<span><span class="co">#> [6] "bestTune" "call" "dots" "metric" "control" </span></span>
<span><span class="co">#> [11] "finalModel" "preProcess" "trainingData" "ptype" "resample" </span></span>
<span><span class="co">#> [16] "resampledCM" "perfNames" "maximize" "yLimits" "times" </span></span>
<span><span class="co">#> [21] "levels"</span></span></code></pre></div>
<p><code>test_data</code> is the partition of the dataset that was used
for testing. In machine learning, it’s always important to have a
held-out test dataset that is not used in the training stage. In this
pipeline we do that using <code><a href="../reference/run_ml.html">run_ml()</a></code> where we split your data
into training and testing sets. The training data are used to build the
model (e.g. tune hyperparameters, learn the data) and the test data are
used to evaluate how well the model performs.</p>
<div class="sourceCode" id="cb5"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/utils/head.html" class="external-link">head</a></span><span class="op">(</span><span class="va">results</span><span class="op">$</span><span class="va">test_data</span><span class="op">)</span></span>
<span><span class="co">#> dx Otu00009 Otu00005 Otu00010 Otu00001 Otu00008 Otu00004 Otu00003</span></span>
<span><span class="co">#> 9 normal 119 142 248 256 363 112 871</span></span>
<span><span class="co">#> 14 normal 60 209 70 86 96 1 123</span></span>
<span><span class="co">#> 16 cancer 205 5 180 1668 95 22 3</span></span>
<span><span class="co">#> 17 normal 188 356 107 381 1035 915 315</span></span>
<span><span class="co">#> 27 normal 4 21 161 7 1 27 8</span></span>
<span><span class="co">#> 30 normal 13 166 5 31 33 5 58</span></span>
<span><span class="co">#> Otu00002 Otu00007 Otu00006</span></span>
<span><span class="co">#> 9 995 0 137</span></span>
<span><span class="co">#> 14 426 54 40</span></span>
<span><span class="co">#> 16 20 590 570</span></span>
<span><span class="co">#> 17 357 253 341</span></span>
<span><span class="co">#> 27 25 322 5</span></span>
<span><span class="co">#> 30 179 6 30</span></span></code></pre></div>
<p><code>performance</code> is a dataframe of (mainly) performance
metrics (1 column for cross-validation performance metric, several for
test performance metrics, and 2 columns at the end with ML method and
seed):</p>
<div class="sourceCode" id="cb6"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">results</span><span class="op">$</span><span class="va">performance</span></span>
<span><span class="co">#> <span style="color: #949494;"># A tibble: 1 × 17</span></span></span>
<span><span class="co">#> cv_metric_AUC logLoss AUC prAUC Accuracy Kappa F1 Sensi…¹ Speci…² Pos_P…³</span></span>
<span><span class="co">#> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><dbl></span></span></span>
<span><span class="co">#> <span style="color: #BCBCBC;">1</span> 0.622 0.684 0.647 0.606 0.590 0.179 0.6 0.6 0.579 0.6</span></span>
<span><span class="co">#> <span style="color: #949494;"># … with 7 more variables: Neg_Pred_Value <dbl>, Precision <dbl>, Recall <dbl>,</span></span></span>
<span><span class="co">#> <span style="color: #949494;"># Detection_Rate <dbl>, Balanced_Accuracy <dbl>, method <chr>, seed <dbl>,</span></span></span>
<span><span class="co">#> <span style="color: #949494;"># and abbreviated variable names ¹Sensitivity, ²Specificity, ³Pos_Pred_Value</span></span></span></code></pre></div>
<p>When using logistic regression for binary classification, area under
the receiver-operator characteristic curve (AUC) is a useful metric to
evaluate model performance. Because of that, it’s the default that we
use for <code>mikropml</code>. However, it is crucial to evaluate your
model performance using multiple metrics. Below you can find more
information about other performance metrics and how to use them in our
package.</p>
<p><code>cv_metric_AUC</code> is the AUC for the cross-validation folds
for the training data. This gives us a sense of how well the model
performs on the training data.</p>
<p>Most of the other columns are performance metrics for the test data —
the data that wasn’t used to build the model. Here, you can see that the
AUC for the test data is not much above 0.5, suggesting that this model
does not predict much better than chance, and that the model is overfit
because the cross-validation AUC (<code>cv_metric_AUC</code>, measured
during training) is much higher than the testing AUC. This isn’t too
surprising since we’re using so few features with this example dataset,
so don’t be discouraged. The default option also provides a number of
other performance metrics that you might be interested in, including
area under the precision-recall curve (prAUC).</p>
<p>The last columns of <code>results$performance</code> are the method
and seed (if you set one) to help with combining results from multiple
runs (see <code><a href="../articles/parallel.html">vignette("parallel")</a></code>).</p>
<p><code>feature_importance</code> has information about feature
importance values if <code>find_feature_importance = TRUE</code> (the
default is <code>FALSE</code>). Since we used the defaults, there’s
nothing here:</p>
<div class="sourceCode" id="cb7"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">results</span><span class="op">$</span><span class="va">feature_importance</span></span>
<span><span class="co">#> [1] "Skipped feature importance"</span></span></code></pre></div>
</div>
<div class="section level2">
<h2 id="customizing-parameters">Customizing parameters<a class="anchor" aria-label="anchor" href="#customizing-parameters"></a>
</h2>
<p>There are a few arguments that allow you to change how you execute
<code><a href="../reference/run_ml.html">run_ml()</a></code>. We’ve chosen reasonable defaults for you, but we
encourage you to change these if you think something else would be
better for your data.</p>
<div class="section level3">
<h3 id="changing-kfold-cv_times-and-training_frac">Changing <code>kfold</code>, <code>cv_times</code>, and
<code>training_frac</code><a class="anchor" aria-label="anchor" href="#changing-kfold-cv_times-and-training_frac"></a>
</h3>
<ul>
<li>
<code>kfold</code>: The number of folds to run for cross-validation
(default: 5).</li>
<li>
<code>cv_times</code>: The number of times to run repeated
cross-validation (default: 100).</li>
<li>
<code>training_frac</code>: The fraction of data for the training
set (default: 0.8). The rest of the data is used for testing.</li>
</ul>
<p>Here’s an example where we change some of the default parameters:</p>
<div class="sourceCode" id="cb8"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">results_custom</span> <span class="op"><-</span> <span class="fu"><a href="../reference/run_ml.html">run_ml</a></span><span class="op">(</span><span class="va">otu_mini_bin</span>,</span>
<span> <span class="st">"glmnet"</span>,</span>
<span> kfold <span class="op">=</span> <span class="fl">2</span>,</span>
<span> cv_times <span class="op">=</span> <span class="fl">5</span>,</span>
<span> training_frac <span class="op">=</span> <span class="fl">0.5</span>,</span>
<span> seed <span class="op">=</span> <span class="fl">2019</span></span>
<span><span class="op">)</span></span>
<span><span class="co">#> Using 'dx' as the outcome column.</span></span>
<span><span class="co">#> Training the model...</span></span>
<span><span class="co">#> Loading required package: ggplot2</span></span>
<span><span class="co">#> Loading required package: lattice</span></span>
<span><span class="co">#> </span></span>
<span><span class="co">#> Attaching package: 'caret'</span></span>
<span><span class="co">#> The following object is masked from 'package:mikropml':</span></span>
<span><span class="co">#> </span></span>
<span><span class="co">#> compare_models</span></span>
<span><span class="co">#> Warning in (function (w) : `caret::train()` issued the following warning:</span></span>
<span><span class="co">#> </span></span>
<span><span class="co">#> simpleWarning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, : There were missing values in resampled performance measures.</span></span>
<span><span class="co">#> </span></span>
<span><span class="co">#> This warning usually means that the model didn't converge in some cross-validation folds because it is predicting something close to a constant. As a result, certain performance metrics can't be calculated. This suggests that some of the hyperparameters chosen are doing very poorly.</span></span>
<span><span class="co">#> Training complete.</span></span></code></pre></div>
<p>You might have noticed that this one ran faster — that’s because we
reduced <code>kfold</code> and <code>cv_times</code>. This is okay for
testing things out and may even be necessary for smaller datasets. But
in general it may be better to have larger numbers for these parameters;
we think the defaults are a good starting point <span class="citation">(Topçuoğlu et al. 2020)</span>.</p>
<div class="section level4">
<h4 id="custom-training-indices">Custom training indices<a class="anchor" aria-label="anchor" href="#custom-training-indices"></a>
</h4>
<p>When <code>training_frac</code> is a fraction between 0 and 1, a
random sample of observations in the dataset are chosen for the training
set to satisfy the <code>training_frac</code> using
<code><a href="../reference/get_partition_indices.html">get_partition_indices()</a></code>. However, in some cases you might
wish to control exactly which observations are in the training set. You
can instead assign <code>training_frac</code> a vector of indices that
correspond to which rows of the dataset should go in the training set
(all remaining sequences will go in the testing set). Here’s an example
with ~80% of the data in the training set:</p>
<div class="sourceCode" id="cb9"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">n_obs</span> <span class="op"><-</span> <span class="va">otu_mini_bin</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span> <span class="fu"><a href="https://rdrr.io/r/base/nrow.html" class="external-link">nrow</a></span><span class="op">(</span><span class="op">)</span></span>
<span><span class="va">training_size</span> <span class="op"><-</span> <span class="fl">0.8</span> <span class="op">*</span> <span class="va">n_obs</span></span>
<span><span class="va">training_rows</span> <span class="op"><-</span> <span class="fu"><a href="https://rdrr.io/r/base/sample.html" class="external-link">sample</a></span><span class="op">(</span><span class="va">n_obs</span>, <span class="va">training_size</span><span class="op">)</span></span>
<span><span class="va">results_custom_train</span> <span class="op"><-</span> <span class="fu"><a href="../reference/run_ml.html">run_ml</a></span><span class="op">(</span><span class="va">otu_mini_bin</span>,</span>
<span> <span class="st">"glmnet"</span>,</span>
<span> kfold <span class="op">=</span> <span class="fl">2</span>,</span>
<span> cv_times <span class="op">=</span> <span class="fl">5</span>,</span>
<span> training_frac <span class="op">=</span> <span class="va">training_rows</span>,</span>
<span> seed <span class="op">=</span> <span class="fl">2019</span></span>
<span><span class="op">)</span></span>
<span><span class="co">#> Using 'dx' as the outcome column.</span></span>
<span><span class="co">#> Using the custom training set indices provided by `training_frac`.</span></span>
<span><span class="co">#> The fraction of data in the training set will be 0.8</span></span>
<span><span class="co">#> Training the model...</span></span>
<span><span class="co">#> Training complete.</span></span></code></pre></div>
</div>
</div>
<div class="section level3">
<h3 id="changing-the-performance-metric">Changing the performance metric<a class="anchor" aria-label="anchor" href="#changing-the-performance-metric"></a>
</h3>
<p>There are two arguments that allow you to change what performance
metric to use for model evaluation, and what performance metrics to
calculate using the test data.</p>
<p><code>perf_metric_function</code> is the function used to calculate
the performance metrics.</p>
<p>The default for classification is
<code><a href="https://rdrr.io/pkg/caret/man/postResample.html" class="external-link">caret::multiClassSummary()</a></code> and the default for regression
is <code><a href="https://rdrr.io/pkg/caret/man/postResample.html" class="external-link">caret::defaultSummary()</a></code>. We’d suggest not changing this
unless you really know what you’re doing.</p>
<p><code>perf_metric_name</code> is the column name from the output of
<code>perf_metric_function</code>. We chose reasonable defaults (AUC for
binary, logLoss for multiclass, and RMSE for continuous), but the
default functions calculate a bunch of different performance metrics, so
you can choose a different one if you’d like.</p>
<p>The default performance metrics available for classification are:</p>
<pre><code><span><span class="co">#> [1] "logLoss" "AUC" "prAUC" </span></span>
<span><span class="co">#> [4] "Accuracy" "Kappa" "Mean_F1" </span></span>
<span><span class="co">#> [7] "Mean_Sensitivity" "Mean_Specificity" "Mean_Pos_Pred_Value" </span></span>
<span><span class="co">#> [10] "Mean_Neg_Pred_Value" "Mean_Precision" "Mean_Recall" </span></span>
<span><span class="co">#> [13] "Mean_Detection_Rate" "Mean_Balanced_Accuracy"</span></span></code></pre>
<p>The default performance metrics available for regression are:</p>
<pre><code><span><span class="co">#> [1] "RMSE" "Rsquared" "MAE"</span></span></code></pre>
<p>Here’s an example using prAUC instead of AUC:</p>
<div class="sourceCode" id="cb12"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">results_pr</span> <span class="op"><-</span> <span class="fu"><a href="../reference/run_ml.html">run_ml</a></span><span class="op">(</span><span class="va">otu_mini_bin</span>,</span>
<span> <span class="st">"glmnet"</span>,</span>
<span> cv_times <span class="op">=</span> <span class="fl">5</span>,</span>
<span> perf_metric_name <span class="op">=</span> <span class="st">"prAUC"</span>,</span>
<span> seed <span class="op">=</span> <span class="fl">2019</span></span>
<span><span class="op">)</span></span>
<span><span class="co">#> Using 'dx' as the outcome column.</span></span>
<span><span class="co">#> Training the model...</span></span>
<span><span class="co">#> Warning in (function (w) : `caret::train()` issued the following warning:</span></span>
<span><span class="co">#> </span></span>
<span><span class="co">#> simpleWarning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, : There were missing values in resampled performance measures.</span></span>
<span><span class="co">#> </span></span>
<span><span class="co">#> This warning usually means that the model didn't converge in some cross-validation folds because it is predicting something close to a constant. As a result, certain performance metrics can't be calculated. This suggests that some of the hyperparameters chosen are doing very poorly.</span></span>
<span><span class="co">#> Training complete.</span></span></code></pre></div>
<p>You’ll see that the cross-validation metric is prAUC, instead of the
default AUC:</p>
<div class="sourceCode" id="cb13"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">results_pr</span><span class="op">$</span><span class="va">performance</span></span>
<span><span class="co">#> <span style="color: #949494;"># A tibble: 1 × 17</span></span></span>
<span><span class="co">#> cv_metric_p…¹ logLoss AUC prAUC Accur…² Kappa F1 Sensi…³ Speci…⁴ Pos_P…⁵</span></span>
<span><span class="co">#> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><dbl></span></span></span>
<span><span class="co">#> <span style="color: #BCBCBC;">1</span> 0.577 0.691 0.663 0.605 0.538 0.053<span style="text-decoration: underline;">9</span> 0.690 1 0.052<span style="text-decoration: underline;">6</span> 0.526</span></span>
<span><span class="co">#> <span style="color: #949494;"># … with 7 more variables: Neg_Pred_Value <dbl>, Precision <dbl>, Recall <dbl>,</span></span></span>
<span><span class="co">#> <span style="color: #949494;"># Detection_Rate <dbl>, Balanced_Accuracy <dbl>, method <chr>, seed <dbl>,</span></span></span>
<span><span class="co">#> <span style="color: #949494;"># and abbreviated variable names ¹cv_metric_prAUC, ²Accuracy, ³Sensitivity,</span></span></span>
<span><span class="co">#> <span style="color: #949494;"># ⁴Specificity, ⁵Pos_Pred_Value</span></span></span></code></pre></div>
</div>
<div class="section level3">
<h3 id="using-groups">Using groups<a class="anchor" aria-label="anchor" href="#using-groups"></a>
</h3>
<p>The optional <code>groups</code> is a vector of groups to keep
together when splitting the data into train and test sets and for
cross-validation. Sometimes it’s important to split up the data based on
a grouping instead of just randomly. This allows you to control for
similarities within groups that you don’t want to skew your predictions
(i.e. batch effects). For example, with biological data you may have
samples collected from multiple hospitals, and you might like to keep
observations from the same hospital in the same partition.</p>
<p>Here’s an example where we split the data into train/test sets based
on groups:</p>
<div class="sourceCode" id="cb14"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="co"># make random groups</span></span>
<span><span class="fu"><a href="https://rdrr.io/r/base/Random.html" class="external-link">set.seed</a></span><span class="op">(</span><span class="fl">2019</span><span class="op">)</span></span>
<span><span class="va">grps</span> <span class="op"><-</span> <span class="fu"><a href="https://rdrr.io/r/base/sample.html" class="external-link">sample</a></span><span class="op">(</span><span class="va">LETTERS</span><span class="op">[</span><span class="fl">1</span><span class="op">:</span><span class="fl">8</span><span class="op">]</span>, <span class="fu"><a href="https://rdrr.io/r/base/nrow.html" class="external-link">nrow</a></span><span class="op">(</span><span class="va">otu_mini_bin</span><span class="op">)</span>, replace <span class="op">=</span> <span class="cn">TRUE</span><span class="op">)</span></span>
<span><span class="va">results_grp</span> <span class="op"><-</span> <span class="fu"><a href="../reference/run_ml.html">run_ml</a></span><span class="op">(</span><span class="va">otu_mini_bin</span>,</span>
<span> <span class="st">"glmnet"</span>,</span>
<span> cv_times <span class="op">=</span> <span class="fl">2</span>,</span>
<span> training_frac <span class="op">=</span> <span class="fl">0.8</span>,</span>
<span> groups <span class="op">=</span> <span class="va">grps</span>,</span>
<span> seed <span class="op">=</span> <span class="fl">2019</span></span>
<span><span class="op">)</span></span>
<span><span class="co">#> Using 'dx' as the outcome column.</span></span>
<span><span class="co">#> Fraction of data in the training set: 0.795 </span></span>
<span><span class="co">#> Groups in the training set: A B D F G H </span></span>
<span><span class="co">#> Groups in the testing set: C E</span></span>
<span><span class="co">#> Groups will be kept together in CV partitions</span></span>
<span><span class="co">#> Training the model...</span></span>
<span><span class="co">#> Training complete.</span></span></code></pre></div>
<p>The one difference here is <code><a href="../reference/run_ml.html">run_ml()</a></code> will report how much
of the data is in the training set if you run the above code chunk. This
can be a little finicky depending on how many samples and groups you
have. This is because it won’t be exactly what you specify with
<code>training_frac</code>, since you have to include all of one group
in either the training set <em>or</em> the test set.</p>
<div class="section level4">
<h4 id="controlling-how-groups-are-assigned-to-partitions">Controlling how groups are assigned to partitions<a class="anchor" aria-label="anchor" href="#controlling-how-groups-are-assigned-to-partitions"></a>
</h4>
<p>When you use the <code>groups</code> parameter as above, by default
<code><a href="../reference/run_ml.html">run_ml()</a></code> will assume that you want all of the observations
from each group to be placed in the same partition of the train/test
split. This makes sense when you want to use groups to control for batch
effects. However, in some cases you might prefer to control exactly
which groups end up in which partition, and you might even be okay with
some observations from the same group being assigned to different
partitions.</p>
<p>For example, say you want groups A and B to be used for training, C
and D for testing, and you don’t have a preference for what happens to
the other groups. You can give the <code>group_partitions</code>
parameter a named list to specify which groups should go in the training
set and which should go in the testing set.</p>
<div class="sourceCode" id="cb15"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">results_grp_part</span> <span class="op"><-</span> <span class="fu"><a href="../reference/run_ml.html">run_ml</a></span><span class="op">(</span><span class="va">otu_mini_bin</span>,</span>
<span> <span class="st">"glmnet"</span>,</span>
<span> cv_times <span class="op">=</span> <span class="fl">2</span>,</span>
<span> training_frac <span class="op">=</span> <span class="fl">0.8</span>,</span>
<span> groups <span class="op">=</span> <span class="va">grps</span>,</span>
<span> group_partitions <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/list.html" class="external-link">list</a></span><span class="op">(</span></span>
<span> train <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="st">"A"</span>, <span class="st">"B"</span><span class="op">)</span>,</span>
<span> test <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="st">"C"</span>, <span class="st">"D"</span><span class="op">)</span></span>
<span> <span class="op">)</span>,</span>
<span> seed <span class="op">=</span> <span class="fl">2019</span></span>
<span><span class="op">)</span></span>
<span><span class="co">#> Using 'dx' as the outcome column.</span></span>
<span><span class="co">#> Fraction of data in the training set: 0.785 </span></span>
<span><span class="co">#> Groups in the training set: A B E F G H </span></span>
<span><span class="co">#> Groups in the testing set: C D</span></span>
<span><span class="co">#> Groups will not be kept together in CV partitions because the number of groups in the training set is not larger than `kfold`</span></span>
<span><span class="co">#> Training the model...</span></span>
<span><span class="co">#> Training complete.</span></span></code></pre></div>
<p>In the above case, all observations from A & B will be used for
training, all from C & D will be used for testing, and the remaining
groups will be randomly assigned to one or the other to satisfy the
<code>training_frac</code> as closely as possible.</p>
<p>In another scenario, maybe you want only groups A through F to be
used for training, but you also want to allow other observations not
selected for training from A through F to be used for testing:</p>
<div class="sourceCode" id="cb16"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">results_grp_trainA</span> <span class="op"><-</span> <span class="fu"><a href="../reference/run_ml.html">run_ml</a></span><span class="op">(</span><span class="va">otu_mini_bin</span>,</span>
<span> <span class="st">"glmnet"</span>,</span>
<span> cv_times <span class="op">=</span> <span class="fl">2</span>,</span>
<span> kfold <span class="op">=</span> <span class="fl">2</span>,</span>
<span> training_frac <span class="op">=</span> <span class="fl">0.5</span>,</span>
<span> groups <span class="op">=</span> <span class="va">grps</span>,</span>
<span> group_partitions <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/list.html" class="external-link">list</a></span><span class="op">(</span></span>
<span> train <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="st">"A"</span>, <span class="st">"B"</span>, <span class="st">"C"</span>, <span class="st">"D"</span>, <span class="st">"E"</span>, <span class="st">"F"</span><span class="op">)</span>,</span>
<span> test <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="st">"A"</span>, <span class="st">"B"</span>, <span class="st">"C"</span>, <span class="st">"D"</span>, <span class="st">"E"</span>, <span class="st">"F"</span>, <span class="st">"G"</span>, <span class="st">"H"</span><span class="op">)</span></span>
<span> <span class="op">)</span>,</span>
<span> seed <span class="op">=</span> <span class="fl">2019</span></span>
<span><span class="op">)</span></span>
<span><span class="co">#> Using 'dx' as the outcome column.</span></span>
<span><span class="co">#> Fraction of data in the training set: 0.5 </span></span>
<span><span class="co">#> Groups in the training set: A B C D E F </span></span>
<span><span class="co">#> Groups in the testing set: A B C D E F G H</span></span>
<span><span class="co">#> Groups will be kept together in CV partitions</span></span>
<span><span class="co">#> Training the model...</span></span>
<span><span class="co">#> Training complete.</span></span></code></pre></div>
<p>If you need even more control than this, take a look at <a href="#custom-training-indices">setting custom training indices</a>. You
might also prefer to provide your own train control scheme with the
<code>cross_val</code> parameter in <code><a href="../reference/run_ml.html">run_ml()</a></code>.</p>
</div>
</div>
<div class="section level3">
<h3 id="more-arguments">More arguments<a class="anchor" aria-label="anchor" href="#more-arguments"></a>
</h3>
<p>Some ML methods take optional arguments, such as <code>ntree</code>
for <code>randomForest</code>-based models or <a href="https://topepo.github.io/caret/train-models-by-tag.html#Accepts_Case_Weights" class="external-link">case
weights</a>. Any additional arguments you give to <code><a href="../reference/run_ml.html">run_ml()</a></code>
are forwarded along to <code><a href="https://rdrr.io/pkg/caret/man/train.html" class="external-link">caret::train()</a></code> so you can leverage
those options.</p>
<div class="section level4">
<h4 id="case-weights">Case weights<a class="anchor" aria-label="anchor" href="#case-weights"></a>
</h4>
<p>If you want to use case weights, you will also need to use custom
indices for the training data (i.e. perform the partition before
<code>run_ml(</code>) as <a href="#custom-training-indices">above</a>).
Here’s one way to do this with the weights calculated from the
proportion of each class in the data set, with ~70% of the data in the
training set:</p>
<div class="sourceCode" id="cb17"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/base/Random.html" class="external-link">set.seed</a></span><span class="op">(</span><span class="fl">20221016</span><span class="op">)</span></span>
<span><span class="kw"><a href="https://rdrr.io/r/base/library.html" class="external-link">library</a></span><span class="op">(</span><span class="va"><a href="https://dplyr.tidyverse.org" class="external-link">dplyr</a></span><span class="op">)</span></span>
<span><span class="va">train_set_indices</span> <span class="op"><-</span> <span class="fu"><a href="../reference/get_partition_indices.html">get_partition_indices</a></span><span class="op">(</span><span class="va">otu_mini_bin</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/pull.html" class="external-link">pull</a></span><span class="op">(</span><span class="va">dx</span><span class="op">)</span>,</span>
<span> training_frac <span class="op">=</span> <span class="fl">0.70</span></span>
<span><span class="op">)</span></span>
<span><span class="va">case_weights_dat</span> <span class="op"><-</span> <span class="va">otu_mini_bin</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span>
<span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/count.html" class="external-link">count</a></span><span class="op">(</span><span class="va">dx</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span>
<span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/mutate.html" class="external-link">mutate</a></span><span class="op">(</span>p <span class="op">=</span> <span class="va">n</span> <span class="op">/</span> <span class="fu"><a href="https://rdrr.io/r/base/sum.html" class="external-link">sum</a></span><span class="op">(</span><span class="va">n</span><span class="op">)</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span>
<span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/select.html" class="external-link">select</a></span><span class="op">(</span><span class="va">dx</span>, <span class="va">p</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span>
<span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html" class="external-link">right_join</a></span><span class="op">(</span><span class="va">otu_mini_bin</span>, by <span class="op">=</span> <span class="st">"dx"</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span>
<span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/select.html" class="external-link">select</a></span><span class="op">(</span><span class="op">-</span><span class="fu"><a href="https://tidyselect.r-lib.org/reference/starts_with.html" class="external-link">starts_with</a></span><span class="op">(</span><span class="st">"Otu"</span><span class="op">)</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span>
<span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/mutate.html" class="external-link">mutate</a></span><span class="op">(</span></span>
<span> row_num <span class="op">=</span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/row_number.html" class="external-link">row_number</a></span><span class="op">(</span><span class="op">)</span>,</span>
<span> in_train <span class="op">=</span> <span class="va">row_num</span> <span class="op"><a href="https://rdrr.io/r/base/match.html" class="external-link">%in%</a></span> <span class="va">train_set_indices</span></span>
<span> <span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span>
<span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/filter.html" class="external-link">filter</a></span><span class="op">(</span><span class="va">in_train</span><span class="op">)</span></span>
<span><span class="co">#> Warning in right_join(., otu_mini_bin, by = "dx"): Each row in `x` is expected to match at most 1 row in `y`.</span></span>
<span><span class="co">#> <span style="color: #00BBBB;">ℹ</span> Row 1 of `x` matches multiple rows.</span></span>
<span><span class="co">#> <span style="color: #00BBBB;">ℹ</span> If multiple matches are expected, set `multiple = "all"` to silence this</span></span>
<span><span class="co">#> warning.</span></span>
<span><span class="fu"><a href="https://rdrr.io/r/utils/head.html" class="external-link">head</a></span><span class="op">(</span><span class="va">case_weights_dat</span><span class="op">)</span></span>
<span><span class="co">#> dx p row_num in_train</span></span>
<span><span class="co">#> 1 cancer 0.49 1 TRUE</span></span>
<span><span class="co">#> 2 cancer 0.49 2 TRUE</span></span>
<span><span class="co">#> 3 cancer 0.49 3 TRUE</span></span>
<span><span class="co">#> 4 cancer 0.49 4 TRUE</span></span>
<span><span class="co">#> 5 cancer 0.49 5 TRUE</span></span>
<span><span class="co">#> 6 cancer 0.49 6 TRUE</span></span>
<span><span class="fu"><a href="https://rdrr.io/r/utils/head.html" class="external-link">tail</a></span><span class="op">(</span><span class="va">case_weights_dat</span><span class="op">)</span></span>
<span><span class="co">#> dx p row_num in_train</span></span>
<span><span class="co">#> 136 normal 0.51 194 TRUE</span></span>
<span><span class="co">#> 137 normal 0.51 195 TRUE</span></span>
<span><span class="co">#> 138 normal 0.51 196 TRUE</span></span>
<span><span class="co">#> 139 normal 0.51 197 TRUE</span></span>
<span><span class="co">#> 140 normal 0.51 198 TRUE</span></span>
<span><span class="co">#> 141 normal 0.51 200 TRUE</span></span>
<span><span class="fu"><a href="https://rdrr.io/r/base/nrow.html" class="external-link">nrow</a></span><span class="op">(</span><span class="va">case_weights_dat</span><span class="op">)</span> <span class="op">/</span> <span class="fu"><a href="https://rdrr.io/r/base/nrow.html" class="external-link">nrow</a></span><span class="op">(</span><span class="va">otu_mini_bin</span><span class="op">)</span></span>
<span><span class="co">#> [1] 0.705</span></span></code></pre></div>
<div class="sourceCode" id="cb18"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">results_weighted</span> <span class="op"><-</span> <span class="fu"><a href="../reference/run_ml.html">run_ml</a></span><span class="op">(</span><span class="va">otu_mini_bin</span>,</span>
<span> <span class="st">"glmnet"</span>,</span>
<span> outcome_colname <span class="op">=</span> <span class="st">"dx"</span>,</span>
<span> seed <span class="op">=</span> <span class="fl">2019</span>,</span>
<span> training_frac <span class="op">=</span> <span class="va">case_weights_dat</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/pull.html" class="external-link">pull</a></span><span class="op">(</span><span class="va">row_num</span><span class="op">)</span>,</span>
<span> weights <span class="op">=</span> <span class="va">case_weights_dat</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/pull.html" class="external-link">pull</a></span><span class="op">(</span><span class="va">p</span><span class="op">)</span></span>
<span><span class="op">)</span></span></code></pre></div>
<p>See the caret docs for <a href="https://topepo.github.io/caret/train-models-by-tag.html#Accepts_Case_Weights" class="external-link">a
list of models that accept case weights</a>.</p>
</div>
</div>
</div>
<div class="section level2">
<h2 id="finding-feature-importance">Finding feature importance<a class="anchor" aria-label="anchor" href="#finding-feature-importance"></a>
</h2>
<p>To find which features are contributing to predictive power, you can
use <code>find_feature_importance = TRUE</code>. How we use permutation
importance to determine feature importance is described in <span class="citation">(Topçuoğlu et al. 2020)</span>. Briefly, it permutes
each of the features individually (or correlated ones together) and
evaluates how much the performance metric decreases. The more
performance decreases when the feature is randomly shuffled, the more
important that feature is. The default is <code>FALSE</code> because it
takes a while to run and is only useful if you want to know what
features are important in predicting your outcome.</p>
<p>Let’s look at some feature importance results:</p>
<div class="sourceCode" id="cb19"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">results_imp</span> <span class="op"><-</span> <span class="fu"><a href="../reference/run_ml.html">run_ml</a></span><span class="op">(</span><span class="va">otu_mini_bin</span>,</span>
<span> <span class="st">"rf"</span>,</span>
<span> outcome_colname <span class="op">=</span> <span class="st">"dx"</span>,</span>
<span> find_feature_importance <span class="op">=</span> <span class="cn">TRUE</span>,</span>
<span> seed <span class="op">=</span> <span class="fl">2019</span></span>
<span><span class="op">)</span></span></code></pre></div>
<p>Now, we can check out the feature importances:</p>
<div class="sourceCode" id="cb20"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">results_imp</span><span class="op">$</span><span class="va">feature_importance</span></span>
<span><span class="co">#> perf_metric perf_metric_diff pvalue lower upper feat method</span></span>
<span><span class="co">#> 1 0.5459125 0.0003375 0.51485149 0.49125 0.60250 Otu00001 rf</span></span>
<span><span class="co">#> 2 0.5682625 -0.0220125 0.73267327 0.50625 0.63125 Otu00002 rf</span></span>
<span><span class="co">#> 3 0.5482875 -0.0020375 0.56435644 0.50500 0.59000 Otu00003 rf</span></span>
<span><span class="co">#> 4 0.6314375 -0.0851875 1.00000000 0.55250 0.71250 Otu00004 rf</span></span>
<span><span class="co">#> 5 0.4991750 0.0470750 0.08910891 0.44125 0.57125 Otu00005 rf</span></span>
<span><span class="co">#> 6 0.5364875 0.0097625 0.28712871 0.50125 0.57375 Otu00006 rf</span></span>
<span><span class="co">#> 7 0.5382875 0.0079625 0.39603960 0.47500 0.58750 Otu00007 rf</span></span>
<span><span class="co">#> 8 0.5160500 0.0302000 0.09900990 0.46750 0.55750 Otu00008 rf</span></span>
<span><span class="co">#> 9 0.5293375 0.0169125 0.17821782 0.49500 0.55625 Otu00009 rf</span></span>
<span><span class="co">#> 10 0.4976500 0.0486000 0.12871287 0.41000 0.56250 Otu00010 rf</span></span>
<span><span class="co">#> perf_metric_name seed</span></span>
<span><span class="co">#> 1 AUC 2019</span></span>
<span><span class="co">#> 2 AUC 2019</span></span>
<span><span class="co">#> 3 AUC 2019</span></span>
<span><span class="co">#> 4 AUC 2019</span></span>
<span><span class="co">#> 5 AUC 2019</span></span>
<span><span class="co">#> 6 AUC 2019</span></span>
<span><span class="co">#> 7 AUC 2019</span></span>
<span><span class="co">#> 8 AUC 2019</span></span>
<span><span class="co">#> 9 AUC 2019</span></span>
<span><span class="co">#> 10 AUC 2019</span></span></code></pre></div>
<p>There are several columns:</p>
<ol style="list-style-type: decimal">
<li>
<code>perf_metric</code>: The performance value of the permuted
feature.</li>
<li>
<code>perf_metric_diff</code>: The difference between the
performance for the actual and permuted data (i.e. test performance
minus permuted performance). Features with a larger
<code>perf_metric_diff</code> are more important.</li>
<li>
<code>pvalue</code>: the probability of obtaining the actual
performance value under the null hypothesis.</li>
<li>
<code>lower</code>: the lower bound for the 95% confidence interval
of <code>perf_metric</code>.</li>
<li>
<code>upper</code>: the upper bound for the 95% confidence interval
of <code>perf_metric</code>.</li>
<li>
<code>feat</code>: The feature (or group of correlated features)
that was permuted.</li>
<li>
<code>method</code>: The <a href="#the-methods-we-support">ML
method</a> used.</li>
<li>
<code>perf_metric_name</code>: The <a href="#id_-changing-the-performance-metric">name of the performance
metric</a> represented by <code>perf_metric</code> &
<code>perf_metric_diff</code>.</li>
<li>
<code>seed</code>: The seed (if set).</li>
</ol>
<p>As you can see here, the differences are negligible (close to zero),
which makes sense since our model isn’t great. If you’re interested in
feature importance, it’s especially useful to run multiple different
train/test splits, as shown in our <a href="https://github.com/SchlossLab/mikropml-snakemake-workflow/" class="external-link">example
snakemake workflow</a>.</p>
<p>You can also choose to permute correlated features together using
<code>corr_thresh</code> (default: 1). Any features that are above the
correlation threshold are permuted together; i.e. perfectly correlated
features are permuted together when using the default value.</p>
<div class="sourceCode" id="cb21"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">results_imp_corr</span> <span class="op"><-</span> <span class="fu"><a href="../reference/run_ml.html">run_ml</a></span><span class="op">(</span><span class="va">otu_mini_bin</span>,</span>
<span> <span class="st">"glmnet"</span>,</span>
<span> cv_times <span class="op">=</span> <span class="fl">5</span>,</span>
<span> find_feature_importance <span class="op">=</span> <span class="cn">TRUE</span>,</span>
<span> corr_thresh <span class="op">=</span> <span class="fl">0.2</span>,</span>
<span> seed <span class="op">=</span> <span class="fl">2019</span></span>
<span><span class="op">)</span></span>
<span><span class="co">#> Using 'dx' as the outcome column.</span></span>
<span><span class="co">#> Training the model...</span></span>
<span><span class="co">#> Warning in (function (w) : `caret::train()` issued the following warning:</span></span>
<span><span class="co">#> </span></span>
<span><span class="co">#> simpleWarning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, : There were missing values in resampled performance measures.</span></span>
<span><span class="co">#> </span></span>
<span><span class="co">#> This warning usually means that the model didn't converge in some cross-validation folds because it is predicting something close to a constant. As a result, certain performance metrics can't be calculated. This suggests that some of the hyperparameters chosen are doing very poorly.</span></span>
<span><span class="co">#> Training complete.</span></span>
<span><span class="co">#> Finding feature importance...</span></span>
<span><span class="co">#> Feature importance complete.</span></span>
<span><span class="va">results_imp_corr</span><span class="op">$</span><span class="va">feature_importance</span></span>
<span><span class="co">#> perf_metric perf_metric_diff pvalue lower upper</span></span>
<span><span class="co">#> 1 0.4941842 0.1531842 0.05940594 0.3236842 0.6473684</span></span>
<span><span class="co">#> feat</span></span>
<span><span class="co">#> 1 Otu00001|Otu00002|Otu00003|Otu00004|Otu00005|Otu00006|Otu00007|Otu00008|Otu00009|Otu00010</span></span>
<span><span class="co">#> method perf_metric_name seed</span></span>
<span><span class="co">#> 1 glmnet AUC 2019</span></span></code></pre></div>
<p>You can see which features were permuted together in the
<code>feat</code> column. Here all 3 features were permuted together
(which doesn’t really make sense, but it’s just an example).</p>
<p>If you previously executed <code><a href="../reference/run_ml.html">run_ml()</a></code> without feature
importance but now wish to find feature importance after the fact, see
the example code in the <code><a href="../reference/get_feature_importance.html">get_feature_importance()</a></code>
documentation.</p>
<p><code><a href="../reference/get_feature_importance.html">get_feature_importance()</a></code> can show a live progress bar,
see <code><a href="../articles/parallel.html">vignette("parallel")</a></code> for examples.</p>
</div>
<div class="section level2">
<h2 id="tuning-hyperparameters-using-the-hyperparameter-argument">Tuning hyperparameters (using the <code>hyperparameter</code>
argument)<a class="anchor" aria-label="anchor" href="#tuning-hyperparameters-using-the-hyperparameter-argument"></a>
</h2>
<p>This is important, so we have a whole vignette about them. The bottom
line is we provide default hyperparameters that you can start with, but
it’s important to tune your hyperparameters. For more information about
what the default hyperparameters are, and how to tune hyperparameters,
see <code><a href="../articles/tuning.html">vignette("tuning")</a></code>.</p>
</div>
<div class="section level2">
<h2 id="other-models">Other models<a class="anchor" aria-label="anchor" href="#other-models"></a>
</h2>
<p>Here are examples of how to train and evaluate other models. The
output for all of them is very similar, so we won’t go into those
details.</p>
<div class="section level3">
<h3 id="random-forest">Random forest<a class="anchor" aria-label="anchor" href="#random-forest"></a>
</h3>
<div class="sourceCode" id="cb22"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">results_rf</span> <span class="op"><-</span> <span class="fu"><a href="../reference/run_ml.html">run_ml</a></span><span class="op">(</span><span class="va">otu_mini_bin</span>,</span>
<span> <span class="st">"rf"</span>,</span>
<span> cv_times <span class="op">=</span> <span class="fl">5</span>,</span>
<span> seed <span class="op">=</span> <span class="fl">2019</span></span>
<span><span class="op">)</span></span></code></pre></div>
<p>The <code>rf</code> engine takes an optional argument
<code>ntree</code>: the number of trees to use for random forest. This
can’t be tuned using the <code>rf</code> package implementation of
random forest. Please refer to <code>caret</code> documentation if you
are interested in other packages with random forest implementations.</p>
<div class="sourceCode" id="cb23"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">results_rf_nt</span> <span class="op"><-</span> <span class="fu"><a href="../reference/run_ml.html">run_ml</a></span><span class="op">(</span><span class="va">otu_mini_bin</span>,</span>
<span> <span class="st">"rf"</span>,</span>
<span> cv_times <span class="op">=</span> <span class="fl">5</span>,</span>
<span> ntree <span class="op">=</span> <span class="fl">1000</span>,</span>
<span> seed <span class="op">=</span> <span class="fl">2019</span></span>
<span><span class="op">)</span></span></code></pre></div>
</div>
<div class="section level3">
<h3 id="decision-tree">Decision tree<a class="anchor" aria-label="anchor" href="#decision-tree"></a>
</h3>
<div class="sourceCode" id="cb24"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">results_dt</span> <span class="op"><-</span> <span class="fu"><a href="../reference/run_ml.html">run_ml</a></span><span class="op">(</span><span class="va">otu_mini_bin</span>,</span>
<span> <span class="st">"rpart2"</span>,</span>
<span> cv_times <span class="op">=</span> <span class="fl">5</span>,</span>
<span> seed <span class="op">=</span> <span class="fl">2019</span></span>
<span><span class="op">)</span></span></code></pre></div>
</div>
<div class="section level3">
<h3 id="svm">SVM<a class="anchor" aria-label="anchor" href="#svm"></a>
</h3>
<div class="sourceCode" id="cb25"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">results_svm</span> <span class="op"><-</span> <span class="fu"><a href="../reference/run_ml.html">run_ml</a></span><span class="op">(</span><span class="va">otu_mini_bin</span>,</span>
<span> <span class="st">"svmRadial"</span>,</span>
<span> cv_times <span class="op">=</span> <span class="fl">5</span>,</span>
<span> seed <span class="op">=</span> <span class="fl">2019</span></span>
<span><span class="op">)</span></span></code></pre></div>
<p>If you get a message “maximum number of iterations reached”, see <a href="https://github.com/topepo/caret/issues/425" class="external-link">this issue</a> in
caret.</p>
</div>
</div>
<div class="section level2">
<h2 id="other-data">Other data<a class="anchor" aria-label="anchor" href="#other-data"></a>
</h2>
<div class="section level3">
<h3 id="multiclass-data">Multiclass data<a class="anchor" aria-label="anchor" href="#multiclass-data"></a>
</h3>
<p>We provide <code>otu_mini_multi</code> with a multiclass outcome
(three or more outcomes):</p>
<div class="sourceCode" id="cb26"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">otu_mini_multi</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span>
<span> <span class="fu">dplyr</span><span class="fu">::</span><span class="fu"><a href="https://dplyr.tidyverse.org/reference/pull.html" class="external-link">pull</a></span><span class="op">(</span><span class="st">"dx"</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span>
<span> <span class="fu"><a href="https://rdrr.io/r/base/unique.html" class="external-link">unique</a></span><span class="op">(</span><span class="op">)</span></span>
<span><span class="co">#> [1] "adenoma" "carcinoma" "normal"</span></span></code></pre></div>
<p>Here’s an example of running multiclass data:</p>
<div class="sourceCode" id="cb27"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">results_multi</span> <span class="op"><-</span> <span class="fu"><a href="../reference/run_ml.html">run_ml</a></span><span class="op">(</span><span class="va">otu_mini_multi</span>,</span>
<span> outcome_colname <span class="op">=</span> <span class="st">"dx"</span>,</span>
<span> seed <span class="op">=</span> <span class="fl">2019</span></span>
<span><span class="op">)</span></span></code></pre></div>
<p>The performance metrics are slightly different, but the format of
everything else is the same:</p>
<div class="sourceCode" id="cb28"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">results_multi</span><span class="op">$</span><span class="va">performance</span></span>
<span><span class="co">#> <span style="color: #949494;"># A tibble: 1 × 17</span></span></span>
<span><span class="co">#> cv_metric…¹ logLoss AUC prAUC Accur…² Kappa Mean_F1 Mean_…³ Mean_…⁴ Mean_…⁵</span></span>
<span><span class="co">#> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><chr></span> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><chr></span> </span></span>
<span><span class="co">#> <span style="color: #BCBCBC;">1</span> 1.07 1.11 0.506 0.353 0.382 0.044<span style="text-decoration: underline;">9</span> <span style="color: #BB0000;">NA</span> 0.360 0.682 NaN </span></span>
<span><span class="co">#> <span style="color: #949494;"># … with 7 more variables: Mean_Neg_Pred_Value <dbl>, Mean_Precision <chr>,</span></span></span>
<span><span class="co">#> <span style="color: #949494;"># Mean_Recall <dbl>, Mean_Detection_Rate <dbl>, Mean_Balanced_Accuracy <dbl>,</span></span></span>
<span><span class="co">#> <span style="color: #949494;"># method <chr>, seed <dbl>, and abbreviated variable names</span></span></span>
<span><span class="co">#> <span style="color: #949494;"># ¹cv_metric_logLoss, ²Accuracy, ³Mean_Sensitivity, ⁴Mean_Specificity,</span></span></span>
<span><span class="co">#> <span style="color: #949494;"># ⁵Mean_Pos_Pred_Value</span></span></span></code></pre></div>
</div>
<div class="section level3">
<h3 id="continuous-data">Continuous data<a class="anchor" aria-label="anchor" href="#continuous-data"></a>
</h3>
<p>And here’s an example for running continuous data, where the outcome
column is numerical:</p>
<div class="sourceCode" id="cb29"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">results_cont</span> <span class="op"><-</span> <span class="fu"><a href="../reference/run_ml.html">run_ml</a></span><span class="op">(</span><span class="va">otu_mini_bin</span><span class="op">[</span>, <span class="fl">2</span><span class="op">:</span><span class="fl">11</span><span class="op">]</span>,</span>
<span> <span class="st">"glmnet"</span>,</span>
<span> outcome_colname <span class="op">=</span> <span class="st">"Otu00001"</span>,</span>
<span> seed <span class="op">=</span> <span class="fl">2019</span></span>
<span><span class="op">)</span></span></code></pre></div>
<p>Again, the performance metrics are slightly different, but the format
of the rest is the same:</p>
<div class="sourceCode" id="cb30"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">results_cont</span><span class="op">$</span><span class="va">performance</span></span>
<span><span class="co">#> <span style="color: #949494;"># A tibble: 1 × 6</span></span></span>
<span><span class="co">#> cv_metric_RMSE RMSE Rsquared MAE method seed</span></span>
<span><span class="co">#> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><chr></span> <span style="color: #949494; font-style: italic;"><dbl></span></span></span>
<span><span class="co">#> <span style="color: #BCBCBC;">1</span> 622. 731. 0.089<span style="text-decoration: underline;">3</span> 472. glmnet <span style="text-decoration: underline;">2</span>019</span></span></code></pre></div>
</div>
</div>
<div class="section level2">
<h2 class="unnumbered" id="references">References<a class="anchor" aria-label="anchor" href="#references"></a>
</h2>
<div id="refs" class="references csl-bib-body hanging-indent">
<div id="ref-tang_democratizing_2020" class="csl-entry">
Tang, Shengpu, Parmida Davarmanesh, Yanmeng Song, Danai Koutra, Michael
W. Sjoding, and Jenna Wiens. 2020. <span>“Democratizing <span>EHR</span>
Analyses with <span>FIDDLE</span>: A Flexible Data-Driven Preprocessing
Pipeline for Structured Clinical Data.”</span> <em>J Am Med Inform
Assoc</em>, October. <a href="https://doi.org/10.1093/jamia/ocaa139" class="external-link">https://doi.org/10.1093/jamia/ocaa139</a>.
</div>
<div id="ref-topcuoglu_framework_2020" class="csl-entry">
Topçuoğlu, Begüm D., Nicholas A. Lesniak, Mack T. Ruffin, Jenna Wiens,
and Patrick D. Schloss. 2020. <span>“A <span>Framework</span> for
<span>Effective Application</span> of <span>Machine Learning</span> to
<span>Microbiome</span>-<span>Based Classification
Problems</span>.”</span> <em>mBio</em> 11 (3). <a href="https://doi.org/10.1128/mBio.00434-20" class="external-link">https://doi.org/10.1128/mBio.00434-20</a>.
</div>
</div>
</div>
</main><aside class="col-md-3"><nav id="toc"><h2>On this page</h2>
</nav></aside>
</div>
<footer><div class="pkgdown-footer-left">
<p></p>
<p>Developed by <a href="https://github.com/BTopcuoglu" class="external-link">Begüm Topçuoğlu</a>, <a href="https://github.com/zenalapp" class="external-link">Zena Lapp</a>, <a href="https://github.com/kelly-sovacool" class="external-link">Kelly Sovacool</a>, Evan Snitkin, Jenna Wiens, <a href="https://github.com/pschloss" class="external-link">Patrick Schloss</a>.</p>
</div>
<div class="pkgdown-footer-right">
<p></p>
<p>Site built with <a href="https://pkgdown.r-lib.org/" class="external-link">pkgdown</a> 2.0.7.</p>
</div>
</footer>
</div>
</body>
</html>