-
Notifications
You must be signed in to change notification settings - Fork 16
/
parallel.html
614 lines (584 loc) · 70.3 KB
/
parallel.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
<!DOCTYPE html>
<!-- Generated by pkgdown: do not edit by hand --><html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta name="description" content="mikropml">
<title>Parallel processing • mikropml</title>
<!-- favicons --><link rel="icon" type="image/png" sizes="16x16" href="../favicon-16x16.png">
<link rel="icon" type="image/png" sizes="32x32" href="../favicon-32x32.png">
<link rel="apple-touch-icon" type="image/png" sizes="180x180" href="../apple-touch-icon.png">
<link rel="apple-touch-icon" type="image/png" sizes="120x120" href="../apple-touch-icon-120x120.png">
<link rel="apple-touch-icon" type="image/png" sizes="76x76" href="../apple-touch-icon-76x76.png">
<link rel="apple-touch-icon" type="image/png" sizes="60x60" href="../apple-touch-icon-60x60.png">
<script src="../deps/jquery-3.6.0/jquery-3.6.0.min.js"></script><meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<link href="../deps/bootstrap-5.2.2/bootstrap.min.css" rel="stylesheet">
<script src="../deps/bootstrap-5.2.2/bootstrap.bundle.min.js"></script><!-- Font Awesome icons --><link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.12.1/css/all.min.css" integrity="sha256-mmgLkCYLUQbXn0B1SRqzHar6dCnv9oZFPEC1g1cwlkk=" crossorigin="anonymous">
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.12.1/css/v4-shims.min.css" integrity="sha256-wZjR52fzng1pJHwx4aV2AO3yyTOXrcDW7jBpJtTwVxw=" crossorigin="anonymous">
<!-- bootstrap-toc --><script src="https://cdn.jsdelivr.net/gh/afeld/bootstrap-toc@v1.0.1/dist/bootstrap-toc.min.js" integrity="sha256-4veVQbu7//Lk5TSmc7YV48MxtMy98e26cf5MrgZYnwo=" crossorigin="anonymous"></script><!-- headroom.js --><script src="https://cdnjs.cloudflare.com/ajax/libs/headroom/0.11.0/headroom.min.js" integrity="sha256-AsUX4SJE1+yuDu5+mAVzJbuYNPHj/WroHuZ8Ir/CkE0=" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/headroom/0.11.0/jQuery.headroom.min.js" integrity="sha256-ZX/yNShbjqsohH1k95liqY9Gd8uOiE1S4vZc+9KQ1K4=" crossorigin="anonymous"></script><!-- clipboard.js --><script src="https://cdnjs.cloudflare.com/ajax/libs/clipboard.js/2.0.6/clipboard.min.js" integrity="sha256-inc5kl9MA1hkeYUt+EC3BhlIgyp/2jDIyBLS6k3UxPI=" crossorigin="anonymous"></script><!-- search --><script src="https://cdnjs.cloudflare.com/ajax/libs/fuse.js/6.4.6/fuse.js" integrity="sha512-zv6Ywkjyktsohkbp9bb45V6tEMoWhzFzXis+LrMehmJZZSys19Yxf1dopHx7WzIKxr5tK2dVcYmaCk2uqdjF4A==" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/autocomplete.js/0.38.0/autocomplete.jquery.min.js" integrity="sha512-GU9ayf+66Xx2TmpxqJpliWbT5PiGYxpaG8rfnBEk1LL8l1KGkRShhngwdXK1UgqhAzWpZHSiYPc09/NwDQIGyg==" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/mark.js/8.11.1/mark.min.js" integrity="sha512-5CYOlHXGh6QpOFA/TeTylKLWfB3ftPsde7AnmhuitiTX4K5SqCLBeKro6sPS8ilsz1Q4NRx3v8Ko2IBiszzdww==" crossorigin="anonymous"></script><!-- pkgdown --><script src="../pkgdown.js"></script><meta property="og:title" content="Parallel processing">
<meta property="og:description" content="mikropml">
<meta property="og:image" content="http://www.schlosslab.org/mikropml/logo.png">
<meta name="robots" content="noindex">
<!-- mathjax --><script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js" integrity="sha256-nvJJv9wWKEm88qvoQl9ekL2J+k/RWIsaSScxxlsrv8k=" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/config/TeX-AMS-MML_HTMLorMML.js" integrity="sha256-84DKXVJXs0/F8OTMzX4UR909+jtl4G7SPypPavF+GfA=" crossorigin="anonymous"></script><!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.3/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]-->
</head>
<body>
<a href="#main" class="visually-hidden-focusable">Skip to contents</a>
<nav class="navbar fixed-top navbar-light navbar-expand-lg bg-light"><div class="container">
<a class="navbar-brand me-2" href="../index.html">mikropml</a>
<small class="nav-text text-danger me-auto" data-bs-toggle="tooltip" data-bs-placement="bottom" title="In-development version">1.5.0.9000</small>
<button class="navbar-toggler" type="button" data-bs-toggle="collapse" data-bs-target="#navbar" aria-controls="navbar" aria-expanded="false" aria-label="Toggle navigation">
<span class="navbar-toggler-icon"></span>
</button>
<div id="navbar" class="collapse navbar-collapse ms-3">
<ul class="navbar-nav me-auto">
<li class="nav-item">
<a class="nav-link" href="../reference/index.html">Reference</a>
</li>
<li class="active nav-item dropdown">
<a href="#" class="nav-link dropdown-toggle" data-bs-toggle="dropdown" role="button" aria-expanded="false" aria-haspopup="true" id="dropdown-articles">Articles</a>
<div class="dropdown-menu" aria-labelledby="dropdown-articles">
<h6 class="dropdown-header" data-toc-skip>Paper</h6>
<a class="dropdown-item" href="../articles/paper.html">mikropml: User-Friendly R Package for Supervised Machine Learning Pipelines</a>
<div class="dropdown-divider"></div>
<h6 class="dropdown-header" data-toc-skip>Vignettes</h6>
<a class="dropdown-item" href="../articles/introduction.html">Introduction to mikropml</a>
<a class="dropdown-item" href="../articles/preprocess.html">Preprocessing data</a>
<a class="dropdown-item" href="../articles/tuning.html">Hyperparameter tuning</a>
<a class="dropdown-item" href="../articles/parallel.html">Parallel processing</a>
</div>
</li>
<li class="nav-item">
<a class="nav-link" href="../news/index.html">Changelog</a>
</li>
</ul>
<form class="form-inline my-2 my-lg-0" role="search">
<input type="search" class="form-control me-sm-2" aria-label="Toggle navigation" name="search-input" data-search-index="../search.json" id="search-input" placeholder="Search for" autocomplete="off">
</form>
<ul class="navbar-nav">
<li class="nav-item">
<a class="external-link nav-link" href="https://github.com/SchlossLab/mikropml/" aria-label="github">
<span class="fab fa fab fa-github fa-lg"></span>
</a>
</li>
</ul>
</div>
</div>
</nav><div class="container template-article">
<div class="row">
<main id="main" class="col-md-9"><div class="page-header">
<img src="../logo.png" class="logo" alt=""><h1>Parallel processing</h1>
<h4 data-toc-skip class="author">Kelly L.
Sovacool</h4>
<small class="dont-index">Source: <a href="https://github.com/SchlossLab/mikropml/blob/HEAD/vignettes/parallel.Rmd" class="external-link"><code>vignettes/parallel.Rmd</code></a></small>
<div class="d-none name"><code>parallel.Rmd</code></div>
</div>
<p>In this tutorial, we show how you can speed up pre-processing, model
training, and feature importance steps for individual runs, as well as
how to train multiple models in parallel within R and visualize the
results. However, we highly recommend using a workflow manager such as
Snakemake rather than parallelizing within a single R session. Jump to
the section <a href="#parallelizing-with-snakemake">Parallelizing with
Snakemake</a> below if you’re interested in skipping right to our best
recommendation.</p>
<div class="sourceCode" id="cb1"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="kw"><a href="https://rdrr.io/r/base/library.html" class="external-link">library</a></span><span class="op">(</span><span class="va"><a href="https://www.schlosslab.org/mikropml/" class="external-link">mikropml</a></span><span class="op">)</span></span>
<span><span class="kw"><a href="https://rdrr.io/r/base/library.html" class="external-link">library</a></span><span class="op">(</span><span class="va"><a href="https://dplyr.tidyverse.org" class="external-link">dplyr</a></span><span class="op">)</span></span>
<span><span class="co">#> </span></span>
<span><span class="co">#> Attaching package: 'dplyr'</span></span>
<span><span class="co">#> The following objects are masked from 'package:stats':</span></span>
<span><span class="co">#> </span></span>
<span><span class="co">#> filter, lag</span></span>
<span><span class="co">#> The following objects are masked from 'package:base':</span></span>
<span><span class="co">#> </span></span>
<span><span class="co">#> intersect, setdiff, setequal, union</span></span>
<span><span class="kw"><a href="https://rdrr.io/r/base/library.html" class="external-link">library</a></span><span class="op">(</span><span class="va"><a href="https://ggplot2.tidyverse.org" class="external-link">ggplot2</a></span><span class="op">)</span></span></code></pre></div>
<div class="section level2">
<h2 id="speed-up-single-runs">Speed up single runs<a class="anchor" aria-label="anchor" href="#speed-up-single-runs"></a>
</h2>
<p>By default, <code><a href="../reference/preprocess_data.html">preprocess_data()</a></code>, <code><a href="../reference/run_ml.html">run_ml()</a></code>,
and <code><a href="../reference/compare_models.html">compare_models()</a></code> use only one process in series. If
you’d like to parallelize various steps of the pipeline to make them run
faster, install <code>foreach</code>, <code>future</code>,
<code>future.apply</code>, and <code>doFuture</code>. Then, register a
future plan prior to calling these functions:</p>
<div class="sourceCode" id="cb2"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="fu">doFuture</span><span class="fu">::</span><span class="fu"><a href="https://doFuture.futureverse.org/reference/registerDoFuture.html" class="external-link">registerDoFuture</a></span><span class="op">(</span><span class="op">)</span></span>
<span><span class="fu">future</span><span class="fu">::</span><span class="fu"><a href="https://future.futureverse.org/reference/plan.html" class="external-link">plan</a></span><span class="op">(</span><span class="fu">future</span><span class="fu">::</span><span class="va"><a href="https://future.futureverse.org/reference/multicore.html" class="external-link">multicore</a></span>, workers <span class="op">=</span> <span class="fl">2</span><span class="op">)</span></span></code></pre></div>
<p>Above, we used the <code>multicore</code> plan to split the work
across 2 cores. See the <a href="https://cran.r-project.org/web/packages/future/vignettes/future-1-overview.html" class="external-link"><code>future</code>
documentation</a> for more about picking the best plan for your use
case. Notably, <code>multicore</code> does not work inside RStudio or on
Windows; you will need to use <code>multisession</code> instead in those
cases.</p>
<p>After registering a future plan, you can call
<code><a href="../reference/preprocess_data.html">preprocess_data()</a></code> and <code><a href="../reference/run_ml.html">run_ml()</a></code> as usual, and
they will run certain tasks in parallel.</p>
<div class="sourceCode" id="cb3"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">otu_data_preproc</span> <span class="op"><-</span> <span class="fu"><a href="../reference/preprocess_data.html">preprocess_data</a></span><span class="op">(</span><span class="va">otu_mini_bin</span>, <span class="st">"dx"</span><span class="op">)</span><span class="op">$</span><span class="va">dat_transformed</span></span>
<span><span class="va">result1</span> <span class="op"><-</span> <span class="fu"><a href="../reference/run_ml.html">run_ml</a></span><span class="op">(</span><span class="va">otu_data_preproc</span>, <span class="st">"glmnet"</span>, seed <span class="op">=</span> <span class="fl">2019</span><span class="op">)</span></span></code></pre></div>
<p>There’s a also a parallel version of the <code>rf</code> engine
called <code>parRF</code> which trains the trees in the forest in
parallel. See the <a href="https://topepo.github.io/caret/train-models-by-tag.html#Random_Forest" class="external-link">caret
docs</a> for more information.</p>
<div class="section level3">
<h3 id="bootstrap-performance">Bootstrap performance<a class="anchor" aria-label="anchor" href="#bootstrap-performance"></a>
</h3>
<p>If you only intend to call <code><a href="../reference/run_ml.html">run_ml()</a></code> once to generate one
train/test split (e.g. such as for a temporal split of the dataset), you
can evaluate the model performance by bootstrapping the test set.</p>
<p>Here we show how to generate <code>100</code> bootstraps and
calculate a confidence interval for the model performance. We only use
<code>100</code> here for computation speed, but it is recommended to
generate <code>10000</code> bootstraps for a more precise estimation of
the confidence interval.</p>
<div class="sourceCode" id="cb4"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">boot_perf</span> <span class="op"><-</span> <span class="fu"><a href="../reference/bootstrap_performance.html">bootstrap_performance</a></span><span class="op">(</span><span class="va">result1</span>,</span>
<span> outcome_colname <span class="op">=</span> <span class="st">"dx"</span>,</span>
<span> bootstrap_times <span class="op">=</span> <span class="fl">100</span>, alpha <span class="op">=</span> <span class="fl">0.05</span></span>
<span><span class="op">)</span></span>
<span><span class="va">boot_perf</span></span>
<span><span class="co">#> <span style="color: #949494;"># A tibble: 15 × 6</span></span></span>
<span><span class="co">#> term .lower .estimate .upper .alpha .method </span></span>
<span><span class="co">#> <span style="color: #949494; font-style: italic;"><chr></span> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><chr></span> </span></span>
<span><span class="co">#> <span style="color: #BCBCBC;"> 1</span> AUC 0.434 0.639 0.820 0.05 percentile</span></span>
<span><span class="co">#> <span style="color: #BCBCBC;"> 2</span> Accuracy 0.422 0.583 0.744 0.05 percentile</span></span>
<span><span class="co">#> <span style="color: #BCBCBC;"> 3</span> Balanced_Accuracy 0.431 0.586 0.749 0.05 percentile</span></span>
<span><span class="co">#> <span style="color: #BCBCBC;"> 4</span> Detection_Rate 0.179 0.299 0.449 0.05 percentile</span></span>
<span><span class="co">#> <span style="color: #BCBCBC;"> 5</span> F1 0.412 0.585 0.762 0.05 percentile</span></span>
<span><span class="co">#> <span style="color: #BCBCBC;"> 6</span> Kappa -<span style="color: #BB0000;">0.132</span> 0.167 0.486 0.05 percentile</span></span>
<span><span class="co">#> <span style="color: #BCBCBC;"> 7</span> Neg_Pred_Value 0.375 0.572 0.807 0.05 percentile</span></span>
<span><span class="co">#> <span style="color: #BCBCBC;"> 8</span> Pos_Pred_Value 0.395 0.599 0.855 0.05 percentile</span></span>
<span><span class="co">#> <span style="color: #BCBCBC;"> 9</span> Precision 0.395 0.599 0.855 0.05 percentile</span></span>
<span><span class="co">#> <span style="color: #BCBCBC;">10</span> Recall 0.375 0.584 0.824 0.05 percentile</span></span>
<span><span class="co">#> <span style="color: #BCBCBC;">11</span> Sensitivity 0.375 0.584 0.824 0.05 percentile</span></span>
<span><span class="co">#> <span style="color: #BCBCBC;">12</span> Specificity 0.379 0.587 0.823 0.05 percentile</span></span>
<span><span class="co">#> <span style="color: #BCBCBC;">13</span> cv_metric_AUC 0.622 0.622 0.622 0.05 percentile</span></span>
<span><span class="co">#> <span style="color: #BCBCBC;">14</span> logLoss 0.660 0.685 0.714 0.05 percentile</span></span>
<span><span class="co">#> <span style="color: #BCBCBC;">15</span> prAUC 0.442 0.583 0.734 0.05 percentile</span></span></code></pre></div>
</div>
</div>
<div class="section level2">
<h2 id="call-run_ml-multiple-times-in-parallel-in-r">Call <code>run_ml()</code> multiple times in parallel in R<a class="anchor" aria-label="anchor" href="#call-run_ml-multiple-times-in-parallel-in-r"></a>
</h2>
<p>You can use functions from the <code>future.apply</code> package to
call <code><a href="../reference/run_ml.html">run_ml()</a></code> multiple times in parallel with different
parameters. You will first need to run <code><a href="https://future.futureverse.org/reference/plan.html" class="external-link">future::plan()</a></code> as
above if you haven’t already. Then, call <code><a href="../reference/run_ml.html">run_ml()</a></code> with
multiple seeds using <code>future_lapply()</code>:</p>
<div class="sourceCode" id="cb5"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="co"># NOTE: use more seeds for real-world data</span></span>
<span><span class="va">results_multi</span> <span class="op"><-</span> <span class="fu">future.apply</span><span class="fu">::</span><span class="fu"><a href="https://future.apply.futureverse.org/reference/future_lapply.html" class="external-link">future_lapply</a></span><span class="op">(</span><span class="fu"><a href="https://rdrr.io/r/base/seq.html" class="external-link">seq</a></span><span class="op">(</span><span class="fl">100</span>, <span class="fl">102</span><span class="op">)</span>, <span class="kw">function</span><span class="op">(</span><span class="va">seed</span><span class="op">)</span> <span class="op">{</span></span>
<span> <span class="fu"><a href="../reference/run_ml.html">run_ml</a></span><span class="op">(</span><span class="va">otu_data_preproc</span>, <span class="st">"glmnet"</span>, seed <span class="op">=</span> <span class="va">seed</span><span class="op">)</span></span>
<span><span class="op">}</span>, future.seed <span class="op">=</span> <span class="cn">TRUE</span><span class="op">)</span></span>
<span><span class="co">#> Using 'dx' as the outcome column.</span></span>
<span><span class="co">#> Training the model...</span></span>
<span><span class="co">#> Loading required package: lattice</span></span>
<span><span class="co">#> </span></span>
<span><span class="co">#> Attaching package: 'caret'</span></span>
<span><span class="co">#> The following object is masked from 'package:mikropml':</span></span>
<span><span class="co">#> </span></span>
<span><span class="co">#> compare_models</span></span>
<span><span class="co">#> Training complete.</span></span>
<span><span class="co">#> Using 'dx' as the outcome column.</span></span>
<span><span class="co">#> Training the model...</span></span>
<span><span class="co">#> Loading required package: lattice</span></span>
<span><span class="co">#> </span></span>
<span><span class="co">#> Attaching package: 'caret'</span></span>
<span><span class="co">#> The following object is masked from 'package:mikropml':</span></span>
<span><span class="co">#> </span></span>
<span><span class="co">#> compare_models</span></span>
<span><span class="co">#> Training complete.</span></span>
<span><span class="co">#> Using 'dx' as the outcome column.</span></span>
<span><span class="co">#> Training the model...</span></span>
<span><span class="co">#> Training complete.</span></span></code></pre></div>
<p>Each call to <code><a href="../reference/run_ml.html">run_ml()</a></code> with a different seed uses a
different random split of the data into training and testing sets. Since
we are using seeds, we must set <code>future.seed</code> to
<code>TRUE</code> (see the <a href="https://cran.r-project.org/web/packages/future.apply/future.apply.pdf" class="external-link"><code>future.apply</code>
documentation</a> and <a href="https://www.r-bloggers.com/2020/09/future-1-19-1-making-sure-proper-random-numbers-are-produced-in-parallel-processing/" class="external-link">this
blog post</a> for details on parallel-safe random seeds). This example
uses only a few seeds for speed and simplicity, but for real data we
recommend using many more seeds to get a better estimate of model
performance.</p>
<p>In these examples, we used functions from the
<code>future.apply</code> package to <code><a href="../reference/run_ml.html">run_ml()</a></code> in parallel,
but you can accomplish the same thing with parallel versions of the
<code><a href="https://purrr.tidyverse.org/reference/map.html" class="external-link">purrr::map()</a></code> functions using the <code>furrr</code> package
(e.g. <code><a href="https://furrr.futureverse.org/reference/future_map.html" class="external-link">furrr::future_map_dfr()</a></code>).</p>
<p>Extract the performance results and combine into one dataframe for
all seeds:</p>
<div class="sourceCode" id="cb6"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">perf_df</span> <span class="op"><-</span> <span class="fu">future.apply</span><span class="fu">::</span><span class="fu"><a href="https://future.apply.futureverse.org/reference/future_lapply.html" class="external-link">future_lapply</a></span><span class="op">(</span><span class="va">results_multi</span>,</span>
<span> <span class="kw">function</span><span class="op">(</span><span class="va">result</span><span class="op">)</span> <span class="op">{</span></span>
<span> <span class="va">result</span><span class="op">[[</span><span class="st">"performance"</span><span class="op">]</span><span class="op">]</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span>
<span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/select.html" class="external-link">select</a></span><span class="op">(</span><span class="va">cv_metric_AUC</span>, <span class="va">AUC</span>, <span class="va">method</span><span class="op">)</span></span>
<span> <span class="op">}</span>,</span>
<span> future.seed <span class="op">=</span> <span class="cn">TRUE</span></span>
<span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span>
<span> <span class="fu">dplyr</span><span class="fu">::</span><span class="fu"><a href="https://dplyr.tidyverse.org/reference/bind_rows.html" class="external-link">bind_rows</a></span><span class="op">(</span><span class="op">)</span></span>
<span><span class="va">perf_df</span></span>
<span><span class="co">#> <span style="color: #949494;"># A tibble: 3 × 3</span></span></span>
<span><span class="co">#> cv_metric_AUC AUC method</span></span>
<span><span class="co">#> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><dbl></span> <span style="color: #949494; font-style: italic;"><chr></span> </span></span>
<span><span class="co">#> <span style="color: #BCBCBC;">1</span> 0.630 0.634 glmnet</span></span>
<span><span class="co">#> <span style="color: #BCBCBC;">2</span> 0.591 0.608 glmnet</span></span>
<span><span class="co">#> <span style="color: #BCBCBC;">3</span> 0.671 0.471 glmnet</span></span></code></pre></div>
<div class="section level3">
<h3 id="multiple-ml-methods">Multiple ML methods<a class="anchor" aria-label="anchor" href="#multiple-ml-methods"></a>
</h3>
<p>You may also wish to compare performance for different ML methods.
<code><a href="https://rdrr.io/r/base/mapply.html" class="external-link">mapply()</a></code> can iterate over multiple lists or vectors, and
<code>future_mapply()</code> works the same way:</p>
<div class="sourceCode" id="cb7"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="co"># NOTE: use more seeds for real-world data</span></span>
<span><span class="va">param_grid</span> <span class="op"><-</span> <span class="fu"><a href="https://rdrr.io/r/base/expand.grid.html" class="external-link">expand.grid</a></span><span class="op">(</span></span>
<span> seeds <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/seq.html" class="external-link">seq</a></span><span class="op">(</span><span class="fl">100</span>, <span class="fl">103</span><span class="op">)</span>,</span>
<span> methods <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="st">"glmnet"</span>, <span class="st">"rf"</span><span class="op">)</span></span>
<span><span class="op">)</span></span>
<span><span class="va">results_mtx</span> <span class="op"><-</span> <span class="fu">future.apply</span><span class="fu">::</span><span class="fu"><a href="https://future.apply.futureverse.org/reference/future_mapply.html" class="external-link">future_mapply</a></span><span class="op">(</span></span>
<span> <span class="kw">function</span><span class="op">(</span><span class="va">seed</span>, <span class="va">method</span><span class="op">)</span> <span class="op">{</span></span>
<span> <span class="fu"><a href="../reference/run_ml.html">run_ml</a></span><span class="op">(</span><span class="va">otu_data_preproc</span>,</span>
<span> <span class="va">method</span>,</span>
<span> seed <span class="op">=</span> <span class="va">seed</span>,</span>
<span> find_feature_importance <span class="op">=</span> <span class="cn">TRUE</span></span>
<span> <span class="op">)</span></span>
<span> <span class="op">}</span>,</span>
<span> <span class="va">param_grid</span><span class="op">$</span><span class="va">seeds</span>,</span>
<span> <span class="va">param_grid</span><span class="op">$</span><span class="va">methods</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span> <span class="fu"><a href="https://rdrr.io/r/base/character.html" class="external-link">as.character</a></span><span class="op">(</span><span class="op">)</span>,</span>
<span> future.seed <span class="op">=</span> <span class="cn">TRUE</span></span>
<span><span class="op">)</span></span>
<span><span class="co">#> Using 'dx' as the outcome column.</span></span>
<span><span class="co">#> Training the model...</span></span>
<span><span class="co">#> Loading required package: lattice</span></span>
<span><span class="co">#> </span></span>
<span><span class="co">#> Attaching package: 'caret'</span></span>
<span><span class="co">#> The following object is masked from 'package:mikropml':</span></span>
<span><span class="co">#> </span></span>
<span><span class="co">#> compare_models</span></span>
<span><span class="co">#> Training complete.</span></span>
<span><span class="co">#> Finding feature importance...</span></span>
<span><span class="co">#> Feature importance complete.</span></span>
<span><span class="co">#> Using 'dx' as the outcome column.</span></span>
<span><span class="co">#> Training the model...</span></span>
<span><span class="co">#> Training complete.</span></span>
<span><span class="co">#> Finding feature importance...</span></span>
<span><span class="co">#> Feature importance complete.</span></span>
<span><span class="co">#> Using 'dx' as the outcome column.</span></span>
<span><span class="co">#> Training the model...</span></span>
<span><span class="co">#> Training complete.</span></span>
<span><span class="co">#> Finding feature importance...</span></span>
<span><span class="co">#> Feature importance complete.</span></span>
<span><span class="co">#> Using 'dx' as the outcome column.</span></span>
<span><span class="co">#> Training the model...</span></span>
<span><span class="co">#> Training complete.</span></span>
<span><span class="co">#> Finding feature importance...</span></span>
<span><span class="co">#> Feature importance complete.</span></span>
<span><span class="co">#> Using 'dx' as the outcome column.</span></span>
<span><span class="co">#> Training the model...</span></span>
<span><span class="co">#> Loading required package: lattice</span></span>
<span><span class="co">#> </span></span>
<span><span class="co">#> Attaching package: 'caret'</span></span>
<span><span class="co">#> The following object is masked from 'package:mikropml':</span></span>
<span><span class="co">#> </span></span>
<span><span class="co">#> compare_models</span></span>
<span><span class="co">#> Training complete.</span></span>
<span><span class="co">#> Finding feature importance...</span></span>
<span><span class="co">#> Feature importance complete.</span></span>
<span><span class="co">#> Using 'dx' as the outcome column.</span></span>
<span><span class="co">#> Training the model...</span></span>
<span><span class="co">#> Training complete.</span></span>
<span><span class="co">#> Finding feature importance...</span></span>
<span><span class="co">#> Feature importance complete.</span></span>
<span><span class="co">#> Using 'dx' as the outcome column.</span></span>
<span><span class="co">#> Training the model...</span></span>
<span><span class="co">#> Training complete.</span></span>
<span><span class="co">#> Finding feature importance...</span></span>
<span><span class="co">#> Feature importance complete.</span></span>
<span><span class="co">#> Using 'dx' as the outcome column.</span></span>
<span><span class="co">#> Training the model...</span></span>
<span><span class="co">#> Training complete.</span></span>
<span><span class="co">#> Finding feature importance...</span></span>
<span><span class="co">#> Feature importance complete.</span></span></code></pre></div>
</div>
<div class="section level3">
<h3 id="visualize-the-results">Visualize the results<a class="anchor" aria-label="anchor" href="#visualize-the-results"></a>
</h3>
<p><code>ggplot2</code> is required to use our plotting functions below.
You can also create your own plots however you like using the results
data.</p>
<div class="section level4">
<h4 id="performance">Performance<a class="anchor" aria-label="anchor" href="#performance"></a>
</h4>
<div class="section level5">
<h5 id="mean-auc">Mean AUC<a class="anchor" aria-label="anchor" href="#mean-auc"></a>
</h5>
<div class="sourceCode" id="cb8"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">perf_df</span> <span class="op"><-</span> <span class="fu"><a href="https://rdrr.io/r/base/lapply.html" class="external-link">lapply</a></span><span class="op">(</span></span>
<span> <span class="va">results_mtx</span><span class="op">[</span><span class="st">"performance"</span>, <span class="op">]</span>,</span>
<span> <span class="kw">function</span><span class="op">(</span><span class="va">x</span><span class="op">)</span> <span class="op">{</span></span>
<span> <span class="va">x</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/select.html" class="external-link">select</a></span><span class="op">(</span><span class="va">cv_metric_AUC</span>, <span class="va">AUC</span>, <span class="va">method</span><span class="op">)</span></span>
<span> <span class="op">}</span></span>
<span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span>
<span> <span class="fu">dplyr</span><span class="fu">::</span><span class="fu"><a href="https://dplyr.tidyverse.org/reference/bind_rows.html" class="external-link">bind_rows</a></span><span class="op">(</span><span class="op">)</span></span>
<span></span>
<span><span class="va">perf_boxplot</span> <span class="op"><-</span> <span class="fu"><a href="../reference/plot_model_performance.html">plot_model_performance</a></span><span class="op">(</span><span class="va">perf_df</span><span class="op">)</span></span>
<span><span class="va">perf_boxplot</span></span></code></pre></div>
<p><img src="parallel_files/figure-html/plot_perf-1.png" width="700"></p>
<p><code><a href="../reference/plot_model_performance.html">plot_model_performance()</a></code> returns a ggplot2 object. You
can add layers to customize the plot:</p>
<div class="sourceCode" id="cb9"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">perf_boxplot</span> <span class="op">+</span></span>
<span> <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/ggtheme.html" class="external-link">theme_classic</a></span><span class="op">(</span><span class="op">)</span> <span class="op">+</span></span>
<span> <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/scale_brewer.html" class="external-link">scale_color_brewer</a></span><span class="op">(</span>palette <span class="op">=</span> <span class="st">"Dark2"</span><span class="op">)</span> <span class="op">+</span></span>
<span> <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/coord_flip.html" class="external-link">coord_flip</a></span><span class="op">(</span><span class="op">)</span></span></code></pre></div>
<p><img src="parallel_files/figure-html/customize_perf_plot-1.png" width="700"></p>
</div>
<div class="section level5">
<h5 id="roc-and-prc-curves">ROC and PRC curves<a class="anchor" aria-label="anchor" href="#roc-and-prc-curves"></a>
</h5>
<p>First calculate the sensitivity, specificity, and precision for all
models.</p>
<div class="sourceCode" id="cb10"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">get_sensspec_seed</span> <span class="op"><-</span> <span class="kw">function</span><span class="op">(</span><span class="va">colnum</span><span class="op">)</span> <span class="op">{</span></span>
<span> <span class="va">result</span> <span class="op"><-</span> <span class="va">results_mtx</span><span class="op">[</span>, <span class="va">colnum</span><span class="op">]</span></span>
<span> <span class="va">trained_model</span> <span class="op"><-</span> <span class="va">result</span><span class="op">$</span><span class="va">trained_model</span></span>
<span> <span class="va">test_data</span> <span class="op"><-</span> <span class="va">result</span><span class="op">$</span><span class="va">test_data</span></span>
<span> <span class="va">seed</span> <span class="op"><-</span> <span class="va">result</span><span class="op">$</span><span class="va">performance</span><span class="op">$</span><span class="va">seed</span></span>
<span> <span class="va">method</span> <span class="op"><-</span> <span class="va">result</span><span class="op">$</span><span class="va">trained_model</span><span class="op">$</span><span class="va">method</span></span>
<span> <span class="va">sensspec</span> <span class="op"><-</span> <span class="fu"><a href="../reference/sensspec.html">calc_model_sensspec</a></span><span class="op">(</span></span>
<span> <span class="va">trained_model</span>,</span>
<span> <span class="va">test_data</span>,</span>
<span> <span class="st">"dx"</span></span>
<span> <span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span>
<span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/mutate.html" class="external-link">mutate</a></span><span class="op">(</span>seed <span class="op">=</span> <span class="va">seed</span>, method <span class="op">=</span> <span class="va">method</span><span class="op">)</span></span>
<span> <span class="kw"><a href="https://rdrr.io/r/base/function.html" class="external-link">return</a></span><span class="op">(</span><span class="va">sensspec</span><span class="op">)</span></span>
<span><span class="op">}</span></span>
<span><span class="va">sensspec_dat</span> <span class="op"><-</span> <span class="fu">purrr</span><span class="fu">::</span><span class="fu"><a href="https://purrr.tidyverse.org/reference/map_dfr.html" class="external-link">map_dfr</a></span><span class="op">(</span></span>
<span> <span class="fu"><a href="https://rdrr.io/r/base/seq.html" class="external-link">seq</a></span><span class="op">(</span><span class="fl">1</span>, <span class="fu"><a href="https://rdrr.io/r/base/dim.html" class="external-link">dim</a></span><span class="op">(</span><span class="va">results_mtx</span><span class="op">)</span><span class="op">[</span><span class="fl">2</span><span class="op">]</span><span class="op">)</span>,</span>
<span> <span class="va">get_sensspec_seed</span></span>
<span><span class="op">)</span></span>
<span><span class="co">#> Using 'dx' as the outcome column.</span></span>
<span><span class="co">#> Using 'dx' as the outcome column.</span></span>
<span><span class="co">#> Using 'dx' as the outcome column.</span></span>
<span><span class="co">#> Using 'dx' as the outcome column.</span></span>
<span><span class="co">#> Using 'dx' as the outcome column.</span></span>
<span><span class="co">#> Using 'dx' as the outcome column.</span></span>
<span><span class="co">#> Using 'dx' as the outcome column.</span></span>
<span><span class="co">#> Using 'dx' as the outcome column.</span></span></code></pre></div>
<div class="section level6">
<h6 id="plot-curves-for-a-single-model">Plot curves for a single model<a class="anchor" aria-label="anchor" href="#plot-curves-for-a-single-model"></a>
</h6>
<div class="sourceCode" id="cb11"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">sensspec_1</span> <span class="op"><-</span> <span class="va">sensspec_dat</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/filter.html" class="external-link">filter</a></span><span class="op">(</span><span class="va">seed</span> <span class="op">==</span> <span class="fl">100</span>, <span class="va">method</span> <span class="op">==</span> <span class="st">"glmnet"</span><span class="op">)</span></span>
<span><span class="va">sensspec_1</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span>
<span> <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/ggplot.html" class="external-link">ggplot</a></span><span class="op">(</span><span class="fu"><a href="https://ggplot2.tidyverse.org/reference/aes.html" class="external-link">aes</a></span><span class="op">(</span>x <span class="op">=</span> <span class="va">specificity</span>, y <span class="op">=</span> <span class="va">sensitivity</span>, <span class="op">)</span><span class="op">)</span> <span class="op">+</span></span>
<span> <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/geom_path.html" class="external-link">geom_line</a></span><span class="op">(</span><span class="op">)</span> <span class="op">+</span></span>
<span> <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/geom_abline.html" class="external-link">geom_abline</a></span><span class="op">(</span></span>
<span> intercept <span class="op">=</span> <span class="fl">1</span>, slope <span class="op">=</span> <span class="fl">1</span>,</span>
<span> linetype <span class="op">=</span> <span class="st">"dashed"</span>, color <span class="op">=</span> <span class="st">"grey50"</span></span>
<span> <span class="op">)</span> <span class="op">+</span></span>
<span> <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/coord_fixed.html" class="external-link">coord_equal</a></span><span class="op">(</span><span class="op">)</span> <span class="op">+</span></span>
<span> <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/scale_continuous.html" class="external-link">scale_x_reverse</a></span><span class="op">(</span>expand <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="fl">0</span>, <span class="fl">0</span><span class="op">)</span>, limits <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="fl">1.01</span>, <span class="op">-</span><span class="fl">0.01</span><span class="op">)</span><span class="op">)</span> <span class="op">+</span></span>
<span> <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/scale_continuous.html" class="external-link">scale_y_continuous</a></span><span class="op">(</span>expand <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="fl">0</span>, <span class="fl">0</span><span class="op">)</span>, limits <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="op">-</span><span class="fl">0.01</span>, <span class="fl">1.01</span><span class="op">)</span><span class="op">)</span> <span class="op">+</span></span>
<span> <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/labs.html" class="external-link">labs</a></span><span class="op">(</span>x <span class="op">=</span> <span class="st">"Specificity"</span>, y <span class="op">=</span> <span class="st">"Sensitivity"</span><span class="op">)</span> <span class="op">+</span></span>
<span> <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/ggtheme.html" class="external-link">theme_bw</a></span><span class="op">(</span><span class="op">)</span> <span class="op">+</span></span>
<span> <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/theme.html" class="external-link">theme</a></span><span class="op">(</span>legend.title <span class="op">=</span> <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/element.html" class="external-link">element_blank</a></span><span class="op">(</span><span class="op">)</span><span class="op">)</span></span></code></pre></div>
<p><img src="parallel_files/figure-html/plot_curves_one_model-1.png" width="700"></p>
<div class="sourceCode" id="cb12"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span></span>
<span><span class="va">baseline_precision_otu</span> <span class="op"><-</span> <span class="fu"><a href="../reference/calc_baseline_precision.html">calc_baseline_precision</a></span><span class="op">(</span></span>
<span> <span class="va">otu_data_preproc</span>,</span>
<span> <span class="st">"dx"</span>, <span class="st">"cancer"</span></span>
<span><span class="op">)</span></span>
<span><span class="co">#> Using 'dx' as the outcome column.</span></span>
<span><span class="va">sensspec_1</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span>
<span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/rename.html" class="external-link">rename</a></span><span class="op">(</span>recall <span class="op">=</span> <span class="va">sensitivity</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span>
<span> <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/ggplot.html" class="external-link">ggplot</a></span><span class="op">(</span><span class="fu"><a href="https://ggplot2.tidyverse.org/reference/aes.html" class="external-link">aes</a></span><span class="op">(</span>x <span class="op">=</span> <span class="va">recall</span>, y <span class="op">=</span> <span class="va">precision</span>, <span class="op">)</span><span class="op">)</span> <span class="op">+</span></span>
<span> <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/geom_path.html" class="external-link">geom_line</a></span><span class="op">(</span><span class="op">)</span> <span class="op">+</span></span>
<span> <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/geom_abline.html" class="external-link">geom_hline</a></span><span class="op">(</span></span>
<span> yintercept <span class="op">=</span> <span class="va">baseline_precision_otu</span>,</span>
<span> linetype <span class="op">=</span> <span class="st">"dashed"</span>, color <span class="op">=</span> <span class="st">"grey50"</span></span>
<span> <span class="op">)</span> <span class="op">+</span></span>
<span> <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/coord_fixed.html" class="external-link">coord_equal</a></span><span class="op">(</span><span class="op">)</span> <span class="op">+</span></span>
<span> <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/scale_continuous.html" class="external-link">scale_x_continuous</a></span><span class="op">(</span>expand <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="fl">0</span>, <span class="fl">0</span><span class="op">)</span>, limits <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="op">-</span><span class="fl">0.01</span>, <span class="fl">1.01</span><span class="op">)</span><span class="op">)</span> <span class="op">+</span></span>
<span> <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/scale_continuous.html" class="external-link">scale_y_continuous</a></span><span class="op">(</span>expand <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="fl">0</span>, <span class="fl">0</span><span class="op">)</span>, limits <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="op">-</span><span class="fl">0.01</span>, <span class="fl">1.01</span><span class="op">)</span><span class="op">)</span> <span class="op">+</span></span>
<span> <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/labs.html" class="external-link">labs</a></span><span class="op">(</span>x <span class="op">=</span> <span class="st">"Recall"</span>, y <span class="op">=</span> <span class="st">"Precision"</span><span class="op">)</span> <span class="op">+</span></span>
<span> <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/ggtheme.html" class="external-link">theme_bw</a></span><span class="op">(</span><span class="op">)</span> <span class="op">+</span></span>
<span> <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/theme.html" class="external-link">theme</a></span><span class="op">(</span>legend.title <span class="op">=</span> <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/element.html" class="external-link">element_blank</a></span><span class="op">(</span><span class="op">)</span><span class="op">)</span></span></code></pre></div>
<p><img src="parallel_files/figure-html/plot_curves_one_model-2.png" width="700"></p>
</div>
<div class="section level6">
<h6 id="plot-mean-roc-and-prc-for-all-models">Plot mean ROC and PRC for all models<a class="anchor" aria-label="anchor" href="#plot-mean-roc-and-prc-for-all-models"></a>
</h6>
<div class="sourceCode" id="cb13"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">sensspec_dat</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span>
<span> <span class="fu"><a href="../reference/sensspec.html">calc_mean_roc</a></span><span class="op">(</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span>
<span> <span class="fu"><a href="../reference/plot_curves.html">plot_mean_roc</a></span><span class="op">(</span><span class="op">)</span></span></code></pre></div>
<p><img src="parallel_files/figure-html/plot_roc_prc-1.png" width="700"></p>
<div class="sourceCode" id="cb14"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span></span>
<span><span class="va">sensspec_dat</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span>
<span> <span class="fu"><a href="../reference/sensspec.html">calc_mean_prc</a></span><span class="op">(</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span>
<span> <span class="fu"><a href="../reference/plot_curves.html">plot_mean_prc</a></span><span class="op">(</span>baseline_precision <span class="op">=</span> <span class="va">baseline_precision_otu</span><span class="op">)</span></span></code></pre></div>
<p><img src="parallel_files/figure-html/plot_roc_prc-2.png" width="700"></p>
</div>
</div>
</div>
<div class="section level4">
<h4 id="feature-importance">Feature importance<a class="anchor" aria-label="anchor" href="#feature-importance"></a>
</h4>
<p>The <code>perf_metric_diff</code> from the feature importance data
frame contains the differences between the performance on the actual
test data and the performance on the permuted test data
(i.e. <strong>test</strong> minus <strong>permuted</strong>). If a
feature is important for model performance, we expect
<code>perf_metric_diff</code> to be positive. In other words, the
features that resulted in the largest <strong>decrease</strong> in
performance when permuted are the most important features.</p>
<div class="section level5">
<h5 id="feature-importance-for-multiple-models">Feature importance for multiple models<a class="anchor" aria-label="anchor" href="#feature-importance-for-multiple-models"></a>
</h5>
<p>You can select the top n most important features for your models and
plot them like so:</p>
<div class="sourceCode" id="cb15"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">feat_df</span> <span class="op"><-</span> <span class="va">results_mtx</span><span class="op">[</span><span class="st">"feature_importance"</span>, <span class="op">]</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span>
<span> <span class="fu">dplyr</span><span class="fu">::</span><span class="fu"><a href="https://dplyr.tidyverse.org/reference/bind_rows.html" class="external-link">bind_rows</a></span><span class="op">(</span><span class="op">)</span></span>
<span></span>
<span><span class="va">top_n</span> <span class="op"><-</span> <span class="fl">5</span></span>
<span><span class="va">top_feats</span> <span class="op"><-</span> <span class="va">feat_df</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span>
<span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/group_by.html" class="external-link">group_by</a></span><span class="op">(</span><span class="va">method</span>, <span class="va">feat</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span>
<span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/summarise.html" class="external-link">summarize</a></span><span class="op">(</span>mean_diff <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/stats/median.html" class="external-link">median</a></span><span class="op">(</span><span class="va">perf_metric_diff</span><span class="op">)</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span>
<span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/filter.html" class="external-link">filter</a></span><span class="op">(</span><span class="va">mean_diff</span> <span class="op">></span> <span class="fl">0</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span>
<span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/slice.html" class="external-link">slice_max</a></span><span class="op">(</span>order_by <span class="op">=</span> <span class="va">mean_diff</span>, n <span class="op">=</span> <span class="va">top_n</span><span class="op">)</span></span>
<span><span class="co">#> `summarise()` has grouped output by 'method'. You can override using the</span></span>
<span><span class="co">#> `.groups` argument.</span></span>
<span></span>
<span><span class="va">feat_df</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span>
<span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html" class="external-link">right_join</a></span><span class="op">(</span><span class="va">top_feats</span>, by <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="st">"method"</span>, <span class="st">"feat"</span><span class="op">)</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span>
<span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/mutate.html" class="external-link">mutate</a></span><span class="op">(</span>features <span class="op">=</span> <span class="fu">forcats</span><span class="fu">::</span><span class="fu"><a href="https://forcats.tidyverse.org/reference/fct_reorder.html" class="external-link">fct_reorder</a></span><span class="op">(</span><span class="fu"><a href="https://rdrr.io/r/base/factor.html" class="external-link">factor</a></span><span class="op">(</span><span class="va">feat</span><span class="op">)</span>, <span class="va">mean_diff</span><span class="op">)</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span>
<span> <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/ggplot.html" class="external-link">ggplot</a></span><span class="op">(</span><span class="fu"><a href="https://ggplot2.tidyverse.org/reference/aes.html" class="external-link">aes</a></span><span class="op">(</span>x <span class="op">=</span> <span class="va">perf_metric_diff</span>, y <span class="op">=</span> <span class="va">features</span>, color <span class="op">=</span> <span class="va">method</span><span class="op">)</span><span class="op">)</span> <span class="op">+</span></span>
<span> <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/geom_boxplot.html" class="external-link">geom_boxplot</a></span><span class="op">(</span><span class="op">)</span> <span class="op">+</span></span>
<span> <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/geom_abline.html" class="external-link">geom_vline</a></span><span class="op">(</span>xintercept <span class="op">=</span> <span class="fl">0</span>, linetype <span class="op">=</span> <span class="st">"dashed"</span><span class="op">)</span> <span class="op">+</span></span>
<span> <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/labs.html" class="external-link">labs</a></span><span class="op">(</span></span>
<span> x <span class="op">=</span> <span class="st">"Decrease in performance (actual minus permutation)"</span>,</span>
<span> y <span class="op">=</span> <span class="st">"Features"</span>,</span>
<span> caption <span class="op">=</span> <span class="st">"Features which have a lower performance when permuted have a</span></span>
<span><span class="st"> difference in performance above zero. The features with the greatest</span></span>
<span><span class="st"> decrease are the most important for model performance."</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span>
<span> <span class="fu">stringr</span><span class="fu">::</span><span class="fu"><a href="https://stringr.tidyverse.org/reference/str_wrap.html" class="external-link">str_wrap</a></span><span class="op">(</span>width <span class="op">=</span> <span class="fl">100</span><span class="op">)</span></span>
<span> <span class="op">)</span> <span class="op">+</span></span>
<span> <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/ggtheme.html" class="external-link">theme_bw</a></span><span class="op">(</span><span class="op">)</span> <span class="op">+</span></span>
<span> <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/theme.html" class="external-link">theme</a></span><span class="op">(</span>plot.caption <span class="op">=</span> <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/element.html" class="external-link">element_text</a></span><span class="op">(</span>hjust <span class="op">=</span> <span class="fl">0</span><span class="op">)</span><span class="op">)</span></span></code></pre></div>
<p><img src="parallel_files/figure-html/feat_imp_plot-1.png" width="700"></p>
<p>See the docs for <code><a href="../reference/get_feature_importance.html">get_feature_importance()</a></code> for more
details on how these values are computed.</p>
</div>
<div class="section level5">
<h5 id="feature-importance-for-a-single-model">Feature importance for a single model<a class="anchor" aria-label="anchor" href="#feature-importance-for-a-single-model"></a>
</h5>
<p>You can also plot feature importance for a single model. Here we
report the actual performance, the permutation performance, and the
empirical 95% confidence interval for the permutation performance.</p>
<div class="sourceCode" id="cb16"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">feat_imp_1</span> <span class="op"><-</span> <span class="va">results_mtx</span><span class="op">[</span>, <span class="fl">1</span><span class="op">]</span><span class="op">[[</span><span class="st">"feature_importance"</span><span class="op">]</span><span class="op">]</span></span>
<span><span class="va">perf_metric_name</span> <span class="op"><-</span> <span class="va">results_mtx</span><span class="op">[</span>, <span class="fl">1</span><span class="op">]</span><span class="op">[[</span><span class="st">"trained_model"</span><span class="op">]</span><span class="op">]</span><span class="op">$</span><span class="va">metric</span></span>
<span><span class="va">perf_actual</span> <span class="op"><-</span> <span class="va">results_mtx</span><span class="op">[</span>, <span class="fl">1</span><span class="op">]</span><span class="op">[[</span><span class="st">"performance"</span><span class="op">]</span><span class="op">]</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/pull.html" class="external-link">pull</a></span><span class="op">(</span><span class="va">perf_metric_name</span><span class="op">)</span></span>
<span></span>
<span><span class="va">feat_imp_1</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span>
<span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/filter.html" class="external-link">filter</a></span><span class="op">(</span><span class="va">perf_metric_diff</span> <span class="op">></span> <span class="fl">0</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span>
<span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/mutate.html" class="external-link">mutate</a></span><span class="op">(</span>feat <span class="op">=</span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/if_else.html" class="external-link">if_else</a></span><span class="op">(</span><span class="va">pvalue</span> <span class="op"><</span> <span class="fl">0.05</span>, <span class="fu"><a href="https://rdrr.io/r/base/paste.html" class="external-link">paste0</a></span><span class="op">(</span><span class="st">"*"</span>, <span class="va">feat</span><span class="op">)</span>, <span class="fu"><a href="https://rdrr.io/r/base/character.html" class="external-link">as.character</a></span><span class="op">(</span><span class="va">feat</span><span class="op">)</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span>
<span> <span class="fu"><a href="https://rdrr.io/r/base/factor.html" class="external-link">as.factor</a></span><span class="op">(</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span>
<span> <span class="fu">forcats</span><span class="fu">::</span><span class="fu"><a href="https://forcats.tidyverse.org/reference/fct_reorder.html" class="external-link">fct_reorder</a></span><span class="op">(</span><span class="va">perf_metric_diff</span><span class="op">)</span><span class="op">)</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span></span>
<span> <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/ggplot.html" class="external-link">ggplot</a></span><span class="op">(</span><span class="fu"><a href="https://ggplot2.tidyverse.org/reference/aes.html" class="external-link">aes</a></span><span class="op">(</span>x <span class="op">=</span> <span class="va">perf_metric</span>, xmin <span class="op">=</span> <span class="va">lower</span>, xmax <span class="op">=</span> <span class="va">upper</span>, y <span class="op">=</span> <span class="va">feat</span><span class="op">)</span><span class="op">)</span> <span class="op">+</span></span>
<span> <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/geom_linerange.html" class="external-link">geom_pointrange</a></span><span class="op">(</span><span class="op">)</span> <span class="op">+</span></span>
<span> <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/geom_abline.html" class="external-link">geom_vline</a></span><span class="op">(</span>xintercept <span class="op">=</span> <span class="va">perf_actual</span>, linetype <span class="op">=</span> <span class="st">"dashed"</span><span class="op">)</span> <span class="op">+</span></span>
<span> <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/labs.html" class="external-link">labs</a></span><span class="op">(</span></span>
<span> x <span class="op">=</span> <span class="st">"Permutation performance"</span>,</span>
<span> y <span class="op">=</span> <span class="st">"Features"</span>,</span>
<span> caption <span class="op">=</span> <span class="st">"The dashed line represents the actual performance on the</span></span>
<span><span class="st"> test set. Features which have a lower performance when permuted are</span></span>
<span><span class="st"> important for model performance. Significant features (pvalue < 0.05)</span></span>
<span><span class="st"> are marked with an asterisk (*). Error bars represent the 95%</span></span>
<span><span class="st"> confidence interval."</span> <span class="op"><a href="https://magrittr.tidyverse.org/reference/pipe.html" class="external-link">%>%</a></span> <span class="fu">stringr</span><span class="fu">::</span><span class="fu"><a href="https://stringr.tidyverse.org/reference/str_wrap.html" class="external-link">str_wrap</a></span><span class="op">(</span>width <span class="op">=</span> <span class="fl">110</span><span class="op">)</span></span>
<span> <span class="op">)</span> <span class="op">+</span></span>
<span> <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/ggtheme.html" class="external-link">theme_bw</a></span><span class="op">(</span><span class="op">)</span> <span class="op">+</span></span>
<span> <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/theme.html" class="external-link">theme</a></span><span class="op">(</span>plot.caption <span class="op">=</span> <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/element.html" class="external-link">element_text</a></span><span class="op">(</span>hjust <span class="op">=</span> <span class="fl">0</span><span class="op">)</span><span class="op">)</span></span></code></pre></div>
<p><img src="parallel_files/figure-html/feat_imp_single-1.png" width="700"></p>
</div>
</div>
</div>
</div>
<div class="section level2">
<h2 id="live-progress-updates">Live progress updates<a class="anchor" aria-label="anchor" href="#live-progress-updates"></a>
</h2>
<p><code><a href="../reference/preprocess_data.html">preprocess_data()</a></code> and
<code><a href="../reference/get_feature_importance.html">get_feature_importance()</a></code> support reporting live progress
updates using the <code>progressr</code> package. The format is up to
you, but we recommend using a progress bar like this:</p>
<div class="sourceCode" id="cb17"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="co"># optionally, specify the progress bar format with the `progress` package.</span></span>
<span><span class="fu">progressr</span><span class="fu">::</span><span class="fu"><a href="https://progressr.futureverse.org/reference/handlers.html" class="external-link">handlers</a></span><span class="op">(</span><span class="fu">progressr</span><span class="fu">::</span><span class="fu"><a href="https://progressr.futureverse.org/reference/handler_progress.html" class="external-link">handler_progress</a></span><span class="op">(</span></span>
<span> format <span class="op">=</span> <span class="st">":message :bar :percent | elapsed: :elapsed | eta: :eta"</span>,</span>
<span> clear <span class="op">=</span> <span class="cn">FALSE</span>,</span>
<span> show_after <span class="op">=</span> <span class="fl">0</span></span>
<span><span class="op">)</span><span class="op">)</span></span>
<span><span class="co"># tell progressr to always report progress in any functions that use it.</span></span>
<span><span class="co"># set this to FALSE to turn it back off again.</span></span>
<span><span class="fu">progressr</span><span class="fu">::</span><span class="fu"><a href="https://progressr.futureverse.org/reference/handlers.html" class="external-link">handlers</a></span><span class="op">(</span>global <span class="op">=</span> <span class="cn">TRUE</span><span class="op">)</span></span>
<span></span>
<span><span class="co"># run your code and watch the live progress updates.</span></span>
<span><span class="va">dat</span> <span class="op"><-</span> <span class="fu"><a href="../reference/preprocess_data.html">preprocess_data</a></span><span class="op">(</span><span class="va">otu_mini_bin</span>, <span class="st">"dx"</span><span class="op">)</span><span class="op">$</span><span class="va">dat_transformed</span></span>
<span><span class="co">#> Using 'dx' as the outcome column.</span></span>
<span><span class="co">#> preprocessing ========================>------- 78% | elapsed: 1s | eta: 0s</span></span>
<span><span class="va">results</span> <span class="op"><-</span> <span class="fu"><a href="../reference/run_ml.html">run_ml</a></span><span class="op">(</span><span class="va">dat</span>, <span class="st">"glmnet"</span>,</span>
<span> kfold <span class="op">=</span> <span class="fl">2</span>, cv_times <span class="op">=</span> <span class="fl">2</span>,</span>
<span> find_feature_importance <span class="op">=</span> <span class="cn">TRUE</span></span>
<span><span class="op">)</span></span>
<span><span class="co">#> Using 'dx' as the outcome column.</span></span>
<span><span class="co">#> Training the model...</span></span>
<span><span class="co">#> Training complete.</span></span>
<span><span class="co">#> Feature importance =========================== 100% | elapsed: 37s | eta: 0s</span></span></code></pre></div>
<p>Note that some future backends support “near-live” progress updates,
meaning the progress may not be reported immediately when parallel
processing with futures. Read more on that <a href="https://progressr.futureverse.org/articles/progressr-intro.html#near-live-versus-buffered-progress-updates-with-futures" class="external-link">in
the <code>progressr</code> vignette</a>. For more on
<code>progressr</code> and how to customize the format of progress
updates, see the <a href="https://progressr.futureverse.org/" class="external-link"><code>progressr</code>
docs</a>.</p>
</div>
<div class="section level2">
<h2 id="parallelizing-with-snakemake">Parallelizing with Snakemake<a class="anchor" aria-label="anchor" href="#parallelizing-with-snakemake"></a>
</h2>
<p>When parallelizing multiple calls to <code><a href="../reference/run_ml.html">run_ml()</a></code> in R as in
the examples above, all of the results objects are held in memory. This
isn’t a big deal for a small dataset run with only a few seeds. However,
for large datasets run in parallel with, say, 100 seeds (recommended),
you may run into problems trying to store all of those objects in memory
at once.</p>
<p>Using a workflow manager such as Snakemake or Nextflow is highly
recommend to maximize the scalability and reproducibility of
computational analyses. We created <a href="https://github.com/SchlossLab/mikropml-snakemake-workflow" class="external-link">a
template Snakemake workflow here</a> which you can use as a starting
point for your ML project.</p>
<p><a href="https://github.com/SchlossLab/mikropml-snakemake-workflow" class="external-link"><img src="https://raw.githubusercontent.com/SchlossLab/mikropml-snakemake-workflow/main/figures/dag.png" alt="snakemake-dag"></a></p>
</div>
</main><aside class="col-md-3"><nav id="toc"><h2>On this page</h2>
</nav></aside>
</div>
<footer><div class="pkgdown-footer-left">
<p></p>
<p>Developed by <a href="https://github.com/BTopcuoglu" class="external-link">Begüm Topçuoğlu</a>, <a href="https://github.com/zenalapp" class="external-link">Zena Lapp</a>, <a href="https://github.com/kelly-sovacool" class="external-link">Kelly Sovacool</a>, Evan Snitkin, Jenna Wiens, <a href="https://github.com/pschloss" class="external-link">Patrick Schloss</a>.</p>
</div>
<div class="pkgdown-footer-right">
<p></p>
<p>Site built with <a href="https://pkgdown.r-lib.org/" class="external-link">pkgdown</a> 2.0.7.</p>
</div>
</footer>
</div>
</body>
</html>