-
Notifications
You must be signed in to change notification settings - Fork 106
/
healthcareai.html
executable file
·510 lines (482 loc) · 62 KB
/
healthcareai.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
<!DOCTYPE html>
<!-- Generated by pkgdown: do not edit by hand --><html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Getting Started with healthcareai • healthcareai</title>
<!-- favicons --><link rel="icon" type="image/png" sizes="16x16" href="../../favicon-16x16.png">
<link rel="icon" type="image/png" sizes="32x32" href="../../favicon-32x32.png">
<link rel="apple-touch-icon" type="image/png" sizes="180x180" href="../../apple-touch-icon.png">
<link rel="apple-touch-icon" type="image/png" sizes="120x120" href="../../apple-touch-icon-120x120.png">
<link rel="apple-touch-icon" type="image/png" sizes="76x76" href="../../apple-touch-icon-76x76.png">
<link rel="apple-touch-icon" type="image/png" sizes="60x60" href="../../apple-touch-icon-60x60.png">
<!-- jquery --><script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js" integrity="sha256-FgpCb/KJQlLNfOu91ta32o/NMZxltwRo8QtmkMRdAu8=" crossorigin="anonymous"></script><!-- Bootstrap --><link href="https://cdnjs.cloudflare.com/ajax/libs/bootswatch/3.3.7/yeti/bootstrap.min.css" rel="stylesheet" crossorigin="anonymous">
<script src="https://cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/3.3.7/js/bootstrap.min.js" integrity="sha256-U5ZEeKfGNOja007MMD3YBI0A3OSZOQbeG6z2f2Y0hu8=" crossorigin="anonymous"></script><!-- Font Awesome icons --><link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.7.1/css/all.min.css" integrity="sha256-nAmazAk6vS34Xqo0BSrTb+abbtFlgsFK7NKSi6o7Y78=" crossorigin="anonymous">
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.7.1/css/v4-shims.min.css" integrity="sha256-6qHlizsOWFskGlwVOKuns+D1nB6ssZrHQrNj1wGplHc=" crossorigin="anonymous">
<!-- clipboard.js --><script src="https://cdnjs.cloudflare.com/ajax/libs/clipboard.js/2.0.4/clipboard.min.js" integrity="sha256-FiZwavyI2V6+EXO1U+xzLG3IKldpiTFf3153ea9zikQ=" crossorigin="anonymous"></script><!-- headroom.js --><script src="https://cdnjs.cloudflare.com/ajax/libs/headroom/0.9.4/headroom.min.js" integrity="sha256-DJFC1kqIhelURkuza0AvYal5RxMtpzLjFhsnVIeuk+U=" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/headroom/0.9.4/jQuery.headroom.min.js" integrity="sha256-ZX/yNShbjqsohH1k95liqY9Gd8uOiE1S4vZc+9KQ1K4=" crossorigin="anonymous"></script><!-- pkgdown --><link href="../../pkgdown.css" rel="stylesheet">
<script src="../../pkgdown.js"></script><!-- docsearch --><script src="../../docsearch.js"></script><link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/docsearch.js/2.6.1/docsearch.min.css" integrity="sha256-QOSRU/ra9ActyXkIBbiIB144aDBdtvXBcNc3OTNuX/Q=" crossorigin="anonymous">
<link href="../../docsearch.css" rel="stylesheet">
<script src="https://cdnjs.cloudflare.com/ajax/libs/mark.js/8.11.1/jquery.mark.min.js" integrity="sha256-4HLtjeVgH0eIB3aZ9mLYF6E8oU5chNdjU6p6rrXpl9U=" crossorigin="anonymous"></script><meta property="og:title" content="Getting Started with healthcareai">
<meta property="og:description" content="">
<meta property="og:image" content="https://docs.healthcare.ai/logo.png">
<meta name="twitter:card" content="summary">
<!-- mathjax --><script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js" integrity="sha256-nvJJv9wWKEm88qvoQl9ekL2J+k/RWIsaSScxxlsrv8k=" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/config/TeX-AMS-MML_HTMLorMML.js" integrity="sha256-84DKXVJXs0/F8OTMzX4UR909+jtl4G7SPypPavF+GfA=" crossorigin="anonymous"></script><!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.3/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]--><!-- Global site tag (gtag.js) - Google Analytics --><script async src="https://www.googletagmanager.com/gtag/js?id=UA-85609357-1"></script><script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'UA-85609357-1');
</script>
</head>
<body>
<div class="container template-article">
<header><div class="navbar navbar-default navbar-fixed-top" role="navigation">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<span class="navbar-brand">
<a class="navbar-link" href="../../index.html">healthcareai</a>
<span class="version label label-default" data-toggle="tooltip" data-placement="bottom" title="Released version">2.4.0</span>
</span>
</div>
<div id="navbar" class="navbar-collapse collapse">
<ul class="nav navbar-nav">
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-expanded="false">
Vignettes
<span class="caret"></span>
</a>
<ul class="dropdown-menu" role="menu">
<li>
<a href="../../articles/site_only/healthcareai.html">Getting Started</a>
</li>
<li>
<a href="../../articles/site_only/db_connections.html">Database Connections</a>
</li>
<li>
<a href="../../articles/site_only/deploy_model.html">Deploying a Model</a>
</li>
<li>
<a href="../../articles/site_only/best_levels.html">Variables with Many Categories</a>
</li>
<li>
<a href="../../articles/site_only/performance.html">Performance with Big Data</a>
</li>
<li>
<a href="../../articles/site_only/transitioning.html">Transition from Version 1</a>
</li>
</ul>
</li>
<li>
<a href="../../reference/index.html">Functions</a>
</li>
<li>
<a href="../../news/index.html">News</a>
</li>
</ul>
<ul class="nav navbar-nav navbar-right">
<li>
<a href="https://github.com/HealthCatalyst/healthcareai-r">
<span class="fa fa-github"></span>
</a>
</li>
<li>
<a href="https://healthcare-ai.slack.com/">
<span class="fa fa-users"></span>
</a>
</li>
</ul>
<form class="navbar-form navbar-right hidden-xs hidden-sm" role="search">
<div class="form-group">
<input type="search" class="form-control" name="search-input" id="search-input" placeholder="Search..." aria-label="Search for..." autocomplete="off">
</div>
</form>
</div>
<!--/.nav-collapse -->
</div>
<!--/.container -->
</div>
<!--/.navbar -->
</header><div class="row">
<div class="col-md-9 contents">
<div class="page-header toc-ignore">
<h1>Getting Started with healthcareai</h1>
<div class="hidden name"><code>healthcareai.Rmd</code></div>
</div>
<p>First we attach the healthcareai R package to make its functions available. If your package version is less than 2.0, none of the code here will work. You can check the package version with <code><a href="https://rdrr.io/r/utils/packageDescription.html">packageVersion("healthcareai")</a></code>, and you can get the latest stable version by running <code><a href="https://rdrr.io/r/utils/install.packages.html">install.packages("healthcareai")</a></code>. If you have v1.X code that you want to use with the new version of the package, check out the <a href="transitioning.html">Transitioning vignette</a>.</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb1-1" data-line-number="1"><span class="kw"><a href="https://rdrr.io/r/base/library.html">library</a></span>(healthcareai)</a>
<a class="sourceLine" id="cb1-2" data-line-number="2"><span class="co"># > healthcareai version 2.4.0</span></a>
<a class="sourceLine" id="cb1-3" data-line-number="3"><span class="co"># > Please visit https://docs.healthcare.ai for full documentation and vignettes. Join the community at https://healthcare-ai.slack.com</span></a></code></pre></div>
<p><code>healthcareai</code> comes with a built in dataset documenting diabetes among adult Pima females. Once you attach the package, the dataset is available in the variable <code>pima_diabetes</code>. Let’s take a look at the data with the <code>str</code> function. There are 768 records in 10 variables including one identifier column, several nominal variables, and substantial missingness (represented in R by <code>NA</code>).</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb2-1" data-line-number="1"><span class="kw"><a href="https://rdrr.io/r/utils/str.html">str</a></span>(pima_diabetes)</a>
<a class="sourceLine" id="cb2-2" data-line-number="2"><span class="co"># > Classes 'tbl_df', 'tbl' and 'data.frame': 768 obs. of 10 variables:</span></a>
<a class="sourceLine" id="cb2-3" data-line-number="3"><span class="co"># > $ patient_id : int 1 2 3 4 5 6 7 8 9 10 ...</span></a>
<a class="sourceLine" id="cb2-4" data-line-number="4"><span class="co"># > $ pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...</span></a>
<a class="sourceLine" id="cb2-5" data-line-number="5"><span class="co"># > $ plasma_glucose: int 148 85 183 89 137 116 78 115 197 125 ...</span></a>
<a class="sourceLine" id="cb2-6" data-line-number="6"><span class="co"># > $ diastolic_bp : int 72 66 64 66 40 74 50 NA 70 96 ...</span></a>
<a class="sourceLine" id="cb2-7" data-line-number="7"><span class="co"># > $ skinfold : int 35 29 NA 23 35 NA 32 NA 45 NA ...</span></a>
<a class="sourceLine" id="cb2-8" data-line-number="8"><span class="co"># > $ insulin : int NA NA NA 94 168 NA 88 NA 543 NA ...</span></a>
<a class="sourceLine" id="cb2-9" data-line-number="9"><span class="co"># > $ weight_class : chr "obese" "overweight" "normal" "overweight" ...</span></a>
<a class="sourceLine" id="cb2-10" data-line-number="10"><span class="co"># > $ pedigree : num 0.627 0.351 0.672 0.167 2.288 ...</span></a>
<a class="sourceLine" id="cb2-11" data-line-number="11"><span class="co"># > $ age : int 50 31 32 21 33 30 26 29 53 54 ...</span></a>
<a class="sourceLine" id="cb2-12" data-line-number="12"><span class="co"># > $ diabetes : chr "Y" "N" "Y" "N" ...</span></a></code></pre></div>
<div id="easy-machine-learning" class="section level1">
<h1 class="hasAnchor">
<a href="#easy-machine-learning" class="anchor"></a>Easy Machine Learning</h1>
<p>If you don’t want to fuss with details any more than necessary, <code>machine_learn</code> is the function for you. It makes it as easy as possible to implement machine learning models by putting all the detains in the background so that you don’t have to worry about them. Of course it might be wise to worry about them, and we’ll get to how to do that further down, but for now, you can automatically take care of problems in the data, do basic feature engineering, and tune multiple machine learning models using cross validation with <code>machine_learn</code>.</p>
<p><code>machine_learn</code> always gets the name of the data frame, then any columns that should not be used by the model (uninformative columns, such as IDs), then the variable to be predicted with <code>outcome =</code>. If you want <code>machine_learn</code> to run faster, you can have that—at the expense of a bit of predictive power—by setting its <code>tune</code> argument to <code>FALSE</code>.</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb3-1" data-line-number="1">quick_models <-<span class="st"> </span><span class="kw"><a href="../../reference/machine_learn.html">machine_learn</a></span>(pima_diabetes, patient_id, <span class="dt">outcome =</span> diabetes)</a>
<a class="sourceLine" id="cb3-2" data-line-number="2"><span class="co"># > Training new data prep recipe...</span></a>
<a class="sourceLine" id="cb3-3" data-line-number="3"><span class="co"># > Variable(s) ignored in prep_data won't be used to tune models: patient_id</span></a>
<a class="sourceLine" id="cb3-4" data-line-number="4"><span class="co"># > </span></a>
<a class="sourceLine" id="cb3-5" data-line-number="5"><span class="co"># > diabetes looks categorical, so training classification algorithms.</span></a>
<a class="sourceLine" id="cb3-6" data-line-number="6"><span class="co"># > </span></a>
<a class="sourceLine" id="cb3-7" data-line-number="7"><span class="co"># > After data processing, models are being trained on 12 features with 768 observations.</span></a>
<a class="sourceLine" id="cb3-8" data-line-number="8"><span class="co"># > Based on n_folds = 5 and hyperparameter settings, the following number of models will be trained: 50 rf's, 50 xgb's, and 100 glm's</span></a>
<a class="sourceLine" id="cb3-9" data-line-number="9"><span class="co"># > Training with cross validation: Random Forest</span></a>
<a class="sourceLine" id="cb3-10" data-line-number="10"><span class="co"># > Training with cross validation: eXtreme Gradient Boosting</span></a>
<a class="sourceLine" id="cb3-11" data-line-number="11"><span class="co"># > Training with cross validation: glmnet</span></a>
<a class="sourceLine" id="cb3-12" data-line-number="12"><span class="co"># > </span></a>
<a class="sourceLine" id="cb3-13" data-line-number="13"><span class="co"># > *** Models successfully trained. The model object contains the training data minus ignored ID columns. ***</span></a>
<a class="sourceLine" id="cb3-14" data-line-number="14"><span class="co"># > *** If there was PHI in training data, normal PHI protocols apply to the model object. ***</span></a></code></pre></div>
<p><code>machine_learn</code> has told us that it has created a recipe for data preparation (this allows us to do exactly the same data cleaning and feature engineering when you want predictions on a new dataset), is ignoring <code>patient_id</code> when tuning models as we told it to, is training classification algorithms because the outcome variable <code>diabetes</code> is categorical, and has executed cross validation for three machine learning models: random forests, XGBoost, and regularized regression. Let’s see what the models look like.</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb4-1" data-line-number="1">quick_models</a>
<a class="sourceLine" id="cb4-2" data-line-number="2"><span class="co"># > Algorithms Trained: Random Forest, eXtreme Gradient Boosting, and glmnet</span></a>
<a class="sourceLine" id="cb4-3" data-line-number="3"><span class="co"># > Model Name: diabetes</span></a>
<a class="sourceLine" id="cb4-4" data-line-number="4"><span class="co"># > Target: diabetes</span></a>
<a class="sourceLine" id="cb4-5" data-line-number="5"><span class="co"># > Class: Classification</span></a>
<a class="sourceLine" id="cb4-6" data-line-number="6"><span class="co"># > Performance Metric: AUROC</span></a>
<a class="sourceLine" id="cb4-7" data-line-number="7"><span class="co"># > Number of Observations: 768</span></a>
<a class="sourceLine" id="cb4-8" data-line-number="8"><span class="co"># > Number of Features: 12</span></a>
<a class="sourceLine" id="cb4-9" data-line-number="9"><span class="co"># > Models Trained: 2020-02-28 08:42:20 </span></a>
<a class="sourceLine" id="cb4-10" data-line-number="10"><span class="co"># > </span></a>
<a class="sourceLine" id="cb4-11" data-line-number="11"><span class="co"># > Models tuned via 5-fold cross validation over 9 combinations of hyperparameter values.</span></a>
<a class="sourceLine" id="cb4-12" data-line-number="12"><span class="co"># > Best model: Random Forest</span></a>
<a class="sourceLine" id="cb4-13" data-line-number="13"><span class="co"># > AUPR = 0.71, AUROC = 0.85</span></a>
<a class="sourceLine" id="cb4-14" data-line-number="14"><span class="co"># > Optimal hyperparameter values:</span></a>
<a class="sourceLine" id="cb4-15" data-line-number="15"><span class="co"># > mtry = 5</span></a>
<a class="sourceLine" id="cb4-16" data-line-number="16"><span class="co"># > splitrule = extratrees</span></a>
<a class="sourceLine" id="cb4-17" data-line-number="17"><span class="co"># > min.node.size = 20</span></a></code></pre></div>
<p>Everything looks as expected, and the best model is is a random forest that achieves performance of AUROC = 0.85. Not bad for one line of code.</p>
<p>Now that we have our models, we can make predictions using the <code>predict</code> function. If you provide a new data frame to <code>predict</code> it will make predictions on the new data; otherwise, it will make predictions on the training data.</p>
<div class="sourceCode" id="cb5"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb5-1" data-line-number="1">predictions <-<span class="st"> </span><span class="kw"><a href="https://rdrr.io/r/stats/predict.html">predict</a></span>(quick_models)</a>
<a class="sourceLine" id="cb5-2" data-line-number="2">predictions</a>
<a class="sourceLine" id="cb5-3" data-line-number="3"><span class="co"># > "predicted_diabetes" predicted by Random Forest last trained: 2020-02-28 08:42:20</span></a>
<a class="sourceLine" id="cb5-4" data-line-number="4"><span class="co"># > Performance in training: AUROC = 0.85</span></a>
<a class="sourceLine" id="cb5-5" data-line-number="5"><span class="co"># > # A tibble: 768 x 11</span></a>
<a class="sourceLine" id="cb5-6" data-line-number="6"><span class="co"># > diabetes predicted_diabe… patient_id pregnancies plasma_glucose diastolic_bp</span></a>
<a class="sourceLine" id="cb5-7" data-line-number="7"><span class="co"># > * <fct> <dbl> <int> <int> <int> <int></span></a>
<a class="sourceLine" id="cb5-8" data-line-number="8"><span class="co"># > 1 Y 0.678 1 6 148 72</span></a>
<a class="sourceLine" id="cb5-9" data-line-number="9"><span class="co"># > 2 N 0.153 2 1 85 66</span></a>
<a class="sourceLine" id="cb5-10" data-line-number="10"><span class="co"># > 3 Y 0.460 3 8 183 64</span></a>
<a class="sourceLine" id="cb5-11" data-line-number="11"><span class="co"># > 4 N 0.00927 4 1 89 66</span></a>
<a class="sourceLine" id="cb5-12" data-line-number="12"><span class="co"># > 5 Y 0.566 5 0 137 40</span></a>
<a class="sourceLine" id="cb5-13" data-line-number="13"><span class="co"># > # … with 763 more rows, and 5 more variables: skinfold <int>, insulin <int>,</span></a>
<a class="sourceLine" id="cb5-14" data-line-number="14"><span class="co"># > # weight_class <chr>, pedigree <dbl>, age <int></span></a></code></pre></div>
<p>We get a message about when the model was trained and how well it preformed in training, and we get back a data frame that looks sort of like the original, but has a new column <code>predited_diabetes</code> that contains the model-generated probability each individual has diabetes, and contains changes that were made preparing the data for model training, e.g. missingness has been filled in and <code>weight_class</code> has been split into a series of “dummy” variables.</p>
<p>We can plot how effectively the model is able to separate diabetic from non-diabetic individuals by calling the <code>plot</code> function on the output of <code>predict</code>.</p>
<div class="sourceCode" id="cb6"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb6-1" data-line-number="1"><span class="kw"><a href="https://rdrr.io/r/graphics/plot.html">plot</a></span>(predictions)</a></code></pre></div>
<p><img src="healthcareai_files/figure-html/unnamed-chunk-6-1.png" width="576"></p>
<p>If you want outcome-class predictions in addition to predicted probabilites, the <code>outcome_groups</code> argument accomplishes that. If it is <code>TRUE</code> the overall accuracy of predictions is maximized. If it is a number, it represents the relative cost of a false-negative to a false-positive outcome. The example below says that one false negative is as bad as two false positives. If you want risk groups instead, see the <code>risk_groups</code> argument.</p>
<div class="sourceCode" id="cb7"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb7-1" data-line-number="1">quick_models <span class="op">%>%</span><span class="st"> </span></a>
<a class="sourceLine" id="cb7-2" data-line-number="2"><span class="st"> </span><span class="kw"><a href="https://rdrr.io/r/stats/predict.html">predict</a></span>(<span class="dt">outcome_groups =</span> <span class="dv">2</span>) <span class="op">%>%</span><span class="st"> </span></a>
<a class="sourceLine" id="cb7-3" data-line-number="3"><span class="st"> </span><span class="kw"><a href="https://rdrr.io/r/graphics/plot.html">plot</a></span>()</a></code></pre></div>
<p><img src="healthcareai_files/figure-html/unnamed-chunk-7-1.png" width="576"></p>
</div>
<div id="data-profiling" class="section level1">
<h1 class="hasAnchor">
<a href="#data-profiling" class="anchor"></a>Data Profiling</h1>
<p>It is always a good idea to be aware of where there are missing values in data. The <code>missingness</code> function helps with that. In addition to looking for values R sees as missing, it looks for other values that might represent missing, such as <code>"NULL"</code>, and issues a warning if it finds any. Like many <code>healthcareai</code> functions, it has a <code>plot</code> method so you can inspect the results more quickly and intuitively by passing the output to <code>plot</code>.</p>
<div class="sourceCode" id="cb8"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb8-1" data-line-number="1"><span class="kw"><a href="../../reference/missingness.html">missingness</a></span>(pima_diabetes) <span class="op">%>%</span><span class="st"> </span></a>
<a class="sourceLine" id="cb8-2" data-line-number="2"><span class="st"> </span><span class="kw"><a href="https://rdrr.io/r/graphics/plot.html">plot</a></span>()</a></code></pre></div>
<p><img src="healthcareai_files/figure-html/unnamed-chunk-8-1.png" width="576"></p>
<p>It’s good that we don’t have any missingness in our ID or outcome columns. We’ll see how missingness in predictors is addressed further down.</p>
</div>
<div id="data-preparation" class="section level1">
<h1 class="hasAnchor">
<a href="#data-preparation" class="anchor"></a>Data Preparation</h1>
<p>To get an honest picture of how well a model performs (and an accurate estimate of how well it will perform on yet-unseen data), it is wise to hide a small portion of observations from model training and assess model performance on this “validation” or “test” dataset. In fact, <code>healthcareai</code> does this automatically and repeatedly under the hood, so it’s not strictly necessary, but it’s still a good idea. The <code>split_train_test</code> function simplifies this, and it ensures the test dataset has proportionally similar characteristics to the training dataset. By default, 80% of observations are used for training; that proportion can be adjusted with the <code>p</code> parameter. The <code>seed</code> parameter controls randomness so that you can get the same split every time you run the code if you want strict reproducability.</p>
<div class="sourceCode" id="cb9"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb9-1" data-line-number="1">split_data <-<span class="st"> </span><span class="kw"><a href="../../reference/split_train_test.html">split_train_test</a></span>(<span class="dt">d =</span> pima_diabetes,</a>
<a class="sourceLine" id="cb9-2" data-line-number="2"> <span class="dt">outcome =</span> diabetes,</a>
<a class="sourceLine" id="cb9-3" data-line-number="3"> <span class="dt">p =</span> <span class="fl">.9</span>,</a>
<a class="sourceLine" id="cb9-4" data-line-number="4"> <span class="dt">seed =</span> <span class="dv">84105</span>)</a></code></pre></div>
<p><code>split_data</code> contains two data frames, named <code>train</code> and <code>test</code>.</p>
<p>One of the major workhorse functions in <code>healthcareai</code> is <code>prep_data</code>. It is called under-the-hood by <code>machine_learn</code>, so you don’t have to worry about these details if you don’t want to, but eventually you’ll want to customize how your data is prepared; this is where you do that. The helpfile <code><a href="../../reference/prep_data.html">?prep_data</a></code> describes what the function does and how it can be customized. Here, let’s customize preparation to scale and center numeric variables and avoid collapsing rare factor levels into “other”.</p>
<p>The first arguments to <code>prep_data</code> are the same as those to <code>machine_learn</code>: data frame, ignored columns, and the outcome column. Then we can specify prep details.</p>
<div class="sourceCode" id="cb10"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb10-1" data-line-number="1">prepped_training_data <-<span class="st"> </span><span class="kw"><a href="../../reference/prep_data.html">prep_data</a></span>(split_data<span class="op">$</span>train, patient_id, <span class="dt">outcome =</span> diabetes,</a>
<a class="sourceLine" id="cb10-2" data-line-number="2"> <span class="dt">center =</span> <span class="ot">TRUE</span>, <span class="dt">scale =</span> <span class="ot">TRUE</span>,</a>
<a class="sourceLine" id="cb10-3" data-line-number="3"> <span class="dt">collapse_rare_factors =</span> <span class="ot">FALSE</span>)</a>
<a class="sourceLine" id="cb10-4" data-line-number="4"><span class="co"># > Training new data prep recipe...</span></a></code></pre></div>
<p>The “recipe” that the above message refers to is a set of instructions for how to transform a dataset the way we just transformed our training data. Any machine learning that we do (within <code>healthcareai</code>) on <code>prepped_training_data</code> will retain that recipe and apply it before making predictions on new data. That means that when you have models making predictions in production, you don’t have to figure out how to transform the data or worry about encountering missing data or new category levels.</p>
</div>
<div id="model-training" class="section level1">
<h1 class="hasAnchor">
<a href="#model-training" class="anchor"></a>Model Training</h1>
<p><code>machine_learn</code> takes care of data preparation and model training for you, but if you want more precise control, <code>tune_models</code> and <code>flash_models</code> are the model-training function you’re looking for. They differ in that <code>tune_models</code> searches over hyperparameters to optimize model performance, while <code>flash_models</code> trains models at set hyperparameter values. So, <code>tune_models</code> produces better models, but takes longer (approaching 10x longer at default settings).</p>
<p>Let’s tune all three available models: random forests (“RF”), regularized regression (i.e. lasso and ridge, “GLM”), and gradient-boosted decision trees (i.e. XGBoost, “XGB”). To optimize model performance, let’s crank <code>tune_depth</code> up a little from its default value of ten. That will tune the models over more combinations of hyperparameter values in the search for the best model. This will increasing training time, so be cautious with it at first, but for this modest-sized dataset, the entire process takes less than a minute to complete on a laptop.</p>
<p>Let’s also select “PR” as our model metric. That optimizes for area under the precision-recall curve rather than the default of area under the receiver operating characteristic curve (“ROC”). This is usually a good idea when one outcome category is much more common than the other category.</p>
<div class="sourceCode" id="cb11"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb11-1" data-line-number="1">models <-<span class="st"> </span><span class="kw"><a href="../../reference/tune_models.html">tune_models</a></span>(<span class="dt">d =</span> prepped_training_data,</a>
<a class="sourceLine" id="cb11-2" data-line-number="2"> <span class="dt">outcome =</span> diabetes,</a>
<a class="sourceLine" id="cb11-3" data-line-number="3"> <span class="dt">tune_depth =</span> <span class="dv">25</span>,</a>
<a class="sourceLine" id="cb11-4" data-line-number="4"> <span class="dt">metric =</span> <span class="st">"PR"</span>)</a>
<a class="sourceLine" id="cb11-5" data-line-number="5"><span class="co"># > Variable(s) ignored in prep_data won't be used to tune models: patient_id</span></a>
<a class="sourceLine" id="cb11-6" data-line-number="6"><span class="co"># > </span></a>
<a class="sourceLine" id="cb11-7" data-line-number="7"><span class="co"># > diabetes looks categorical, so training classification algorithms.</span></a>
<a class="sourceLine" id="cb11-8" data-line-number="8"><span class="co"># > </span></a>
<a class="sourceLine" id="cb11-9" data-line-number="9"><span class="co"># > After data processing, models are being trained on 13 features with 692 observations.</span></a>
<a class="sourceLine" id="cb11-10" data-line-number="10"><span class="co"># > Based on n_folds = 5 and hyperparameter settings, the following number of models will be trained: 125 rf's, 125 xgb's, and 250 glm's</span></a>
<a class="sourceLine" id="cb11-11" data-line-number="11"><span class="co"># > Training with cross validation: Random Forest</span></a>
<a class="sourceLine" id="cb11-12" data-line-number="12"><span class="co"># > Training with cross validation: eXtreme Gradient Boosting</span></a>
<a class="sourceLine" id="cb11-13" data-line-number="13"><span class="co"># > Training with cross validation: glmnet</span></a>
<a class="sourceLine" id="cb11-14" data-line-number="14"><span class="co"># > </span></a>
<a class="sourceLine" id="cb11-15" data-line-number="15"><span class="co"># > *** Models successfully trained. The model object contains the training data minus ignored ID columns. ***</span></a>
<a class="sourceLine" id="cb11-16" data-line-number="16"><span class="co"># > *** If there was PHI in training data, normal PHI protocols apply to the model object. ***</span></a></code></pre></div>
<p>You can compare performance across models with <code>evaluate</code>.</p>
<div class="sourceCode" id="cb12"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb12-1" data-line-number="1"><span class="kw"><a href="../../reference/evaluate.html">evaluate</a></span>(models, <span class="dt">all_models =</span> <span class="ot">TRUE</span>)</a>
<a class="sourceLine" id="cb12-2" data-line-number="2"><span class="co"># > # A tibble: 3 x 3</span></a>
<a class="sourceLine" id="cb12-3" data-line-number="3"><span class="co"># > model AUPR AUROC</span></a>
<a class="sourceLine" id="cb12-4" data-line-number="4"><span class="co"># > <chr> <dbl> <dbl></span></a>
<a class="sourceLine" id="cb12-5" data-line-number="5"><span class="co"># > 1 Random Forest 0.703 0.842</span></a>
<a class="sourceLine" id="cb12-6" data-line-number="6"><span class="co"># > 2 glmnet 0.688 0.836</span></a>
<a class="sourceLine" id="cb12-7" data-line-number="7"><span class="co"># > 3 eXtreme Gradient Boosting 0.687 0.820</span></a></code></pre></div>
<p>For more detail, you can examine how models perform across hyperparameters by plotting the model object. Here we plot only the best model’s performance over hyperparameter by extracting it by name. It looks like extratrees is a superior split rule for this model.</p>
<div class="sourceCode" id="cb13"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb13-1" data-line-number="1">models[<span class="st">"Random Forest"</span>] <span class="op">%>%</span><span class="st"> </span></a>
<a class="sourceLine" id="cb13-2" data-line-number="2"><span class="st"> </span><span class="kw"><a href="https://rdrr.io/r/graphics/plot.html">plot</a></span>()</a></code></pre></div>
<p><img src="healthcareai_files/figure-html/unnamed-chunk-13-1.png" width="864"></p>
<div id="faster-model-training" class="section level2">
<h2 class="hasAnchor">
<a href="#faster-model-training" class="anchor"></a>Faster Model Training</h2>
<p>If you’re feeling the need for speed, <code>flash_models</code> is the function for you. It uses fixed sets of hyperparameter values to train the models, so you still get a model customized to your data, but without burning the electricity and time to precisely optimize all the details. Here we’ll use <code>models = "RF"</code> to train only a random forest.</p>
<p>If you want to train a model on fixed hyperparameter values, but you want to choose those values, you can pass them to the <code>hyperparameters</code> argument of <code>tune_models</code>. Run <code><a href="../../reference/get_hyperparameter_defaults.html">get_hyperparameter_defaults()</a></code> to see the default values and get a list you can customize.</p>
<div class="sourceCode" id="cb14"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb14-1" data-line-number="1">untuned_rf <-<span class="st"> </span><span class="kw"><a href="../../reference/flash_models.html">flash_models</a></span>(<span class="dt">d =</span> prepped_training_data,</a>
<a class="sourceLine" id="cb14-2" data-line-number="2"> <span class="dt">outcome =</span> diabetes,</a>
<a class="sourceLine" id="cb14-3" data-line-number="3"> <span class="dt">models =</span> <span class="st">"RF"</span>,</a>
<a class="sourceLine" id="cb14-4" data-line-number="4"> <span class="dt">metric =</span> <span class="st">"PR"</span>)</a>
<a class="sourceLine" id="cb14-5" data-line-number="5"><span class="co"># > Variable(s) ignored in prep_data won't be used to tune models: patient_id</span></a>
<a class="sourceLine" id="cb14-6" data-line-number="6"><span class="co"># > </span></a>
<a class="sourceLine" id="cb14-7" data-line-number="7"><span class="co"># > diabetes looks categorical, so training classification algorithms.</span></a>
<a class="sourceLine" id="cb14-8" data-line-number="8"><span class="co"># > </span></a>
<a class="sourceLine" id="cb14-9" data-line-number="9"><span class="co"># > After data processing, models are being trained on 13 features with 692 observations.</span></a>
<a class="sourceLine" id="cb14-10" data-line-number="10"><span class="co"># > Based on n_folds = 5 and hyperparameter settings, the following number of models will be trained: 5 rf's</span></a>
<a class="sourceLine" id="cb14-11" data-line-number="11"><span class="co"># > Training at fixed values: Random Forest</span></a>
<a class="sourceLine" id="cb14-12" data-line-number="12"><span class="co"># > </span></a>
<a class="sourceLine" id="cb14-13" data-line-number="13"><span class="co"># > *** Models successfully trained. The model object contains the training data minus ignored ID columns. ***</span></a>
<a class="sourceLine" id="cb14-14" data-line-number="14"><span class="co"># > *** If there was PHI in training data, normal PHI protocols apply to the model object. ***</span></a></code></pre></div>
</div>
</div>
<div id="model-interpretation" class="section level1">
<h1 class="hasAnchor">
<a href="#model-interpretation" class="anchor"></a>Model Interpretation</h1>
<div id="interpret" class="section level2">
<h2 class="hasAnchor">
<a href="#interpret" class="anchor"></a>Interpret</h2>
<p>If you trained a GLM model, you can extract model coefficients from it with the <code>interpret</code> function. These are coefficient estimates from a regularized logistic or linear regression model. If you didn’t scale your predictors (which is the default in <code>prep_data</code>), these will be in natural units (e.g. in the plot below, a unit increase in plasma glucose corresponds to an expected log-odds increase of diabetes of just over one). Importantly, natural units mean that you can’t interpret the size of the coefficients as the importance of the predictor. To get that interpretation, scale your features during data preparation by calling <code>prep_data</code> with <code>scale = TRUE</code> and then <code>flash_models</code> or <code>tune_models</code>.</p>
<p>In this plot, the low value of <code>weight_class_normal</code> signifies that people with normal weight are less likely to have diabetes. Similarly, plasma glucose is associated with increased risk of diabetes after accounting for other variables.</p>
<div class="sourceCode" id="cb15"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb15-1" data-line-number="1"><span class="kw"><a href="../../reference/interpret.html">interpret</a></span>(models) <span class="op">%>%</span><span class="st"> </span></a>
<a class="sourceLine" id="cb15-2" data-line-number="2"><span class="st"> </span><span class="kw"><a href="https://rdrr.io/r/graphics/plot.html">plot</a></span>()</a>
<a class="sourceLine" id="cb15-3" data-line-number="3"><span class="co"># > Warning in interpret(models): Interpreting glmnet model, but Random Forest</span></a>
<a class="sourceLine" id="cb15-4" data-line-number="4"><span class="co"># > performed best in cross-validation and will be used to make predictions. To use</span></a>
<a class="sourceLine" id="cb15-5" data-line-number="5"><span class="co"># > the glmnet model for predictions, extract it with x['glmnet'].</span></a></code></pre></div>
<p><img src="healthcareai_files/figure-html/unnamed-chunk-15-1.png" width="672"></p>
</div>
<div id="variable-importance" class="section level2">
<h2 class="hasAnchor">
<a href="#variable-importance" class="anchor"></a>Variable Importance</h2>
<p>Tree based methods such as random forest and boosted decision trees can’t provide coefficients like regularized regression models can, but they can provide information about how important each feature (aka predictor, aka variable) is for making accurate predictions. You can see these “variable importances” by calling <code>get_variable_importance</code> on your model object. Like <code>interpret</code> and many other functions in <code>healthcareai</code>, you can plot the output of <code>get_variable_importance</code> with a simple <code>plot</code> call.</p>
<div class="sourceCode" id="cb16"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb16-1" data-line-number="1"><span class="kw"><a href="../../reference/get_variable_importance.html">get_variable_importance</a></span>(models) <span class="op">%>%</span></a>
<a class="sourceLine" id="cb16-2" data-line-number="2"><span class="st"> </span><span class="kw"><a href="https://rdrr.io/r/graphics/plot.html">plot</a></span>()</a></code></pre></div>
<p><img src="healthcareai_files/figure-html/unnamed-chunk-16-1.png" width="672"></p>
</div>
<div id="explore" class="section level2">
<h2 class="hasAnchor">
<a href="#explore" class="anchor"></a>Explore</h2>
<p>The <code>explore</code> function reveals how a model makes its predictions. It takes the most important features in a model, and uses a variety of “counterfactual” observations across those features to see what predictions the model would make at various combinations of the features. To see the effect of more features adjust the <code>n_use</code> argument to <code>plot</code>, or for different features, specify <code>x_var</code> and <code>color_var</code>.</p>
<div class="sourceCode" id="cb17"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb17-1" data-line-number="1"><span class="kw"><a href="../../reference/explore.html">explore</a></span>(models) <span class="op">%>%</span><span class="st"> </span></a>
<a class="sourceLine" id="cb17-2" data-line-number="2"><span class="st"> </span><span class="kw"><a href="https://rdrr.io/r/graphics/plot.html">plot</a></span>()</a>
<a class="sourceLine" id="cb17-3" data-line-number="3"><span class="co"># > With 4 varying features and n_use = 2, using median to aggregate predicted outcomes across age and pregnancies. You could turn `n_use` up to see the impact of more features.</span></a></code></pre></div>
<p><img src="healthcareai_files/figure-html/unnamed-chunk-17-1.png" width="576"></p>
</div>
</div>
<div id="prediction" class="section level1">
<h1 class="hasAnchor">
<a href="#prediction" class="anchor"></a>Prediction</h1>
<p><code>predict</code> will automatically use the best-performing model from training (evaluated out-of-fold in cross validation). If no new data is passed to <code>predict</code> it will return out-of-fold predictions from training. The predicted probabilities appear in the <code>predicted_diabetes</code> column.</p>
<div class="sourceCode" id="cb18"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb18-1" data-line-number="1"><span class="kw"><a href="https://rdrr.io/r/stats/predict.html">predict</a></span>(models)</a>
<a class="sourceLine" id="cb18-2" data-line-number="2"><span class="co"># > "predicted_diabetes" predicted by Random Forest last trained: 2020-02-28 08:43:02</span></a>
<a class="sourceLine" id="cb18-3" data-line-number="3"><span class="co"># > Performance in training: AUPR = 0.7</span></a>
<a class="sourceLine" id="cb18-4" data-line-number="4"><span class="co"># > # A tibble: 692 x 11</span></a>
<a class="sourceLine" id="cb18-5" data-line-number="5"><span class="co"># > diabetes predicted_diabe… patient_id pregnancies plasma_glucose diastolic_bp</span></a>
<a class="sourceLine" id="cb18-6" data-line-number="6"><span class="co"># > * <fct> <dbl> <int> <int> <int> <int></span></a>
<a class="sourceLine" id="cb18-7" data-line-number="7"><span class="co"># > 1 Y 0.691 1 6 148 72</span></a>
<a class="sourceLine" id="cb18-8" data-line-number="8"><span class="co"># > 2 N 0.142 2 1 85 66</span></a>
<a class="sourceLine" id="cb18-9" data-line-number="9"><span class="co"># > 3 Y 0.432 3 8 183 64</span></a>
<a class="sourceLine" id="cb18-10" data-line-number="10"><span class="co"># > 4 N 0.0219 4 1 89 66</span></a>
<a class="sourceLine" id="cb18-11" data-line-number="11"><span class="co"># > 5 Y 0.534 5 0 137 40</span></a>
<a class="sourceLine" id="cb18-12" data-line-number="12"><span class="co"># > # … with 687 more rows, and 5 more variables: skinfold <int>, insulin <int>,</span></a>
<a class="sourceLine" id="cb18-13" data-line-number="13"><span class="co"># > # weight_class <chr>, pedigree <dbl>, age <int></span></a></code></pre></div>
<p>To get predictions on a new dataset, pass the new data to <code>predict</code>, and it will automatically be prepared based on the recipe generated on the training data. We can plot the predictions to see how well our model is doing, and we see that it’s separating diabetic from non-diabetic individuals pretty well, although there a fair number of non-diabetics with high predicted probabilities of diabetes. This may be due to optimizing for precision recall, or may indicate pre-diabetic patients.</p>
<p>Above, we saw how to make outcome-class predictions. Here, we make risk-group predictions, defining four risk groups (low, moderate, high, and extreme) containing 30%, 40%, 20% and 10% of patients, respectively.</p>
<div class="sourceCode" id="cb19"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb19-1" data-line-number="1">test_predictions <-<span class="st"> </span></a>
<a class="sourceLine" id="cb19-2" data-line-number="2"><span class="st"> </span><span class="kw"><a href="https://rdrr.io/r/stats/predict.html">predict</a></span>(models, </a>
<a class="sourceLine" id="cb19-3" data-line-number="3"> split_data<span class="op">$</span>test, </a>
<a class="sourceLine" id="cb19-4" data-line-number="4"> <span class="dt">risk_groups =</span> <span class="kw"><a href="https://rdrr.io/r/base/c.html">c</a></span>(<span class="dt">low =</span> <span class="dv">30</span>, <span class="dt">moderate =</span> <span class="dv">40</span>, <span class="dt">high =</span> <span class="dv">20</span>, <span class="dt">extreme =</span> <span class="dv">10</span>)</a>
<a class="sourceLine" id="cb19-5" data-line-number="5"> )</a>
<a class="sourceLine" id="cb19-6" data-line-number="6"><span class="co"># > Prepping data based on provided recipe</span></a>
<a class="sourceLine" id="cb19-7" data-line-number="7"><span class="kw"><a href="https://rdrr.io/r/graphics/plot.html">plot</a></span>(test_predictions)</a></code></pre></div>
<p><img src="healthcareai_files/figure-html/unnamed-chunk-19-1.png" width="576"></p>
</div>
<div id="saving-moving-and-loading-models" class="section level1">
<h1 class="hasAnchor">
<a href="#saving-moving-and-loading-models" class="anchor"></a>Saving, Moving, and Loading Models</h1>
<p>Everything we have done above happens “in memory”. It’s all within one R session, so there’s no need to save anything to disk or load anything back into R. Putting a machine learning model in production typically means moving the model into a production environment. To do that, save the model with <code>save_models</code> function.</p>
<div class="sourceCode" id="cb20"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb20-1" data-line-number="1"><span class="kw"><a href="../../reference/save_models.html">save_models</a></span>(models, <span class="dt">file =</span> <span class="st">"my_models.RDS"</span>)</a></code></pre></div>
<p>The above code will store the <code>models</code> object with all its metadata in the <code>my_models.RDS</code> file in the working directory, which you can identify with <code><a href="https://rdrr.io/r/base/getwd.html">getwd()</a></code>. You can move that file to any other directory or machine, even across operating systems, and pull it back into R with the <code>load_models</code> function.</p>
<p>The only tricky thing here is you have to direct <code>load_models</code> to the directory that the model file is in. If you don’t provide a filepath, i.e. call <code><a href="../../reference/save_models.html">load_models()</a></code>, you’ll get a dialog box from which you can choose your model file. Otherwise, you can provide <code>load_models</code> an absolute path to the file, e.g. <code><a href="../../reference/save_models.html">load_models("C:/Users/user.name/Documents/diabetes/my_models.RDS")</a></code>, or a path relative to your working directory, which again you can find with <code><a href="https://rdrr.io/r/base/getwd.html">getwd()</a></code>, e.g. <code><a href="../../reference/save_models.html">load_models("data/my_models.RDS")</a></code>. If you put the models in the same directory as your R script or project, you can load the models without any file path.</p>
<div class="sourceCode" id="cb21"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb21-1" data-line-number="1">models <-<span class="st"> </span><span class="kw"><a href="../../reference/save_models.html">load_models</a></span>(<span class="st">"my_models.RDS"</span>)</a></code></pre></div>
<p>That will reestablish the <code>models</code> object in your R session. You can confirm this by clicking on the “Environment” tab in R Studio or running <code><a href="https://rdrr.io/r/base/ls.html">ls()</a></code> to list all objects in your R session.</p>
</div>
<div id="a-regression-example" class="section level1">
<h1 class="hasAnchor">
<a href="#a-regression-example" class="anchor"></a>A Regression Example</h1>
<p>All the examples above have been classification tasks, predicting a yes/no outcome. Here’s an example of a full regression modeling pipeline on a silly problem: predicting individuals’ ages. The code is very similar to classification.</p>
<div class="sourceCode" id="cb22"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb22-1" data-line-number="1">regression_models <-<span class="st"> </span><span class="kw"><a href="../../reference/machine_learn.html">machine_learn</a></span>(pima_diabetes, patient_id, <span class="dt">outcome =</span> age)</a>
<a class="sourceLine" id="cb22-2" data-line-number="2"><span class="co"># > Training new data prep recipe...</span></a>
<a class="sourceLine" id="cb22-3" data-line-number="3"><span class="co"># > Variable(s) ignored in prep_data won't be used to tune models: patient_id</span></a>
<a class="sourceLine" id="cb22-4" data-line-number="4"><span class="co"># > </span></a>
<a class="sourceLine" id="cb22-5" data-line-number="5"><span class="co"># > age looks numeric, so training regression algorithms.</span></a>
<a class="sourceLine" id="cb22-6" data-line-number="6"><span class="co"># > </span></a>
<a class="sourceLine" id="cb22-7" data-line-number="7"><span class="co"># > After data processing, models are being trained on 14 features with 768 observations.</span></a>
<a class="sourceLine" id="cb22-8" data-line-number="8"><span class="co"># > Based on n_folds = 5 and hyperparameter settings, the following number of models will be trained: 50 rf's, 50 xgb's, and 100 glm's</span></a>
<a class="sourceLine" id="cb22-9" data-line-number="9"><span class="co"># > Training with cross validation: Random Forest</span></a>
<a class="sourceLine" id="cb22-10" data-line-number="10"><span class="co"># > Training with cross validation: eXtreme Gradient Boosting</span></a>
<a class="sourceLine" id="cb22-11" data-line-number="11"><span class="co"># > Training with cross validation: glmnet</span></a>
<a class="sourceLine" id="cb22-12" data-line-number="12"><span class="co"># > </span></a>
<a class="sourceLine" id="cb22-13" data-line-number="13"><span class="co"># > *** Models successfully trained. The model object contains the training data minus ignored ID columns. ***</span></a>
<a class="sourceLine" id="cb22-14" data-line-number="14"><span class="co"># > *** If there was PHI in training data, normal PHI protocols apply to the model object. ***</span></a>
<a class="sourceLine" id="cb22-15" data-line-number="15"><span class="kw"><a href="https://rdrr.io/r/base/summary.html">summary</a></span>(regression_models)</a>
<a class="sourceLine" id="cb22-16" data-line-number="16"><span class="co"># > Models trained: 2020-02-28 08:43:23</span></a>
<a class="sourceLine" id="cb22-17" data-line-number="17"><span class="co"># > </span></a>
<a class="sourceLine" id="cb22-18" data-line-number="18"><span class="co"># > Models tuned via 5-fold cross validation over 10 combinations of hyperparameter values.</span></a>
<a class="sourceLine" id="cb22-19" data-line-number="19"><span class="co"># > Best performance: RMSE = 9.1, MAE = 6.5, Rsquared = 0.41</span></a>
<a class="sourceLine" id="cb22-20" data-line-number="20"><span class="co"># > By Random Forest with hyperparameters:</span></a>
<a class="sourceLine" id="cb22-21" data-line-number="21"><span class="co"># > mtry = 4</span></a>
<a class="sourceLine" id="cb22-22" data-line-number="22"><span class="co"># > splitrule = variance</span></a>
<a class="sourceLine" id="cb22-23" data-line-number="23"><span class="co"># > min.node.size = 17</span></a>
<a class="sourceLine" id="cb22-24" data-line-number="24"><span class="co"># > </span></a>
<a class="sourceLine" id="cb22-25" data-line-number="25"><span class="co"># > Out-of-fold performance of all trained models:</span></a>
<a class="sourceLine" id="cb22-26" data-line-number="26"><span class="co"># > </span></a>
<a class="sourceLine" id="cb22-27" data-line-number="27"><span class="co"># > $`Random Forest`</span></a>
<a class="sourceLine" id="cb22-28" data-line-number="28"><span class="co"># > # A tibble: 10 x 9</span></a>
<a class="sourceLine" id="cb22-29" data-line-number="29"><span class="co"># > mtry splitrule min.node.size RMSE Rsquared MAE RMSESD RsquaredSD MAESD</span></a>
<a class="sourceLine" id="cb22-30" data-line-number="30"><span class="co"># > <int> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl></span></a>
<a class="sourceLine" id="cb22-31" data-line-number="31"><span class="co"># > 1 4 variance 17 9.07 0.410 6.53 0.655 0.0283 0.276</span></a>
<a class="sourceLine" id="cb22-32" data-line-number="32"><span class="co"># > 2 3 variance 4 9.08 0.412 6.59 0.694 0.0283 0.305</span></a>
<a class="sourceLine" id="cb22-33" data-line-number="33"><span class="co"># > 3 5 variance 7 9.09 0.404 6.53 0.601 0.0246 0.235</span></a>
<a class="sourceLine" id="cb22-34" data-line-number="34"><span class="co"># > 4 4 extratrees 17 9.16 0.417 6.66 0.814 0.0408 0.441</span></a>
<a class="sourceLine" id="cb22-35" data-line-number="35"><span class="co"># > 5 7 variance 2 9.20 0.391 6.62 0.587 0.0209 0.212</span></a>
<a class="sourceLine" id="cb22-36" data-line-number="36"><span class="co"># > # … with 5 more rows</span></a>
<a class="sourceLine" id="cb22-37" data-line-number="37"><span class="co"># > </span></a>
<a class="sourceLine" id="cb22-38" data-line-number="38"><span class="co"># > $`eXtreme Gradient Boosting`</span></a>
<a class="sourceLine" id="cb22-39" data-line-number="39"><span class="co"># > # A tibble: 10 x 13</span></a>
<a class="sourceLine" id="cb22-40" data-line-number="40"><span class="co"># > eta max_depth gamma colsample_bytree min_child_weight subsample nrounds</span></a>
<a class="sourceLine" id="cb22-41" data-line-number="41"><span class="co"># > <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int></span></a>
<a class="sourceLine" id="cb22-42" data-line-number="42"><span class="co"># > 1 0.0291 4 5.73 0.730 0.248 0.763 570</span></a>
<a class="sourceLine" id="cb22-43" data-line-number="43"><span class="co"># > 2 0.176 7 9.76 0.518 3.53 0.744 46</span></a>
<a class="sourceLine" id="cb22-44" data-line-number="44"><span class="co"># > 3 0.0990 2 0.350 0.624 2.33 0.526 626</span></a>
<a class="sourceLine" id="cb22-45" data-line-number="45"><span class="co"># > 4 0.423 5 6.79 0.643 3.80 0.940 69</span></a>
<a class="sourceLine" id="cb22-46" data-line-number="46"><span class="co"># > 5 0.432 5 6.23 0.505 14.2 0.356 30</span></a>
<a class="sourceLine" id="cb22-47" data-line-number="47"><span class="co"># > # … with 5 more rows, and 6 more variables: RMSE <dbl>, Rsquared <dbl>,</span></a>
<a class="sourceLine" id="cb22-48" data-line-number="48"><span class="co"># > # MAE <dbl>, RMSESD <dbl>, RsquaredSD <dbl>, MAESD <dbl></span></a>
<a class="sourceLine" id="cb22-49" data-line-number="49"><span class="co"># > </span></a>
<a class="sourceLine" id="cb22-50" data-line-number="50"><span class="co"># > $glmnet</span></a>
<a class="sourceLine" id="cb22-51" data-line-number="51"><span class="co"># > # A tibble: 20 x 8</span></a>
<a class="sourceLine" id="cb22-52" data-line-number="52"><span class="co"># > alpha lambda RMSE Rsquared MAE RMSESD RsquaredSD MAESD</span></a>
<a class="sourceLine" id="cb22-53" data-line-number="53"><span class="co"># > <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl></span></a>
<a class="sourceLine" id="cb22-54" data-line-number="54"><span class="co"># > 1 0 0.00128 9.37 0.377 6.74 0.578 0.0798 0.358</span></a>
<a class="sourceLine" id="cb22-55" data-line-number="55"><span class="co"># > 2 0 0.00367 9.37 0.377 6.74 0.578 0.0798 0.358</span></a>
<a class="sourceLine" id="cb22-56" data-line-number="56"><span class="co"># > 3 0 0.00896 9.37 0.377 6.74 0.578 0.0798 0.358</span></a>
<a class="sourceLine" id="cb22-57" data-line-number="57"><span class="co"># > 4 0 0.0218 9.37 0.377 6.74 0.578 0.0798 0.358</span></a>
<a class="sourceLine" id="cb22-58" data-line-number="58"><span class="co"># > 5 0 0.0367 9.37 0.377 6.74 0.578 0.0798 0.358</span></a>
<a class="sourceLine" id="cb22-59" data-line-number="59"><span class="co"># > # … with 15 more rows</span></a></code></pre></div>
<p>Let’s make a prediction on a hypothetical new patient. Note that the model handles missingness in <code>insulin</code> and a new category level in <code>weight_class</code> without a problem (but warns about it).</p>
<div class="sourceCode" id="cb23"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb23-1" data-line-number="1">new_patient <-<span class="st"> </span><span class="kw"><a href="https://rdrr.io/r/base/data.frame.html">data.frame</a></span>(</a>
<a class="sourceLine" id="cb23-2" data-line-number="2"> <span class="dt">pregnancies =</span> <span class="dv">0</span>,</a>
<a class="sourceLine" id="cb23-3" data-line-number="3"> <span class="dt">plasma_glucose =</span> <span class="dv">80</span>,</a>
<a class="sourceLine" id="cb23-4" data-line-number="4"> <span class="dt">diastolic_bp =</span> <span class="dv">55</span>,</a>
<a class="sourceLine" id="cb23-5" data-line-number="5"> <span class="dt">skinfold =</span> <span class="dv">24</span>,</a>
<a class="sourceLine" id="cb23-6" data-line-number="6"> <span class="dt">insulin =</span> <span class="ot">NA</span>,</a>
<a class="sourceLine" id="cb23-7" data-line-number="7"> <span class="dt">weight_class =</span> <span class="st">"???"</span>,</a>
<a class="sourceLine" id="cb23-8" data-line-number="8"> <span class="dt">pedigree =</span> <span class="fl">.2</span>,</a>
<a class="sourceLine" id="cb23-9" data-line-number="9"> <span class="dt">diabetes =</span> <span class="st">"N"</span>)</a>
<a class="sourceLine" id="cb23-10" data-line-number="10"><span class="kw"><a href="https://rdrr.io/r/stats/predict.html">predict</a></span>(regression_models, new_patient)</a>
<a class="sourceLine" id="cb23-11" data-line-number="11"><span class="co"># > Warning in ready_with_prep(object, newdata, mi): The following variables(s) had the following value(s) in predict that were not observed in training. </span></a>
<a class="sourceLine" id="cb23-12" data-line-number="12"><span class="co"># > weight_class: ???</span></a>
<a class="sourceLine" id="cb23-13" data-line-number="13"><span class="co"># > Prepping data based on provided recipe</span></a>
<a class="sourceLine" id="cb23-14" data-line-number="14"><span class="co"># > "predicted_age" predicted by Random Forest last trained: 2020-02-28 08:43:23</span></a>
<a class="sourceLine" id="cb23-15" data-line-number="15"><span class="co"># > Performance in training: RMSE = 9.07</span></a>
<a class="sourceLine" id="cb23-16" data-line-number="16"><span class="co"># > # A tibble: 1 x 9</span></a>
<a class="sourceLine" id="cb23-17" data-line-number="17"><span class="co"># > predicted_age pregnancies plasma_glucose diastolic_bp skinfold insulin</span></a>
<a class="sourceLine" id="cb23-18" data-line-number="18"><span class="co"># > * <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> </span></a>
<a class="sourceLine" id="cb23-19" data-line-number="19"><span class="co"># > 1 23.7 0 80 55 24 NA </span></a>
<a class="sourceLine" id="cb23-20" data-line-number="20"><span class="co"># > # … with 3 more variables: weight_class <fct>, pedigree <dbl>, diabetes <fct></span></a></code></pre></div>
</div>
</div>
<div class="col-md-3 hidden-xs hidden-sm" id="sidebar">
<div id="tocnav">
<h2 class="hasAnchor">
<a href="#tocnav" class="anchor"></a>Contents</h2>
<ul class="nav nav-pills nav-stacked">
<li><a href="#easy-machine-learning">Easy Machine Learning</a></li>
<li><a href="#data-profiling">Data Profiling</a></li>
<li><a href="#data-preparation">Data Preparation</a></li>
<li>
<a href="#model-training">Model Training</a><ul class="nav nav-pills nav-stacked">
<li><a href="#faster-model-training">Faster Model Training</a></li>
</ul>
</li>
<li>
<a href="#model-interpretation">Model Interpretation</a><ul class="nav nav-pills nav-stacked">
<li><a href="#interpret">Interpret</a></li>
<li><a href="#variable-importance">Variable Importance</a></li>
<li><a href="#explore">Explore</a></li>
</ul>
</li>
<li><a href="#prediction">Prediction</a></li>
<li><a href="#saving-moving-and-loading-models">Saving, Moving, and Loading Models</a></li>
<li><a href="#a-regression-example">A Regression Example</a></li>
</ul>
</div>
</div>
</div>
<footer><div class="copyright">
<p>Developed by Levi Thatcher, Michael Levy, Mike Mastanduno, Taylor Larsen, Taylor Miller, Rex Sumsion.</p>
</div>
<div class="pkgdown">
<p>Site built with <a href="https://pkgdown.r-lib.org/">pkgdown</a> 1.4.1.</p>
</div>
</footer>
</div>
<script src="https://cdnjs.cloudflare.com/ajax/libs/docsearch.js/2.6.1/docsearch.min.js" integrity="sha256-GKvGqXDznoRYHCwKXGnuchvKSwmx9SRMrZOTh2g4Sb0=" crossorigin="anonymous"></script><script>
docsearch({
apiKey: 'ac39465bc37cbef616f5de1e646b6037',
indexName: 'healthcareai',
inputSelector: 'input#search-input.form-control',
transformData: function(hits) {
return hits.map(function (hit) {
hit.url = updateHitURL(hit);
return hit;
});
}
});
</script>
</body>
</html>