generated from allisonhorst/meds-distill-template
-
Notifications
You must be signed in to change notification settings - Fork 19
/
Lab6.Rmd
71 lines (39 loc) · 2.9 KB
/
Lab6.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
---
title: "Lab6"
author: {Student Name}
date: "2023-03-01"
output: html_document
---
## Case Study: Eel Distribution Modeling
This week's lab follows a project modeling the eel species Anguilla australis described by Elith et al. (2008). There are two data sets for this lab. You'll use one for training and evaluating your model, and you'll use your model to make predictions predictions on the other. Then you'll compare your model's performance to the model used by Elith et al.
## Data
Grab the training and evaluation data sets (eel.model.data.csv, eel.eval.data.csv) from github here:
https://github.com/MaRo406/eds-232-machine-learning/blob/main/data
### Preprocess
Create a recipe to prepare your data for the XGBoost model
### Split and Resample
Split the model data (eel.model.data.csv) into a training and test set, stratified by outcome score (Angaus). Use 10-fold CV to resample the training set.
## Tuning XGBoost
### Tune Learning Rate
Following the XGBoost tuning strategy outlined in lecture, first we conduct tuning on just the learning rate parameter:
1. Create a model specification using {xgboost} for the estimation
- Only specify one parameter to tune()
2. Set up a grid to tune your model by using a range of learning rate parameter values: expand.grid(learn_rate = seq(0.0001, 0.3, length.out = 30))
- Use appropriate metrics argument(s) - Computational efficiency becomes a factor as models get more complex and data get larger. Record the time it takes to run. Do this for each tuning phase you run.You could use {tictoc} or Sys.time().
3. Show the performance of the best models and the estimates for the learning rate parameter values associated with each.
### Tune Tree Parameters
1. Create a new specification where you set the learning rate (which you already optimized) and tune the tree parameters.
2. Set up a tuning grid. This time use grid_latin_hypercube() to get a representative sampling of the parameter space
3. Show the performance of the best models and the estimates for the tree parameter values associated with each.
### Tune Stochastic Parameters
1. Create a new specification where you set the learning rate and tree parameters (which you already optimized) and tune the stochastic parameters.
2. Set up a tuning grid. Use grid_latin_hypercube() again.
3. Show the performance of the best models and the estimates for the tree parameter values associated with each.
## Finalize workflow and make final prediction
1. How well did your model perform? What types of errors did it make?
## Fit your model the evaluation data and compare performance
1. Now used your final model to predict on the other dataset (eval.data.csv)
2. How does your model perform on this data?
3. How do your results compare to those of Elith et al.?
- Use {vip} to compare variable importance
- What do your variable importance results tell you about the distribution of this eel species?