-
Notifications
You must be signed in to change notification settings - Fork 4
/
a1_palaeodata_application.Rmd
301 lines (261 loc) · 8.88 KB
/
a1_palaeodata_application.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
---
title: "Application with palaeodata"
output: rmarkdown::html_vignette
#output: rmarkdown::pdf_document
vignette: >
%\VignetteIndexEntry{Application with palaeodata}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
# set options to limit threads used by imported libraries
options(cores=2)
options(mc.cores=2)
# xgboost uses data.table
data.table::setDTthreads(1)
```
# SDMs with tidymodels for palaeo data
In this article, we show how a Species Distribution Model can be fitted with
`tidysdm` on time-scattered (i.e.palaeontological, archaeozoological, archaeological) data, with samples covering different time periods. We recommend users first read the "tidysdm overview" article,
which introduces a number of functions and concepts that will be used in the present article.
We first load `tidysdm`:
```{r}
library(tidysdm)
```
# Preparing your data
We start by loading a set of radiocarbon dates (calibrated) for horses, covering
from 22k years ago until 8k years ago.
```{r load_presences}
data(horses)
horses
```
We convert our dataset into an `sf` data.frame so that we can easily plot it
(here `tidyterra` shines):
```{r}
library(sf)
horses <- st_as_sf(horses, coords = c("longitude", "latitude"))
st_crs(horses) <- 4326
```
As a background to our presences, we will use the land mask for the present,
taken from `pastclim`, and cut to cover only Europe:
```{r echo=FALSE, results='hide'}
library(pastclim)
set_data_path(on_CRAN = TRUE)
```
```{r}
library(pastclim)
land_mask <- pastclim::get_land_mask(time_bp = 0, dataset = "Example")
europe_poly <- vect(region_outline$Europe)
crs(europe_poly) <- "lonlat"
land_mask <- crop(land_mask, europe_poly)
land_mask <- mask(land_mask, europe_poly)
```
And use `tidyterra` to plot:
```{r cast_to_sf, fig.width=6, fig.height=4}
library(tidyterra)
ggplot() +
geom_spatraster(data = land_mask, aes(fill = land_mask_0)) +
scale_fill_terrain_c() +
geom_sf(data = horses, aes(col = time_bp))
```
We now thin our presences, so that locations are further than 100km and
2000 years apart.
```{r thin_by_dist}
set.seed(123)
horses <- thin_by_dist_time(horses,
dist_min = km2m(100),
interval_min = y2d(2000),
time_col = "time_bp",
lubridate_fun = pastclim::ybp2date
)
nrow(horses)
```
And see what we have left:
```{r plot_thinned, fig.width=6, fig.height=4}
ggplot() +
geom_spatraster(data = land_mask, aes(fill = land_mask_0)) +
scale_fill_terrain_c() +
geom_sf(data = horses, aes(col = time_bp))
```
We now need a time series of palaeoclimate reconstructions. In this vignette,
we will use the example dataset from `pastclim`. This dataset only has
reconstructions every 5k years for the past 20k years at 1 degree resolution,
with 3 bioclimatic variables. It will suffice for illustrative purposes, but
we recommend that you download higher quality datasets with `pastclim` for
real analysis. As for the land mask, we will cut the reconstructions to cover
Europe only:
```{r load_climate}
library(pastclim)
climate_vars <- c("bio01", "bio10", "bio12")
climate_full <- pastclim::region_series(
bio_variables = climate_vars,
data = "Example",
crop = region_outline$Europe
)
```
Now we thin the observations to only keep one per cell in the raster (it would be better
if we had an equal area projection...), and remove locations outside the
desired area (if there was any):
```{r thin_by_cell}
set.seed(123)
horses <- thin_by_cell_time(horses,
raster = climate_full,
time_col = "time_bp",
lubridate_fun = pastclim::ybp2date
)
nrow(horses)
```
Let's see what we have left of our points:
```{r, fig.width=6, fig.height=4}
ggplot() +
geom_spatraster(data = land_mask, aes(fill = land_mask_0)) +
scale_fill_terrain_c() +
geom_sf(data = horses, aes(col = time_bp))
```
Now we sample pseudo-absences (we will constraint them to be at least 70km away
from any presences), selecting three times the number of presences
```{r}
set.seed(123)
horses <- sample_pseudoabs_time(horses,
n_per_presence = 3,
raster = climate_full,
time_col = "time_bp",
lubridate_fun = pastclim::ybp2date,
method = c("dist_min", km2m(70))
)
```
Let's see our presences and absences:
```{r, fig.width=6, fig.height=4}
ggplot() +
geom_spatraster(data = land_mask, aes(fill = land_mask_0)) +
scale_fill_terrain_c() +
geom_sf(data = horses, aes(col = class))
```
Now let's get the climate for these location. `pastclim` requires a data frame
with two columns
with coordinates and a column of time in years before present (where negative
values represent time in the past). We manipulate the `sf` object accordingly:
```{r climate_for_locations}
horses_df <- horses %>%
dplyr::bind_cols(sf::st_coordinates(horses)) %>%
mutate(time_bp = date2ybp(time_step)) %>%
as.data.frame() %>%
select(-geometry)
# get climate
horses_df <- location_slice_from_region_series(horses_df,
region_series = climate_full
)
# add the climate reconstructions to the sf object, and remove the time_step
# as we don't need it for modelling
horses <- horses %>%
bind_cols(horses_df[, climate_vars]) %>%
select(-time_step)
```
# Fit the model by crossvalidation
Next, we need to set up a `recipe` to define how to handle our dataset. We don't
want to transform our data, so we just
need to define the formula (*class* is the outcome,
all other variables are predictors; note that, for `sf` objects, `geometry` is
automatically ignored as a predictor):
```{r recipe}
horses_rec <- recipe(horses, formula = class ~ .)
horses_rec
```
We can quickly check that we have the variables that we want with:
```{r}
horses_rec$var_info
```
We now build a `workflow_set` of different models, defining which
hyperparameters we want to tune. We will use *glm*, *gam*, *random forest* and
*boosted trees* as
our models, so only *random forest* and *boosted trees* have tunable hyperparameters. For the most
commonly used models, `tidysdm` automatically chooses the most important
parameters, but it is possible to fully customise model specifications.
```{r workflow_set}
horses_models <-
# create the workflow_set
workflow_set(
preproc = list(default = horses_rec),
models = list(
# the standard glm specs (no params to tune)
glm = sdm_spec_glm(),
# the standard sdm specs (no params to tune)
gam = sdm_spec_gam(),
# rf specs with tuning
rf = sdm_spec_rf(),
# boosted tree model (gbm) specs with tuning
gbm = sdm_spec_boost_tree()
),
# make all combinations of preproc and models,
cross = TRUE
) %>%
# set formula for gams
update_workflow_model("default_gam",
spec = sdm_spec_gam(),
formula = gam_formula(horses_rec)
) %>%
# tweak controls to store information needed later to create the ensemble
option_add(control = control_ensemble_grid())
```
Note that *gams* are unusual, as we need to specify a formula to define to which
variables we will fit smooths. By default, `gam_formula()` fits a smooth to every
continuous predictor, but a custom formula can be provided instead.
We now want to set up a spatial block cross-validation scheme to tune and assess
our models:
```{r training_cv, fig.width=6, fig.height=4}
library(tidysdm)
set.seed(1005)
horses_cv <- spatial_block_cv(horses, v = 5)
autoplot(horses_cv)
```
We can now use the block CV folds to
tune and assess the models:
```{r tune_grid}
set.seed(123)
horses_models <-
horses_models %>%
workflow_map("tune_grid",
resamples = horses_cv, grid = 5,
metrics = sdm_metric_set(), verbose = TRUE
)
```
Note that `workflow_set` correctly detects that we have no tuning parameters for
*glm* and *gam*. We can have a look at the performance of our models with:
```{r, fig.width=6, fig.height=4}
autoplot(horses_models)
```
Now let's create an ensemble, selecting the best set of parameters for each model
(this is really only relevant for the random forest, as there were not hype-parameters
to tune for the glm and gam). We will use the Boyce continuous index as our metric
to choose the best random forest and boosted tree. When adding members to an ensemble, they are
automatically fitted to the full training dataset, and so ready to make predictions.
```{r}
horses_ensemble <- simple_ensemble() %>%
add_member(horses_models, metric = "boyce_cont")
```
And visualise it
```{r, fig.width=6, fig.height=4}
autoplot(horses_ensemble)
```
# Projecting to other times
We can now make predictions with this ensemble (using the default option of taking
the mean of the predictions from each model) for the Last Glacial Maximum (LGM, 21,000 years ago).
```{r get_lgm}
climate_lgm <- pastclim::region_slice(
time_bp = -20000,
bio_variables = climate_vars,
data = "Example",
crop = region_outline$Europe
)
```
And predict using the ensemble:
```{r plot_lgm, fig.width=6, fig.height=4}
prediction_lgm <- predict_raster(horses_ensemble, climate_lgm)
ggplot() +
geom_spatraster(data = prediction_lgm, aes(fill = mean)) +
scale_fill_terrain_c()
```