-
Notifications
You must be signed in to change notification settings - Fork 16
/
seegSDM_tutorial.Rmd
374 lines (263 loc) · 15 KB
/
seegSDM_tutorial.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
Example workflow / tutorial for the seegSDM package
========================================================
This document runs through a typical workflow for distribution modelling using the ```seegSDM``` package.
It will includes importing and checking data, running BRT ensembles in parallel, examining the fitted models and calculating and outputting prediciton rasters. Please report any issues via the [issues tracking system](https://github.com/SEEG-Oxford/seegSDM/issues).
### Contents
* [Installing the package](#install)
* [Loading data](#load)
* [Quality control](#quality)
* [Generating pseudo-absences](#pseudo)
* [Extracting covariate data](#extract)
* [Running a single BRT model](#BRT)
* [Running a BRT ensemble in parallel](#ensemble)
* [Summarizing the BRT ensemble](#vis)
* [Outputting the results](#output)
### <a id="install"></a>Installing the package
To install seegSDM straight from github we use the ```install_github``` function in the ```devtools``` package.
If it isn't already installed, install devtools from CRAN and load it
```{r message=FALSE}
install.packages('devtools')
library(devtools)
```
```seegSDM``` currently depends on version 2.1-0.4 or later of the ```gbm``` R package.
This version isn't on CRAN yet, so you'll need to install it from github first.
```{r message=FALSE}
install_github('harrysouthworth/gbm')
```
Now we can use ```install_github``` to install ```seegSDM``` and then load the package
```{r message=FALSE}
install_github('SEEG-Oxford/seegSDM')
library(seegSDM)
```
### <a id="load"></a>Loading data
Next we load in some occurrence data. Here we'll use fake occurrence data provided with the package, though you can import your own using e.g. ```read.csv```. The occurrence object has a number of different columns, giving coodinates of observations of a fake disease (both polygon and point data) as well as information needed for model fitting.
```{r}
# load the data
data(occurrence)
# look at the first 6 lines
head(occurrence)
```
Most of the ```seegSDM``` functions use ```SpatialPoints*``` objects (from the ```sp``` package) so we convert ```occurrence``` into one of these. We can do this using the function ```occurrence2SPDF``` which makes some assumptions about whats in ```occurrence``` (see the helpfile for details).
```{r}
# convert to a SpatialPoints object
occ <- occurrence2SPDF(occurrence)
```
Next we load a bunch of raster files containing covariates for the model. Again, we use some fake data rasters which are provided with the package. You can import your own using ```raster``` from the ```raster``` package or maybe using the ````seegSDM``` function ```importRasters``` to make things a little easier.
```{r fig.width=8.5, fig.height=8}
# load the covariate rasters
data(covariates)
# see a summary
covariates
# and plot them
plot(covariates)
```
### <a id="quality"></a>Quality control
There are currently two functions to check the quality of incoming data: ```checkRasters``` which checks that rasters match up with an expected template raster and ```checkOccurrence``` which runs a number of checks on the incoming occurrence data to make sure there aren't any serious errors, and to tidy up some more minor ones.
First we'll run ```checkRasters``` to make sure the ```covariates``` lines up with a template raster that we know is correct.
```{r}
# load a template raster to check covariates against
data(template)
# run checkRasters
checkRasters(covariates, template)
```
If everything is fine the original object is returned (so here R prints a summary), otherwise an error is thrown. See ```?checkRasters``` for more details of the checks that are done.
Next we use ```checkOccurrence``` run a whole bunch of checks (```?checkOccurrence``` for details) on the incoming occurrence data. We need a few other raster layers to do this, including an evidence consensus layer and a raster brick of admin unit maps (again, these are fake data).
We load and plot the evidence consensus layer:
```{r fig.width = 6, fig.height = 6}
data(consensus)
plot(consensus)
```
and a raster brick with (fake) GAUL codes at 4 different levels.
```{r fig.width = 8.5, fig.height = 8}
# load and plot the four layers
data(admin)
plot(admin)
```
Now we can run ```checkOccurrence```
```{r}
# overwriting the occ object we defined before
occ <- checkOccurrence(occurrence, consensus, admin)
```
some of the occurrence points were in polygons with areas geater than the default allowed maximum. The maximum allowed area can be altered. Other than this all of the checks were passed. If any major problems had been detected, ```checkOccurrence``` would have thrown an error message.
### <a id="pseudo"></a>Generating pseudo-absences & extracting the data
There are various different schools of thought on how to select pseudo-absences for presence-only species distribution modelling.
The recent Dengue distribution paper by [Bhatt et al.](http://dx.doi.org/10.1038/nature12060) used a distance threshold and an evidence consensus layer to select both pseudo-absence and pseudo-presence points. The function ```extractBhatt``` takes a vector of three required parameters and applies this procedure. It also extracts the values of the covariates for all of these points. For polygon occurrence records the multiple values are efficiently extracted and summarised using the function ```extractAdmin```. For point data the ```raster``` package function ```extract``` is used.
Other pseduo-data generation methods can be carried out using the more general functions ```bgSample``` and ```bgDistance```. Here we apply ```extractBhatt``` to our data using some arbitrary parameter settings.
```{r}
# to make sure this tutorial is reproducible, we set the seed
# for the random number generator
set.seed(1)
# run extractBhatt, defining the last covariate as a factor
lis <- extractBhatt(c(2, 1, 5),
occ,
covariates,
consensus,
admin,
factor = c(FALSE, FALSE, TRUE),
return_points = TRUE)
```
With ```return_points = TRUE```, ```extractBhatt``` produces a list with three elements: a dataframe for modelling and the locations of the pseudo-presence and pseudo-absence records. We have a look at the object and plot the points.
```{r fig.width = 7, fig.height = 6}
# look what's in the list returned
names(lis)
# summarise the dataframe which we'll use for modelling
summary(lis$data)
# evidence consensus layer as a background
plot(consensus)
# add the pseudo-absences in light blue (note they aren't in high scoring
# consensus regions)
points(lis$pseudo_absence, pch = 16, col = 'light blue')
# and pseudo-presences in purple (not in the very low scoring regions)
points(lis$pseudo_presence, pch = 16, col = 'purple')
# and add the occurrence points
points(occ, pch = 16)
```
### <a id="BRT"></a>Running a single BRT model
We're now ready to run a BRT model. The ```gbm.step``` function in the ```dismo``` package (which ```seegSDM``` loads) runs a cross-validation procedure to pick the best number of trees (an important parameter in BRT) and runs the final model. ```seegSDM``` provides a wrapper function ```runBRT``` for ```gbm.step``` with a set of SEEG-preferred default settings. Only four arguments need to be provided: the dataframe, the indicies for the presence/pseudo-absence and covariate columns and a ```RasterBrick``` object to predict to.
```{r message = FALSE}
brt <- runBRT(lis$data, 4:6, 1, covariates, gbm.coords = 2:3)
```
```runBRT``` returns a list giving the model, a raster of the predicted probability of presence and data to plot the relative influence and covariate effects. The last two are used in the BRT ensemble modelling, but we can visualise the single model using functions from the ```gbm``` package.
We can plot the individual marginal effect curves for each covariate...
```{r fig.width=12, fig.height=4}
par(mfrow = c(1, nlayers(covariates)))
for (i in 1:nlayers(covariates)) {
plot(brt$model, i)
}
```
...the 2-dimensional interaction between the first two covariates...
```{r fig.width=6, fig.height=6}
plot(brt$model, 1:2)
```
...the relative influence of each covariate...
```{r fig.width=6, fig.height=6}
summary(brt$model)
```
... (internal) cross-validation statistics...
```{r}
getStats(brt)
```
...and the map of predicted habitat suitability produced by ```runBRT```.
```{r fig.width=7, fig.height=7}
plot(brt$pred, zlim = c(0, 1))
```
### <a id="ensemble"></a>Running a BRT ensemble in parallel
The plots above show one of the drawbacks of single BRT models: they fit very jerky effects of environmental covariates. We can get smoother and more realistic effect curves through fitting and averaging an ensemble of multiple BRT models. Ensembling also allows us to reduce reliance of the model on the arbitrary selection of parameters for pseudo-data generation and enables us to produce estimates of uncertainty in both the model and its predictions.
Next we'll set up an ensemble of models fitted using pseudo-data generated in different ways by altering the parameters passed to ```extractBhatt```. We first define the ranges of parameters to use then get all the different combinations using R's ```expand.grid``` function.
```{r}
# pseudo-absences per occurrence:
na <- c(1, 5, 10)
# pseudo-presences per occurrence:
np <- c(0.1, 0.05, 0.01)
# distance (in decimal degrees) within which to sample these
mu <- c(1, 3, 5)
pars <- expand.grid(na = na, np = np, mu = mu)
# each row contains a different configuration of parameters
head(pars)
# now there are 3 * 3 * 3 = 27 different combinations and models to run
nrow(pars)
```
To use all of the ```seegSDM``` functions for running ensembles we need to change these into a list, which we can do using ```lapply``` and a small function to subset ```pars```.
```{r}
sub <- function (i, pars) {
pars[i, ]
}
par_list <- lapply(1:nrow(pars), sub, pars)
```
Unfortunately running BRT ensembles can be time consuming and it's preferable to run them in parallel across multiple processors. There are a number of different R packages to help with this, but here we use the ```snowfall``` package.
Loading seegSDM has already loaded snowfall, so the first thing we need to do is set up a parallel cluster using the function ```sfInit```, which is as easy as this:
```{r warning = FALSE}
# set up a cluster of four cpus in with parallel execution.
# You may want to run a different number depending on your computer!
sfInit(cpus = 4, parallel = TRUE)
```
We need to export any functions we want to run in parallel out to each processor. Everything we'll use is in ```seegSDM```, so we export the whole package using ```sfLibrary```
```{r}
sfLibrary(seegSDM)
```
Now we're ready to run some code in parallel. We use the function ```sfLapply``` which acts like ```lapply```, except that each element of the list is processed in parallel. We run ```extractBhatt``` over these different parameter settings.
```{r}
# carry out data extraction for the ensemble in parallel
# takes around 9s on my 4-core machine
data_list <- sfLapply(par_list,
extractBhatt,
occ,
covariates,
consensus,
admin,
factor = c(FALSE, FALSE, TRUE))
```
`data_list` is a list with each element containing a different dataframe for modelling. We run the BRT ensemble by using `sfLapply` again, this time with `runBRT`.
```{r}
# fit the models of the ensemble in parallel
# takes around 45s on my 4-core machine
model_list <- sfLapply(data_list,
runBRT,
4:6,
1,
covariates,
gbm.coords = 2:3)
```
```model_lis``` now contains a list the objects returned by each ```runBRT``` run. We can get cross-validation statistics for each of these using ```getStats```. We do that now, also in parallel, then combine them into a single matrix.
```{r fig.width= 7, fig.height = 7}
# get cv statistics in parallel
stat_lis <- sfLapply(model_list, getStats)
# Now we've finished with the parallel cluster, we should shut it down
sfStop()
# convert the list into a matrix using the do.call function
stats <- do.call('rbind', stat_lis)
# look at it
head(stats)
# and produce a boxplot of a few imnportant statistics
boxplot(stats[, 3:7], col = 'grey', ylim = c(0, 1))
```
We now have a list of model outputs, each of which contains the fitted model, predictions and information for plotting. We can pull out individual model runs and plot the predictions.
```{r fig.width = 10, fig.height = 5.5}
par(mfrow = c(1, 2))
plot(model_list[[1]]$pred, main = 'run 1', zlim = c(0, 1))
plot(model_list[[27]]$pred, main = 'run 27', zlim = c(0, 1))
```
### <a id="vis"></a>Summarizing the BRT ensemble
There are three helper functions to help us summarize the BRT ensemble: `getRelInf`, `getEffectPlots` and `combinePreds`. `getRelInf` combines the models to get summaries of the relative influence of the covariates across all the models. It returns a matrix of means and quantiles, and optionally produces a boxplot.
```{r fig.width=5, fig.height = 5}
relinf <- getRelInf(model_list, plot = TRUE)
relinf
```
`getEffectPlots` performs a similar operation for the effect plots, optionally plotting mean effects with uncertainty intervals.
```{r fig.width=12, fig.height = 4}
par(mfrow = c(1, 3))
effect <- getEffectPlots(model_list, plot = TRUE)
```
`combinePreds` combines the prediction maps (on the probability scale) from multiple models and returns rasters giving the mean, median and quantiles of the ensemble predictions. unlike the previous two functions, `combinePreds` needs a `RasterBrick` or `RasterStack` object with each layers giving a single prediction. So we need to create one of these before we can use `combinePreds`. Note that we can also run `combinePreds` in parallel to save some time if the rasters are particularly large.
```{r fig.width = 9, fig.height = 8}
# lapply to extract the predictions into a list
preds <- lapply(model_list, function(x) x$pred)
# coerce the list into a rasterbrick
preds <- brick(preds)
# now we can run combinePreds
preds <- combinePreds(preds)
# plot the resulting maps
plot(preds, zlim = c(0, 1))
```
We can create a simple map of prediction uncertainty by subtracting the lower from the upper quantile.
```{r fig.width = 10, fig.height = 5.5}
# calculate uncertainty
preds$uncertainty <- preds[[4]] - preds[[3]]
# plot mean and uncertainty
par(mfrow = c(1, 2))
# plot mean
plot(preds$mean,
zlim = c(0, 1),
main = 'mean')
# and uncertainty
plot(preds$uncertainty,
col = topo.colors(100),
main = 'uncertainty')
```
### <a id="output"></a>Outputting the results
Outputting the prediction maps can be done with the function `writeRaster` in the `raster` package. See `?writeFormats` for the raster formats that can be written to, and how to specify them.
```{r eval = FALSE}
# write the mean prediction and uncertainty rasters as Geo-Tiffs
writeRaster(preds$mean, 'c:/prediction_map', format = 'GTiff')
writeRaster(preds$uncertainty, 'c:/uncertainty_map', format = 'GTiff')
```