vignettes/a1_palaeodata_application.Rmd

---
title: "Application with palaeodata"
output: rmarkdown::html_vignette
#output: rmarkdown::pdf_document
vignette: >
  %\VignetteIndexEntry{Application with palaeodata}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
# set options to limit threads used by imported libraries
options(cores=2)
options(mc.cores=2)
# xgboost uses data.table
data.table::setDTthreads(1)
```

# SDMs with tidymodels for palaeo data

In this article, we show how a Species Distribution Model can be fitted with
`tidysdm` on time-scattered (i.e.palaeontological, archaeozoological, archaeological) data, with samples covering different time periods. We recommend users first read the "tidysdm overview" article,
which introduces a number of functions and concepts that will be used in the present article.

We first load `tidysdm`:

```{r}
library(tidysdm)
```

# Preparing your data

We start by loading a set of radiocarbon dates (calibrated) for horses, covering
from 22k years ago until 8k years ago.

```{r load_presences}
data(horses)
horses
```

We convert our dataset into an `sf` data.frame so that we can easily plot it
(here `tidyterra` shines):
```{r}
library(sf)
horses <- st_as_sf(horses, coords = c("longitude", "latitude"))
st_crs(horses) <- 4326
```

As a background to our presences, we will use the land mask for the present,
taken from `pastclim`, and cut to cover only Europe:
```{r echo=FALSE, results='hide'}
library(pastclim)
set_data_path(on_CRAN = TRUE)
```

```{r}
library(pastclim)
land_mask <- pastclim::get_land_mask(time_bp = 0, dataset = "Example")
europe_poly <- vect(region_outline$Europe)
crs(europe_poly) <- "lonlat"
land_mask <- crop(land_mask, europe_poly)
land_mask <- mask(land_mask, europe_poly)
```

And use `tidyterra` to plot:
```{r cast_to_sf, fig.width=6, fig.height=4}
library(tidyterra)
ggplot() +
  geom_spatraster(data = land_mask, aes(fill = land_mask_0)) +
  scale_fill_terrain_c() +
  geom_sf(data = horses, aes(col = time_bp))
```

We now thin our presences, so that locations are further than 100km and
2000 years apart.

```{r thin_by_dist}
set.seed(123)
horses <- thin_by_dist_time(horses,
  dist_min = km2m(100),
  interval_min = y2d(2000),
  time_col = "time_bp",
  lubridate_fun = pastclim::ybp2date
)
nrow(horses)
```

And see what we have left:
```{r plot_thinned, fig.width=6, fig.height=4}
ggplot() +
  geom_spatraster(data = land_mask, aes(fill = land_mask_0)) +
  scale_fill_terrain_c() +
  geom_sf(data = horses, aes(col = time_bp))
```

We now need a time series of palaeoclimate reconstructions. In this vignette,
we will use the example dataset from `pastclim`. This dataset only has
reconstructions every 5k years for the past 20k years at 1 degree resolution, 
with 3 bioclimatic variables. It will suffice for illustrative purposes, but
we recommend that you download higher quality datasets with `pastclim` for
real analysis. As for the land mask, we will cut the reconstructions to cover
Europe only:

```{r load_climate}
library(pastclim)
climate_vars <- c("bio01", "bio10", "bio12")
climate_full <- pastclim::region_series(
  bio_variables = climate_vars,
  data = "Example",
  crop = region_outline$Europe
)
```

Now we thin the observations to only keep one per cell in the raster (it would be better
if we had an equal area projection...), and remove locations outside the
desired area (if there was any):
```{r thin_by_cell}
set.seed(123)
horses <- thin_by_cell_time(horses,
  raster = climate_full,
  time_col = "time_bp",
  lubridate_fun = pastclim::ybp2date
)
nrow(horses)
```

Let's see what we have left of our points:
```{r, fig.width=6, fig.height=4}
ggplot() +
  geom_spatraster(data = land_mask, aes(fill = land_mask_0)) +
  scale_fill_terrain_c() +
  geom_sf(data = horses, aes(col = time_bp))
```

Now we sample pseudo-absences (we will constraint them to be at least 70km away
from any presences), selecting three times the number of presences
```{r}
set.seed(123)
horses <- sample_pseudoabs_time(horses,
  n_per_presence = 3,
  raster = climate_full,
  time_col = "time_bp",
  lubridate_fun = pastclim::ybp2date,
  method = c("dist_min", km2m(70))
)
```

Let's see our presences and absences:
```{r, fig.width=6, fig.height=4}
ggplot() +
  geom_spatraster(data = land_mask, aes(fill = land_mask_0)) +
  scale_fill_terrain_c() +
  geom_sf(data = horses, aes(col = class))
```

Now let's get the climate for these location. `pastclim` requires a data frame
with two columns
with coordinates and a column of time in years before present (where negative
values represent time in the past). We manipulate the `sf` object accordingly:
```{r climate_for_locations}
horses_df <- horses %>%
  dplyr::bind_cols(sf::st_coordinates(horses)) %>%
  mutate(time_bp = date2ybp(time_step)) %>%
  as.data.frame() %>%
  select(-geometry)
# get climate
horses_df <- location_slice_from_region_series(horses_df,
  region_series = climate_full
)

# add the climate reconstructions to the sf object, and remove the time_step
# as we don't need it for modelling
horses <- horses %>%
  bind_cols(horses_df[, climate_vars]) %>%
  select(-time_step)
```

# Fit the model by crossvalidation

Next, we need to set up a `recipe` to define how to handle our dataset. We don't
want to transform our data, so we just
need to define the formula (*class* is the outcome,
all other variables are predictors; note that, for `sf` objects, `geometry` is
automatically ignored as a predictor):
```{r recipe}
horses_rec <- recipe(horses, formula = class ~ .)
horses_rec
```

We can quickly check that we have the variables that we want with:
```{r}
horses_rec$var_info
```

We now build a `workflow_set` of different models, defining which 
hyperparameters we want to tune. We will use *glm*, *gam*, *random forest* and
*boosted trees* as
our models, so only *random forest* and *boosted trees* have tunable hyperparameters. For the most
commonly used models, `tidysdm` automatically chooses the most important
parameters, but it is possible to fully customise model specifications. 

```{r workflow_set}
horses_models <-
  # create the workflow_set
  workflow_set(
    preproc = list(default = horses_rec),
    models = list(
      # the standard glm specs  (no params to tune)
      glm = sdm_spec_glm(),
      # the standard sdm specs (no params to tune)
      gam = sdm_spec_gam(),
      # rf specs with tuning
      rf = sdm_spec_rf(),
      # boosted tree model (gbm) specs with tuning
      gbm = sdm_spec_boost_tree()
    ),
    # make all combinations of preproc and models,
    cross = TRUE
  ) %>%
  # set formula for gams
  update_workflow_model("default_gam",
    spec = sdm_spec_gam(),
    formula = gam_formula(horses_rec)
  ) %>%
  # tweak controls to store information needed later to create the ensemble
  option_add(control = control_ensemble_grid())
```

Note that *gams* are unusual, as we need to specify a formula to define to which
variables we will fit smooths. By default, `gam_formula()` fits a smooth to every
continuous predictor, but a custom formula can be provided instead.

We now want to set up a spatial block cross-validation scheme to tune and assess
our models:

```{r training_cv, fig.width=6, fig.height=4}
library(tidysdm)
set.seed(1005)
horses_cv <- spatial_block_cv(horses, v = 5)
autoplot(horses_cv)
```

We can now use the block CV folds to
tune and assess the models:
```{r tune_grid}
set.seed(123)
horses_models <-
  horses_models %>%
  workflow_map("tune_grid",
    resamples = horses_cv, grid = 5,
    metrics = sdm_metric_set(), verbose = TRUE
  )
```

Note that `workflow_set` correctly detects that we have no tuning parameters for 
*glm* and *gam*. We can have a look at the performance of our models with:

```{r, fig.width=6, fig.height=4}
autoplot(horses_models)
```

Now let's create an ensemble, selecting the best set of parameters for each model
(this is really only relevant for the random forest, as there were not hype-parameters
to tune for the glm and gam). We will use the Boyce continuous index as our metric
to choose the best random forest and boosted tree. When adding members to an ensemble, they are
automatically fitted to the full training dataset, and so ready to make predictions.
```{r}
horses_ensemble <- simple_ensemble() %>%
  add_member(horses_models, metric = "boyce_cont")
```

And visualise it
```{r, fig.width=6, fig.height=4}
autoplot(horses_ensemble)
```

# Projecting to other times

We can now make predictions with this ensemble (using the default option of taking
the mean of the predictions from each model) for the Last Glacial Maximum (LGM, 21,000 years ago).

```{r get_lgm}
climate_lgm <- pastclim::region_slice(
  time_bp = -20000,
  bio_variables = climate_vars,
  data = "Example",
  crop = region_outline$Europe
)
```

And predict using the ensemble:
```{r plot_lgm, fig.width=6, fig.height=4}
prediction_lgm <- predict_raster(horses_ensemble, climate_lgm)
ggplot() +
  geom_spatraster(data = prediction_lgm, aes(fill = mean)) +
  scale_fill_terrain_c()
```