Fetching contributors…
Cannot retrieve contributors at this time
293 lines (205 sloc) 15 KB
title: "03 Using `Cache` with `SpaDES`"
- "Eliot J. B. McIntire"
date: '`r strftime(Sys.Date(), "%B %d %Y")`'
fig_width: 7
number_sections: yes
self_contained: yes
toc: yes
vignette: >
%\VignetteIndexEntry{03 Caching SpaDES simulations}
%\VignetteDepends{igraph, raster, SpaDES}
As part of a reproducible work flow, caching of various function calls are a critical component.
Down the road, it is likely that an entire work flow from raw data to publication, decision support, report writing, presentation building etc., could be built and be reproducible anywhere, on demand.
The `reproducible::Cache` function is built to work with any R function.
However, it becomes very powerful in a `SpaDES` context because we can build large, powerful applications that are transparent and tied to the raw data that may be *many* conceptual steps upstream in the workflow.
To do this, we have built several customizations within the `SpaDES` package.
Important to this is dealing correctly with the `simList`, which is an object that has slot that is an environment. But more important are the various tools that can be used at higher levels, *i.e.*, not just for "standard" functions.
# Caching as part of `SpaDES`
Some of the details of the `simList`-specific features of this `Cache` function include:
- The function converts all elements that have an environment as part of their attributes into a format that has no unique environment attribute, using `format` if a function, and `as.list` in the case of the `simList` environment.
- When used within SpaDES modules, `Cache` (capital C) does not require that the argument `cacheRepo` be specified.
If called from inside a SpaDES module, `Cache` will use the `cacheRepo` argument from a call to `cachePath(sim)`, taking the `sim` from the call stack.
Similarly, if no `cacheRepo` argument is specified, then it will use `getOption("spades.cachePath")`, which will, by default, be a temporary location with no persistence between R sessions!
To persist between sessions, use `SpaDES::setPaths()` every session.
In a `SpaDES` context, there are several levels of caching that can be used as part of a reproducible workflow.
Each level can be used to a modeler's advantage; and, all can be -- and are often -- used concurrently.
## At the `spades` level
And entire call to `spades` or `experiment` can be cached.
This will have the effect of eliminating any stochasticity in the model as the output will simply be the cached version of the `simList`.
This is likely most useful in situations where reproducibility is more important than "new" stochasticity (*e.g.*, building decision support systems, apps, final version of a manuscript).
```{r examples, echo=TRUE, message=FALSE}
library(igraph) # for %>%
mySim <- simInit(
times = list(start = 0.0, end = 5.0),
params = list(
.globals = list(stackName = "landscape", burnStats = "testStats"),
randomLandscapes = list(.plotInitialTime = NA),
fireSpread = list(.plotInitialTime = NA)
modules = list("randomLandscapes", "fireSpread"),
paths = list(modulePath = system.file("sampleModules", package = "SpaDES.core")))
This functionality can be achieved within a `spades` call.
```{r spades}
# compare caching ... run once to create cache
system.time(outSim <- spades(Copy(mySim), cache = TRUE, notOlderThan = Sys.time()))
Note that if there were any visualizations (here we turned them off with `.plotInitialTime = NA` above) they will happen the first time through, but not the cached times.
```{r spades-cached}
# vastly faster 2nd time
system.time(outSimCached <- spades(Copy(mySim), cache = TRUE))
all.equal(outSim, outSimCached)
## At the `experiment` level
This functionality can be achieved within an experiment call.
This can be done 2 ways, either: "internally" through the cache argument, which will cache *each spades call*; or, "externally" which will *cache the entire experiment*.
If there are lots of `spades` calls, then the former will be slow as the `simList` will be digested once per `spades` call.
### Using cache argument
```{r experiment-cache}
system.time(sims1 <- experiment(mySim, replicates = 2, cache = TRUE))
# internal -- second time faster
system.time(sims2 <- experiment(mySim, replicates = 2, cache = TRUE))
all.equal(sims1, sims2)
### Wrapping `experiment` with `Cache`
Here, the `simList` (and other arguments to experiment) is hashed once, and if it is found to be the same as previous, then the returned list of `simList` objects is recovered.
This means that even a very large experiment, with many replicates and combinations of parameters and modules can be recovered very quickly.
Here we show that you can output objects to disk, so the list of `simList` objects doesn't get too big.
Then, when we recover it in the Cached version, all the files are still there, the list of `simList` objects is small, so very fast to recover.
```{r Cache-experiment}
# External
outputs(mySim) <- data.frame(objectName = "landscape")
system.time(sims3 <- Cache(experiment, mySim, replicates = 3, .plotInitialTime = NA,
clearSimEnv = TRUE))
The second time is way faster. We see the output files in the same location.
```{r Cache-experiment-2}
system.time(sims4 <- Cache(experiment, mySim, replicates = 3, .plotInitialTime = NA,
clearSimEnv = TRUE))
all.equal(sims3, sims4)
dir(outputPath(mySim), recursive = TRUE)
Notice that speed up can be enormous; in this case ~100 times.
## Module-level caching
If the parameter `.useCache` in the module's metadata is set to `TRUE`, then *every* event in the module will be cached.
That means that every time that module is called from within a spades or experiment call, `Cache` will be called.
Only the objects inside the `simList` that correspond to the `inputObjects` or the `outputObjects` from the module metadata will be assessed for caching.
For general use, module-level caching would be mostly useful for modules that have no stochasticity, such as data-preparation modules, GIS modules etc.
In this example, we will use the cache on the randomLandscapes module.
This means that each subsequent call to spades will result in identical outputs from the `randomLandscapes` module (only!).
This would be useful when only one random landscape is needed simply for trying something out, or putting into production code (*e.g.*, publication, decision support, etc.).
```{r module-level, echo=TRUE}
# Module-level
params(mySim)$randomLandscapes$.useCache <- TRUE
system.time(randomSim <- spades(Copy(mySim), .plotInitialTime = NA,
notOlderThan = Sys.time(), debug = TRUE))
# vastly faster the second time
system.time(randomSimCached <- spades(Copy(mySim), .plotInitialTime = NA,
debug = TRUE))
Test that only layers produced in `randomLandscapes` are identical, not `fireSpread`.
```{r test-module-level}
layers <- list("DEM", "forestAge", "habitatQuality", "percentPine", "Fires")
same <- lapply(layers, function(l) identical(randomSim$landscape[[l]],
names(same) <- layers
print(same) # Fires is not same because all non-init events in fireSpread are not cached
## Event-level caching
If the parameter `.useCache` in the module's metadata is set to a *character or character vector*, then that or those event(s), identified by their name, will be cached.
That means that every time the event is called from within a `spades` or `experiment` call, `Cache` will be called.
Only the objects inside the `simList` that correspond to the `inputObjects` or the `outputObjects` as defined in the module metadata will be assessed for caching inputs or outputs, respectively.
The fact that all and only the named `inputObjects` and `outputObjects` are cached and returned may be inefficient (*i.e.*, it may cache more objects than are necessary) for individual events.
Similar to module-level caching, event-level caching would be mostly useful for events that have no stochasticity, such as data-preparation events, GIS events etc.
Here, we don't change the module-level caching for randomLandscapes, but we add to it a cache for only the "init" event for `fireSpread`.
```{r event-level, echo=TRUE}
params(mySim)$fireSpread$.useCache <- "init"
system.time(randomSim <- spades(Copy(mySim), .plotInitialTime = NA,
notOlderThan = Sys.time(), debug = TRUE))
# vastly faster the second time
system.time(randomSimCached <- spades(Copy(mySim), .plotInitialTime = NA,
debug = TRUE))
## Function-level caching
Any function can be cached using: `Cache(FUN = functionName, ...)`.
This will be a slight change to a function call, such as:
`projectRaster(raster, crs = crs(newRaster))`
`Cache(projectRaster, raster, crs = crs(newRaster))`.
```{r function-level, echo=TRUE}
ras <- raster(extent(0, 1e3, 0, 1e3), res = 1)
system.time(map <- Cache(gaussMap, ras, cacheRepo = cachePath(mySim),
notOlderThan = Sys.time()))
# vastly faster the second time
system.time(mapCached <- Cache(gaussMap, ras, cacheRepo = cachePath(mySim)))
all.equal(map, mapCached)
## Working with the Cache manually
Since the cache is simply an `archivist` repository, all `archivist` functions will work as is.
In addition, there are several helpers in the `reproducible` package, including `showCache`, `keepCache` and `clearCache` that may be useful.
Also, one can access cached items manually (rather than simply rerunning the same `Cache` function again).
```{r manual-cache}
# examine a part of the Cache
showCache(mySim)[tagKey == "function", -c("artifact")]
if (requireNamespace("archivist")) {
# get the RasterLayer that was produced with the gaussMap function:
map <- unique(showCache(mySim, userTags = "gaussMap")$artifact) %>%
archivist::loadFromLocalRepo(repoDir = cachePath(mySim), value = TRUE)
# Reproducible Workflow
In general, we feel that a liberal use of `Cache` will make a re-useable and reproducible work flow.
`shiny` apps can be made, taking advantage of `Cache`.
Indeed, much of the difficulty in managing data sets and saving them for future use, can be accommodated by caching.
## Nested Caching
- Imagine we have large model, with many modules, with replication and alternative module collections (e.g., alternative fire models)
- To run this would have a nested structure with the following functions:
```{r, eval=FALSE, echo=TRUE}
simInit --> many .inputObjects calls
experiment --> many spades calls --> many module calls --> many event calls --> many function calls
Lets say we start to introduce caching to this structure. We start from the "inner" most functions that we could imaging Caching would be useful. Lets say there are some GIS operations, like `raster::projectRaster`, which operates on an input shapefile. We can Cache the `projectRaster` call to make this much faster, since it will always be the same result for a given input raster.
If we look back at our structure above, we see that we still have LOTS of places that are not Cached. That means that the experiment call will still spawn many spades calls, which will still spawn many module calls, and many event calls, just to get to the one `Cache(projectRaster)` call which is Cached. This function will likely be called hundreds of times (because `experiment` runs the `spades` call 100 times due to replication). This is good, but **Cache does take some time**. So, even if `Cache(projectRaster)` takes only 0.02 seconds, calling it hundreds of times means maybe 4 seconds. If we are doing this for many functions, then this will be too slow.
We can start putting `Cache` all up the sequence of calls. Unfortunately, the way we use Cache at each of these levels is a bit different, so we need a slightly different approach for each.
#### Cache the `experiment` call
This will assess the `simList` (the objects, times, modules, etc.) and if they are all the same, it will return the final list of `simList`s that came from the first `experiment` call. NOTE: because this can be large, it is likely that you want `clearSimEnv = TRUE`, and have all objects that are needed after the experiment call saved to disk. Any stochasticity/randomness inside modules will be frozen. This is likely ok if the objective is to show results in a web app (via shiny or otherwise) or another visualization about the experiment outputs, e.g., comparing treatments, once sufficient stochasticity has been achieved.
`mySimListOut <- Cache(experiment, mySim, clearSimEnv = TRUE)`
#### Cache the `spades` calls inside `experiment`
`experiment(cache = TRUE)`
This will cache each of the `spades` calls inside the `experiment` call. That means that there are as many cache events as there are replicates and experimental treatments, which, again could be a lot. Like caching the `experiment` call, stochasticity/randomness will be frozen. Note, one good use of this is when you are making iterative, incremental replication, e.g.,
`mySimOut <- experiment(mySim, replicates = 5, cache = TRUE)`
You decide after waiting 10 minutes for it to finish, that you need more replication. Rather than start from zero replicates, you can just pick up where you left off:
`mySimOut <- experiment(mySim, replicates = 10, cache = TRUE)`
This will only add 5 more replicates.
#### Cache a whole module
Pass `.useCache = TRUE` as a parameter to the module, during the `simInit`
Some modules are inherently non-random, such as GIS modules, or parameter fitting statistical modules. We expect these to be identical results each time, so we can safely cache the entire module.
```{r, eval=FALSE, echo=TRUE}
parameters = list(
FireModule = list(.useCache = TRUE)
mySim <- simInit(..., params = parameters)
mySimOut <- spades(mySim)
The messaging should indicate the caching is happening on every event in that module.
***Note: This option REQUIRES that the metadata in inputs and outputs be exactly correct, i.e., all inputObjects and outputObjects must be correctly identified and listed in the defineModule metadata***
***If the module is cached, and there are errors when it is run, it almost is garanteed to be a problem with the `inputObjects` and `outputObjects` incorrectly specified.***
#### Cache individual functions
`Cache(<functionName>, <other arguments>)`
This will allow fine scale control of individual function calls.
## Data-to-decisions
Once nested Caching is used all the way up to the `experiment` level and even further up (e.g., if there is a shiny module), then even very complex models can be put into a complete workflow.
The current vision for `SpaDES` is that it will allow this type of "data to decisions" complete workflow that allows for deep, robust models, across disciplines, with easily accessible front ends, that are quick and responsive to users, yet can handle data changes, module changes, etc.