# Summary statistics in simulations

## Summary statistics in simulations

Often, we want to calculate some summary statistics from an object, such as a the mean value of a map, and plot that as a graph, updated at some interval, so that we can see how that statistic behaves over the simulation.

What's the best way to do this in a `SpaDES` simulation?

There are two main ways to achieve this for a simulation, both of which are implemented at the module level and differ only in whether you are the author/developer of the module in question.

In the examples below, we will be using a raster of forest age and we will calculate the mean forest age annually in our simulation. As a reminder, the code below shows how to calculate the mean forest age in regular R code (i.e., not directly in a `SpaDES` simulation).

```library(ggplot2)
library(raster)

# create a dummy raster of forest age (based on the init from the randomLandscapes sample module)
inMemory <- TRUE
nx <- 100
ny <- 100
template <- raster(nrows = ny, ncols = nx, xmn = -nx/2, xmx = nx/2, ymn = -ny/2, ymx = ny/2)
speedup <- max(1, nx/5e2)

forestAge <- gaussMap(template, scale = 10, var = 0.1, speedup = speedup, inMemory = inMemory)
forestAge[] <- round(getValues(forestAge),1)*20

Plot(forestAge, new = TRUE)
cellStats(forestAge, "mean")```

Note, we are using `cellStats()` from the `raster` package, which can efficiently calculate several built-in statistics. Although custom statistics functions can be used with `cellStats()`, extra care must be taken to make these efficient with large rasters.

### Summary statistics for a module that you developed

If you are the author/developer of a module, simply do the following:

1. Add a new module parameter and output to the module metadata.

Add a new parameter, `summaryInterval`, which defines how often the summary statistics will be calculated. Remember, that this interval is based on the time unit of the module, so for a module with `timeunit = "year"` summary events can be scheduled annually by setting `summaryInterval` to `1`.

Add a new entry to the `outputObjects` data.frame in the module metadata at the top of the module code file. This is the object that will store the summary values over the course of the simulation. You can explicitly set this object's name or allow the name to be passed by the user as a global parameter in a simulation.

```defineModule(sim, list(
# other module metadata not shown
parameters = rbind(
# other parameters not shown
defineParameter("summaryInterval", "integer", 1, NA, NA, "time interval between summarize events")
),
outputObjects = data.frame(
objectName = "meanForestAge", objectClass = "numeric", other = NA_character_,
stringsAsFactors = FALSE)
))```

Define a new event type and make sure it is scheduled during 'init'. The 'summarize' event will call the new module function `moduleNameSummarize()` which we will define below. Optionally, the 'summarize' event can reschedule itself (see below) using the new module parameter `summaryInterval`.

```doEvent.moduleName <- function(sim, eventTime, eventType, debug = FALSE) {
if (eventType == "init") {
# do stuff for this event
sim <- sim\$moduleNameInit(sim)

# schedule the next events
# other event types such as plot and save omitted here
sim <- scheduleEvent(sim, start(sim), "moduleName", "summarize", .last())
} else if (eventType == "summarize" ) {
# do stuff for this event
sim <- sim\$moduleNameSummarize(sim)

# schedule the next event
sim <- scheduleEvent(sim, time(sim) + params(sim)\$moduleName\$summaryInterval,
"moduleName", "summarize", .last())
}
# other event types omitted
return(invisible(sim))
}```

Note: the exact name of the 'summarize' event isn't important, as long as it meaningfully conveys what is happening during the event. Because we want to make sure that all other events for a particular time are run before the summary statistic is calculated, we are setting the 'summarize' event priority high using `.last()` (note that 'save' and 'plot' priorities are typically set even higher using `.last()+1` or `.last()+2`).

3. Add an new event function that calculates the statistic(s) of interest.

The 'init' event should also be updated to a) initialize the object which will store the values calculated during the 'summarize' event, and b) create an index object to be used internally to keep track of which element of the statistic vector to update each interval. It's important to pre-allocate the statistic vector to improve code performance.

```## event functions

moduleNameInit <- function(sim) {
# all other pieces of this function are omitted

nSamples <- ceiling((end(sim) - start(sim)) / params(sim)\$moduleName\$summaryInterval)
sim\$meanForestAge <- rep(NA_real_, nSamples)
sim\$meanForestAgeIndex <- 1L

return(sim)
}

# other event functions omitted

moduleNameSummarize <- function(sim) {
sim\$meanForestAge[sim\$meanForestAgeIndex] <- cellStats(sim\$forestAge, "mean")
sim\$meanForestAgeIndex <- sim\$meanForestAgeIndex + 1L

# you could Plot the summary here or create a new 'summaryPlot' event
df <- data.frame(year = 1:length(sim\$meanForestAge), meanAge = sim\$meanForestAge)
q <- ggplot(df, aes(x = year, y = meanAge)) +
geom_point()
Plot(q)
}```
4. Update your module's `reqdPkgs` field in the metadata if you are using additional packages.

### Summary statistics for a module not developed by you

In many cases, you will be interested in running simulations using a module developed by someone else. Although you could modify this module per above, adding in the relevant summary outputs, a better approach is to create a new module whose sole purpose it is to summarize the outputs of the other module.

A carefully built summary module can then be reused with multiple other modules, simply by specifying which simulation objects should be summarized.

#### Demonstration apps

LandWeb Demonstration App