vignettes/Against_SPDs.Rmd

---
title: "Why Not to Use SPDs"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Why Not to Use SPDs}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: references.bib
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  dev='png',
  fig.width=7, 
  fig.height=5.5 # ,
  # dev.args=list(antialias = "none")
)
```

```{r setup}
library(carbondate)
```

## Summed Probability Distributions
Currently, the most commonly-used approach to summarise calendar age information from multiple ^14^C determinations is via summed probability distributions (SPD). These are **not statistically valid** estimators of the calendar age of a potential future sample. It is our view that they should not be used in any dates-as-data approach to provide a population proxy.  

For an SPD, the posterior calendar age density of each object is first calculated independently) from the others as using the function above. These individual densities are then summed/averaged to give an SPD estimate. The independence assumed in the separate calibration of each sample, followed by subsequent summarisation, generates a contradiction. 

Additionally, the SPDs approach fundamentally does not model the samples in the calendar age domain. Consequently, it is also not able to deal with inversions in the calibration curve where there are multiple disjoint calendar periods which are consistent with the observed determinations; or with plateau periods. 

The SPD function is **ONLY** provided here as a comparison with the other routines. To calculate the SPD for a set of radiocarbon determinations (here we use the example dataset `armit` [@armit2014]) see the example below, where we also plot the results.

```{r find_spd, out.width="100%"}
spd <- FindSummedProbabilityDistribution(
  calendar_age_range_BP = c(1000, 4500), 
  rc_determinations = armit$c14_age, 
  rc_sigmas = armit$c14_sig, 
  F14C_inputs = FALSE, 
  calibration_curve = intcal20,
  plot_output = TRUE)
```

**Note:** The summary functions for plotting the predictive joint calendar age density using the rigorous Bayesian non-parametric alternative to SPDs (described in the vignette [Non-parametric Summed Density](Non-parametric-summed-density.html)) can also optionally plot the SPD without having to calculate it separately first.

## Illustration of why not to use SPDs

### Fitting to two Normals

Consider the simulated `two_normals` dataset. We know that the calendar age density that underlies these ^14^C value is a mixture of two normal densities - one centred at 3500 cal yr BP (with a 1$\sigma$ standard deviation of 200 cal yrs); and another (more concentrated) centred at 5000 cal yr BP (with a 1$\sigma$ standard deviation of 100 cal yrs). However, when we calculate the SPD we obtain:

```{r two_normals_spd, echo = 2,  out.width="100%"}
oldpar <- par(no.readonly = TRUE)

two_normals_spd <- FindSummedProbabilityDistribution(
  calendar_age_range_BP=c(2500, 7000),
  rc_determinations= two_normals$c14_age,
  rc_sigmas = two_normals$c14_sig,
  calibration_curve=intcal20,
  plot_output = TRUE)

par(new = TRUE, 
    mgp = c(3, 0.7, 0),
    xaxs = "i",
    yaxs = "i",
    mar = c(5, 4.5, 4, 2) + 0.1,
    las = 1)
xlim <- rev(range(two_normals_spd$calendar_age_BP))
ylim <- c(0, 3 * max(two_normals_spd$probability))
plot(
    NULL,
    NULL,
    type = "n",
    ylim = ylim,
    xlim = xlim,
    axes = FALSE,
    xlab = NA,
    ylab = NA,
    xaxs = "i",
    yaxs = "i")
# Show true underlying calendar age density
weights_true <- c(0.45, 0.55)
cluster_means_true_calBP <- c(3500, 5000)
cluster_precisions_true <- 1 / c(200, 100)^2

# Find and plot true exact density
truedens <- function(t, w, truemean, trueprec) {
  dens <- 0
  for(i in 1:length(w)) {
    dens <- dens + w[i] * dnorm(t, mean = truemean[i], sd = 1/sqrt(trueprec[i]))
  }
  dens
}
curve(truedens(
  x,
  w = weights_true,
  truemean = cluster_means_true_calBP,
  trueprec = cluster_precisions_true),
      from = 2500, to = 7000, n = 401,
      lwd = 2,
      col = "red", add = TRUE)

# Reset plotting parameters
par(oldpar)
```

Here we have manually overlain the true (known) shared calendar age density in red. As we can see, the SPD captures does capture some broad features but does not reconstruct the truth well, and is hard to interpret. In particular, the SPD is highly variable, showing multiple peaks, due to the wiggliness of the calibration curve. The SPD peak shown around 5300 cal yr BP is entirely spurious, yet almost of the same magnitude as its peak around 3500 cal yr BP (which is a part of the genuine density). 

### An improvement using our library approaches

While this is jumping forward somewhat, to evidence that our methods provide better reconstructions, we run the same example using our Bayesian non-parametric summarisation approach (shown in purple as _Polya Urn_) and obtain:

```{r two_normals_compare, echo = FALSE,  out.width="100%"}
oldpar <- par(no.readonly = TRUE)

polya_urn_output <- PolyaUrnBivarDirichlet(
  rc_determinations = two_normals$c14_age,
  rc_sigmas = two_normals$c14_sig,
  calibration_curve=intcal20,
  n_iter = 1e4,
  show_progress = FALSE)

two_normals_DPMM <- PlotPredictiveCalendarAgeDensity(
  output_data = polya_urn_output,
  show_SPD = TRUE)

par(new = TRUE, 
    mgp = c(3, 0.7, 0),
    xaxs = "i",
    yaxs = "i",
    mar = c(5, 4.5, 4, 2) + 0.1,
    las = 1)
xlim <- rev(range(two_normals_DPMM[[1]]$calendar_age_BP))
ylim <- c(0, 3 * max(two_normals_DPMM[[1]]$density_mean))
plot(
    NULL,
    NULL,
    type = "n",
    ylim = ylim,
    xlim = xlim,
    axes = FALSE,
    xlab = NA,
    ylab = NA,
    xaxs = "i",
    yaxs = "i")
# Show true underlying calendar age density
weights_true <- c(0.45, 0.55)
cluster_means_true_calBP <- c(3500, 5000)
cluster_precisions_true <- 1 / c(200, 100)^2

# Find and plot true exact density
truedens <- function(t, w, truemean, trueprec) {
  dens <- 0
  for(i in 1:length(w)) {
    dens <- dens + w[i] * dnorm(t, mean = truemean[i], sd = 1/sqrt(trueprec[i]))
  }
  dens
}
curve(truedens(
  x,
  w = weights_true,
  truemean = cluster_means_true_calBP,
  trueprec = cluster_precisions_true),
      from = 2500, to = 7000, n = 401,
      lwd = 2,
      col = "red", add = TRUE)

# Reset plotting parameters
par(oldpar)
```


# References