Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aoa() appears to return incorrect thresholds (different from Meyer & Pebesma 2021) #46

Closed
mikemahoney218 opened this issue Dec 11, 2022 · 2 comments

Comments

@mikemahoney218
Copy link

Hi all,

Adapting some code from the MEE-AOA repo, I believe I can calculate an AOA like this:

set.seed(123)

library(CAST)
library(caret)
library(virtualspecies)

npoints <- 50
meansPCA <- c(3, -1)
sdPCA <- c(2, 2)
simulateResponse <- c("bio2","bio5","bio10", "bio13", "bio14","bio19")
studyarea <- c(-15, 65, 30, 75)
predictors_global <- raster::brick(
  system.file(
    "extdata/bioclim_global.grd", 
    package = "CAST"
  )
)

predictors <- crop(predictors_global, extent(studyarea))
mask <- predictors[[1]]
values(mask)[!is.na(values(mask))] <- 1
response_vs <- generateSpFromPCA(
  predictors[[simulateResponse]],
  means = meansPCA,
  sds = sdPCA, 
  plot = FALSE
)
response <- response_vs$suitab.raster
mask <- rasterToPolygons(mask,dissolve=TRUE)

samplepoints <- spsample(mask,npoints,"random")
trainDat <- extract(predictors,samplepoints,df=TRUE)
trainDat$response <- extract (response,samplepoints)
trainDat <- trainDat[complete.cases(trainDat),]

model <- train(trainDat[,names(predictors)],
               trainDat$response,
               method="rf",
               importance=TRUE,
               trControl = trainControl(method="none"))

AOA <- aoa(trainDat, model=model)

According to the 2021 paper, I believe the AOA threshold after this should be equal to "the 75-percentile plus 1.5 times the IQR of the DI values of the cross-validated training data". Calculating that using quantile and IQR gives us these results:

di <- attr(AOA$AOA, "TrainDI")

(threshold_quantile <- stats::quantile(di, 0.75))
#>       75% 
#> 0.3059488
(threshold_iqr <- (1.5 * stats::IQR(di)))
#> [1] 0.3392091
threshold_quantile + threshold_iqr
#>       75% 
#> 0.6451579

But the AOA threshold returned by aoa() doesn't match that calculation:

AOA$parameters$threshold
#> [1] 0.4770295

If I'm right and this is unexpected, it seems to be due to the use of boxplot.stats() here:

thres <- grDevices::boxplot.stats(TrainDI)$stats[5]

That gives us the threshold that CAST returns:

grDevices::boxplot.stats(di)$stats[5]
#> [1] 0.4770295

But I'm not entirely sure what boxplot.stats() actually does. For instance, imagine that we cut off the last di value in our vector:

di[50]
#> [1] 0.2120274
di <- di[1:49]

Because it's a rather low number, both our 75% percentile and IQR increase:

(threshold_quantile <- stats::quantile(di, 0.75))
#>       75% 
#> 0.3101567
(threshold_iqr <- (1.5 * stats::IQR(di)))
#> [1] 0.3523555
threshold_quantile + threshold_iqr
#>       75% 
#> 0.6625121

But boxplot.stats() returns the same value as before:

grDevices::boxplot.stats(di)$stats[5]
#> [1] 0.4770295

Created on 2022-12-11 by the reprex package (v2.0.1)

Apologies if I'm misunderstanding something here! The return here just didn't match my expectations.

@mikemahoney218
Copy link
Author

Looking at the source of grDevices::boxplot.stats, I see:

> grDevices::boxplot.stats
function (x, coef = 1.5, do.conf = TRUE, do.out = TRUE) 
{
    if (coef < 0) 
        stop("'coef' must not be negative")
    nna <- !is.na(x)
    n <- sum(nna)
    stats <- stats::fivenum(x, na.rm = TRUE)
    iqr <- diff(stats[c(2, 4)])
    if (coef == 0) 
        do.out <- FALSE
    else {
        out <- if (!is.na(iqr)) {
            x < (stats[2L] - coef * iqr) | x > (stats[4L] + coef * 
                iqr)
        }
        else !is.finite(x)
        if (any(out[nna], na.rm = TRUE)) 
            stats[c(1, 5)] <- range(x[!out], na.rm = TRUE)
    }
    conf <- if (do.conf) 
        stats[3L] + c(-1.58, 1.58) * iqr/sqrt(n)
    list(stats = stats, n = n, conf = conf, out = if (do.out) x[out & 
        nna] else numeric())
}

The relevant lines here are:

out <- if (!is.na(iqr)) {
  x < (stats[2L] - coef * iqr) | x > (stats[4L] + coef * iqr)
}
#> [...]
stats[c(1, 5)] <- range(x[!out], na.rm = TRUE)

In context, that means that the AOA threshold winds up equaling:

max(di[!(di > (quantile(di, 0.75) + 1.5 * IQR(di)))])

Which means that the threshold is going to be the value in di closest to, but not more than quantile(di, 0.75) + (1.5 * IQR(di)), which, particularly for smaller data, may be a significantly different value. The returned value will always be lower than (or the same as) the 75th percentile plus 1.5 times the IQR. Is this expected behavior?

HannaMeyer added a commit that referenced this issue Dec 12, 2022
@HannaMeyer
Copy link
Owner

Thanks for finding that! I fixed it according to your suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants