Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sample does not generate the expected number of background points #194

Closed
gabrieldansereau opened this issue May 16, 2023 · 2 comments · Fixed by #207
Closed

sample does not generate the expected number of background points #194

gabrieldansereau opened this issue May 16, 2023 · 2 comments · Fixed by #207
Assignees

Comments

@gabrieldansereau
Copy link
Member

When generating a given number of background points (e.g. 1000), SpeciesDistributionToolkit.sample will often return fewer background sites (~970), which is counterintuitive when trying to generate the same number of background points as occurrence ones.

This is because we use StatsBase.sample(keys(replace(layer, false => nothing)), n; kwargs...) internally, which uses replace=true by default to sample with replacement. The sampling then selects the same keys multiple times, which results in fewer background sites, especially with small layers. Using replace=false fixes the issues.

@tpoisot Should we make replace=false the default? I think the users' intent will most often be to generate a given number of background sites. Otherwise, I'll add something to the function documentation and vignette to make the fix more obvious.

Here's an example adapted from the vignettes:

using SpeciesDistributionToolkit
using CairoMakie
using Random
spatial_extent = (left = 3.0, bottom = 55.2, right = 19.7, top = 64.9)
rangifer = taxon("Rangifer tarandus tarandus"; strict = false)
query = [
    "occurrenceStatus" => "PRESENT",
    "hasCoordinate" => true,
    "decimalLatitude" => (spatial_extent.bottom, spatial_extent.top),
    "decimalLongitude" => (spatial_extent.left, spatial_extent.right),
    "limit" => 300,
]
presences = occurrences(rangifer, query...)
for i in 1:3
    occurrences!(presences)
end
temperature = SimpleSDMPredictor(RasterData(WorldClim2, BioClim); spatial_extent...)
presencelayer = mask(temperature, presences, Bool)
absmask = pseudoabsencemask(SurfaceRangeEnvelope, presencelayer)

And then we have:

julia> sum(presencelayer)
265

julia> Random.seed!(42);

julia> abs = SpeciesDistributionToolkit.sample(absmask, sum(presencelayer));

julia> sum(abs) # fewer sites
251

julia> Random.seed!(42);

julia> abs2 = SpeciesDistributionToolkit.sample(absmask, sum(presencelayer); replace=false);

julia> sum(abs2) # same number of sites
265
@tpoisot
Copy link
Member

tpoisot commented May 16, 2023

I was contemplating renaming this method anyways. Let me look into it.

@tpoisot
Copy link
Member

tpoisot commented May 17, 2023

Two things -- I think I will rename this method to rarefy, because it's closer to what sample does conceptually.

I think the replace=true was here to use layers with fewer points that requested, but there's obviously a better way to handle this. I will open a PR later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
2 participants