Skip to content

Commit

Permalink
Bring Fauxcurrences back in the main package (#137)
Browse files Browse the repository at this point in the history
* βž• add Fauxcurrences back in

* πŸ§‘β€πŸ« Fauxcurrences vignette

* βž• update Fauxcurrences in root
  • Loading branch information
tpoisot authored Feb 13, 2023
1 parent 797432c commit 4eca03b
Show file tree
Hide file tree
Showing 4 changed files with 101 additions and 52 deletions.
3 changes: 1 addition & 2 deletions Fauxcurrences/Project.toml
Original file line number Diff line number Diff line change
@@ -1,13 +1,12 @@
name = "Fauxcurrences"
uuid = "a2d61402-033a-4ca9-aef4-652d70cf7c9c"
authors = ["TimothΓ©e Poisot <timothee.poisot@umontreal.ca>"]
version = "0.1.0"
version = "0.2.0"

[deps]
Distances = "b4f34e82-e78d-54a5-968a-f98e89d6e8f7"
GBIF = "ee291a33-5a6c-5552-a3c8-0f29a1181037"
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
Plots = "91a5bcdd-55d7-5caf-9e0b-520d859cae80"
SimpleSDMLayers = "2c645270-77db-11e9-22c3-0f302a89c64c"
Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
StatsBase = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91"
Expand Down
1 change: 1 addition & 0 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ version = "0.0.1"
[deps]
ArchGDAL = "c9ce4bd3-c3d5-55b8-8973-c0e20141b8c3"
Distances = "b4f34e82-e78d-54a5-968a-f98e89d6e8f7"
Fauxcurrences = "a2d61402-033a-4ca9-aef4-652d70cf7c9c"
GBIF = "ee291a33-5a6c-5552-a3c8-0f29a1181037"
GDAL = "add2ef01-049f-52c4-9ee2-e494f65e021a"
MakieCore = "20f20a25-4f0e-4fdf-b5d1-57303727442b"
Expand Down
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
# # Fauxcurrences.jl
# # Generation of fauxcurrences

# This package is a *clean-room*, *feature-equivalent* re-write in Julia of most
# of the functionalities of the [`fauxcurrence` package for R][paper]. The
# original code is licensed under the GPL, and this package is licensed under
# the MIT. For this reason, the original code, and any document distributed with
# it, has not been consulted during the implementation; the work is entirely
# based on the published article. As detailed in the following sections, the two
# packages do not have *feature parity*, but there is an overlap in the most
# significant functions.
# The `Fauxcurrences.jl` package is a *clean-room*, *feature-equivalent*
# re-write in Julia of most of the functionalities of the [`fauxcurrence`
# package for R][paper]. The original code is licensed under the GPL, and this
# package is licensed under the MIT. For this reason, the original code, and any
# document distributed with it, has not been consulted during the
# implementation; the work is entirely based on the published article. As
# detailed in the following sections, the two packages do not have *feature
# parity*, but there is an overlap in the most significant functions.

# [paper]: https://onlinelibrary.wiley.com/doi/full/10.1111/ecog.05880

Expand All @@ -24,8 +24,9 @@
# ## Why?

# Interoperability: this package uses `SimpleSDMLayers`, `Distances`, and `GBIF`
# as backends, making it fit very snuggly with the rest of the (Eco)Julia
# ecosystem, and working towards integration of tools to build SDMs at scale.
# as backends, making it fit very snuggly with the rest of the
# `SpeciesDistributionToolkit` ecosystem, and working towards integration of
# tools to build SDMs at scale.

# Expandability: the package is built on modular functions, to ensure that
# custom workflows can be built, while maintaining a general interface.
Expand Down Expand Up @@ -61,9 +62,7 @@
# (for example using the `GBIF.jl` package), and a layer (one of the
# `SimpleSDMLayer` types, predictor or response does not matter).

using Fauxcurrences
using SimpleSDMLayers
using GBIF
using SpeciesDistributionToolkit

#-

Expand All @@ -78,20 +77,27 @@ Random.seed!(616525434012345)
# this is a matrix with longitudes in the first row, and latitudes in the second
# row. The matrix *has* to be column-major, with observations as columns. To
# make sure that we cover a reasonable spatial extent, we will look at the
# default (small) raster distributed with `SimpleSDMLayers`.
# rather small area in space (part of the GaspΓ©sie region):

layer = geotiff(SimpleSDMPredictor, joinpath(dirname(pathof(SimpleSDMLayers)), "..", "data", "connectivity.tiff"))
bbox = (bottom = 43.285203, left = -68.631592, top = 47.487513, right = -59.458008)
layer = SimpleSDMPredictor(RasterData(CHELSA1, BioClim); layer = 1, bbox...)

# This covers a small area in the Laurentians, north of MontrΓ©al, so looking for
# racoons, deers, and white-footed mice is a safe bet:
# We will pick two species (a skunk, and *Iris versicolor*):

taxa = taxon.(["Procyon lotor", "Odocoileus virginianus", "Peromyscus leucopus"])
taxa =
taxon.([
"Mephitis mephitis",
"Iris versicolor",
])

observations = [occurrences(t,
"hasCoordinate" => "true",
"decimalLatitude" => extrema(latitudes(layer)),
"decimalLongitude" => extrema(longitudes(layer)),
"limit" => 100) for t in taxa]
observations = [
occurrences(t,
"hasCoordinate" => "true",
"occurrenceStatus" => "PRESENT",
"decimalLatitude" => extrema(latitudes(layer)),
"decimalLongitude" => extrema(longitudes(layer)),
"limit" => 150) for t in taxa
]

# The last step is to turn these occurrences into a matrix of latitudes and
# longitudes:
Expand All @@ -100,11 +106,13 @@ obs = [Fauxcurrences.get_valid_coordinates(o, layer) for o in observations]

# ### Parameters for the run

# Decide on the number of points to generate, and the weight matrix. The number
# of points to generate is, by default, the number of observations in the
# original dataset, but this can be changed to generate balanced samples:
# We need to decide on the number of points (pseudo-occurrences) to generate,
# and the weight matrix. The number of points to generate is, by default, the
# number of observations in the original dataset, but this can be changed to
# generate balanced samples. Here, we will pick 80 as a target number of
# occurrences:

points_to_generate = fill(30, length(obs))
points_to_generate = fill(80, length(obs))

# The weight matrix is used to determine whether intra or inter-specific
# distances are more important in the distribution score. For example, this will
Expand All @@ -121,7 +129,8 @@ W = Fauxcurrences.weighted_components(length(obs), 0.75)
# re-written many, many times. The upside is that, if this steps fits in your
# memory, the entire workflow will also fit in your memory.

obs_intra, obs_inter, sim_intra, sim_inter = Fauxcurrences.preallocate_distance_matrices(obs; samples=points_to_generate);
obs_intra, obs_inter, sim_intra, sim_inter =
Fauxcurrences.preallocate_distance_matrices(obs; samples = points_to_generate);

# The entire workflow is designed to use multiple species (as most relevant
# questions will require multiple species). Using a single-species approach only
Expand All @@ -141,7 +150,7 @@ Fauxcurrences.measure_interspecific_distances!(obs_inter, obs);
# This is a two-step process, involving first the pre-allocation of coordinate
# matrices, and second the population of these matrices using random points.

sim = Fauxcurrences.preallocate_simulated_points(obs; samples=points_to_generate);
sim = Fauxcurrences.preallocate_simulated_points(obs; samples = points_to_generate);

# The actual bootstrapping can be a little longer:

Expand All @@ -154,10 +163,22 @@ Fauxcurrences.bootstrap!(sim, layer, obs, obs_intra, obs_inter, sim_intra, sim_i
# matrix where the maximum distance is larger than the maximum distance in the
# empirical matrix.

bin_intra = [Fauxcurrences._bin_distribution(obs_intra[i], maximum(obs_intra[i])) for i in 1:length(obs_intra)];
bin_inter = [Fauxcurrences._bin_distribution(obs_inter[i], maximum(obs_inter[i])) for i in 1:length(obs_inter)];
bin_s_intra = [Fauxcurrences._bin_distribution(sim_intra[i], maximum(obs_intra[i])) for i in 1:length(obs_intra)];
bin_s_inter = [Fauxcurrences._bin_distribution(sim_inter[i], maximum(obs_inter[i])) for i in 1:length(obs_inter)];
bin_intra = [
Fauxcurrences._bin_distribution(obs_intra[i], maximum(obs_intra[i])) for
i in axes(obs_intra, 1)
];
bin_inter = [
Fauxcurrences._bin_distribution(obs_inter[i], maximum(obs_inter[i])) for
i in axes(obs_inter, 1)
];
bin_s_intra = [
Fauxcurrences._bin_distribution(sim_intra[i], maximum(obs_intra[i])) for
i in axes(obs_intra, 1)
];
bin_s_inter = [
Fauxcurrences._bin_distribution(sim_inter[i], maximum(obs_inter[i])) for
i in axes(obs_inter, 1)
];

# This version of the `_bin_distribution` function is allocating an array for
# the bins, but the internal function is over-writing it, to avoid unwanted
Expand All @@ -175,12 +196,12 @@ D = Fauxcurrences.score_distributions(W, bin_intra, bin_s_intra, bin_inter, bin_
# ### Setup the actual run

# This step has infinitely many variations, as `Fauxcurrences` only offers a
# method to do *a single step forward*. In most cases, using *e.g.*
# method to perform *a single step forward*. In most cases, using *e.g.*
# `ProgressMeter` will be a good way to track the progress of the run, and to
# allow, for example, to stop it when a condition of criteria (absolute/relative
# divergence, globally, on average, or per-species) are met. For the sake of
# simplicity, we return the sum of all divergences, measures for 500000
# timesteps.
# allow, for example, to stop it when a collection of criteria
# (absolute/relative divergence, globally, on average, or per-species) are met.
# For the sake of simplicity, we return the sum of all divergences, measures for
# 500000 timesteps.

progress = zeros(Float64, 500_000)
progress[1] = sum(D)
Expand All @@ -197,33 +218,61 @@ progress[1] = sum(D)
# structure of your points, the problem might be more or less difficult to
# solve. In practice, using `ProgressMeter` will make the timing of the whole
# process a lot more informative. In this simple version, we rely on a manually
# created progress report:

for i in 2:length(progress)
progress[i] = Fauxcurrences.step!(sim, layer, W, obs_intra, obs_inter, sim_intra, sim_inter, bin_intra, bin_inter, bin_s_intra, bin_s_inter, progress[i-1])
if iszero(i%20_000)
println("[$(lpad(round(Int64, 100*(i/length(progress))), 3))%]\tJS-divergence: $(round(progress[i]; digits=3))")
# created progress report. Note that we stop the process when we have done at
# least 10⁴ steps, with no improvement over the last 10³.

for i in axes(progress, 1)[2:end]
progress[i] = Fauxcurrences.step!(
sim,
layer,
W,
obs_intra,
obs_inter,
sim_intra,
sim_inter,
bin_intra,
bin_inter,
bin_s_intra,
bin_s_inter,
progress[i - 1],
)
if i > 10_000
if abs(progress[i] - progress[i - 1000]) <= 1e-3
break
end
end
if iszero(i % 20_000)
println(
"[$(lpad(round(Int64, 100*(i/length(progress))), 3))%]\tJS-divergence: $(round(progress[i]; digits=3))",
)
end
end

# The call to `step!` is... lengthy. The reason for this is very simple: `step!`
# will update as much information as it can *in place* when a change is made.
# This means that there are no objects creates (only changed), but the downside
# This means that there are no objects created (only changed). The downside
# is that the function needs to be given a lot of information.

# Depending on the number of species (and the structure of the weight matrix),
# this step will be taking a little while. This is because, assuming we want *n*
# points for *r* species, each step has a complexity on the order of *rnΒ²*,
# which isn't terribly good.

# ### Congratulations, your run is done!

# Here, it would make sense to look at the total improvement (or to plot the
# timeseries of the improvement):

println("Improvement: $(round(progress[begin]/progress[end]; digits=2)) Γ—")
println(
"Improvement: $(round(progress[begin]/progress[findlast(x -> x>0, progress)]; digits=2)) Γ—",
)

# Note that for a small number of iterations (like we used here), this
# improvement is unlikely to be very large; note also that the returns (in terms
# of improvement over time) are very much diminishing. The good news is that
# re-starting the process is as easy as running the loop with calls to `step!`
# again, as the packages has modified the matrices, and is ready to restart at
# any time.
# again, as the package has modified the matrices, and is ready to restart at
# any time. This is useful for checkpointing.

# You can also look at the per-matrix score, out of all the distance bins (set
# as a package-wide variable, `Fauxcurrences._number_of_bins`, which you are
Expand Down
2 changes: 1 addition & 1 deletion src/SpeciesDistributionToolkit.jl
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ using Reexport
@reexport using SimpleSDMDatasets
@reexport using GBIF
@reexport using SimpleSDMLayers
# @rexport using Fauxcurrences
@reexport using Fauxcurrences

# SimpleSDMLayers to wrap everything together
include("integrations/datasets_layers.jl")
Expand Down

2 comments on commit 4eca03b

@tpoisot
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JuliaRegistrator register subdir=Fauxcurrences

@JuliaRegistrator
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registration pull request created: JuliaRegistries/General/77618

After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.

This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:

git tag -a Fauxcurrences-v0.2.0 -m "<description of version>" 4eca03b58d37aac95f8785f3f3379a73882fae07
git push origin Fauxcurrences-v0.2.0

Also, note the warning: Version 0.2.0 skips over 0.1.0
This can be safely ignored. However, if you want to fix this you can do so. Call register() again after making the fix. This will update the Pull request.

Please sign in to comment.