Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Homogenize and simplify spatialstats.py #276

Merged
merged 41 commits into from
Aug 9, 2022

Conversation

rhugonnet
Copy link
Contributor

@rhugonnet rhugonnet commented Aug 1, 2022

This PR aims to address several issues linked to spatialstats, in particular the ease-of-use, clarity of parameters and tests!

Summary of changes

  • The arguments to pass to sample_empirical_variogram in relation to skgstat.MetricSpace have been simplified, and further described. In particular the subsample argument is now automatically calculated to match the number pairwise samples found in a single ensemble (N**2/2 if N is the subsample), even if the method uses more complex distinct ensembles (as is the case for the default cdist_equidistant method),
  • Some minors fixes have been implemented to avoid user-induced mistakes in nd_binning, plot_1d_binning and plot_2d_binning.
  • The format of variogram model functions has been homogenized to use only skgstat.models, which results in the deletion of the vgm, cov functions that existed in xdem.spatialstats, and the replacement of occurences in fit_sum_model_variogram and neff functions (for fitting to empirical variograms, and spatial integration),
  • The format for returning/passing modelled variograms in xDEM has been homogenized to a pd.Dataframe object with columns "models", "range", "psills" and "smooth", described in related function output/input. The format is tested before each function run to provide clear ValueError messages in case of wrong user input,
  • Three basic functions get_func_sum_vgm_models, covariance_from_vgm and correlation_from_vgm have been added to convert variogram model parameters of a pd.Dataframe into a function of sum of variogram with spatial lags, a spatial covariance function with spatial lags, and also a spatial correlation function with spatial lags,
  • The neff functions to estimate a number of effective samples based on spatial correlation have been improved, clarified, and further tested. In particular, the neff_circular_approx_theoretical now contains exact circular integration formulas for any number of summed spherical, gaussian, exponential and cubic models. It is used to thoroughly test the neff_circular_approx_numerical function that can integrate numerically any number of summed variogram of any form. Those two fonctions are to be used when the user solely provides an area value. When the shape of the area is provided, the neff_exact or neff_approx_hugonnet, based on double sum of covariance with exact coordinates, can be used to derive the number of effective samples. The double covariance sum has been vectorized for computational speed, and the random sampling of the neff_approx_hugonnet is now tested with random_state arguments.
  • For now, the documentation examples are updated with the new function names and parameters, but those need further simplification ("beginner example", "advanced example").

Additional changes after answers to questions below:

(Old) Questions for @adehecq and @erikmannerfelt to move forward with this PR:

  1. Are all the function and parameters names fine with you? There's probably a bit improvement to do there, and once we have some good names we stick to them! But it's a bit hard to find those alone sometimes 😅.
  2. Should I do a core neff function, which defaults to one of the other functions, and can be used to choose which calculation? Should I change the others to hidden functions _neff_...? This would remove them from API.
  3. My idea for the updated documentation is to have: (i) beginner examples that fit in just 10 lines of code to exemplify each parameter and the basic behaviour of a function, (ii) advanced examples where things are more detailed/discussed.
  4. For the purpose of 3., I would add an "error_pipeline" that automatically combines steps that go together: nd_binning and interp_nd_binning, a new "standardization" function, and a new "error_map" function; then another one that combines sample_empirical_variogram, fit_sum_model_variogram and neff to get the error in an area? Something along these lines.

To-do-list for related issues

As well as:

  • Add more tests for neff function, including exact integration for spherical, exponential, gaussian and cubic models,
  • Add tests for plotting functions of spatialstats to ensure clear error message with wrong user input.

xdem/spatialstats.py Outdated Show resolved Hide resolved
xdem/spatialstats.py Outdated Show resolved Hide resolved
xdem/spatialstats.py Outdated Show resolved Hide resolved
xdem/spatialstats.py Outdated Show resolved Hide resolved
xdem/spatialstats.py Outdated Show resolved Hide resolved
xdem/spatialstats.py Outdated Show resolved Hide resolved
xdem/spatialstats.py Outdated Show resolved Hide resolved
xdem/spatialstats.py Outdated Show resolved Hide resolved
xdem/spatialstats.py Outdated Show resolved Hide resolved
xdem/spatialstats.py Outdated Show resolved Hide resolved
xdem/spatialstats.py Outdated Show resolved Hide resolved
@adehecq
Copy link
Member

adehecq commented Aug 8, 2022

Thanks for this huge effort ! 👏 I made some (mostly minor) comments directly in the code. We will see with practice how it goes.
Below are an answer to your questions.

  • Are all the function and parameters names fine with you? There's probably a bit improvement to do there, and once we have some good names we stick to them! But it's a bit hard to find those alone sometimes 😅.

I went through each function's name and commented when I thought some changes could be useful. We should particularly try to be consistent between the use of vgm and variogram.
Of course, the different neff functions are a bit puzzling, but with the proper documentation it should be ok.
I could not add a comment at that line because it was not edited, but apparently distance_latlon is not used anymore and could be removed.

  • Should I do a core neff function, which defaults to one of the other functions, and can be used to choose which calculation? Should I change the others to hidden functions _neff_...? This would remove them from API.

Mmmh. Maybe a good idea to have just one function with a parameter to use one or the other.

  • My idea for the updated documentation is to have: (i) beginner examples that fit in just 10 lines of code to exemplify each parameter and the basic behaviour of a function, (ii) advanced examples where things are more detailed/discussed.

Sounds perfect!

  • For the purpose of 3., I would add an "error_pipeline" that automatically combines steps that go together: nd_binning and interp_nd_binning, a new "standardization" function, and a new "error_map" function; then another one that combines sample_empirical_variogram, fit_sum_model_variogram and neff to get the error in an area? Something along these lines.

Yes, I think it is important to have a simple pipeline that runs all the steps.

adehecq
adehecq previously approved these changes Aug 8, 2022
@rhugonnet
Copy link
Contributor Author

Merging now that comments are accounted for.
The error pipeline + new documentation will be done in a separate PR.

@rhugonnet rhugonnet merged commit 43a34cb into GlacioHack:main Aug 9, 2022
@rhugonnet rhugonnet deleted the simplify_spstats branch August 9, 2022 08:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment