Homogenize and simplify `spatialstats.py` #276

rhugonnet · 2022-08-01T13:08:22Z

This PR aims to address several issues linked to spatialstats, in particular the ease-of-use, clarity of parameters and tests!

Summary of changes

The arguments to pass to sample_empirical_variogram in relation to skgstat.MetricSpace have been simplified, and further described. In particular the subsample argument is now automatically calculated to match the number pairwise samples found in a single ensemble (N**2/2 if N is the subsample), even if the method uses more complex distinct ensembles (as is the case for the default cdist_equidistant method),
Some minors fixes have been implemented to avoid user-induced mistakes in nd_binning, plot_1d_binning and plot_2d_binning.
The format of variogram model functions has been homogenized to use only skgstat.models, which results in the deletion of the vgm, cov functions that existed in xdem.spatialstats, and the replacement of occurences in fit_sum_model_variogram and neff functions (for fitting to empirical variograms, and spatial integration),
The format for returning/passing modelled variograms in xDEM has been homogenized to a pd.Dataframe object with columns "models", "range", "psills" and "smooth", described in related function output/input. The format is tested before each function run to provide clear ValueError messages in case of wrong user input,
Three basic functions get_func_sum_vgm_models, covariance_from_vgm and correlation_from_vgm have been added to convert variogram model parameters of a pd.Dataframe into a function of sum of variogram with spatial lags, a spatial covariance function with spatial lags, and also a spatial correlation function with spatial lags,
The neff functions to estimate a number of effective samples based on spatial correlation have been improved, clarified, and further tested. In particular, the neff_circular_approx_theoretical now contains exact circular integration formulas for any number of summed spherical, gaussian, exponential and cubic models. It is used to thoroughly test the neff_circular_approx_numerical function that can integrate numerically any number of summed variogram of any form. Those two fonctions are to be used when the user solely provides an area value. When the shape of the area is provided, the neff_exact or neff_approx_hugonnet, based on double sum of covariance with exact coordinates, can be used to derive the number of effective samples. The double covariance sum has been vectorized for computational speed, and the random sampling of the neff_approx_hugonnet is now tested with random_state arguments.
For now, the documentation examples are updated with the new function names and parameters, but those need further simplification ("beginner example", "advanced example").

Additional changes after answers to questions below:

Renamed all occurences of vgm into parameters or functions names into variogram,
Corrected typos, unclear descriptions and parameters,
Added _ in function names std_err, std_err_finite, distance_latlon, create_circular_mask and create_circular_ring to remove from API.
Added additional tests on output of sample_empirical_variogram and _choose_cdist_equidistant_sampling_parameters,
Corrected sphinx syntax error,
Addressed issues related to testing and added two pages to the wiki: https://github.com/GlacioHack/xdem/wiki/Stuck-on-a-mysterious-error-during-CI%3F and https://github.com/GlacioHack/xdem/wiki/Memo-on-Python-and-numpy-%60dtype%60,
Questions 3-4 will resolved in a separate PR.

(Old) Questions for @adehecq and @erikmannerfelt to move forward with this PR:

Are all the function and parameters names fine with you? There's probably a bit improvement to do there, and once we have some good names we stick to them! But it's a bit hard to find those alone sometimes 😅.
Should I do a core neff function, which defaults to one of the other functions, and can be used to choose which calculation? Should I change the others to hidden functions _neff_...? This would remove them from API.
My idea for the updated documentation is to have: (i) beginner examples that fit in just 10 lines of code to exemplify each parameter and the basic behaviour of a function, (ii) advanced examples where things are more detailed/discussed.
For the purpose of 3., I would add an "error_pipeline" that automatically combines steps that go together: nd_binning and interp_nd_binning, a new "standardization" function, and a new "error_map" function; then another one that combines sample_empirical_variogram, fit_sum_model_variogram and neff to get the error in an area? Something along these lines.

To-do-list for related issues

As well as:

Add more tests for neff function, including exact integration for spherical, exponential, gaussian and cubic models,
Add tests for plotting functions of spatialstats to ensure clear error message with wrong user input.

Update upstream

…tat.models

…eter input

…ration

…unction checks, names, descriptions

xdem/spatialstats.py

adehecq · 2022-08-08T12:52:00Z

Thanks for this huge effort ! 👏 I made some (mostly minor) comments directly in the code. We will see with practice how it goes.
Below are an answer to your questions.

Are all the function and parameters names fine with you? There's probably a bit improvement to do there, and once we have some good names we stick to them! But it's a bit hard to find those alone sometimes 😅.

I went through each function's name and commented when I thought some changes could be useful. We should particularly try to be consistent between the use of vgm and variogram.
Of course, the different neff functions are a bit puzzling, but with the proper documentation it should be ok.
I could not add a comment at that line because it was not edited, but apparently distance_latlon is not used anymore and could be removed.

Should I do a core neff function, which defaults to one of the other functions, and can be used to choose which calculation? Should I change the others to hidden functions _neff_...? This would remove them from API.

Mmmh. Maybe a good idea to have just one function with a parameter to use one or the other.

My idea for the updated documentation is to have: (i) beginner examples that fit in just 10 lines of code to exemplify each parameter and the basic behaviour of a function, (ii) advanced examples where things are more detailed/discussed.

Sounds perfect!

For the purpose of 3., I would add an "error_pipeline" that automatically combines steps that go together: nd_binning and interp_nd_binning, a new "standardization" function, and a new "error_map" function; then another one that combines sample_empirical_variogram, fit_sum_model_variogram and neff to get the error in an area? Something along these lines.

Yes, I think it is important to have a simple pipeline that runs all the steps.

…s, typos, etc

…t_equidistant_sampling_params

rhugonnet · 2022-08-09T08:54:28Z

Merging now that comments are accounted for.
The error pipeline + new documentation will be done in a separate PR.

rhugonnet added 28 commits March 27, 2021 14:26

Merge pull request #6 from GlacioHack/main

f7f6108

Update upstream

Merge branch 'GlacioHack:main' into main

9c97f54

Merge branch 'GlacioHack:main' into main

3a48737

Merge remote-tracking branch 'upstream/main'

c7f18be

Merge remote-tracking branch 'upstream/main'

bcd5163

fix deramp residual and optimizer

864fabd

Merge remote-tracking branch 'upstream/main'

cad1f68

Merge remote-tracking branch 'upstream/main'

561e4f2

Merge remote-tracking branch 'upstream/main'

270060e

Merge remote-tracking branch 'upstream/main'

3acaa2b

Merge remote-tracking branch 'upstream/main'

66c4ff9

Merge remote-tracking branch 'upstream/main'

3044b37

Merge remote-tracking branch 'upstream/main'

fc957ea

Merge remote-tracking branch 'upstream/main'

d238a36

Merge remote-tracking branch 'upstream/main'

0d8e922

Remove homemade variogram functions to consistently use those of skgs…

8386bae

…tat.models

Update variogram function to skgstat

844e617

Adjust tests to updated variogram functions

b1529b1

Change spread metric to NMAD for consistency across gallery examples

500293a

Improve function descriptions, fix or improve minor issues with param…

4c62706

…eter input

Add test using multiple cores for variogram estimation

582407f

Improve neff exact integration: step commit

9bb376b

Improve neff functions and test on exact and numerical circular integ…

c84e4fb

…ration

Add exact integration and test for cubic variogram model

4bcdb46

Add correlation and covariance functions

e662601

Add vectorized computation for double covariance sum and homogenize f…

9d3c477

…unction checks, names, descriptions

Vectorize covariance calculations and add associated tests

1a16938

Update variogram and neff function names and parameters

bee10b3

rhugonnet requested review from adehecq and erikmannerfelt August 5, 2022 13:28

rhugonnet added 4 commits August 5, 2022 16:20

Fix line number of neff code snapshot

2ef9bb0

Fix sphinx syntax errors

0d51ee5

Fix the passing of dataframe values

12fb254

Add tests, fix minor issues and sphinx syntax

8086d04