Shorten tests and vectorize `patches_method` #292

rhugonnet · 2022-08-25T21:42:42Z

Summary

The test part of the CI now runs in 6min (including the test on documentation building), instead of 20min before! We should be able to get that under 3min once the terrain.py functions are vectorized 😄

This PR improves the speed of many tests, but especially those linked to functions of spatialstats.py and fit.py that require a lot of processing during either optimization or sampling.
The patches_method is reworked and can now be performed by convolution. This dramatically increases the computing speed compares to the sample size drawn. However, if the convolution is performed on an entire raster, it can still be slower than random patch sampling.
Two points are left to a later PR: improving the speed of terrain.py functions by convolution, which will naturally improve the speed of related tests and examples. And improving the speed of test_coreg.py, which will be better after the advances planned for the module in the next weeks/months.

@adehecq: Will be interesting to discuss where to put some functions listed further below, that probably better fit in geoutils.

As a change summary, this PR:

Makes the data from load_ref_and_diff function available to the test classes of test_spatialstats.py, then calls the variables using e.g. self.ref to avoid duplicated loading in every function;
Subsamples 10,000 points of the raster in test_nd_binning to reduce computing time;
Saves and reads intermediate files during TestBinning, TestVariogram and TestNeffEstimation of test_spatialstats.py to avoid duplicating long processing steps;
Re-orders TestBinning at the beginning of test_spatialstats.py for point 3 to work correctly;
Adds a functionality in xdem.spatialstats.interp_nd_binning to recognize pd.Interval columns that are (unescapably) converted to string when saving to csv, to support pd.Dataframe initated from files read on disk (now done in point 3);
Improve the function xdem.spatialstats._choose_cdist_equidistant_sampling_parameters to respect the pairwise samples and to allow small samples sizes while avoiding skgstat errors (minimum is 10);
Improves tests for xdem.spatialstats._choose_cdist_equidistant_sampling_parameters to test the new functions;
Constrains the xdem.spatialstats.sample_empirical_variogram calls with subsample=10 to reduce computing time;
Constrains the number_effective_samples calls that depend on neff_hugonnet_approx with subsample=10 to reduce computing time;
Fixes a small mistake in computing maxlag in xdem.spatialstats.sample_empirical_variogram;
Adds the option of passing a niter argument to scipy.optimize.basinhopping in xdem.fit.robust_sumsin_fit;
Constrains the test_fit.py functions using basinhopping to niter<25, depending on the test case (need to converge to the test function);
Shortens the computing time of code lines for documentation in docs/source/code and adjusts the line changes in the docs;
Introduces a wrapper for Raster objects for the patches_method;
Homogenizes the old patches method with loop/cadrants into a _patches_loop_cadrants function;
Adds a _patches_convolution method based on convolution.

Additionally, this PR adds some functions that might be good to move to geoutils:

The conversion from Raster or ndarray input + an exclusion/inclusion mask as Vector, np.ndarray or gpd.GeoDataFrame into a 1D np.ndarray of included terrain or a 2D np.ndarray with NaNs on included terrain is now consistently performed by the function _preprocess_values_with_mask_to_array (then used in infer_... functions and patches_method, for now).
A convolution function wrapper that calls either _scipy_convolution or _numba_convolution. Scipy has methods that are quite efficient for large arrays with any kernel size, while numba is very fast for small kernel sizes (source: https://laurentperrinet.github.io/sciblog/posts/2017-09-20-the-fastest-2d-convolution-in-the-world.html). Strangely, the numba convolution currently fails with a Segmentation Fault, impossible to trace the error...
A mean_filter_nan function that adapts a NaN arrays to compute the mean and count of valid samples by convolution.

Resolves #289
Resolves #294
Resolves #284

To-do-list:

The twelve labors of Hercules (reduce under 1s, or up to 5s for fitting/sampling requiring more processing):

The four horsemen of the apocalypse (first three unchanged to avoid damaging the quality of the examples):

plot-spatial-error-propagation-py: 0 minutes 32.042 seconds
plot-heterosc-estimation-modelling-py: 0 minutes 41.727 seconds
plot-standardization-py: 0 minutes 39.583 seconds
(Now 34s) plot-variogram-estimation-modelling-py: 6 minutes 46.937 seconds (because of patches_method)

…sinusoids)

…f pairwise comparison and adjust related tests

…o the minimum of 10

…s csv) as input of interp_nd_binning

…eps, reorder TestBinning to make file reading/saving order work

rhugonnet · 2022-09-04T21:43:50Z

Ready for review!
After a lot of searching, there were actually three separate issues combined into one that made our tests fail: two now opened as GlacioHack/geoutils#293, GlacioHack/geoutils#294, and the propagation of scipy.optimize.least_squares output floating precision into the Longyearbyen dDEM.

To fix them in this PR:

First issue Issue with floating nodata on latest rasterio versions geoutils#293: rasterio is forced to <= 1.2.10 as in Fix bugs following geoutils update #291,
Second issue Wrong behaviour of copy() method with new_array from np.ndarray with NaNs geoutils#294: the apply function of the Coreg class now returns a np.ma.masked_array to copy(new_array=).
Third issue: np.round is used on the output of scipy.optimize.least_squares, with less decimals (5 here) than the x_tol arguments (by default at 1e-8), to ensure that the result and thus the dDEM is always the same. Opening a wiki page in xDEM to remember how to address this in the future 🙂.

.gitignore

tests/test_examples.py

xdem/coreg.py

xdem/spatialstats.py

adehecq · 2022-09-07T09:23:51Z

Additionally, this PR adds some functions that might be good to move to geoutils:

* The conversion from `Raster` or `ndarray` input + an exclusion/inclusion mask as `Vector`, `np.ndarray` or `gpd.GeoDataFrame` into a 1D `np.ndarray` of included terrain or a 2D `np.ndarray` with NaNs on included terrain is now consistently performed by the function `_preprocess_values_with_mask_to_array` (then used in `infer_...` functions and `patches_method`, for now).

* A `convolution` function wrapper that calls either `_scipy_convolution` or `_numba_convolution`. Scipy has methods that are quite efficient for large arrays with any kernel size, while numba is very fast for small kernel sizes (source: https://laurentperrinet.github.io/sciblog/posts/2017-09-20-the-fastest-2d-convolution-in-the-world.html). Strangely, the `numba` convolution currently fails with a Segmentation Fault, impossible to trace the error...

* A `mean_filter_nan` function that adapts a NaN arrays to compute the mean and count of valid samples by convolution.

We should indeed transfer all the convolution functionalities to geoutils in the long term I think (when the numba convolution works!). Maybe just raise an issue for the time being?
The first functionality is useful, but I'm not sure exactly where it would fit. It's essentially handling many different cases.

adehecq

Great job speeding up all these tests !!

rhugonnet · 2022-09-08T12:48:10Z

I'm force merging because tests passed two days ago on the finalized PR.

However, some tests now fail in coreg.py due to the GeoUtils PR we merged yesterday (GlacioHack/geoutils#300), we should address this in a different PR.

rhugonnet added 2 commits August 25, 2022 23:33

Pass keyword arguments to basinhopping

1563b57

Shorten basinhopping tests to 5s each (necessary to converge to test …

4453da6

…sinusoids)

rhugonnet marked this pull request as draft August 25, 2022 21:43

rhugonnet added 9 commits August 30, 2022 10:35

Update code snippets with easier syntax

413aae8

Improve _choose_cdist_parameters function to yield the right number o…

ab440f8

…f pairwise comparison and adjust related tests

Reduce the number of samples for testing sample_empirical_variogram t…

6c0a7c3

…o the minimum of 10

Ignore new temporary test files

c4b374e

Add supportfor pd.Interval converted to string (happens when saving a…

78f0c1b

…s csv) as input of interp_nd_binning

Modify spatialstats tests to save output files to avoid duplicated st…

99c358d

…eps, reorder TestBinning to make file reading/saving order work

Further adjust tests for efficiency

e4468c5

Shorten code showed in snippet in doc pages

627dd41

Intermediate commit on patches method

1c3bd92

rhugonnet changed the title ~~Shorten tests and examples~~ Shorten tests and vectorize patches_method Aug 31, 2022

rhugonnet added 17 commits August 31, 2022 21:47

Finalize patches method

5cd2e83

Fix circular import and _preprocess function

2b9fc58

Incremental commit on patches method update and tests

2f9fe5b

Incremental commit on patches method

1eb450d

Fix tests with geoutils update

be737e2

Merge remote-tracking branch 'upstream/main' into shorten_tests

573f5eb

Fix sphinx warnings

0fab839

Update values with MemoryFile fixes PR merged

7722bc1

Fix last plot

a177bb2

Fix plot1d legend

0c13e23

Pass exact values to ddem

0a2cd0c

Reduce time of example

82c780e

Ignore divide by zero warning

0a86d77

Try to fix GDAL version for CI

d56cc2c

Force gstreamer subversion to compare to exact same environment

7f7f88d

Revert useless changes

c2d7b9a

Try nodata fix by rounding output of scipy.optimize

8ade548

rhugonnet added 5 commits September 4, 2022 20:33

Try upgrading rasterio

5318264

Fix the return of apply function

37f53db

Fix tests with new Longyearbyen dDEM

92c3276

Reforce older rasterio

08c2eab

Try to pass flattened and not ravelled arrays

275326a

rhugonnet marked this pull request as ready for review September 4, 2022 21:22

rhugonnet requested a review from adehecq September 4, 2022 21:22