Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shorten tests and vectorize patches_method #292

Merged
merged 35 commits into from
Sep 8, 2022

Conversation

rhugonnet
Copy link
Contributor

@rhugonnet rhugonnet commented Aug 25, 2022

Summary

The test part of the CI now runs in 6min (including the test on documentation building), instead of 20min before! We should be able to get that under 3min once the terrain.py functions are vectorized 😄

This PR improves the speed of many tests, but especially those linked to functions of spatialstats.py and fit.py that require a lot of processing during either optimization or sampling.
The patches_method is reworked and can now be performed by convolution. This dramatically increases the computing speed compares to the sample size drawn. However, if the convolution is performed on an entire raster, it can still be slower than random patch sampling.
Two points are left to a later PR: improving the speed of terrain.py functions by convolution, which will naturally improve the speed of related tests and examples. And improving the speed of test_coreg.py, which will be better after the advances planned for the module in the next weeks/months.

@adehecq: Will be interesting to discuss where to put some functions listed further below, that probably better fit in geoutils.

As a change summary, this PR:

  1. Makes the data from load_ref_and_diff function available to the test classes of test_spatialstats.py, then calls the variables using e.g. self.ref to avoid duplicated loading in every function;
  2. Subsamples 10,000 points of the raster in test_nd_binning to reduce computing time;
  3. Saves and reads intermediate files during TestBinning, TestVariogram and TestNeffEstimation of test_spatialstats.py to avoid duplicating long processing steps;
  4. Re-orders TestBinning at the beginning of test_spatialstats.py for point 3 to work correctly;
  5. Adds a functionality in xdem.spatialstats.interp_nd_binning to recognize pd.Interval columns that are (unescapably) converted to string when saving to csv, to support pd.Dataframe initated from files read on disk (now done in point 3);
  6. Improve the function xdem.spatialstats._choose_cdist_equidistant_sampling_parameters to respect the pairwise samples and to allow small samples sizes while avoiding skgstat errors (minimum is 10);
  7. Improves tests for xdem.spatialstats._choose_cdist_equidistant_sampling_parameters to test the new functions;
  8. Constrains the xdem.spatialstats.sample_empirical_variogram calls with subsample=10 to reduce computing time;
  9. Constrains the number_effective_samples calls that depend on neff_hugonnet_approx with subsample=10 to reduce computing time;
  10. Fixes a small mistake in computing maxlag in xdem.spatialstats.sample_empirical_variogram;
  11. Adds the option of passing a niter argument to scipy.optimize.basinhopping in xdem.fit.robust_sumsin_fit;
  12. Constrains the test_fit.py functions using basinhopping to niter<25, depending on the test case (need to converge to the test function);
  13. Shortens the computing time of code lines for documentation in docs/source/code and adjusts the line changes in the docs;
  14. Introduces a wrapper for Raster objects for the patches_method;
  15. Homogenizes the old patches method with loop/cadrants into a _patches_loop_cadrants function;
  16. Adds a _patches_convolution method based on convolution.

Additionally, this PR adds some functions that might be good to move to geoutils:

  • The conversion from Raster or ndarray input + an exclusion/inclusion mask as Vector, np.ndarray or gpd.GeoDataFrame into a 1D np.ndarray of included terrain or a 2D np.ndarray with NaNs on included terrain is now consistently performed by the function _preprocess_values_with_mask_to_array (then used in infer_... functions and patches_method, for now).
  • A convolution function wrapper that calls either _scipy_convolution or _numba_convolution. Scipy has methods that are quite efficient for large arrays with any kernel size, while numba is very fast for small kernel sizes (source: https://laurentperrinet.github.io/sciblog/posts/2017-09-20-the-fastest-2d-convolution-in-the-world.html). Strangely, the numba convolution currently fails with a Segmentation Fault, impossible to trace the error...
  • A mean_filter_nan function that adapts a NaN arrays to compute the mean and count of valid samples by convolution.

Resolves #289
Resolves #294
Resolves #284

To-do-list:

The twelve labors of Hercules (reduce under 1s, or up to 5s for fitting/sampling requiring more processing):

  • (Now 5s) 55.24s call tests/test_fit.py::TestRobustFitting::test_robust_simsin_fit_noise_and_outliers
  • (Now 5s) 39.07s call tests/test_fit.py::TestRobustFitting::test_robust_sumsin_fit
  • (Now 11.21s) 32.04s call tests/test_docs.py::TestDocs::test_example_code
  • (Now 0.91s) 21.46s call tests/test_spatialstats.py::TestNeffEstimation::test_spatial_error_propagation
  • (Now 3.12ss) 16.34s call tests/test_spatialstats.py::TestVariogram::test_sample_multirange_variogram_default
  • (Done by beating the four horsemen of the apocalypse) 800s+ call tests/test_docs.py::TestDocs::test_build
  • (Now 0.05s) 13.05s call tests/test_spatialstats.py::TestBinning::test_interp_nd_binning
  • (Now 0.21s) 11.99s call tests/test_spatialstats.py::TestBinning::test_nd_binning
  • (Unchanged) 9.95s call tests/test_coreg.py::TestCoregClass::test_blockwise_coreg_large_gaps
  • (Now 3.98s) 9.75s call tests/test_spatialstats.py::TestVariogram::test_estimate_model_spatial_correlation_and_infer_from_stable
  • (Unchanged ) 9.70s call tests/test_coreg.py::TestCoregClass::test_icp_opencv
  • (Unchanged ) 9.22s call xdem/terrain.py::xdem.terrain.aspect

The four horsemen of the apocalypse (first three unchanged to avoid damaging the quality of the examples):

  • plot-spatial-error-propagation-py: 0 minutes 32.042 seconds
  • plot-heterosc-estimation-modelling-py: 0 minutes 41.727 seconds
  • plot-standardization-py: 0 minutes 39.583 seconds
  • (Now 34s) plot-variogram-estimation-modelling-py: 6 minutes 46.937 seconds (because of patches_method)

@rhugonnet rhugonnet marked this pull request as draft August 25, 2022 21:43
@rhugonnet rhugonnet changed the title Shorten tests and examples Shorten tests and vectorize patches_method Aug 31, 2022
@rhugonnet rhugonnet marked this pull request as ready for review September 4, 2022 21:22
@rhugonnet
Copy link
Contributor Author

rhugonnet commented Sep 4, 2022

Ready for review!
After a lot of searching, there were actually three separate issues combined into one that made our tests fail: two now opened as GlacioHack/geoutils#293, GlacioHack/geoutils#294, and the propagation of scipy.optimize.least_squares output floating precision into the Longyearbyen dDEM.

To fix them in this PR:

.gitignore Show resolved Hide resolved
xdem/coreg.py Show resolved Hide resolved
xdem/spatialstats.py Outdated Show resolved Hide resolved
@adehecq
Copy link
Member

adehecq commented Sep 7, 2022

Additionally, this PR adds some functions that might be good to move to geoutils:

* The conversion from `Raster` or `ndarray` input + an exclusion/inclusion mask as `Vector`, `np.ndarray` or `gpd.GeoDataFrame` into a 1D `np.ndarray` of included terrain or a 2D `np.ndarray` with NaNs on included terrain is now consistently performed by the function `_preprocess_values_with_mask_to_array` (then used in `infer_...` functions and `patches_method`, for now).

* A `convolution` function wrapper that calls either `_scipy_convolution` or `_numba_convolution`. Scipy has methods that are quite efficient for large arrays with any kernel size, while numba is very fast for small kernel sizes (source: https://laurentperrinet.github.io/sciblog/posts/2017-09-20-the-fastest-2d-convolution-in-the-world.html). Strangely, the `numba` convolution currently fails with a Segmentation Fault, impossible to trace the error...

* A `mean_filter_nan` function that adapts a NaN arrays to compute the mean and count of valid samples by convolution.

We should indeed transfer all the convolution functionalities to geoutils in the long term I think (when the numba convolution works!). Maybe just raise an issue for the time being?
The first functionality is useful, but I'm not sure exactly where it would fit. It's essentially handling many different cases.

adehecq
adehecq previously approved these changes Sep 7, 2022
Copy link
Member

@adehecq adehecq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job speeding up all these tests !!

@rhugonnet
Copy link
Contributor Author

I'm force merging because tests passed two days ago on the finalized PR.

However, some tests now fail in coreg.py due to the GeoUtils PR we merged yesterday (GlacioHack/geoutils#300), we should address this in a different PR.

@rhugonnet rhugonnet merged commit a8acdce into GlacioHack:main Sep 8, 2022
@rhugonnet rhugonnet deleted the shorten_tests branch September 8, 2022 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants