Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add robust polynomial, sum of sinusoids fitting #151

Merged
merged 39 commits into from
Sep 8, 2021

Conversation

rhugonnet
Copy link
Contributor

@rhugonnet rhugonnet commented Jun 22, 2021

Polynomial fitting + Sum of sin fitting + Across/Along-track sampling

Resolves #50

coefs, deg = xdem.spatial_tools.robust_polynomial_fit(x, y, estimator='Theil-Sen')

2 robust polynomial fitting solutions

1/ One combining sklearn.linear_model and sklearn.preprocessing.PolynomialFeatures: solves a polynomial with robust estimators Linear Regression, Theil-Sen (median approach), RANSAC or Huber.
2/ Using scipy.optimize.least_squares and specific loss function, some are quite robust to outliers (see example in tests).

Both implemented in the same function

Linear can be solved with scipy or sklearn. While Theil-Sen, RANSAC and Huber only with sklearn.

Input/Output

Simple input: x, y as input, choice of estimator, cost function cost_func (by default, median absolute error), and a few other options described in the docs.
Simple output: polynomial degree as integer, and coefficients in a vector.

Choosing the best polynomial

Simply using the polynomial that has the lowest cost (less spread between true and predicted values) is known to not be a good approach for choosing the optimal degree, as it can lead to overfitting. Here I wrote a simple function that selects the polynomial of smallest degree within a percentage margin of the "best cost" found by the fit.

So, for instance, if degree 1 fit has a cost of 100, degree 2 fit a cost of 20, degree 3 fit a cost of 5 and then degrees 4 to 6 fits have a cost between 4 and 5, the function (with a margin of 20% by default) will select degree 3 as the optimal solution.

EDIT:

TODOLIST to finalize PR

  • Fix tests
  • Re-organize estimator subfunction wrappers to be more generic
  • Refactor poly and sumofsin functions to call new wrapper functions
  • Add kwargs argument that can be identified to any subfunction call (same logic as in spatialstats.py)
  • Fix circular import
  • Move functions to new module fit.py following Sort the mess in spatial_tools.py #157

xdem/spatial_tools.py Outdated Show resolved Hide resolved
@erikmannerfelt
Copy link
Contributor

Nice! I presume this is the first step toward your bias corrections?

Is scikit-learn already installed due to some other dependency? It is not a direct dependency in the environment file.

@rhugonnet
Copy link
Contributor Author

Nice! I presume this is the first step toward your bias corrections?

Is scikit-learn already installed due to some other dependency? It is not a direct dependency in the environment file.

Yes, first towards a series of BiasCorr classes. I thought it would make more sense to partition it, especially as those 1D fitting functions can be used for other applications!

Ups! Didn't check, it was in my local xdem environment for some reason.
I guess that then the main question is: should we have scikit-learn as a dependency @adehecq @erikmannerfelt ?
It's becoming very popular (linked to Dask for example), and it has a lot of great statistic tools (including Gaussian Processes) so I'm clearly for it 😄

@adehecq
Copy link
Member

adehecq commented Jun 22, 2021

Looks very promising!! :-)

Regarding the dependency on scikit-learn (and of other packages in general), I believe that unless not importing the module makes xdem useless, e.g. rasterio, we should not make it a hard dependency. So basically, it should only be imported within the functions where it is needed.
We could have two requirement files, one with only the hard dependencies, and one with all, so that it's easier to install if one wants to use all the xdem functionalities.

@rhugonnet
Copy link
Contributor Author

For now, I have added an _has_sklearn flag triggered by the sklearn import failure in
78e2ef7, and checked in the function, as @erikmannerfelt did for csv2, richDEM.
Is this what you meant by "importing within the function", @adehecq ? I'm not sure of the best practice here.

Should we open an issue (improvement) for creating a file for "full environment" and one for "minimal environment"?

@adehecq
Copy link
Member

adehecq commented Jun 22, 2021

For now, I have added an _has_sklearn flag triggered by the sklearn import failure in
78e2ef7, and checked in the function, as @erikmannerfelt did for csv2, richDEM.
Is this what you meant by "importing within the function", @adehecq ? I'm not sure of the best practice here.

Should we open an issue (improvement) for creating a file for "full environment" and one for "minimal environment"?

My idea was to have the import statement within the function directly. I was discussing with Fabien about it and there are pro/cons to each approach:

  • with the current approach, if you don't catch and raise a specific error (as you did perfectly here!) then the error becomes difficult to trace for the user. So we need to make sure that the error is caught everywhere in the code.
  • with the import within the function, it's more transparent and an ImportError is raised when calling the function, which is then easy to understand. But if using the function in multiple processes, then the module will be loaded each time...
    Any opinion on this @erikmannerfelt ?

Copy link
Contributor

@erikmannerfelt erikmannerfelt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice functionality! I have some quite small remarks but generally I like it!

tests/test_spatial_tools.py Outdated Show resolved Hide resolved
xdem/spatial_tools.py Outdated Show resolved Hide resolved
@@ -398,6 +408,149 @@ def hillshade(dem: Union[np.ndarray, np.ma.masked_array], resolution: Union[floa
# The output is scaled by "(x + 0.6) / 1.84" to make it more similar to GDAL.
return np.clip(255 * (shaded + 0.6) / 1.84, 0, 255).astype("float32")

def get_xy_rotated(raster: gu.georaster.Raster, myang: float) -> tuple[np.ndarray, np.ndarray]:
"""
Rotate x, y axes of image to get along- and cross-track distances.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returns pixel coordinates rotated around the lower left corner, right? Where do the "cross-track distances" come in?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could also forego the raster class (if we want) by using xdem.coreg._get_x_and_y_coords on a transform and the shape of the array. I don't know if that is necessary or better; just a suggestion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, could be a nice idea to combine both with an optional rotation argument.

xdem/spatial_tools.py Outdated Show resolved Hide resolved
xdem/spatial_tools.py Outdated Show resolved Hide resolved
xdem/spatial_tools.py Outdated Show resolved Hide resolved
xdem/spatial_tools.py Outdated Show resolved Hide resolved
xdem/spatial_tools.py Outdated Show resolved Hide resolved
xdem/spatial_tools.py Outdated Show resolved Hide resolved
tests/test_spatial_tools.py Outdated Show resolved Hide resolved
@erikmannerfelt
Copy link
Contributor

For now, I have added an _has_sklearn flag triggered by the sklearn import failure in
78e2ef7, and checked in the function, as @erikmannerfelt did for csv2, richDEM.
Is this what you meant by "importing within the function", @adehecq ? I'm not sure of the best practice here.
Should we open an issue (improvement) for creating a file for "full environment" and one for "minimal environment"?

My idea was to have the import statement within the function directly. I was discussing with Fabien about it and there are pro/cons to each approach:

* with the current approach, if you don't catch and raise a specific error (as you did perfectly here!) then the error becomes difficult to trace for the user. So we need to make sure that the error is caught everywhere in the code.

* with the import within the function, it's more transparent and an ImportError is raised when calling the function, which is then easy to understand. But if using the function in multiple processes, then the module will be loaded each time...
  Any opinion on this @erikmannerfelt ?

I think generally it is advised to import at the top-level because running the function multiple times would otherwise run the imports multiple times. This is not very slow, but it's also not instantaneous. See StackOverflow for reference. I think both approaches are fine; as you mention, there are pros and cons to both. I think we should just be consistent, and for now, we have the _has_blabla syntax in other places.

@erikmannerfelt
Copy link
Contributor

For now, I have added an _has_sklearn flag triggered by the sklearn import failure in
78e2ef7, and checked in the function, as @erikmannerfelt did for csv2, richDEM.
Is this what you meant by "importing within the function", @adehecq ? I'm not sure of the best practice here.

Should we open an issue (improvement) for creating a file for "full environment" and one for "minimal environment"?

My two cents are that environment.yml should contain all optional dependencies as well. Most people (that are not developers) will either use pip, whereby the dependencies/optionals are parsed from setup.py, or from conda-forge, whereby it will be read from the conda recipe (which only specifies the mandatory dependencies, thus skipping the optionals). Thus, environment.yml has a weird niche where it is mostly only used by people that either want to develop, those who want main instead of the latest release, or those that want to modify the code. In these cases, I think there's no real drawback to have them install optional dependencies as well.

Thoughts, @adehecq and @rhugonnet ?

@rhugonnet
Copy link
Contributor Author

My two cents are that environment.yml should contain all optional dependencies as well. Most people (that are not developers) will either use pip, whereby the dependencies/optionals are parsed from setup.py, or from conda-forge, whereby it will be read from the conda recipe (which only specifies the mandatory dependencies, thus skipping the optionals). Thus, environment.yml has a weird niche where it is mostly only used by people that either want to develop, those who want main instead of the latest release, or those that want to modify the code. In these cases, I think there's no real drawback to have them install optional dependencies as well.

Thoughts, @adehecq and @rhugonnet ?

Fully agree, setup.py will do the job for the mandatory ones (almost forgot about it)

@adehecq
Copy link
Member

adehecq commented Jun 22, 2021

For now, I have added an _has_sklearn flag triggered by the sklearn import failure in
78e2ef7, and checked in the function, as @erikmannerfelt did for csv2, richDEM.
Is this what you meant by "importing within the function", @adehecq ? I'm not sure of the best practice here.
Should we open an issue (improvement) for creating a file for "full environment" and one for "minimal environment"?

My two cents are that environment.yml should contain all optional dependencies as well. Most people (that are not developers) will either use pip, whereby the dependencies/optionals are parsed from setup.py, or from conda-forge, whereby it will be read from the conda recipe (which only specifies the mandatory dependencies, thus skipping the optionals). Thus, environment.yml has a weird niche where it is mostly only used by people that either want to develop, those who want main instead of the latest release, or those that want to modify the code. In these cases, I think there's no real drawback to have them install optional dependencies as well.

Thoughts, @adehecq and @rhugonnet ?

Just to clarify, where is the information on the dependencies for conda-forge stored? I thought it was in the environment.yml. If not, then I agree that this file is a good place to put all optional requirements. But in that case, we could probably merge it with the dev-environment.yml, if it's only to be used by people developing/testing the code.
And what is the point of the dev-requirements.txt?
In all cases, it would be great to have a place where we explain where dependencies should be added. For now there are at least 3-4 files where this should be updated no?

@adehecq
Copy link
Member

adehecq commented Jun 22, 2021

I think generally it is advised to import at the top-level because running the function multiple times would otherwise run the imports multiple times.

But a module that was previously imported is not imported again, so the function will take a little bit more time on the first call, but then it should be almost the same no?

xdem/spatial_tools.py Outdated Show resolved Hide resolved
@adehecq
Copy link
Member

adehecq commented Jun 29, 2021

Looks great! This makes me think we could use your robust polynomial fit in coreg deramp.
I have some minor comments, but after that it's good for me.

@adehecq
Copy link
Member

adehecq commented Jul 27, 2021

@rhugonnet what's the status of this PR?

@rhugonnet
Copy link
Contributor Author

@rhugonnet what's the status of this PR?

On it, trying to homogenize things and push a final version!

@rhugonnet
Copy link
Contributor Author

Again, I have an assertion error that happens only in CI, while everything passes locally... and this is for a calculation with a fixed random_state
So glad to be back 😭

@rhugonnet
Copy link
Contributor Author

@adehecq @erikmannerfelt All ready for approval to be merged, except for that one test that seems to fail randomly in CI (while it never does locally, the random_state is well fixed and gives a constant result...).

@rhugonnet
Copy link
Contributor Author

I'm still desperately trying to understand just the base: why 🙏

@rhugonnet
Copy link
Contributor Author

I'm still desperately trying to understand just the base: why

After a few hours, still can't trace it back. 😢
I'm giving up for now, let's just open an issue that I'll reference to this comment.
Description here:

test_robust_sumsin_fit randomly fails, in CI only, for the following assertion:

        for i in range(2):
            assert coefs[3*i] == pytest.approx(true_coefs[3*i], abs=0.02)

The function call is based on scipy.optimize.basinhopping. I cannot reproduce the issue locally, even when setting up xdem-dev environment the same way as in CI.
This is very strange because the test uses a random state random_state=42 that is used both by xdem.spatial_tools.subsample_raster and by basinhopping (seed argument). And locally the output is indeed not subject to any random variation (constant).
The error seems to occur 75% of the time in CI. When it does, the scipy version (1.7.1) and numpy version (1.20.3) are exactly the same as the ones I have locally without error after running 15+ times. So the issue doesn't seem to be from there either...
The mystery remains!

Copy link
Member

@adehecq adehecq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job !
I haven't looked in detail at your latest changes, but if you took into account our comments, it should be alright.
It's annoying you couldn't find the issue behind the test randomly failing...

@rhugonnet rhugonnet merged commit bd40f92 into GlacioHack:main Sep 8, 2021
@rhugonnet rhugonnet deleted the angle_binning branch September 8, 2021 07:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add polynomial fit, sinusoidal fit, along/across-track sampling
3 participants