Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a function to enrich STM using data from another dataset #66

Merged
merged 33 commits into from
May 6, 2024
Merged

Conversation

SarahAlidoost
Copy link
Contributor

@SarahAlidoost SarahAlidoost commented Mar 12, 2024

closes #63

This PR implements two functions:

  • stm.enrich_from_dataset: for both points and raster datasets
  • utils.crop: a util function for cropping a dataset, either points or raster

see #69 on why I changed _io.py in this PR.

Copy link
Member

@rogerkuou rogerkuou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Sarah, thanks for the nice implementation! It works well in general.

I have two comments regarding adding more checks on the incoming data. See above.

These comments come from a test I made with KNMI data. I also attached the notebook and the small example dataset here.

example_test.zip

stmtools/stm.py Outdated Show resolved Hide resolved
stmtools/stm.py Outdated Show resolved Hide resolved
stmtools/stm.py Outdated

# do selection
indexers = {coord: ds[coord] for coord in list(datapoints.coords.keys())}
selections = datapoints.sel(indexers, method="nearest")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a test using sel to get nearest time location, with datetime64[ns] format. This works under two conditions:

  1. For time dimension, in a specific combination: method=nearest + datetime64[ns] format, Xarray requires the time coordinates of ds to be monotonic increasing or decreasing.
  2. For lat and lon dimension, the duplicated values should not exists in the corresponding coordinates.

Can we add check in the beginning of enrich_from_dataset to enforce this?

An example check for time:

# This should be True
np.all(np.diff(ds['time'].values) > 0) or np.all(np.diff(ds['time'].values) < 0) 

An example check for lat and lon:

# Following should be True
np.unique(ds['lat'].values).shape == ds['lat'].values.shape
np.unique(ds['lon'].values).shape == ds['lon'].values.shape

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. It seems that xarray does not enforce cf conventions. I added two util functions to do the checks and call them in stem _enrich_from_points_block here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that with KDTree, these two conditions are not needed. So I removed the checks from _enrich_from_points_block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some tests for these conditions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rogerkuou to check if the points are unique, the test np.unique(ds['lat'].values).shape == ds['lat'].values.shape is not enough because it only checks the duplicates in one dimension here lat. However, for example, points can be located on one line.

Instead, we need a test if there are cases where (lat, lon, time) are duplicated. Functions like xarray.Dataset.drop_duplicates and pandas.DataFrame.duplicated can be used to write a test. But these functions only work on dim and not coords. In our cases, lat and lon are coords and space is the dim. So we might need to use unstack which leads to memory problems.

As discussed, scipy KDTree works if coordinates (lat, lon, time) are duplicated and the values of variables e.g. temperature are the same too, I added a test for this. If the values of variables are not the same for duplicated coordinates, MacOs and linux behave differently to pick up a value related to the nearest neighbor. For this extreme case, I suggest creating a new issue and fixing it in another PR.

@rogerkuou rogerkuou mentioned this pull request Apr 9, 2024
@SarahAlidoost
Copy link
Contributor Author

Hi Sarah, thanks for the nice implementation! It works well in general.

I have two comments regarding adding more checks on the incoming data. See above.

These comments come from a test I made with KNMI data. I also attached the notebook and the small example dataset here.

example_test.zip

thanks for the review. It is ready for another look.

Copy link
Member

@rogerkuou rogerkuou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @SarahAlidoost, thanks for the nice implementation!

I doced the follow-up issues in #75 and #76

I will merge it now.

@rogerkuou rogerkuou merged commit 2c62a0f into main May 6, 2024
16 checks passed
@SarahAlidoost SarahAlidoost deleted the fix_63 branch May 6, 2024 14:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Querying temporal attributes to an STM
2 participants