Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

first/last: getting first/last element of empty array should error #408

Closed
LukeWeidenwalker opened this issue Feb 27, 2023 · 10 comments
Closed
Labels
wontfix This will not be worked on
Milestone

Comments

@LukeWeidenwalker
Copy link
Contributor

LukeWeidenwalker commented Feb 27, 2023

Process ID: first/last

Describe the issue:
Currently, according to example 4 in the docs, calling first or last on an empty array should return null:

Example 4
The input array is empty: return null.
first(data = []) => null

In our implementation this is causing problems when calling first or last as a reducer on an empty datacube, because it suggests that the process should return the value null and return that as the reduced array (i.e. reduced dimension ending up as [null]). In numpy this is only possible for float-type arrays, where I can return [np.nan], but for all other datatypes, such a missing data value does not exist.

Proposed solution:
IMHO first and last should just error if the array is empty, similar to how calling array[0] on an empty array will result in an index out of bounds exception.

Additional context:
This is also how numpy/xarray handles this case:
>>> np.take(np.array([]), indices=0)
IndexError: cannot do a non-empty take from an empty axes.

PR in openeo-processes-dask

cc @ValentinaHutter

@m-mohr
Copy link
Member

m-mohr commented Feb 27, 2023

In "dimension-based" reducers such as reduce_dimension this case doesn't actually happen as empty dimensions are not a thing in openEO as they would be dropped if empty. But, this case can happen in e.g. aggregate_spatial where you get am empty array when you have no data for a given polygon. In this case you need to return something as "void" is not defined in openEO, every process needs to return something. So I'm not sure how I can fix this. Also, this behavior is actually defined in all reducers, e.g. mean, median, max, min, sum etc. Can't you just check if the array is empty and return null in this special case (or catch the error)?

@m-mohr m-mohr added wontfix This will not be worked on and removed bug patch labels Feb 27, 2023
@m-mohr m-mohr closed this as completed Feb 27, 2023
@m-mohr m-mohr reopened this Feb 27, 2023
@m-mohr m-mohr closed this as not planned Won't fix, can't repro, duplicate, stale Feb 27, 2023
@LukeWeidenwalker
Copy link
Contributor Author

empty dimensions are not a thing in openEO

Okay, but first/last have the ignore_nodata parameter, so a dimension with only NaN values would effectively lead to the same problem!

Also, this behavior is actually defined in all reducers, e.g. mean, median, max, min, sum etc.

True, but it's effectively only a problem for the not-only-numerical reducers (e.g. first/last, any/all). For the numerical-only reducers it makes sense to return np.nan and have the array be implicitly cast to float in the parent process. So a potential change would only need to touch those four reducers.

  1. I think the core problem here is that there seems to be a hidden assumption that all datacube implementations (ours being based on numpy/xarray) have a native nodata value for all types of arrays. Afaik this is the case in R, but this doesn't exist in numpy (I just read NEP-26, this has been discussed for a decade, but there's currently only np.nan for float arrays).

  2. More pragmatically, I'd argue that it's not necessary or intuitive for first/last or any/all to ever return a null-value. For first/last, I'd throw an error if nothing is available. For any/all we can simply define what happens and return a boolean in all cases. For reference, in Python it works like this: np.array([]).any() -> False and np.array([]).all() -> True.

@m-mohr
Copy link
Member

m-mohr commented Feb 27, 2023

ignore_nodata is available in most numerical reducers. If that's set to true, it indeed can lead to the same issue. That's why it's probably a good idea to return a value.

Throwing an error doesn't make sense in the aggregate_spatial use case, I think. If there's no data it should indicate that there's no data instead of aborting the whole workflow.

Again, can't you handle the special cases explicitly in code?

@LukeWeidenwalker
Copy link
Contributor Author

ignore_nodata is available in most numerical reducers. If that's set to true, it indeed can lead to the same issue. That's why it's probably a good idea to return a value.

I agree that for these reducers it's fine to return a null value, because we can use the float NaN for that. But that's not the case for these other reducers.

Throwing an error doesn't make sense in the aggregate_spatial use case

I think the error would be thrown by the reducer and caught within aggregate_spatial, so it doesn't mean that the whole workflow has to be aborted!

Again, can't you handle the special cases explicitly in code?

I can't think of a good way to do it. E.g. I don't want to go into reduce_dimension, reduce_spatial and aggregate_spatial and just blanket catch any such impossible type coercions that arose from the reducer.

@m-mohr
Copy link
Member

m-mohr commented Feb 28, 2023

I agree that for these reducers it's fine to return a null value, because we can use the float NaN for that. But that's not the case for these other reducers.

I'm really not into this, but can't you set None/null for other data types? Alternatively use a masked array or so?

I think the error would be thrown by the reducer and caught within aggregate_spatial, so it doesn't mean that the whole workflow has to be aborted!

Ah, if you are referring to throwing an error in the context of openEO processes I assume it means openEO error, which always aborts execution. If it's possible to recover, you can still catch it internally and make it a warning or so, sure. But if you catch, you'd eventually also just set some kind of no-data values for the geometry that failed, which is essentially what the process wants you to do, no?

I can't think of a good way to do it. E.g. I don't want to go into reduce_dimension, reduce_spatial and aggregate_spatial and just blanket catch any such impossible type coercions that arose from the reducer.

Aren't we here just talking about empty arrays? I'm already seeing if len(data) == 0: return np.nan. What's still missing? It feels like I'm missing something crucial while looking at Open-EO/openeo-processes-dask#66 ...

@LukeWeidenwalker
Copy link
Contributor Author

But if you catch, you'd eventually also just set some kind of no-data values for the geometry that failed, which is essentially what the process wants you to do, no?

Yes, but I can only do that within the parent process, not from within the child process! From the perspective of the child process, this operation isn't doable, so it should error out. The first item of an empty array is not a null value!

openEO error, which always aborts execution

Hmm - is this set in stone? I think it does make sense for a Child process graph to throw a standardised OpenEO error and its parent process attempting to catch it.

Aren't we here just talking about empty arrays? I'm already seeing if len(data) == 0: return np.nan. What's still missing? It feels like I'm missing something crucial while looking at Open-EO/openeo-processes-dask#66

Ah, the crucial piece is that the code as proposed in this PR leads to the same issue I raise in #410, In short, np.nan is a float64, so if I attempt to grab the first item from a string array and it doesn't exist, I would suddenly cast my output array to float64. This can lead to unexpected behaviour further down the line, because my dtypes are no longer what I'd expect them to be.

@LukeWeidenwalker
Copy link
Contributor Author

@m-mohr could we reopen this one please? Forgot about it yesterday because it's closed, but it does seem worth discussing in the next telco!

@m-mohr m-mohr reopened this Mar 15, 2023
@m-mohr
Copy link
Member

m-mohr commented Mar 15, 2023

Sure, although I'm not sure what exactly to discuss? Isn't this the basically same issue (regarding nodata) and "solution" as in #410?

@m-mohr m-mohr added this to the 2.0.0 milestone Mar 15, 2023
@LukeWeidenwalker
Copy link
Contributor Author

Update:
Ignore all of my argument about nodata, we'll have to addess that separately.
However, I'm still not convinced that the equivalent of array[0] on an empty array should return anything at all, rather than throw an IndexOutOfBoundException.

@m-mohr
Copy link
Member

m-mohr commented Mar 30, 2023

But an empty array means there is no data, i.e. the process indicates that there is no data by returning null, which is the nodata value. Doesn't that makes sense?

This is how openEO tries to work meaningful on raster data cubes without always throwing errors and making use of no-data values. Other processes also return null on empty arrays, e.g. all, any, array_find, max, min, ... basically every reducer except for count. Deviating from this for just first and last doesn't seem very consistent. If you want to enforce an error on empty arrays, use array_element instead.

@m-mohr m-mohr closed this as not planned Won't fix, can't repro, duplicate, stale Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants