Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aggregate_spatial: crs of data and geometry mismatch undefined #499

Open
ValentinaHutter opened this issue Mar 25, 2024 · 10 comments
Open

Comments

@ValentinaHutter
Copy link

aggregate_spatial: crs of data and geometry mismatch undefined

Undefined crs handling:
In aggregate_spatial, the input parameters data and geometries specify a raster datacube and a vector datacube, the output is a vector datacube.
A user might define the geometries in the wgs84 CRS, while the data coming from openeo must not necessarily be in wgs84.
There are three options to handle this:

  1. The geometries get reprojected to the data CRS and the resulting vector datacube has the CRS of the input data.
  2. The data gets reprojected to the geometries CRS and the resulting vector datacube has the CRS of the input geometries.
  3. An error is thrown, as the user should reproject data or geometries before.

Wouldn't it make sense to define or at least note it on the specification level?

Proposed solution:
We currently use Nr. 1, as it means using the CRS from the first input parameter.

Additional context:
The process is currently being tested on various input datasets, which differ in their CRS.

Backends normally handle the data CRS themselves, but for using this process, it can easily happen that there are two CRS options, which are both equally valid.

@m-mohr
Copy link
Member

m-mohr commented Mar 31, 2024

Option 1 was the intended behavior. We should clarify this indeed. PRs are welcome.

@soxofaan
Copy link
Member

soxofaan commented Apr 5, 2024

  1. The geometries get reprojected to the data CRS and the resulting vector datacube has the CRS of the input data.
  2. The data gets reprojected to the geometries CRS and the resulting vector datacube has the CRS of the input geometries.

I think it should be a mix of 1 and 2.

While one could argue that in most use cases it's practically not very relevant which one you reproject to the other, I guess the most obvious choice for backend implementations is to reproject the geometries to the raster data CRS (e.g. geometries are typically just EPSG4326, and relatively cheap to reproject to the native raster data CRS, which you want to stay in for processing efficiency).

On the other hand, I think the user expectation is to get the same output geometries (in same CRS) as in the input vector cube. This might be even vital for ML application where you want to "join" the aggregation output data with target variables that are associated with the original input geometries.

So:

1+2: The geometries are reprojected to the data CRS and resulting vector datacube has the CRS of the input geometries

@ValentinaHutter
Copy link
Author

Thanks for the input - I agree, it makes sense to have both options! I think it is fine, to leave this decision to the individual backend implementations - but this should really be documented somewhere for new developers. Does it make sense to include this in the specification? Maybe, we could add a section to the specification, such as "note for developers" or simply "note"?

@soxofaan
Copy link
Member

soxofaan commented Apr 5, 2024

Maybe my explanation was a bit messy, I didn't mean to have/keep both options. I meant:

  • for the aggregation itself (mainly relevant for backend developer): project geometries to the raster data (first part of option 1)
  • for the resulting output (relevant to end user too): use the input geometry CRS (seconds part of option 2)

That second bullet point should certainly be documented in the process spec (as it is both relevant to backend dev and end user).
The first bullet point could be a recommendation, but I'm not sure it is a vital part to be documented

@clausmichele
Copy link
Member

If we agree on what @soxofaan proposed, we have to make sure to reuse the input geometries without reprojecting them twice, since it could lead to differences due to floating point rounding (it just append to me!).

@m-mohr
Copy link
Member

m-mohr commented Jul 10, 2024

Yeah, thinking more about it the important part is probably that you get the geometries as provided, i.e. in the CRS of the source geometry without changes to the coordinates etc.

We should probably describe that and then also name how this is done in the background. Maybe it's actually simpler to reproject only the source data to the geometry CRS?

@soxofaan
Copy link
Member

soxofaan commented Jul 16, 2024

The key point is indeed that the output geometries are exactly the same as the input geometries, without any (back and forth) projections or CRS changes in between. This probably needs some clarification in the docs.

How the aggregations are practically calculated is more an implementation detail for the backend implementer I think and I'm not sure the typical user will care about the difference between:

  1. reproject the geometries to the raster data CRS and aggregate.
    • This is probably the most straightforward and cheapest approach in most, if not all, practical use cases.
    • This is the approach we currently take in the geopyspark backend FYI.
  2. reproject the raster data to the geometry CRS and aggregate.
    • maybe there are reasons this is scientifically/mathematically more correct?
    • but what if there is no reasonable (implied) spatial resolution associated with the geometry CRS? If your geometry is in a default EPSG 4326, you don't want to reproject your 10 meter raster data pixels to 1 degree pixels, right?

Regardless, I'm not sure we should over-specify this aspect. However putting a recommendation (for option 1 I think) for the sake of reproducible results could be an option

@clausmichele
Copy link
Member

The process definition should also be fixed since now it's mentioned to use dimension of type geometries, but from the parameter description below and from the official API docs it should be geometry instead:
https://processes.openeo.org/#aggregate_spatial
https://api.openeo.org/#tag/EO-Data-Discovery/operation/describe-collection

@clausmichele
Copy link
Member

1. reproject the geometries to the raster data CRS and aggregate.
   
   * This is probably the most straightforward and cheapest approach in most, if not all, practical use cases.
   * This is the approach we currently take in the geopyspark backend FYI.

@soxofaan which CRS should we use in the output vector cube in your opinion? The raster cube projection or the vector cube projection?

@soxofaan
Copy link
Member

In my opinion the output vector cube should have the exactly geometry dimension as a whole (so same geometries, same order, same CRS) as the input vector cube.

This is what I intended to say with:

The key point is indeed that the output geometries are exactly the same as the input geometries, without any (back and forth) projections or CRS changes in between. This probably needs some clarification in the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants