-
Notifications
You must be signed in to change notification settings - Fork 1
Changes for the UDF data model #14
Comments
Hi, A couple of questions regarding the proposed UDF format: What software library implements this format and allows easy conversion to established "runtime" formats like numpy, shapely or xarray? Why do we need to invent new complex formats outside existing approaches, with feature dimensions and irregular spatial dimensions? What software exists that the UDF developer can use to process this format in Python and R? We should use common and well established data formats, so it is easy to implement UDF's for the user and ist easy to exchange data between the backend and the UDF REST server. Using GeoJSON makes it very clear for the backend, what kind of data was processed and is directly supported by the most common geo-data processing libraries: GEOS and OGR. The library pandas.DatetimeIndex is used for the time stamps. No need for data conversion. If you need raster data with more than three dimensions (i.e.: 2 regular time axes) or you have dense data like MODIS (HDF) or climate model data (netCDF), then the hypercube format can be used. This is implemented in Python via xarray, hence the UDF developer can use xarray directly, no data conversion required. Arrays are supported in R directly. The RasterCollectionTile is a more specialized approach of the hypercube that supports an irregular time dimension and 2D raster slices, which may be better fitted for remote sensing data of non-geostationary satellites, that have sparse data in time and space. The Python libraries are pandas.DatetimeIndex and numpy arrays, that can be used in the Python UDF code directly. About structured and unstructured data: Regarding machine learning model: [1] https://github.com/Open-EO/openeo-udf/blob/master/src/openeo_udf/functions/feature_collections_buffer.py |
I think we should have a discussion in person to figure how to proceed. I see room for alignment between the API/processes and UDFs and both side have good points, which we should align for the best outcome. A good place to discuss would be (before?) the 3rd year planning on the 12th of September so that we can release a "stable" 1.0 of the API at the end of the year, which includes the UDFs, of course. cc @edzer |
For starters, we defined it as our main view on EO data in the glossary (https://open-eo.github.io/openeo-api/glossary/) and from a users / udf script developers view it might get complicated if you have to deal with different data models from which you don't know when or under what circumstances you get those as data in the UDF. If this would somehow be specified I am happy with that.
The exchange format is still JSON and the
I agree, but we currently have defined something that works out of practicability with Python, but in my oppinion we need a more generic approach for the metadata that ships with the data, so all back-ends and UDF services are able to interprete the data correctly. Exchanging JSON between back-ends and UDF services is the lowest common denominator for all back-ends, simply because every back-end needs this to be compliant to the API, besides that we might have an option to use NetCDF, which supports multidimensional data handling. But then we would still have some work to do on agreeing to a common meta data handling. The only downside is that maybe not every back-end supports NetCDF.
Yes, thats right those are build on arrays (please also consider my explanation what MultidimensionalArrays are). But what about the metadata on the indices in those arrays. I do see your point for having those models, but you do some assumptions when calling them like
Yes, a back-end developer has to know it. But the UDF script developer might not be a developer of the back-end and hence need more explanatory data or at least a common way to know what data to expect.
GeoJSON is still on the table. For a potential vector data cube, I would model the spatial feature data as GeoJSON and hence as
Yes, that's right again.
OK, but why do wouldn't we split this into separate UDF calculation requests? For each tile time series separate?
The multidimensional array would be a one dimensional one in this case. A list would have an index dimension (regular with attributes offset=0 or 1 and delta=1) or if you have a named list, then use an irregularly one with values.
I understand and would agree that this should somehow be covered. I agree with @m-mohr that a face to face meeting would be a good idea to discuss this further, I'm looking forward to it. |
I do not understand this. Why does the UDF developer do not know what to expect? There is a Python API specifically designed for Python UDF developers with well known data formats. The UDF developer expects a specific format as input, since he designs his algorithms for this format. If the backend does not provide this format in the UdfData object, than the UDF raises an exception that it expects for example: a hypercube with one temporal and two spatial dimensions and vector points as FeatureCollectionTile, because its job is to sample hypercubes with vector points to generate new time stamped vector points with new attributes as output. The UDF developers can tell the backend with "in code key-words" what it expects and what it produces, so that the backend can check if the User provides the correct input for the UDF in the process graph and if the UDF output is compatible with downstream nodes in the process graph. The designer of the process graph is responsible to know what data formats are expected by the UDF that he wants to use. Hence, the UDF node must have data source nodes as inputs that provide the required formats.
No, that was perfectly clear. You want to have a multi dimensional array with metadata and any kind of objects as array values and axis definition. And that is in my opinion not a good approach, because it is not a common format that you can use for processing. If you have arbitrary objects as data, then the processing library that is used in the UDF must support these kinds of objects. It must know what kind of operators can be applied to these objects (+, -, *, /, ...). There is no software available that supports arbitrary objects. In Python these data formats have self describing metadata like number of features, number of dimensions, dimension name, units, the shape of the multi-dimensional array and so on. In addition, we can put all metadata that is necessary for algorithms as dictionary into the UdfData object. I have no problem with that. I think we should have a telephone conference with Edzer and Matthias before the 12. of September to discuss this topic. |
To keep this issue up-to-date: Internally it was discussed to have a closer look at https://covjson.org/spec/ Also Apache Parquet as exchange file format was considered. |
@flahn Is Parquet still considered or was it not good enough? |
Apache arrow does not support geopandas or xarray, hence can not be used to process hypercubes or vector data efficiently. covjson does not support vector data properly to be used as an exchange format between backends and UDF REST services. Apache Parquet is a columnar data format specific for HADOOP applications. We can consider this format to be used as exchange format, however, we need to specify what to store in this format. |
I would like to suggest an exchange format that suits all requirements (single file, JSON, easy to serialize, data cube support, vector data support, image collections support), that can be implemented as JSON format or relational table structure. It supports in a single file:
Here a simple data cube example:
A simple feature example:
Attached are the JSON schemas and the Python classes that define the data format:
|
Haven't looked into it in details and I'm not into the recent discussions, but some questions from my side (without any judgement, just for clarification):
|
It looks quite good for a JSON model. But I'm concerned with the referencing. In the covJSON specification there is a part under domain called "referencing" which creates a relation between named single dimensions and the reference system. I think this might be useful to avoid confusion with the axis order (e.g. lat/lon, lon/lat). If this is predefined for space (x,y) and time (t) we should document this somewhere. |
No. The suggested schema can be implemented using a table structure as well: SQlite?
The user should not see the communication between the backend and a UDF Rest server, except for Python or R exceptions, if the code failed. But this will show only the Python or R API errors, not the exchange format.
Because these are different datatypes. We should not force everything into a data cube and lose important features, like feature specific time stamps, sparse image collection with intersecting time intervals. For the UDF we can simply focus on data cubes and simple features. That should be absolutely sufficient. However, the suggested format provides many more features. |
The order of the dimension for a data cube is set with the dim option:
Which makes it clear for me, in what order the dimensions are stored in the field data. We can simply state this as the expected default. |
Agreed. I'm just asking because there's some metadata that could be also interesting for a user and if that's (partially) passed into the UDF, it should be aligned with how the rest of the API works anyway. But that's more naming and some structures and I guess you are open to align that as it wouldn't really change the way your proposal would work.
Oh, I see. The last paragraph was important. So the user would only use datacubes (and SF?) in the UDF, but the UDF server and back-end could use different types of exchange formats. That is important, because the data that is passed through the processes of a process graph is never an image collection (vector data is to be discussed, I guess), but always a data cube. |
I was wondering wether the UDF data schema can be remodelled? In my opinion we can model the data as data cubes. The reason is that a raster or feature collection tile are simply very special cases of your current hyper cube model. Even the structured data can be modeled as 1-dimensional or even 2-dimensional data.
The reason for this generalization is simple to make it more clear for both a back-end developer and a udf service developer how the data is structured and how it can be translated into language dependend objects, e.g.
stars
in R orgeopandas
in Python, or how the UDF results have to be interpreted by the back-end.The current UDF request model looks like this:
By revisiting the implementation of Edzer's
stars
package for R and the thoughts @m-mohr put into the dimension extension in STAC I would put up for discussion something like this:I'm not sure about the machine learning models. Does it really have to be part of a general UDF API or might it be better suited to load those in the UdfCode? As I see it the critical part is that it needs to be loaded from the local file system of the UDF service, which might be solved by uploading such data into the back-ends personal workspace and to mount this in the UDF service instance.
The text was updated successfully, but these errors were encountered: