-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collection name differences #52
Comments
This is indeed a big issue for portability. I'd say that establishing some naming conventions would make sense. All products already have names as given by the organisation providing the original data, so keeping to that would make sense; but then we have unofficial variants, such as vegetation indices derived from the original data, spatial subsets, different processing and compositing levels... In some cases, like for vegetation indices, one could even make products that are made lazily, where accessing it would either calculate the index on the fly, or provide a cached version if it has already been precomputed for the area of interest. Maybe there could also be two fields for identifying a product, one that indicates the sensor that the data is from (e.g. Landsat 8 OLI), and one that indicates which derivative it is (e.g. TOC 30m NDVI)? Or perhaps one field with a separator (e.g. |
Very good ideas, @GreatEmerald ! I hadn't read it so far, but came to a similar idea during a telco with Chris Holmes where he presented STAC and how they define metadata about assets, especially EO image collections and granules. Afterwards, I had the idea that we should move away from using the IDs all over the place. They are good for some purposes, but not standardized. We could either do that or take a completely different route by selecting image collections by properties/meta data. We could have a process that just filters image collections by meta data and returns the most plausible one, e.g. give me the raw data (=non-derived data) for image collections that are taken with the platform |
Started implementation with commit 8463ba7 |
Related issue: I think it's a good idea to harmonize the values in eo:platform as most as possible so that a search on this field gives better/more predictable results. I'd suggest adding a recommendation to follow a list of platform names (see below). This issue is inspired by the CEOS OpenSearch best practice document. In an older version (1.1 D3) they had this ToDo (cited for the potential lists of platform names):
Source: CEOS OpenSearch Best Practice Document Version 1.1D3, page 29 In the most recent version they have decided to go with the NASA document:
Source: CEOS OpenSearch Best Practice Document Version 1.2, page 32 |
There is a point that different backends might not provide exactly the same data, but with the same metadata. Similarly, there's a possibility that a search would provide more than one result, which would lead to ambiguity... On one hand, we want to be able to reuse the same data across different backends, but on the other hand, we want to make sure that the data is indeed the same. Also, if we go with just defining metadata, so that the actual name of the dataset doesn't matter, we still need to standardise the metadata. Which is the same amount of effort as standardising the name itself. In which case, why even have a name for the dataset in the first place? Those could be autogenerated from the metadata. For instance like Google Earth Engine has, |
@GreatEmerald wrote:
At first, I really like this idea, but after some more thoughts it doesn't seem to have advantages to our previously discussed approach. The name would be restricted to the pre-defined fields and may not even be distinct, i.e. lead to collections with the same name. In the end, it would be the same thing as selecting by metadata in a process. Your example translated into a process would be something like:
Of course, we would need to pre-define common names / rules for satellites, vendors etc. |
Yes, standardising metadata is more important than standardising the name. From your example one could generate a name as well (or just not use names altogether). Though I'm not sure why you use Hm, another idea is to allow for a metadata field that is a sort of alias for commonly used datasets. For instance, Landsat 8 SR as produced by LaSRC could have an alias |
Thanks for the clarification. I was not really sure what some of the abbreviations were referring to. Your examples make more sense for sure, but still not sure what exactly "C1" is about. Is that some arbitrary name or is there some hidden meaning? @GreatEmerald I'll also take this issue into the STAC discussions, maybe that have some ideas or solutions. |
"C1" stands for Collection 1, as opposed to Collection 0 or Pre-Collection. (Proba-V also has a distinction between Collection 1.01 and Collection 1.02 tiles in the file-based system they use; not everything is in C1.02, and there are some tiles that are in both C1.01 and C1.02.) |
3rd year planning: No actions planned, further explore with @aljacob how find_collections could help. |
We may run into the same issue with |
Although the metadata is already somehow standardized through STAC. So I think there's a difference. |
Yes, for sure it helps, though doesn't completely solve the issue. Generally speaking, if we take a user workflow like in GEE, the user inputs "Sentinel-2" into the search box, and then manually selects the collection they actually want (so in the end the collection identifier is sent to the backend). When we have multiple backends, the results of the query can be quite different (e.g. many more options on EURAC for the same query). That means that the user will need to manually change the collection name before running the job on a different backend because the data is different, and there is no way that e.g. the EURAC backend can figure out what the user really wants from such a query. That's not necessarily a deal-breaker, of course. But I think we should aim to make the users' lives easier, so when the differences between collections on different backends are superfluous (e.g. bounding box only), it would be nice if the user did not need to manually inspect the inputs every time. I think the original vision of OpenEO was that a user can write a script and run it on different backends, whereas with the current direction the user can write a script and run it on different backends after double-checking the inputs and applying an additional process graph that harmonises the data to the backend they were using previously. This is perhaps unavoidable in some cases, but hopefully there are cases where we could in fact avoid that, or at least make it easier for users. |
Yes, that's a good description of the problem, but I don't get what you are actually proposing to do? |
Basically standardising some of the naming, when the products are supposed to be equivalent. So e.g. the process graph that harmonises a given collection to a variant on another backend should result in a new data cube/collection with the same name as on the other backend; or if they are already largely equivalent, have the different backends call it the same. I can see if we can discuss this internally a bit more and see if we can come up with some more ideas... |
As far as I understood at the last in-person meeting, we basically agreed that a standardized naming is very unlikely to happen/work well and therefore we are going to try other options (e.g. find_collection). Therefore, I was bit surprised that you brought it up again, but of course I'm open to good solutions. |
We discussed this internally and the conclusion was that we may only want to standardise names of collections that are truly standard, e.g. Sentinel-2 as it is provided by ESA Science Hub, and not modified in any way, so that the user code could truly be portable for those. (As far as we can tell, none of the backends provide that so far.) For any local backend variants, they should have a different name, and it should be up to users to figure out how to get what they want on each backend, and they should change the code accordingly every time they want to switch backends. So good documentation of collections on each backend is very important (and clients should allow searching through the descriptions, so that users can use e.g. Having bits of process graphs that harmonise products across backends can be very inefficient (e.g. no sense in resampling bands if the user doesn't care about higher resolution bands anyway), so we're wary of that option as well. Expert knowledge on the user side may be better. As for |
Thanks @GreatEmerald for the input.
For now, I'd say standardizing collection names should be out of scope and we should focus on getting openEO fully working. Afterwards, we or someone else may provide recommendations regarding the naming.
The important part here is "clients". As we agreed to not implement search on the back-end level, the search has to be implemented on the client-side.
Fine, means no action to take for us.
As it's broken anyway at the moment and I agree that it may be of limited use, I'll remove it from the process descriptions. To conclude: We may take action later in standardize naming of common collections (i.e. data directly from the organization capturing it). Other than that, there's nothing to do on the API side. I'll close this for now. Feel encouraged to comment/reopen whenever required. |
The products (GET /data) deliver often quite different names for the data sets. That leads to a problem with the scripts as you need to rename the products in the process graphs. This could also be a problem with the band ids. Could we have a shared subset?
The text was updated successfully, but these errors were encountered: