Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collection name differences #52

Closed
m-mohr opened this issue Mar 9, 2018 · 18 comments
Closed

Collection name differences #52

m-mohr opened this issue Mar 9, 2018 · 18 comments

Comments

@m-mohr
Copy link
Member

m-mohr commented Mar 9, 2018

The products (GET /data) deliver often quite different names for the data sets. That leads to a problem with the scripts as you need to rename the products in the process graphs. This could also be a problem with the band ids. Could we have a shared subset?

@GreatEmerald
Copy link
Member

This is indeed a big issue for portability. I'd say that establishing some naming conventions would make sense. All products already have names as given by the organisation providing the original data, so keeping to that would make sense; but then we have unofficial variants, such as vegetation indices derived from the original data, spatial subsets, different processing and compositing levels...

In some cases, like for vegetation indices, one could even make products that are made lazily, where accessing it would either calculate the index on the fly, or provide a cached version if it has already been precomputed for the area of interest.

Maybe there could also be two fields for identifying a product, one that indicates the sensor that the data is from (e.g. Landsat 8 OLI), and one that indicates which derivative it is (e.g. TOC 30m NDVI)? Or perhaps one field with a separator (e.g. /).

@m-mohr m-mohr added this to the v0.4 milestone Mar 26, 2018
@m-mohr
Copy link
Member Author

m-mohr commented Mar 29, 2018

Very good ideas, @GreatEmerald ! I hadn't read it so far, but came to a similar idea during a telco with Chris Holmes where he presented STAC and how they define metadata about assets, especially EO image collections and granules. Afterwards, I had the idea that we should move away from using the IDs all over the place. They are good for some purposes, but not standardized. We could either do that or take a completely different route by selecting image collections by properties/meta data. We could have a process that just filters image collections by meta data and returns the most plausible one, e.g. give me the raw data (=non-derived data) for image collections that are taken with the platform landsat and the sensor ETM+. In theory that should always return the same image collection regardless of the name. Of course, one could also filter by the id itself. Of course we would need more metadata for data discovery (e.g. where data has been derived from, provider, license, platform, sensor, ...). See #64 for more thoughts on this.
Partially this could also work for derived data, e.g. cloud coverage could be specified etc. Of course this data probably wouldn't be completely comparable, but that's not really possible anyway.
I think that's a promising way to go for a problem that seemed unsolvable in my mind. ;)

@m-mohr
Copy link
Member Author

m-mohr commented Apr 4, 2018

Started implementation with commit 8463ba7

@m-mohr
Copy link
Member Author

m-mohr commented Apr 26, 2018

Related issue: I think it's a good idea to harmonize the values in eo:platform as most as possible so that a search on this field gives better/more predictable results. I'd suggest adding a recommendation to follow a list of platform names (see below).

This issue is inspired by the CEOS OpenSearch best practice document.

In an older version (1.1 D3) they had this ToDo (cited for the potential lists of platform names):

OpenSearch servers supporting collection (or granule) search based on satellite name are
recommended to pass the satellite name via the {eo:platform} search parameter defined in
[OGC 13-026r5].
To avoid name mismatches (e.g. "SPOT 5" versus "SPOT5" versus "SPOT-5" or "SEASAT 1"
versus "Seasat"), implementations are encouraged to support names as defined in TBD.

NOTE 3 – Harmonised platform names?
This TBD above should be replaced by a reference to a recommended naming. Candidates are:

Source: CEOS OpenSearch Best Practice Document Version 1.1D3, page 29

In the most recent version they have decided to go with the NASA document:

OpenSearch servers supporting collection (or granule) searches based on satellite name are
recommended to pass the satellite name via the {eo:platform} search parameter defined in [OGC
13-026r8].
To avoid name mismatches (e.g. "SPOT 5" versus "SPOT5" versus "SPOT-5" or "SEASAT 1"
versus "Seasat"), implementations are encouraged to support names as defined in the SKOS
description of satellite and missions is in the Global Change Master Directory (See skos:prefLabel
of the platform as defined in http://gcmdservices.gsfc.nasa.gov/static/kms/platforms/platforms.rdf).

Source: CEOS OpenSearch Best Practice Document Version 1.2, page 32

@GreatEmerald
Copy link
Member

There is a point that different backends might not provide exactly the same data, but with the same metadata. Similarly, there's a possibility that a search would provide more than one result, which would lead to ambiguity... On one hand, we want to be able to reuse the same data across different backends, but on the other hand, we want to make sure that the data is indeed the same.

Also, if we go with just defining metadata, so that the actual name of the dataset doesn't matter, we still need to standardise the metadata. Which is the same amount of effort as standardising the name itself. In which case, why even have a name for the dataset in the first place? Those could be autogenerated from the metadata. For instance like Google Earth Engine has, ${vendor}/${satellite}/${processing_level}/${composite_length}/... (e.g. VITO/PROBAV/C1/S1_TOC_100M, though we'd probably want to have VITO/PROBAV/C1/S1/TOC/100M).

@m-mohr m-mohr modified the milestones: v0.4, v0.3 Jul 4, 2018
@m-mohr
Copy link
Member Author

m-mohr commented Jul 6, 2018

@GreatEmerald wrote:

Those could be autogenerated from the metadata. For instance like Google Earth Engine has, ${vendor}/${satellite}/${processing_level}/${composite_length}/... (e.g. VITO/PROBAV/C1/S1_TOC_100M, though we'd probably want to have VITO/PROBAV/C1/S1/TOC/100M).

At first, I really like this idea, but after some more thoughts it doesn't seem to have advantages to our previously discussed approach. The name would be restricted to the pre-defined fields and may not even be distinct, i.e. lead to collections with the same name. In the end, it would be the same thing as selecting by metadata in a process. Your example translated into a process would be something like:

{
  process_id: "get_data",
  vendor: "VITO",
  satellite: "ProbaV",
  processing_level: "C1/S1_TOC",
  composite_length: "100M"
}

Of course, we would need to pre-define common names / rules for satellites, vendors etc.
In the end it probably won't work without the user checking whether the data sets are really the same, but in the end he also relies on the back-end to not pre-process it and deliver good-enough metadata etc.
It is important to get this working somehow, otherwise this has a huge impact to process graph sharing. If it is required to change the data retrieval due to different "names", we can't load process_graphs from other back-ends as we can't change it on the fly. Actually, this lead to an idea to allow variables in process graphs. I'll write an issue on that.

@GreatEmerald
Copy link
Member

Yes, standardising metadata is more important than standardising the name. From your example one could generate a name as well (or just not use names altogether). Though I'm not sure why you use processing_level: "C1/S1_TOC" instead of something like collection: "C1", atmospheric_correction: "TOC", composite_length: "S1" (and your composite_length should be spatial_resolution).

Hm, another idea is to allow for a metadata field that is a sort of alias for commonly used datasets. For instance, Landsat 8 SR as produced by LaSRC could have an alias Landsat8SR. These datasets are standard (within a particular collection), and it might be faster/easier for users to refer to it than to declare all metadata every time.

@m-mohr
Copy link
Member Author

m-mohr commented Jul 30, 2018

Thanks for the clarification. I was not really sure what some of the abbreviations were referring to. Your examples make more sense for sure, but still not sure what exactly "C1" is about. Is that some arbitrary name or is there some hidden meaning? @GreatEmerald

I'll also take this issue into the STAC discussions, maybe that have some ideas or solutions.

@GreatEmerald
Copy link
Member

"C1" stands for Collection 1, as opposed to Collection 0 or Pre-Collection. (Proba-V also has a distinction between Collection 1.01 and Collection 1.02 tiles in the file-based system they use; not everything is in C1.02, and there are some tiles that are in both C1.01 and C1.02.)

@m-mohr m-mohr changed the title Product name differences Collection name differences Apr 2, 2019
@m-mohr
Copy link
Member Author

m-mohr commented Sep 13, 2019

3rd year planning: No actions planned, further explore with @aljacob how find_collections could help.

@GreatEmerald
Copy link
Member

We may run into the same issue with find_collections, as in we need to standardise the way to find them in some way anyway. It doesn't matter if we standardise the name or the metadata, since the name is part of the metadata anyway.

@m-mohr
Copy link
Member Author

m-mohr commented Oct 11, 2019

Although the metadata is already somehow standardized through STAC. So I think there's a difference.

@GreatEmerald
Copy link
Member

Yes, for sure it helps, though doesn't completely solve the issue.

Generally speaking, if we take a user workflow like in GEE, the user inputs "Sentinel-2" into the search box, and then manually selects the collection they actually want (so in the end the collection identifier is sent to the backend). When we have multiple backends, the results of the query can be quite different (e.g. many more options on EURAC for the same query). That means that the user will need to manually change the collection name before running the job on a different backend because the data is different, and there is no way that e.g. the EURAC backend can figure out what the user really wants from such a query. That's not necessarily a deal-breaker, of course.

But I think we should aim to make the users' lives easier, so when the differences between collections on different backends are superfluous (e.g. bounding box only), it would be nice if the user did not need to manually inspect the inputs every time. I think the original vision of OpenEO was that a user can write a script and run it on different backends, whereas with the current direction the user can write a script and run it on different backends after double-checking the inputs and applying an additional process graph that harmonises the data to the backend they were using previously. This is perhaps unavoidable in some cases, but hopefully there are cases where we could in fact avoid that, or at least make it easier for users.

@m-mohr
Copy link
Member Author

m-mohr commented Oct 11, 2019

Yes, that's a good description of the problem, but I don't get what you are actually proposing to do?

@GreatEmerald
Copy link
Member

Basically standardising some of the naming, when the products are supposed to be equivalent. So e.g. the process graph that harmonises a given collection to a variant on another backend should result in a new data cube/collection with the same name as on the other backend; or if they are already largely equivalent, have the different backends call it the same.

I can see if we can discuss this internally a bit more and see if we can come up with some more ideas...

@m-mohr
Copy link
Member Author

m-mohr commented Oct 15, 2019

As far as I understood at the last in-person meeting, we basically agreed that a standardized naming is very unlikely to happen/work well and therefore we are going to try other options (e.g. find_collection). Therefore, I was bit surprised that you brought it up again, but of course I'm open to good solutions.

@GreatEmerald
Copy link
Member

We discussed this internally and the conclusion was that we may only want to standardise names of collections that are truly standard, e.g. Sentinel-2 as it is provided by ESA Science Hub, and not modified in any way, so that the user code could truly be portable for those. (As far as we can tell, none of the backends provide that so far.) For any local backend variants, they should have a different name, and it should be up to users to figure out how to get what they want on each backend, and they should change the code accordingly every time they want to switch backends. So good documentation of collections on each backend is very important (and clients should allow searching through the descriptions, so that users can use e.g. search_collections("Sentinel-2") and get only those entries that are related to Sentinel-2, and not the other several thousand that the backend may provide).

Having bits of process graphs that harmonise products across backends can be very inefficient (e.g. no sense in resampling bands if the user doesn't care about higher resolution bands anyway), so we're wary of that option as well. Expert knowledge on the user side may be better.

As for find_collection as a process, we find that it may be of limited use in the end; it would be pretty bad if it finds the wrong collection or a collection that is similar but not the same, as the output may be different without the user being aware of that (and the reasons behind it).

@m-mohr
Copy link
Member Author

m-mohr commented Nov 1, 2019

Thanks @GreatEmerald for the input.

We discussed this internally and the conclusion was that we may only want to standardise names of collections that are truly standard, e.g. Sentinel-2 as it is provided by ESA Science Hub, and not modified in any way, so that the user code could truly be portable for those.

For now, I'd say standardizing collection names should be out of scope and we should focus on getting openEO fully working. Afterwards, we or someone else may provide recommendations regarding the naming.

clients should allow searching through the descriptions, so that users can use e.g. `search_collections("Sentinel-2")

The important part here is "clients". As we agreed to not implement search on the back-end level, the search has to be implemented on the client-side.

Having bits of process graphs that harmonise products across backends can be very inefficient (e.g. no sense in resampling bands if the user doesn't care about higher resolution bands anyway), so we're wary of that option as well. Expert knowledge on the user side may be better.

Fine, means no action to take for us.

As for find_collection as a process, we find that it may be of limited use in the end; it would be pretty bad if it finds the wrong collection or a collection that is similar but not the same, as the output may be different without the user being aware of that (and the reasons behind it).

As it's broken anyway at the moment and I agree that it may be of limited use, I'll remove it from the process descriptions.

To conclude: We may take action later in standardize naming of common collections (i.e. data directly from the organization capturing it). Other than that, there's nothing to do on the API side. I'll close this for now. Feel encouraged to comment/reopen whenever required.

@m-mohr m-mohr closed this as completed Nov 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants