Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ML Data Cube Regularization #444

Open
wants to merge 21 commits into
base: ml
Choose a base branch
from
Open

ML Data Cube Regularization #444

wants to merge 21 commits into from

Conversation

PondiB
Copy link
Member

@PondiB PondiB commented Jun 19, 2023

Regularized datacubes are a necessity for machine learning and deep learning in EO time series data. This process aims to eliminate the need for a user chaining processes to have a consistent data cube

@PondiB
Copy link
Member Author

PondiB commented Jun 20, 2023

@m-mohr , I am seeking your eyes whenever you get to have a moment as I have fixed most failures but I am taking way longer to trace this.

@m-mohr
Copy link
Member

m-mohr commented Jun 21, 2023

fyi: I won't get to it anytime soon, sorry.

@PondiB
Copy link
Member Author

PondiB commented Jun 21, 2023

fyi: I won't get to it anytime soon, sorry.

Thanks for getting back. It's fine. I'll figure it out soon.

@PondiB PondiB changed the title data cube regularization ml data cube regularization Sep 14, 2023
@PondiB PondiB changed the title ml data cube regularization data cube regularization for machine learning processes Sep 14, 2023
@soxofaan
Copy link
Member

I'm not sure I understand why this process is necessary. The description talks about "irregular" but if your data is in a openEO data cube, then it's pretty regular already. Your time instants could be spaced unevenly, but that doesn't mean that an ML model could not handle that.

This process looks like a combination between aggregate_temporal_period and resample_spatial, but:

  • aggregate_temporal_period uses a different period specification format
  • aggregate_temporal_period has a reducer argument which ml_regularize_data_cube is missing I guess
  • resample_spatial has projection and method arguments (and some more) which are also missing here

In this state, I think ml_regularize_data_cube is missing quite some parameters.

more generally: is there a compelling reason to define ml_regularize_data_cube, if we already have aggregate_temporal_period and resample_spatial?

@jdries
Copy link
Contributor

jdries commented Sep 25, 2023

@PondiB
Copy link
Member Author

PondiB commented Sep 25, 2023

@soxofaan thanks for the feedback, on the OEMC project we are planning to come up with a new openEO backend with a more focus on ML and DL capabilities for Satellite Image Time Series.

Regular data cube in our case encompasses: (a) there is a unique field function; (b) the spatial support is georeferenced; (c) temporal continuity is assured; and (d) all spatiotemporal locations share the same set of attributes, and (e) there are no gaps or missing values in the spatiotemporal extent.

In our discussion, there were philosophies as shown in the image below and we would like to support both i.e. (1) allowing users to define their processes before ML/DL operations and (2)not bothering the users with underlying processes.
Screenshot 2023-09-25 at 14 54 41

@jdries cool, I will check out the examples.

@jdries
Copy link
Contributor

jdries commented Sep 27, 2023

Nice, this is exactly what I happen to be working on for the moment, in support of a couple of projects using ML.

Maybe you already know, but openEO has a mechanism to build this kind of convenience function that is a combination of existing processes, the openEO 'user defined processes' (UDP). Using this has a couple of advantages:

  • The process definition is very formal, and falls back to the definition of the individual processes, so less specification work to be done.
  • Backends that support the individual processes can easily support the convenience process, even without requiring explicit implementation. This is extremely important if we want to reach the goal of cross-backend compatibility.
  • Backends that do not support the individual processes, can still support the convenience process.
  • If you want a special (e.g. faster) implementation of the convenience process, that's also possible.

I see this case arising more often, so maybe we can create an open source github repo, with the definitions of these UDP's. That would allow users to reference the central repo, or allow backends to import those definitions.

Now about the actual process:

  • spatial regularization is something that openEO already allows to do by default, without requiring any process. If a user loads a mix Sentinel-2 bands at different resolution, we for instance return a datacube with the right UTM zone as projection system, and the highest resolution. So not sure if we need this.
  • cloud masking is tricky, and unfortunately still needs sensor specific implementations to do it right. Not sure how that would work with a convenience process? The most generic approach I can think of is some kind of binarized cloudmask, and then using a 'distance to cloud' metric in the compositing. The sits regularize (1) method mentions sorting images by cloud percentage, but I'm not sure how this translates to openEO datacubes.
  • there's different methods possible to select the best available pixel from a given compositing interval. The most optimal choice somewhat depends on the length of the interval, and number of observations per interval. A method that's relatively generic is using distance to the middle of the interval, combined with distance to cloud. It has the advantage over (1) that you try to ensure that the actually selected observations are spaced evenly in time as much as possible.

(1) https://rdrr.io/cran/sits/man/sits_regularize.html

Copy link
Member

@m-mohr m-mohr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you already know, but openEO has a mechanism to build this kind of convenience function that is a combination of existing processes, the openEO 'user defined processes' (UDP).

Yeah, maybe some of these processes should go into openeo-community-examples if they can be built on top of other processes? This could also apply to the ard_* processes. All these are very heavyweight processes, that may not 100% fit into the current process landscape. I'll take this into the PSC for discussion.

I think we should at least consider trying to solve this use case with existing processes, i.e. add a "process_graph" member to the process description.

@m-mohr
Copy link
Member

m-mohr commented Dec 8, 2023

@PondiB I think it would make sense to make PRs against the ml branch because otherwise all changes from the ML branch will also appear in this PR. This leads to confusion. Please rebase your changes against the ML branch if necessary and set the base branch of the PR to ml.

@PondiB
Copy link
Member Author

PondiB commented Dec 8, 2023

@PondiB I think it would make sense to make PRs against the ml branch because otherwise all changes from the ML branch will also appear in this PR. This leads to confusion. Please rebase your changes against the ML branch if necessary and set the base branch of the PR to ml.

Sure.

@PondiB PondiB changed the base branch from draft to ml December 8, 2023 13:35
@PondiB PondiB changed the title data cube regularization for machine learning processes ML Data Cube Regularization Dec 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants